Running TAO training from following command.
!ssd train --gpus 1 --gpu_index $GPU_INDEX
-e $SPECS_DIR/ssd_train_resnet18_kitti.txt
-r $USER_EXPERIMENT_DIR/experiment_dir_unpruned
-m $USER_EXPERIMENT_DIR/pretrained_resnet18/pretrained_object_detection_vresnet18/resnet_18.hdf5
Please provide the following information when requesting support.
2024-03-12 07:20:24,820 [TAO Toolkit] [INFO] main 356: Number of images in the training dataset: 575
2024-03-12 07:20:24,820 [TAO Toolkit] [INFO] main 358: Number of images in the validation dataset: 144
2024-03-12 07:20:25,450 [TAO Toolkit] [INFO] nvidia_tao_tf1.cv.common.logging.logging 197: Log file already exists at /workspace/tao-experiments/ssd/experiment_dir_unpruned/status.json
2024-03-12 07:20:29,271 [TAO Toolkit] [INFO] root 2102: Starting Training Loop.
Epoch 1/10
19/36 [==============>…] - ETA: 38s - loss: 36.5114DALI daliShareOutput(&pipe_handle_) failed: Critical error in pipeline:
Error when executing GPU operator Slice encountered:
Can’t allocate 6383730688 bytes on device 0.
Current pipeline object is no longer valid.
2024-03-12 07:21:13,426 [TAO Toolkit] [INFO] root 2102: 2 root error(s) found.
(0) Internal: DALI daliShareOutput(&pipe_handle_) failed: Critical error in pipeline:
Error when executing GPU operator Slice encountered:
Can’t allocate 6383730688 bytes on device 0.
Current pipeline object is no longer valid.
[[{{node Dali}}]]
[[cond_14/SliceReplace_5/range/4975]]
(1) Internal: DALI daliShareOutput(&pipe_handle) failed: Critical error in pipeline:
Error when executing GPU operator Slice encountered:
Can’t allocate 6383730688 bytes on device 0.
Current pipeline object is no longer valid.
[[{{node Dali}}]]
0 successful operations.
0 derived errors ignored.
Traceback (most recent call last):
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/ssd/scripts/train.py”, line 586, in
main()
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/common/utils.py”, line 717, in return_func
raise e
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/common/utils.py”, line 705, in return_func
return func(*args, **kwargs)
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/ssd/scripts/train.py”, line 582, in main
raise e
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/ssd/scripts/train.py”, line 562, in main
run_experiment(config_path=args.experiment_spec_file,
File “/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/ssd/scripts/train.py”, line 469, in run_experiment
model.fit(steps_per_epoch=iters_per_epoch,
File “/usr/local/lib/python3.8/dist-packages/keras/engine/training.py”, line 1027, in fit
return training_arrays.fit_loop(self, f, ins,
File “/usr/local/lib/python3.8/dist-packages/keras/engine/training_arrays.py”, line 154, in fit_loop
outs = f(ins)
File “/usr/local/lib/python3.8/dist-packages/keras/backend/tensorflow_backend.py”, line 2715, in call
return self._call(inputs)
File “/usr/local/lib/python3.8/dist-packages/keras/backend/tensorflow_backend.py”, line 2675, in _call
fetched = self._callable_fn(*array_vals)
File “/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py”, line 1470, in call
ret = tf_session.TF_SessionRunCallable(self._session.session,
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
(0) Internal: DALI daliShareOutput(&pipe_handle) failed: Critical error in pipeline:
Error when executing GPU operator Slice encountered:
Can’t allocate 6383730688 bytes on device 0.
Current pipeline object is no longer valid.
[[{{node Dali}}]]
[[cond_14/SliceReplace_5/range/4975]]
(1) Internal: DALI daliShareOutput(&pipe_handle) failed: Critical error in pipeline:
Error when executing GPU operator Slice encountered:
Can’t allocate 6383730688 bytes on device 0.
Current pipeline object is no longer valid.
[[{{node Dali}}]]
0 successful operations.
0 derived errors ignored.
terminate called after throwing an instance of ‘dali::CUDAError’
what(): CUDA runtime API error cudaErrorIllegalAddress (700):
an illegal memory access was encountered
[5bb4629a2b41:12083] *** Process received signal ***
[5bb4629a2b41:12083] Signal: Aborted (6)
[5bb4629a2b41:12083] Signal code: (-6)
[5bb4629a2b41:12083] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7f1006ed5090]
[5bb4629a2b41:12083] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f1006ed500b]
[5bb4629a2b41:12083] [ 2] /usr/lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7f1006eb4859]
[5bb4629a2b41:12083] [ 3] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x9e911)[0x7f100635e911]
[5bb4629a2b41:12083] [ 4] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa38c)[0x7f100636a38c]
[5bb4629a2b41:12083] [ 5] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa3f7)[0x7f100636a3f7]
[5bb4629a2b41:12083] [ 6] /usr/local/lib/python3.8/dist-packages/nvidia/dali/libdali_core.so(+0x286b9)[0x7f0fd24556b9]
[5bb4629a2b41:12083] [ 7] /usr/local/lib/python3.8/dist-packages/nvidia/dali/python_function_plugin.cpython-38-x86_64-linux-gnu.so(_ZNSt16_Sp_counted_baseILN9__gnu_cxx12_Lock_policyE2EE10_M_releaseEv+0x46)[0x7f0fd55e1ed6]
[5bb4629a2b41:12083] [ 8] /usr/local/lib/python3.8/dist-packages/nvidia/dali/libdali_core.so(+0x17046)[0x7f0fd2444046]
[5bb4629a2b41:12083] [ 9] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x468a7)[0x7f1006ed88a7]
[5bb4629a2b41:12083] [10] /usr/lib/x86_64-linux-gnu/libc.so.6(on_exit+0x0)[0x7f1006ed8a60]
[5bb4629a2b41:12083] [11] /usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xfa)[0x7f1006eb608a]
[5bb4629a2b41:12083] [12] python(_start+0x2e)[0x5faa2e]
[5bb4629a2b41:12083] *** End of error message ***
Telemetry data couldn’t be sent, but the command ran successfully.
[WARNING]: Insufficient Permissions
Execution status: FAIL
1 post - 1 participant