I’m using (g6.12xlarge pricing and specs - Vantage) to train a mask r-cnn model
I’m running the toolkit with the command:
docker run -it --rm --gpus all -v /home/ubuntu:/workspace nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5
NVIDIA-SMI:
smi.txt (2.8 KB)
specs:
specs.txt (2.2 KB)
the command from inside docker:
mask_rcnn train -e /workspace/tao/specs/maskrcnn_train_resnet18.txt -d /workspace/tao/mask_rcnn/experiment_dir_unpruned --gpus 4
returns the error:
error.txt (18.0 KB)
[2024-08-15 21:16:55.177738: W /tmp/pip-install-gz1q68mo/horovod_94237439d5f64637a082acc92487fc68/horovod/common/stall_inspector.cc:107] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.
Missing ranks:
1: [DistributedMomentumOptimizer_Allreduce/cond_81/HorovodAllreduce_gradients_AddN_42_0, DistributedMomentumOptimizer_Allreduce/cond_82/HorovodAllreduce_gradients_AddN_41_0, DistributedMomentumOptimizer_Allreduce/cond_83/HorovodAllreduce_gradients_AddN_49_0, DistributedMomentumOptimizer_Allreduce/cond_84/HorovodAllreduce_gradients_AddN_48_0, DistributedMomentumOptimizer_Allreduce/cond_89/HorovodAllreduce_gradients_AddN_13_0, DistributedMomentumOptimizer_Allreduce/cond_92/HorovodAllreduce_gradients_box_predict_BiasAdd_grad_tuple_control_dependency_1_0 ...]
[502ac2b7bffe:411 :0:1429] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
==== backtrace (tid: 1429) ====
0 0x0000000000043090 killpg() ???:0
1 0x000000000006bb17 ncclGroupEnd() ???:0
2 0x0000000000008609 start_thread() ???:0
3 0x000000000011f133 clone() ???:0
=================================
[502ac2b7bffe:00411] *** Process received signal ***
[502ac2b7bffe:00411] Signal: Segmentation fault (11)
[502ac2b7bffe:00411] Signal code: (-6)
[502ac2b7bffe:00411] Failing at address: 0x19b
[502ac2b7bffe:00411] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x73dea6695090]
[502ac2b7bffe:00411] [ 1] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x6bb17)[0x73ddc3cbbb17]
[502ac2b7bffe:00411] [ 2] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609)[0x73dea6637609]
[502ac2b7bffe:00411] [ 3] /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x73dea6771133]
[502ac2b7bffe:00411] *** End of error message ***
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node 502ac2b7bffe exited on signal 11 (Segmentation fault).
If I run with 1 gpu, everything works fine, I just get OOM after some steps, but it’s expected. 2, 3 or 4 GPU give the same error.
I read some other similar posts but none of them is toolkit 5.0.0, it looks like a problem with versions of cuda, tao and other things, do I need to run a older version? which one?
1 post - 1 participant