TAO not running when using multiple GPUs

I’m using (g6.12xlarge pricing and specs - Vantage) to train a mask r-cnn model

I’m running the toolkit with the command:
docker run -it --rm --gpus all -v /home/ubuntu:/workspace nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5

NVIDIA-SMI:
smi.txt (2.8 KB)

specs:
specs.txt (2.2 KB)

the command from inside docker:
mask_rcnn train -e /workspace/tao/specs/maskrcnn_train_resnet18.txt -d /workspace/tao/mask_rcnn/experiment_dir_unpruned --gpus 4

returns the error:
error.txt (18.0 KB)

[2024-08-15 21:16:55.177738: W /tmp/pip-install-gz1q68mo/horovod_94237439d5f64637a082acc92487fc68/horovod/common/stall_inspector.cc:107] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock. 
Missing ranks:
1: [DistributedMomentumOptimizer_Allreduce/cond_81/HorovodAllreduce_gradients_AddN_42_0, DistributedMomentumOptimizer_Allreduce/cond_82/HorovodAllreduce_gradients_AddN_41_0, DistributedMomentumOptimizer_Allreduce/cond_83/HorovodAllreduce_gradients_AddN_49_0, DistributedMomentumOptimizer_Allreduce/cond_84/HorovodAllreduce_gradients_AddN_48_0, DistributedMomentumOptimizer_Allreduce/cond_89/HorovodAllreduce_gradients_AddN_13_0, DistributedMomentumOptimizer_Allreduce/cond_92/HorovodAllreduce_gradients_box_predict_BiasAdd_grad_tuple_control_dependency_1_0 ...]
[502ac2b7bffe:411  :0:1429] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
==== backtrace (tid:   1429) ====
 0 0x0000000000043090 killpg()  ???:0
 1 0x000000000006bb17 ncclGroupEnd()  ???:0
 2 0x0000000000008609 start_thread()  ???:0
 3 0x000000000011f133 clone()  ???:0
=================================
[502ac2b7bffe:00411] *** Process received signal ***
[502ac2b7bffe:00411] Signal: Segmentation fault (11)
[502ac2b7bffe:00411] Signal code:  (-6)
[502ac2b7bffe:00411] Failing at address: 0x19b
[502ac2b7bffe:00411] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x73dea6695090]
[502ac2b7bffe:00411] [ 1] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x6bb17)[0x73ddc3cbbb17]
[502ac2b7bffe:00411] [ 2] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609)[0x73dea6637609]
[502ac2b7bffe:00411] [ 3] /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x73dea6771133]
[502ac2b7bffe:00411] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node 502ac2b7bffe exited on signal 11 (Segmentation fault).

If I run with 1 gpu, everything works fine, I just get OOM after some steps, but it’s expected. 2, 3 or 4 GPU give the same error.

I read some other similar posts but none of them is toolkit 5.0.0, it looks like a problem with versions of cuda, tao and other things, do I need to run a older version? which one?

1 post - 1 participant

Read full topic

TAO not running when using multiple GPUs

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112