Please provide the following information when requesting support.
• Hardware
NVIDIA A100-SXM4-40GB
• Network Type
YOLOv4
• TAO Version
tao info
Configuration of the TAO Toolkit Instance
dockers: ['nvidia/tao/tao-toolkit-tf', 'nvidia/tao/tao-toolkit-pyt', 'nvidia/tao/tao-toolkit-lm']
format_version: 1.0
toolkit_version: 3.21.08
published_date: 08/17/2021
Hi there,
I am trying to train YOLOv4 on a AWS P4 instance created from the NVIDIA Deep Learning Base AMI 2024.03.4-676eed8d-dcf5-4784-87d7-0de463205c17.
I thought everything should run smoothly but it is not the case.
When trying to start a training with tao yolo_v4 train, I am getting the following error:
tao yolo_v4 train
2024-04-23 06:08:09,250 [INFO] root: Registry: ['nvcr.io']
Matplotlib created a temporary config/cache directory at /tmp/matplotlib-9z6ezlfr because the default path (/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
Using TensorFlow backend.
Traceback (most recent call last):
File "/usr/local/bin/yolo_v4", line 8, in <module>
sys.exit(main())
File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/entrypoint/yolo_v4.py", line 12, in main
File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/entrypoint/entrypoint.py", line 256, in launch_job
File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/entrypoint/entrypoint.py", line 47, in get_modules
File "/usr/lib/python3.6/importlib/__init__.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "<frozen importlib._bootstrap>", line 994, in _gcd_import
File "<frozen importlib._bootstrap>", line 971, in _find_and_load
File "<frozen importlib._bootstrap>", line 955, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 665, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 678, in exec_module
File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/scripts/export.py", line 8, in <module>
File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/export/yolov4_exporter.py", line 31, in <module>
File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/export/keras_exporter.py", line 22, in <module>
File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/core/build_wheel.runfiles/ai_infra/moduluspy/modulus/export/_tensorrt.py", line 27, in <module>
File "/usr/local/lib/python3.6/dist-packages/pycuda/autoinit.py", line 5, in <module>
cuda.init()
pycuda._driver.LogicError: cuInit failed: system not yet initialized
2024-04-23 06:08:13,150 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.
I’ve followed another topic pycuda-driver-logicerror-cuinit-failed-system-not-yet-initialized/ and i’ve tried to run pycuda from the container
docker run --gpus all --entrypoint "" -it -v /home/ubuntu/tao/:/workspace/tao-experiments nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.08-py3 /bin/bash
But I am getting
root@fae8148ba2ce:/workspace# python
Python 3.6.9 (default, Jan 26 2021, 15:33:00)
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pycuda
>>> import pycuda.driver as cuda
>>> cuda.init()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
pycuda._driver.LogicError: cuInit failed: system not yet initialized
I’ve also installed
sudo apt-get install nvidia-modprobe
I’ve run TAO 2 weeks ago from another p4 EC2 and I had no issue so I am not sure what is going on. To install I’ve follow the instructions:
sudo apt install software-properties-common
sudo add-apt-repository ppa:deadsnakes/ppa
sudo apt update
sudo apt install python3.7
export VIRTUALENVWRAPPER_PYTHON=/usr/bin/python3.7
export VIRTUALENVWRAPPER_VIRTUALENV=/home/ubuntu/.local/bin/virtualenv
export WORKON_HOME=$HOME/.virtualenvs
source /home/ubuntu/.local/bin/virtualenvwrapper.sh
mkvirtualenv tao-v3.21.08
(tao-v3.21.08) pip install nvidia-pyindex
(tao-v3.21.08) pip install nvidia-tao==0.1.19
(tao-v3.21.08) python --version
Python 3.7.17
(tao-v3.21.08) tao info
Configuration of the TAO Toolkit Instance
dockers: ['nvidia/tao/tao-toolkit-tf', 'nvidia/tao/tao-toolkit-pyt', 'nvidia/tao/tao-toolkit-lm']
format_version: 1.0
toolkit_version: 3.21.08
published_date: 08/17/2021
NVIDIA/Cuda Info:
root@3cbe58ae05b8:/workspace# nvidia-smi
Tue Apr 23 06:40:59 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.07 Driver Version: 535.161.07 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100-SXM4-40GB On | 00000000:10:1C.0 Off | 0 |
| N/A 33C P0 44W / 400W | 0MiB / 40960MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-SXM4-40GB On | 00000000:10:1D.0 Off | 0 |
| N/A 30C P0 42W / 400W | 0MiB / 40960MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA A100-SXM4-40GB On | 00000000:20:1C.0 Off | 0 |
| N/A 31C P0 44W / 400W | 0MiB / 40960MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA A100-SXM4-40GB On | 00000000:20:1D.0 Off | 0 |
| N/A 29C P0 42W / 400W | 0MiB / 40960MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 4 NVIDIA A100-SXM4-40GB On | 00000000:90:1C.0 Off | 0 |
| N/A 32C P0 44W / 400W | 0MiB / 40960MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 5 NVIDIA A100-SXM4-40GB On | 00000000:90:1D.0 Off | 0 |
| N/A 30C P0 45W / 400W | 0MiB / 40960MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 6 NVIDIA A100-SXM4-40GB On | 00000000:A0:1C.0 Off | 0 |
| N/A 33C P0 44W / 400W | 0MiB / 40960MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 7 NVIDIA A100-SXM4-40GB On | 00000000:A0:1D.0 Off | 0 |
| N/A 30C P0 42W / 400W | 0MiB / 40960MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
root@3cbe58ae05b8:/workspace# nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Mon_Oct_12_20:09:46_PDT_2020
Cuda compilation tools, release 11.1, V11.1.105
Build cuda_11.1.TC455_06.29190527_0
root@3cbe58ae05b8:/workspace# dpkg -l |grep cuda
ii cuda-command-line-tools-11-1 11.1.1-1 amd64 CUDA command-line tools
ii cuda-compat-11-1 455.45.01-1 amd64 CUDA Compatibility Platform
ii cuda-compiler-11-1 11.1.1-1 amd64 CUDA compiler
ii cuda-cudart-11-1 11.1.74-1 amd64 CUDA Runtime native Libraries
ii cuda-cudart-dev-11-1 11.1.74-1 amd64 CUDA Runtime native dev links, headers
ii cuda-cuobjdump-11-1 11.1.74-1 amd64 CUDA cuobjdump
ii cuda-cupti-11-1 11.1.105-1 amd64 CUDA profiling tools runtime libs.
ii cuda-cupti-dev-11-1 11.1.105-1 amd64 CUDA profiling tools interface.
ii cuda-driver-dev-11-1 11.1.74-1 amd64 CUDA Driver native dev stub library
ii cuda-gdb-11-1 11.1.105-1 amd64 CUDA-GDB
ii cuda-libraries-11-1 11.1.1-1 amd64 CUDA Libraries 11.1 meta-package
ii cuda-libraries-dev-11-1 11.1.1-1 amd64 CUDA Libraries 11.1 development meta-package
ii cuda-memcheck-11-1 11.1.105-1 amd64 CUDA-MEMCHECK
ii cuda-minimal-build-11-1 11.1.1-1 amd64 Minimal CUDA 11.1 toolkit build packages.
ii cuda-nvcc-11-1 11.1.105-1 amd64 CUDA nvcc
ii cuda-nvdisasm-11-1 11.1.74-1 amd64 CUDA disassembler
ii cuda-nvml-dev-11-1 11.1.74-1 amd64 NVML native dev links, headers
ii cuda-nvprof-11-1 11.1.105-1 amd64 CUDA Profiler tools
ii cuda-nvprune-11-1 11.1.74-1 amd64 CUDA nvprune
ii cuda-nvrtc-11-1 11.1.105-1 amd64 NVRTC native runtime libraries
ii cuda-nvrtc-dev-11-1 11.1.105-1 amd64 NVRTC native dev links, headers
ii cuda-nvtx-11-1 11.1.74-1 amd64 NVIDIA Tools Extension
ii cuda-sanitizer-11-1 11.1.105-1 amd64 CUDA Sanitizer
hi libcudnn8 8.1.1.33-1+cuda11.2 amd64 cuDNN runtime libraries
ii libcudnn8-dev 8.1.1.33-1+cuda11.2 amd64 cuDNN development libraries and headers
hi libnccl-dev 2.7.8-1+cuda11.1 amd64 NVIDIA Collectives Communication Library (NCCL) Development Files
hi libnccl2 2.7.8-1+cuda11.1 amd64 NVIDIA Collectives Communication Library (NCCL) Runtime
ii libnvinfer-dev 7.2.3-1+cuda11.1 amd64 TensorRT development libraries and headers
ii libnvinfer-plugin-dev 7.2.3-1+cuda11.1 amd64 TensorRT plugin libraries
ii libnvinfer-plugin7 7.2.3-1+cuda11.1 amd64 TensorRT plugin libraries
ii libnvinfer7 7.2.3-1+cuda11.1 amd64 TensorRT runtime libraries
ii libnvonnxparsers-dev 7.2.3-1+cuda11.1 amd64 TensorRT ONNX libraries
ii libnvonnxparsers7 7.2.3-1+cuda11.1 amd64 TensorRT ONNX libraries
ii libnvparsers-dev 7.2.3-1+cuda11.1 amd64 TensorRT parsers libraries
ii libnvparsers7 7.2.3-1+cuda11.1 amd64 TensorRT parsers libraries
Any idea? Thanks for the help.
6 posts - 2 participants