Quantcast
Channel: TAO Toolkit - NVIDIA Developer Forums
Viewing all articles
Browse latest Browse all 497

TA0 v3.21.08 - pycuda._driver.LogicError: cuInit failed: system not yet initialized

$
0
0

Please provide the following information when requesting support.

Hardware
NVIDIA A100-SXM4-40GB
• Network Type
YOLOv4
• TAO Version

tao info
Configuration of the TAO Toolkit Instance
dockers: ['nvidia/tao/tao-toolkit-tf', 'nvidia/tao/tao-toolkit-pyt', 'nvidia/tao/tao-toolkit-lm']
format_version: 1.0
toolkit_version: 3.21.08
published_date: 08/17/2021

Hi there,

I am trying to train YOLOv4 on a AWS P4 instance created from the NVIDIA Deep Learning Base AMI 2024.03.4-676eed8d-dcf5-4784-87d7-0de463205c17.
I thought everything should run smoothly but it is not the case.

When trying to start a training with tao yolo_v4 train, I am getting the following error:

tao yolo_v4 train
2024-04-23 06:08:09,250 [INFO] root: Registry: ['nvcr.io']
Matplotlib created a temporary config/cache directory at /tmp/matplotlib-9z6ezlfr because the default path (/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
Using TensorFlow backend.
Traceback (most recent call last):
  File "/usr/local/bin/yolo_v4", line 8, in <module>
    sys.exit(main())
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/entrypoint/yolo_v4.py", line 12, in main
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/entrypoint/entrypoint.py", line 256, in launch_job
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/entrypoint/entrypoint.py", line 47, in get_modules
  File "/usr/lib/python3.6/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 994, in _gcd_import
  File "<frozen importlib._bootstrap>", line 971, in _find_and_load
  File "<frozen importlib._bootstrap>", line 955, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 665, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 678, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/scripts/export.py", line 8, in <module>
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/export/yolov4_exporter.py", line 31, in <module>
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/export/keras_exporter.py", line 22, in <module>
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/core/build_wheel.runfiles/ai_infra/moduluspy/modulus/export/_tensorrt.py", line 27, in <module>
  File "/usr/local/lib/python3.6/dist-packages/pycuda/autoinit.py", line 5, in <module>
    cuda.init()
pycuda._driver.LogicError: cuInit failed: system not yet initialized
2024-04-23 06:08:13,150 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

I’ve followed another topic pycuda-driver-logicerror-cuinit-failed-system-not-yet-initialized/ and i’ve tried to run pycuda from the container

docker run --gpus all --entrypoint ""  -it -v /home/ubuntu/tao/:/workspace/tao-experiments nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.08-py3 /bin/bash

But I am getting

root@fae8148ba2ce:/workspace# python
Python 3.6.9 (default, Jan 26 2021, 15:33:00) 
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pycuda
>>> import pycuda.driver as cuda
>>> cuda.init()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
pycuda._driver.LogicError: cuInit failed: system not yet initialized

I’ve also installed
sudo apt-get install nvidia-modprobe

I’ve run TAO 2 weeks ago from another p4 EC2 and I had no issue so I am not sure what is going on. To install I’ve follow the instructions:

sudo apt install software-properties-common
sudo add-apt-repository ppa:deadsnakes/ppa
sudo apt update
sudo apt install python3.7

export VIRTUALENVWRAPPER_PYTHON=/usr/bin/python3.7
export VIRTUALENVWRAPPER_VIRTUALENV=/home/ubuntu/.local/bin/virtualenv
export WORKON_HOME=$HOME/.virtualenvs
source /home/ubuntu/.local/bin/virtualenvwrapper.sh

mkvirtualenv tao-v3.21.08
(tao-v3.21.08) pip install nvidia-pyindex
(tao-v3.21.08) pip install nvidia-tao==0.1.19

(tao-v3.21.08) python --version
Python 3.7.17
(tao-v3.21.08)  tao info
Configuration of the TAO Toolkit Instance
dockers: ['nvidia/tao/tao-toolkit-tf', 'nvidia/tao/tao-toolkit-pyt', 'nvidia/tao/tao-toolkit-lm']
format_version: 1.0
toolkit_version: 3.21.08
published_date: 08/17/2021

NVIDIA/Cuda Info:

root@3cbe58ae05b8:/workspace# nvidia-smi
Tue Apr 23 06:40:59 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.07             Driver Version: 535.161.07   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-40GB          On  | 00000000:10:1C.0 Off |                    0 |
| N/A   33C    P0              44W / 400W |      0MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM4-40GB          On  | 00000000:10:1D.0 Off |                    0 |
| N/A   30C    P0              42W / 400W |      0MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM4-40GB          On  | 00000000:20:1C.0 Off |                    0 |
| N/A   31C    P0              44W / 400W |      0MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM4-40GB          On  | 00000000:20:1D.0 Off |                    0 |
| N/A   29C    P0              42W / 400W |      0MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA A100-SXM4-40GB          On  | 00000000:90:1C.0 Off |                    0 |
| N/A   32C    P0              44W / 400W |      0MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   5  NVIDIA A100-SXM4-40GB          On  | 00000000:90:1D.0 Off |                    0 |
| N/A   30C    P0              45W / 400W |      0MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   6  NVIDIA A100-SXM4-40GB          On  | 00000000:A0:1C.0 Off |                    0 |
| N/A   33C    P0              44W / 400W |      0MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   7  NVIDIA A100-SXM4-40GB          On  | 00000000:A0:1D.0 Off |                    0 |
| N/A   30C    P0              42W / 400W |      0MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
root@3cbe58ae05b8:/workspace# nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Mon_Oct_12_20:09:46_PDT_2020
Cuda compilation tools, release 11.1, V11.1.105
Build cuda_11.1.TC455_06.29190527_0
root@3cbe58ae05b8:/workspace# dpkg -l |grep cuda
ii  cuda-command-line-tools-11-1  11.1.1-1                            amd64        CUDA command-line tools
ii  cuda-compat-11-1              455.45.01-1                         amd64        CUDA Compatibility Platform
ii  cuda-compiler-11-1            11.1.1-1                            amd64        CUDA compiler
ii  cuda-cudart-11-1              11.1.74-1                           amd64        CUDA Runtime native Libraries
ii  cuda-cudart-dev-11-1          11.1.74-1                           amd64        CUDA Runtime native dev links, headers
ii  cuda-cuobjdump-11-1           11.1.74-1                           amd64        CUDA cuobjdump
ii  cuda-cupti-11-1               11.1.105-1                          amd64        CUDA profiling tools runtime libs.
ii  cuda-cupti-dev-11-1           11.1.105-1                          amd64        CUDA profiling tools interface.
ii  cuda-driver-dev-11-1          11.1.74-1                           amd64        CUDA Driver native dev stub library
ii  cuda-gdb-11-1                 11.1.105-1                          amd64        CUDA-GDB
ii  cuda-libraries-11-1           11.1.1-1                            amd64        CUDA Libraries 11.1 meta-package
ii  cuda-libraries-dev-11-1       11.1.1-1                            amd64        CUDA Libraries 11.1 development meta-package
ii  cuda-memcheck-11-1            11.1.105-1                          amd64        CUDA-MEMCHECK
ii  cuda-minimal-build-11-1       11.1.1-1                            amd64        Minimal CUDA 11.1 toolkit build packages.
ii  cuda-nvcc-11-1                11.1.105-1                          amd64        CUDA nvcc
ii  cuda-nvdisasm-11-1            11.1.74-1                           amd64        CUDA disassembler
ii  cuda-nvml-dev-11-1            11.1.74-1                           amd64        NVML native dev links, headers
ii  cuda-nvprof-11-1              11.1.105-1                          amd64        CUDA Profiler tools
ii  cuda-nvprune-11-1             11.1.74-1                           amd64        CUDA nvprune
ii  cuda-nvrtc-11-1               11.1.105-1                          amd64        NVRTC native runtime libraries
ii  cuda-nvrtc-dev-11-1           11.1.105-1                          amd64        NVRTC native dev links, headers
ii  cuda-nvtx-11-1                11.1.74-1                           amd64        NVIDIA Tools Extension
ii  cuda-sanitizer-11-1           11.1.105-1                          amd64        CUDA Sanitizer
hi  libcudnn8                     8.1.1.33-1+cuda11.2                 amd64        cuDNN runtime libraries
ii  libcudnn8-dev                 8.1.1.33-1+cuda11.2                 amd64        cuDNN development libraries and headers
hi  libnccl-dev                   2.7.8-1+cuda11.1                    amd64        NVIDIA Collectives Communication Library (NCCL) Development Files
hi  libnccl2                      2.7.8-1+cuda11.1                    amd64        NVIDIA Collectives Communication Library (NCCL) Runtime
ii  libnvinfer-dev                7.2.3-1+cuda11.1                    amd64        TensorRT development libraries and headers
ii  libnvinfer-plugin-dev         7.2.3-1+cuda11.1                    amd64        TensorRT plugin libraries
ii  libnvinfer-plugin7            7.2.3-1+cuda11.1                    amd64        TensorRT plugin libraries
ii  libnvinfer7                   7.2.3-1+cuda11.1                    amd64        TensorRT runtime libraries
ii  libnvonnxparsers-dev          7.2.3-1+cuda11.1                    amd64        TensorRT ONNX libraries
ii  libnvonnxparsers7             7.2.3-1+cuda11.1                    amd64        TensorRT ONNX libraries
ii  libnvparsers-dev              7.2.3-1+cuda11.1                    amd64        TensorRT parsers libraries
ii  libnvparsers7                 7.2.3-1+cuda11.1                    amd64        TensorRT parsers libraries

Any idea? Thanks for the help.

6 posts - 2 participants

Read full topic


Viewing all articles
Browse latest Browse all 497

Trending Articles