Please provide the following information when requesting support.
• Hardware (T4)
• Network Type (Yolo_v4)
• TLT Version (TAO 5.0.0)
• Training spec file(If have, please share here)
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)
Hi, some background info on my issue:
I am trying to run NVIDIA TAO version 5.0.0 to train a yolo v4 model. I am running a VM on google cloud, with a NVIDIA T4 GPU.
I followed the steps on this post: https://docs.nvidia.com/tao/tao-toolkit/text/running_in_cloud/running_tao_toolkit_on_gcp.html
I start running Jupyter from the terminal using this command:
andrewh@us-west4-t4:~$ jupyter notebook --ip 0.0.0.0 --port 8888 --allow-root --NotebookApp.token='password'
I get to step 2.3 and run the following command:
!tao model yolo_v4 dataset_convert -d $SPECS_DIR/yolo_v4_tfrecords_kitti_train.txt \
-o $DATA_DOWNLOAD_DIR/yolo_v4/tfrecords/train \
-r $USER_EXPERIMENT_DIR/
And get the following output:
2024-08-12 18:37:14,420 [TAO Toolkit] [INFO] root 160: Registry: ['nvcr.io']
2024-08-12 18:37:14,513 [TAO Toolkit] [INFO] nvidia_tao_cli.components.instance_handler.local_instance 360: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5
2024-08-12 18:37:14,560 [TAO Toolkit] [WARNING] nvidia_tao_cli.components.docker_handler.docker_handler 288:
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the "/home/andrewh/.tao_mounts.json" file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
2024-08-12 18:37:14,560 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 301: Printing tty value True
Using TensorFlow backend.
2024-08-12 18:37:17.451564: I tensorflow/stream_executor/platform/default/dso_loader.cc:50] Successfully opened dynamic library libcudart.so.12
2024-08-12 18:37:17,786 [TAO Toolkit] [WARNING] tensorflow 40: Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
2024-08-12 18:37:21,340 [TAO Toolkit] [WARNING] tensorflow 43: TensorFlow will not use sklearn by default. This improves performance in some cases. To enable sklearn export the environment variable TF_ALLOW_IOLIBS=1.
2024-08-12 18:37:21,470 [TAO Toolkit] [WARNING] tensorflow 42: TensorFlow will not use Dask by default. This improves performance in some cases. To enable Dask export the environment variable TF_ALLOW_IOLIBS=1.
2024-08-12 18:37:21,489 [TAO Toolkit] [WARNING] tensorflow 43: TensorFlow will not use Pandas by default. This improves performance in some cases. To enable Pandas export the environment variable TF_ALLOW_IOLIBS=1.
2024-08-12 18:37:25,637 [TAO Toolkit] [INFO] matplotlib.font_manager 1633: generated new fontManager
2024-08-12 18:37:26,844 [TAO Toolkit] [WARNING] nvidia_tao_tf1.cv.common.export.keras_exporter 36: Failed to import TensorRT package, exporting TLT to a TensorRT engine will not be available.
Traceback (most recent call last):
File "/usr/local/bin/yolo_v4", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/yolo_v4/entrypoint/yolo_v4.py", line 12, in main
launch_job(nvidia_tao_tf1.cv.yolo_v4.scripts, "yolo_v4", sys.argv[1:])
File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/common/entrypoint/entrypoint.py", line 276, in launch_job
modules = get_modules(package)
File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/common/entrypoint/entrypoint.py", line 47, in get_modules
module = importlib.import_module(module_name)
File "/usr/lib/python3.8/importlib/__init__.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
File "<frozen importlib._bootstrap>", line 991, in _find_and_load
File "<frozen importlib._bootstrap>", line 975, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 671, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 848, in exec_module
File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/yolo_v4/scripts/export.py", line 21, in <module>
from nvidia_tao_tf1.cv.yolo_v4.export.yolov4_exporter import YOLOv4Exporter as Exporter
File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/yolo_v4/export/yolov4_exporter.py", line 42, in <module>
from nvidia_tao_tf1.cv.common.export.keras_exporter import KerasExporter as Exporter
File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/common/export/keras_exporter.py", line 46, in <module>
from nvidia_tao_tf1.core.export.app import get_model_input_dtype
File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/core/export/app.py", line 40, in <module>
from nvidia_tao_tf1.core.export._tensorrt import keras_to_tensorrt
File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/core/export/_tensorrt.py", line 39, in <module>
import pycuda.autoinit # noqa pylint: disable=W0611
File "/usr/local/lib/python3.8/dist-packages/pycuda/autoinit.py", line 5, in <module>
cuda.init()
pycuda._driver.RuntimeError: cuInit failed: no CUDA-capable device is detected
2024-08-12 18:37:28,159 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 363: Stopping container.
nvidia-smi
Mon Aug 12 20:15:05 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.256.02 Driver Version: 470.256.02 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 Off | 00000000:00:04.0 Off | 0 |
| N/A 57C P0 28W / 70W | 514MiB / 15109MiB | 1% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1070 G /usr/lib/xorg/Xorg 67MiB |
| 0 N/A N/A 1926 G /usr/lib/xorg/Xorg 131MiB |
| 0 N/A N/A 2053 G /usr/bin/gnome-shell 27MiB |
| 0 N/A N/A 2456 C /usr/NX/bin/nxnode.bin 132MiB |
| 0 N/A N/A 4758 G /usr/lib/firefox/firefox 141MiB |
+-----------------------------------------------------------------------------+
dpkg -l | grep cuda
ii libcudart10.1:amd64 10.1.243-3 amd64 NVIDIA CUDA Runtime Library
ii nvidia-cuda-dev 10.1.243-3 amd64 NVIDIA CUDA development files
ii nvidia-cuda-doc 10.1.243-3 all NVIDIA CUDA and OpenCL documentation
ii nvidia-cuda-gdb 10.1.243-3 amd64 NVIDIA CUDA Debugger (GDB)
ii nvidia-cuda-toolkit 10.1.243-3 amd64 NVIDIA CUDA development toolkit
I’ve read the forum post here with a similar issue: No CUDA-capable device is detected on tao detectnet_v2 dataset convert - #4 by NilsAI
But I am unsure if it applies since I think I am running TAO in a different method than the author of the post.
Any advice on how to proceed with this issue would be much appreciated. I apologize in advice, but I am very new to using Linux, so somethings that may be obvious or simple may not be for me. If any more info is needed, please let me know. I am running Ubuntu 20.04.06, 64-bit.
Thanks,
Andrew
4 posts - 2 participants