Quantcast
Channel: TAO Toolkit - NVIDIA Developer Forums
Viewing all articles
Browse latest Browse all 497

Error during TAO training in WSL2

$
0
0

I am training a model using the TAO Toolkit in WSL2, but I am encountering the following error during training. Could you help me identify the issue?

2024-12-05 05:33:38,306 [TAO Toolkit] [INFO] main 356: Number of images in the training dataset: 1634
2024-12-05 05:33:38,306 [TAO Toolkit] [INFO] main 358: Number of images in the validation dataset: 163
2024-12-05 05:33:38,644 [TAO Toolkit] [INFO] nvidia_tao_tf1.cv.common.logging.logging 197: Log file already exists at /workspace/tao-experiments/ssd/experiment_dir_unpruned/status.json
2024-12-05 05:33:40,920 [TAO Toolkit] [INFO] root 2102: Starting Training Loop.
Epoch 1/100
DALI daliCreatePipeline(&pipe_handle_, serialized_pipeline.c_str(), serialized_pipeline.length(), max_batch_size, num_threads, device_id, exec_separated, prefetch_queue_depth_, cpu_prefetch_queue_depth, prefetch_queue_depth_, enable_memory_stats_) failed: Critical error when building pipeline:
Error when constructing operator: decoders__Image encountered:
Error in thread 0: nvml error (3): The nvml requested operation is not available on target device
Current pipeline object is no longer valid.
[318a1815ba11:00248] *** Process received signal ***
[318a1815ba11:00248] Signal: Segmentation fault (11)
[318a1815ba11:00248] Signal code: Address not mapped (1)
[318a1815ba11:00248] Failing at address: 0x29d00000270
[318a1815ba11:00248] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7fd309356090]
[318a1815ba11:00248] [ 1] /usr/lib/wsl/drivers/nv_dispi.inf_amd64_adf5a840df867035/libcuda.so.1.1(+0x2b0cd6)[0x7fd1505c6cd6]
[318a1815ba11:00248] [ 2] /usr/lib/wsl/drivers/nv_dispi.inf_amd64_adf5a840df867035/libcuda.so.1.1(+0x2b10f0)[0x7fd1505c70f0]
[318a1815ba11:00248] [ 3] /usr/lib/wsl/drivers/nv_dispi.inf_amd64_adf5a840df867035/libcuda.so.1.1(+0x2b26d5)[0x7fd1505c86d5]
[318a1815ba11:00248] [ 4] /usr/local/lib/python3.8/dist-packages/nvidia/dali/libdali.so(+0x5fa2f0)[0x7fd267ed32f0]
[318a1815ba11:00248] [ 5] /usr/local/lib/python3.8/dist-packages/nvidia/dali/libdali.so(+0x658158)[0x7fd267f31158]
[318a1815ba11:00248] [ 6] /usr/local/lib/python3.8/dist-packages/nvidia/dali/libdali.so(daliDeletePipeline+0x3c)[0x7fd267aba26c]
[318a1815ba11:00248] [ 7] /usr/local/lib/python3.8/dist-packages/nvidia/dali_tf_plugin/libdali_tf_1_15.so(_ZN12dali_tf_impl6DaliOpD0Ev+0x54)[0x7fd214c86ea4]
[318a1815ba11:00248] [ 8] /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/…/libtensorflow_framework.so.1(_ZN10tensorflow14CreateOpKernelENS_10DeviceTypeEPNS_10DeviceBaseEPNS_9AllocatorEPNS_22FunctionLibraryRuntimeERKNS_7NodeDefEiPPNS_8OpKernelE+0x98d)[0x7fd2862f81cd]
[318a1815ba11:00248] [ 9] /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/…/libtensorflow_framework.so.1(_ZN10tensorflow21CreateNonCachedKernelEPNS_6DeviceEPNS_22FunctionLibraryRuntimeERKNS_7NodeDefEiPPNS_8OpKernelE+0xf2)[0x7fd28659ff52]
[318a1815ba11:00248] [10] /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/…/libtensorflow_framework.so.1(_ZN10tensorflow26FunctionLibraryRuntimeImpl12CreateKernelERKNS_7NodeDefEPNS_22FunctionLibraryRuntimeEPPNS_8OpKernelE+0x9a3)[0x7fd2865c0003]
[318a1815ba11:00248] [11] /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/…/libtensorflow_framework.so.1(_ZN10tensorflow26FunctionLibraryRuntimeImpl12CreateKernelERKNS_7NodeDefEPPNS_8OpKernelE+0x18)[0x7fd2865c03f8]
[318a1815ba11:00248] [12] /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/…/libtensorflow_cc.so.1(+0x60356a3)[0x7fd28d4886a3]
[318a1815ba11:00248] [13] /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/…/libtensorflow_framework.so.1(_ZN10tensorflow9OpSegment12FindOrCreateERKSsS2_PPNS_8OpKernelESt8functionIFNS_6StatusES5_EE+0x1ba)[0x7fd2862f93ba]
[318a1815ba11:00248] [14] /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/…/libtensorflow_cc.so.1(+0x6035c82)[0x7fd28d488c82]
[318a1815ba11:00248] [15] /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/…/libtensorflow_framework.so.1(+0x1141f58)[0x7fd2865aef58]
[318a1815ba11:00248] [16] /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/…/libtensorflow_framework.so.1(_ZN10tensorflow16NewLocalExecutorERKNS_19LocalExecutorParamsESt10unique_ptrIKNS_5GraphESt14default_deleteIS5_EEPPNS_8ExecutorE+0x6b)[0x7fd2865b057b]
[318a1815ba11:00248] [17] /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/…/libtensorflow_framework.so.1(+0x114360d)[0x7fd2865b060d]
[318a1815ba11:00248] [18] /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/…/libtensorflow_framework.so.1(_ZN10tensorflow11NewExecutorERKSsRKNS_19LocalExecutorParamsESt10unique_ptrIKNS_5GraphESt14default_deleteIS7_EEPS5_INS_8ExecutorES8_ISB_EE+0x66)[0x7fd2865b0e56]
[318a1815ba11:00248] [19] /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/…/libtensorflow_cc.so.1(_ZN10tensorflow13DirectSession15CreateExecutorsERKNS_15CallableOptionsEPSt10unique_ptrINS0_16ExecutorsAndKeysESt14default_deleteIS5_EEPS4_INS0_12FunctionInfoES6_ISA_EEPNS0_12RunStateArgsE+0xd31)[0x7fd28d49acd1]
[318a1815ba11:00248] [20] /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/…/libtensorflow_cc.so.1(_ZN10tensorflow13DirectSession12MakeCallableERKNS_15CallableOptionsEPx+0x129)[0x7fd28d49d5a9]
[318a1815ba11:00248] [21] /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow10SessionRef12MakeCallableERKNS_15CallableOptionsEPx+0x31d)[0x7fd3041d4fed]
[318a1815ba11:00248] [22] /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so(+0xec3a2)[0x7fd3041ce3a2]
[318a1815ba11:00248] [23] /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so(+0x8db9a)[0x7fd30416fb9a]
[318a1815ba11:00248] [24] python(PyCFunction_Call+0xfa)[0x5f5bda]
[318a1815ba11:00248] [25] python(_PyObject_MakeTpCall+0x296)[0x5f6706]
[318a1815ba11:00248] [26] python(_PyEval_EvalFrameDefault+0x5db3)[0x571143]
[318a1815ba11:00248] [27] python(_PyFunction_Vectorcall+0x1b6)[0x5f5ee6]
[318a1815ba11:00248] [28] python[0x59c39d]
[318a1815ba11:00248] [29] python(_PyObject_MakeTpCall+0x1ff)[0x5f666f]
[318a1815ba11:00248] *** End of error message ***

Execution status: FAIL
2024-12-05 11:03:47,046 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 363: Stopping container.

I am working in a Python 3.7 environment.
My setup is as follows:

  • WSL2 Distribution: Ubuntu 20.04 (Version: 2)
    CUDA Toolkit Version**: 12.2
    NVIDIA Driver Version**: 566.14
    GPU**: NVIDIA GeForce RTX 4090
    NVIDIA DALI Version**: 0.31.0
    Python Version**: 3.7.0

I would appreciate any insights into what might be causing this issue.

3 posts - 2 participants

Read full topic


Viewing all articles
Browse latest Browse all 497

Trending Articles