Please provide the following information when requesting support.
• Hardware (T4/V100/Xavier/Nano/etc) RTX 3080ti
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc) DINO
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here) 5.3.0
• Training spec file(If have, please share here)
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)
I am working to train Dino with my custom dataset, i follow the documentation from ngc and tao docs. After spend whole day, i still got several error like belows. Please help me to check it.
Specs
train:
num_gpus: 1
num_nodes: 1
validation_interval: 1
optim:
lr_backbone: 2e-05
lr: 2e-4
lr_steps: [11]
momentum: 0.9
num_epochs: 12
dataset:
train_data_sources:
- image_dir: /ws/tao_trainer/data/dino/train/images
json_file: /ws/tao_trainer/data/dino/train/train.json
val_data_sources:
- image_dir: /ws/tao_trainer/data/dino/valid/images
json_file: /ws/tao_trainer/data/dino/valid/valid.json
num_classes: 6
batch_size: 4
workers: 8
augmentation:
fixed_padding: False
model:
backbone: fan_small
train_backbone: True
pretrained_backbone_path: /ws/tao_trainer/dino/fan_small_hybrid_nvimagenet.pth
num_feature_levels: 4
dec_layers: 6
enc_layers: 6
num_queries: 300
num_select: 100
dropout_ratio: 0.0
dim_feedforward: 2048
Reproduce
docker run -it --rm --gpus all -v /home/tmp/Documents:/ws nvcr.io/nvidia/tao/tao-toolkit:5.3.0-pyt dino train -e /ws/tao_trainer/dino/train.yml -r /ws/tao_trainer/dino/training_models -k threat_detection --gpus 1
===========================
=== TAO Toolkit PyTorch ===
===========================
NVIDIA Release 5.3.0-PyT (build 76438008)
TAO Toolkit Version 5.3.0
Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This container image and its contents are governed by the TAO Toolkit End User License Agreement.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/tao-toolkit-software-license-agreement
WARNING: CUDA Minor Version Compatibility mode ENABLED.
Using driver version 530.41.03 which has support for CUDA 12.1. This container
was built with CUDA 12.3 and will be run in Minor Version Compatibility mode.
CUDA Forward Compatibility is preferred over Minor Version Compatibility for use
with this container but was unavailable:
[[System has unsupported display driver / cuda driver combination (CUDA_ERROR_SYSTEM_DRIVER_MISMATCH) cuInit()=803]]
See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.
NOTE: The SHMEM allocation limit is set to the default of 64MB. This may be
insufficient for TAO Toolkit. NVIDIA recommends the use of the following flags:
docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 ...
/usr/local/lib/python3.10/dist-packages/hydra/plugins/config_source.py:124: UserWarning: Support for .yml files is deprecated. Use .yaml extension for Hydra config files
deprecation_warning(
Could not override 'results_dir'.
To append to your config use +results_dir=/ws/tao_trainer/dino/training_models
Key 'results_dir' is not in struct
full_key: results_dir
object_type=dict
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Execution status: FAIL
6 posts - 2 participants