Quantcast
Channel: TAO Toolkit - NVIDIA Developer Forums
Viewing all articles
Browse latest Browse all 497

Cannot run Dino with tao-5.3.0

$
0
0

Please provide the following information when requesting support.

• Hardware (T4/V100/Xavier/Nano/etc) RTX 3080ti
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc) DINO
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here) 5.3.0
• Training spec file(If have, please share here)
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)

I am working to train Dino with my custom dataset, i follow the documentation from ngc and tao docs. After spend whole day, i still got several error like belows. Please help me to check it.

Specs

train:
  num_gpus: 1
  num_nodes: 1
  validation_interval: 1
  optim:
    lr_backbone: 2e-05
    lr: 2e-4
    lr_steps: [11]
    momentum: 0.9
  num_epochs: 12
dataset:
  train_data_sources:
    - image_dir: /ws/tao_trainer/data/dino/train/images
      json_file: /ws/tao_trainer/data/dino/train/train.json
  val_data_sources:
    - image_dir: /ws/tao_trainer/data/dino/valid/images
      json_file: /ws/tao_trainer/data/dino/valid/valid.json
  num_classes: 6
  batch_size: 4
  workers: 8
  augmentation:
    fixed_padding: False
model:
  backbone: fan_small
  train_backbone: True
  pretrained_backbone_path: /ws/tao_trainer/dino/fan_small_hybrid_nvimagenet.pth
  num_feature_levels: 4
  dec_layers: 6
  enc_layers: 6
  num_queries: 300
  num_select: 100
  dropout_ratio: 0.0
  dim_feedforward: 2048

Reproduce

docker run -it --rm --gpus all -v /home/tmp/Documents:/ws nvcr.io/nvidia/tao/tao-toolkit:5.3.0-pyt dino train -e /ws/tao_trainer/dino/train.yml -r /ws/tao_trainer/dino/training_models -k threat_detection --gpus 1

===========================
=== TAO Toolkit PyTorch ===
===========================

NVIDIA Release 5.3.0-PyT (build 76438008)
TAO Toolkit Version 5.3.0

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

This container image and its contents are governed by the TAO Toolkit End User License Agreement.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/tao-toolkit-software-license-agreement

WARNING: CUDA Minor Version Compatibility mode ENABLED.
  Using driver version 530.41.03 which has support for CUDA 12.1.  This container
  was built with CUDA 12.3 and will be run in Minor Version Compatibility mode.
  CUDA Forward Compatibility is preferred over Minor Version Compatibility for use
  with this container but was unavailable:
  [[System has unsupported display driver / cuda driver combination (CUDA_ERROR_SYSTEM_DRIVER_MISMATCH) cuInit()=803]]
  See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.

NOTE: The SHMEM allocation limit is set to the default of 64MB.  This may be
   insufficient for TAO Toolkit.  NVIDIA recommends the use of the following flags:
   docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 ...

/usr/local/lib/python3.10/dist-packages/hydra/plugins/config_source.py:124: UserWarning: Support for .yml files is deprecated. Use .yaml extension for Hydra config files
  deprecation_warning(
Could not override 'results_dir'.
To append to your config use +results_dir=/ws/tao_trainer/dino/training_models
Key 'results_dir' is not in struct
    full_key: results_dir
    object_type=dict

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Execution status: FAIL

6 posts - 2 participants

Read full topic


Viewing all articles
Browse latest Browse all 497

Trending Articles