Quantcast
Channel: TAO Toolkit - NVIDIA Developer Forums
Viewing all articles
Browse latest Browse all 497

Fine Tuning DINO Retail Object detector - error out as it expects unspecified/unknown configurations

$
0
0

Previous thread: Fine Tuning Retail Object Detection Models provided in NGC - #6 by Morganh where we are attempting to fine tune DINO Retail Object detector with TAO 5.5

I am getting following error when trying to train the model in TAO 5.5.
Its looking for this configuration cudnn.benchmark = cfg["train"]["cudnn"]["benchmark"] , but I cant find any such configuration in TAO DINO documentation

 tao model dino train \
-e  /workspace/tao-experiments/specs/train.yml
2024-11-22 03:25:19,278 [TAO Toolkit] [INFO] root 160: Registry: ['nvcr.io']
2024-11-22 03:25:19,368 [TAO Toolkit] [INFO] nvidia_tao_cli.components.instance_handler.local_instance 360: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:5.5.0-pyt
2024-11-22 03:25:19,382 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 301: Printing tty value True
[2024-11-22 03:25:27,199 - TAO Toolkit - matplotlib.font_manager - INFO] generated new fontManager
/usr/local/lib/python3.10/dist-packages/hydra/plugins/config_source.py:124: UserWarning: Support for .yml files is deprecated. Use .yaml extension for Hydra config files
  deprecation_warning(
/usr/local/lib/python3.10/dist-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/next/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
  ret = run_job(
/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/loggers/api_logging.py:236: UserWarning: Log file already exists at /workspace/tao-experiments/results/trainings/training1/status.json
  rank_zero_warn(
Seed set to 1234
Train results will be saved at: /workspace/tao-experiments/results/trainings/training1
Error executing job with overrides: []Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/decorators/workflow.py", line 69, in _func
    raise e
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/decorators/workflow.py", line 48, in _func
    runner(cfg, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/dino/scripts/train.py", line 146, in main
    run_experiment(experiment_config=cfg,
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/dino/scripts/train.py", line 36, in run_experiment
    results_dir, resume_ckpt, gpus, ptl_loggers = initialize_train_experiment(experiment_config, key)
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/initialize_experiments.py", line 56, in initialize_train_experiment
cudnn.benchmark = cfg["train"]["cudnn"]["benchmark"]omegaconf.errors.ConfigKeyError: Key 'cudnn' is not in struct
    full_key: train.cudnn
    object_type=dict

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
[2024-11-22 03:25:35,916 - TAO Toolkit - root - INFO] Sending telemetry data.
[2024-11-22 03:25:35,916 - TAO Toolkit - root - INFO] ================> Start Reporting Telemetry <================
[2024-11-22 03:25:35,916 - TAO Toolkit - root - INFO] Sending {'version': '5.5.0', 'action': 'train', 'network': 'dino', 'gpu': ['Tesla-V100-SXM2-16GB'], 'success': False, 'time_lapsed': 8} to https://api.tao.ngc.nvidia.com.
[2024-11-22 03:25:37,147 - TAO Toolkit - root - INFO] Telemetry sent successfully.
[2024-11-22 03:25:37,148 - TAO Toolkit - root - INFO] ================> End Reporting Telemetry <================
[2024-11-22 03:25:37,148 - TAO Toolkit - root - WARNING] Execution status: FAIL
2024-11-22 03:25:38,297 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 363: Stopping container.

And following is the configuration file:

train:
  freeze: ['backbone', 'transformer.encoder']
  pretrained_model_path: /workspace/tao-experiments/models/retail_object_detection_vtrainable_retail_object_detection_binary_v2.2.2.3/dino_model_epoch011.pth
  num_gpus: 1
  num_nodes: 1
  validation_interval: 1
  checkpoint_interval: 1
  seed: 1234
  results_dir: /workspace/tao-experiments/results/trainings/training1
  optim:
    lr_backbone: 1e-6
    lr: 1e-5
    lr_steps: [11]
    momentum: 0.9
  num_epochs: 12
dataset:
  train_data_sources:
    - image_dir: /workspace/tao-experiments/data/dataset_2024-22-11T0942_1732228936/train
      json_file: /workspace/tao-experiments/data/dataset_2024-22-11T0942_1732228936/annotations/instances_train.json
  val_data_sources:
    - image_dir: /workspace/tao-experiments/data/dataset_2024-22-11T0942_1732228936/test
      json_file: /workspace/tao-experiments/data/dataset_2024-22-11T0942_1732228936/annotations/instances_test.json
  num_classes: 2
  batch_size: 4
  workers: 8
  augmentation:
    fixed_padding: False
model:
  backbone: fan_base
  num_feature_levels: 4
  dec_layers: 6
  enc_layers: 6
  num_queries: 900
  num_select: 100
  dropout_ratio: 0.0
  dim_feedforward: 2048
results_dir: /workspace/tao-experiments/results/trainings/training1
encryption_key: nvidia_tao

Based on the pytorch repo, it seems its looking for other configurations such as cfg["train"]["cudnn"]["deterministic"], cfg["train"]["cudnn"]["benchmark"] which are not defined in documentation.

  1. Can you please explain why I am getting this errors? (dont they have default values specified).
  2. And if I am suppose to specify values, can you let me know the values for the above two configurations? Thanks.

4 posts - 2 participants

Read full topic


Viewing all articles
Browse latest Browse all 497

Trending Articles