Quantcast
Channel: TAO Toolkit - NVIDIA Developer Forums
Viewing all articles
Browse latest Browse all 499

The tao 5.5.0 launcher dino notebook is not working with default settings

$
0
0

Please provide the following information when requesting support.

• Hardware (T4/V100/Xavier/Nano/etc)
RTX 4090
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc)
Dino as configured in the tao launcher notebook
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here)
5.5.0
• Training spec file(If have, please share here)
Defaults from the notebook/github.
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)

I have exactly the same issue as this user: Very low evaluation results for dino model by dino.ipynb in tao-getting-started_v5.3

I just went throught the notebook, and could reproduce the issue on 2 different systems, one ubuntu 20.04 and one ubuntu 22.04.

After 12 epochs, the AP is around zero, while it should be around 50. That thread suggested two things:

  • to check the categories id numbering, but since this is coco2017, this is all set correctly.
  • to increase the num_queries from 300 back to 900

How does the author of the notebook use/validate that it is actually correct if the claimed AP can not be reproduced? These are other issues reported by that notebook:

/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:91: UserWarning: None of the inputs have requires_grad=True. Gradients will be None

No pretrained configuration specified for convnext_base_in22k model. Using a default. Please add a config to the model pretrained_cfg registry or pass explicitly.

Loaded pretrained weights from /workspace/tao-experiments/dino/pretrained_dino_nvimagenet_vfan_small_hybrid_nvimagenet/fan_small_hybrid_nvimagenet.pth
_IncompatibleKeys(missing_keys=[‘out_norm1.weight’, ‘out_norm1.bias’, ‘out_norm2.weight’, ‘out_norm2.bias’, ‘out_norm3.weight’, ‘out_norm3.bias’, ‘learnable_downsample.weight’, ‘learnable_downsample.bias’], unexpected_keys=[‘norm.weight’, ‘norm.bias’, ‘head.fc.weight’, ‘head.fc.bias’])

To save memory, I did set the precision to fp16, so I was not entirely ‘default’, but allowed from the nvidia DINO documentation ( DINO - NVIDIA Docs ). My full edits to the train.yaml file are:

diff --git a/notebooks/tao_launcher_starter_kit/dino/specs/train.yaml b/notebooks/tao_launcher_starter_kit/dino/specs/train.yaml
index e63f8cd..e9b6d22 100644
--- a/notebooks/tao_launcher_starter_kit/dino/specs/train.yaml
+++ b/notebooks/tao_launcher_starter_kit/dino/specs/train.yaml
@@ -7,7 +7,9 @@ train:
     lr: 2e-4
     lr_steps: [11]
     momentum: 0.9
     num_epochs: 12
+  precision: fp16
+  activation_checkpoint: True
 dataset:
   train_data_sources:
     - image_dir: /data/raw-data/train2017/
@@ -17,9 +19,9 @@ dataset:
       json_file: /data/raw-data/annotations/instances_val2017.json
   num_classes: 91
   batch_size: 4
-  workers: 8
+  workers: 16
   augmentation:
-    fixed_padding: False
+    fixed_padding: True
 model:
   backbone: fan_small
   train_backbone: True

I have now restarted training this setup with model.num_queries: 900 but if this fixes it, the notebook should be modified upstream.

2 posts - 2 participants

Read full topic


Viewing all articles
Browse latest Browse all 499

Trending Articles