Please provide the following information when requesting support.
• Hardware (T4/V100/Xavier/Nano/etc)
RTX 4090
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc)
Dino as configured in the tao launcher notebook
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here)
5.5.0
• Training spec file(If have, please share here)
Defaults from the notebook/github.
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)
I have exactly the same issue as this user: Very low evaluation results for dino model by dino.ipynb in tao-getting-started_v5.3
I just went throught the notebook, and could reproduce the issue on 2 different systems, one ubuntu 20.04 and one ubuntu 22.04.
After 12 epochs, the AP is around zero, while it should be around 50. That thread suggested two things:
- to check the categories id numbering, but since this is coco2017, this is all set correctly.
- to increase the num_queries from 300 back to 900
How does the author of the notebook use/validate that it is actually correct if the claimed AP can not be reproduced? These are other issues reported by that notebook:
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:91: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
No pretrained configuration specified for convnext_base_in22k model. Using a default. Please add a config to the model pretrained_cfg registry or pass explicitly.
Loaded pretrained weights from /workspace/tao-experiments/dino/pretrained_dino_nvimagenet_vfan_small_hybrid_nvimagenet/fan_small_hybrid_nvimagenet.pth
_IncompatibleKeys(missing_keys=[‘out_norm1.weight’, ‘out_norm1.bias’, ‘out_norm2.weight’, ‘out_norm2.bias’, ‘out_norm3.weight’, ‘out_norm3.bias’, ‘learnable_downsample.weight’, ‘learnable_downsample.bias’], unexpected_keys=[‘norm.weight’, ‘norm.bias’, ‘head.fc.weight’, ‘head.fc.bias’])
To save memory, I did set the precision to fp16, so I was not entirely ‘default’, but allowed from the nvidia DINO documentation ( DINO - NVIDIA Docs ). My full edits to the train.yaml file are:
diff --git a/notebooks/tao_launcher_starter_kit/dino/specs/train.yaml b/notebooks/tao_launcher_starter_kit/dino/specs/train.yaml
index e63f8cd..e9b6d22 100644
--- a/notebooks/tao_launcher_starter_kit/dino/specs/train.yaml
+++ b/notebooks/tao_launcher_starter_kit/dino/specs/train.yaml
@@ -7,7 +7,9 @@ train:
lr: 2e-4
lr_steps: [11]
momentum: 0.9
num_epochs: 12
+ precision: fp16
+ activation_checkpoint: True
dataset:
train_data_sources:
- image_dir: /data/raw-data/train2017/
@@ -17,9 +19,9 @@ dataset:
json_file: /data/raw-data/annotations/instances_val2017.json
num_classes: 91
batch_size: 4
- workers: 8
+ workers: 16
augmentation:
- fixed_padding: False
+ fixed_padding: True
model:
backbone: fan_small
train_backbone: True
I have now restarted training this setup with model.num_queries: 900 but if this fixes it, the notebook should be modified upstream.
2 posts - 2 participants