Deformable detr model keeps failing to train

Please provide the following information when requesting support.

• Hardware (T4/V100/Xavier/Nano/etc): RTX 4090
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc) : D-DETR
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here) : v.5.2.0

Configuration of the TAO Toolkit Instance

task_group:         
    model:             
        dockers:                 
            nvidia/tao/tao-toolkit:                     
                5.0.0-tf2.11.0:                         
                    docker_registry: nvcr.io
                    tasks: 
                        1. classification_tf2
                        2. efficientdet_tf2
                5.0.0-tf1.15.5:                         
                    docker_registry: nvcr.io
                    tasks: 
                        1. bpnet
                        2. classification_tf1
                        3. converter
                        4. detectnet_v2
                        5. dssd
                        6. efficientdet_tf1
                        7. faster_rcnn
                        8. fpenet
                        9. lprnet
                        10. mask_rcnn
                        11. multitask_classification
                        12. retinanet
                        13. ssd
                        14. unet
                        15. yolo_v3
                        16. yolo_v4
                        17. yolo_v4_tiny
                5.2.0-pyt2.1.0:                         
                    docker_registry: nvcr.io
                    tasks: 
                        1. action_recognition
                        2. centerpose
                        3. deformable_detr
                        4. dino
                        5. mal
                        6. ml_recog
                        7. ocdnet
                        8. ocrnet
                        9. optical_inspection
                        10. pointpillars
                        11. pose_classification
                        12. re_identification
                        13. visual_changenet
                5.2.0-pyt1.14.0:                         
                    docker_registry: nvcr.io
                    tasks: 
                        1. classification_pyt
                        2. segformer
    dataset:             
        dockers:                 
            nvidia/tao/tao-toolkit:                     
                5.2.0-data-services:                         
                    docker_registry: nvcr.io
                    tasks: 
                        1. augmentation
                        2. auto_label
                        3. annotations
                        4. analytics
    deploy:             
        dockers:                 
            nvidia/tao/tao-toolkit:                     
                5.2.0-deploy:                         
                    docker_registry: nvcr.io
                    tasks: 
                        1. visual_changenet
                        2. centerpose
                        3. classification_pyt
                        4. classification_tf1
                        5. classification_tf2
                        6. deformable_detr
                        7. detectnet_v2
                        8. dino
                        9. dssd
                        10. efficientdet_tf1
                        11. efficientdet_tf2
                        12. faster_rcnn
                        13. lprnet
                        14. mask_rcnn
                        15. ml_recog
                        16. multitask_classification
                        17. ocdnet
                        18. ocrnet
                        19. optical_inspection
                        20. retinanet
                        21. segformer
                        22. ssd
                        23. trtexec
                        24. unet
                        25. yolo_v3
                        26. yolo_v4
                        27. yolo_v4_tiny
format_version: 3.0
toolkit_version: 5.2.0
published_date: 12/06/2023

• Training spec file(If have, please share here) :

train:
  num_gpus: 1
  num_nodes: 1
  validation_interval: 1
  optim:
    lr_backbone: 2e-5
    lr: 2e-4
    lr_steps: [10, 20, 30, 40]
    momentum: 0.9
  num_epochs: 50
  precision: fp16
dataset:
  train_data_sources:
    - image_dir: /home/xint/TAO-Toolkit/D-DETR/data/raw-data/train2017/
      json_file: /home/xint/TAO-Toolkit/D-DETR/data/raw-data/annotations/instances_train2017.json
  val_data_sources:
    - image_dir: /home/xint/TAO-Toolkit/D-DETR/data/raw-data/val2017/
      json_file: /home/xint/TAO-Toolkit/D-DETR/data/raw-data/annotations/instances_val2017.json
  num_classes: 91
  batch_size: 2
  workers: 8
  augmentation:
    fixed_padding: False
model:
  backbone: resnet_50
  train_backbone: True
  pretrained_backbone_path: /home/xint/TAO-Toolkit/D-DETR/results/pretrained_deformable_detr_nvimagenet_vresnet50/resnet50_nvimagenetv2.pth.tar
  num_feature_levels: 2
  return_interm_indices: [1, 2]
  dec_layers: 6
  enc_layers: 6
  num_queries: 300
  with_box_refine: True
  dropout_ratio: 0.3

• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.) :
I really didnt change the code except changing precision and batch size of the spec file and notebook file name, but the training process keeps failing. Here is the log of the training.

For multi-GPU, change num_gpus in train.yaml based on your machine or pass --gpus to the cli.
For multi-node, change num_gpus and num_nodes in train.yaml based on your machine or pass --num_nodes to the cli.
2024-01-16 16:33:41,593 [TAO Toolkit] [INFO] root 160: Registry: ['nvcr.io']
2024-01-16 16:33:41,694 [TAO Toolkit] [INFO] nvidia_tao_cli.components.instance_handler.local_instance 360: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:5.2.0-pyt2.1.0
2024-01-16 16:33:41,812 [TAO Toolkit] [WARNING] nvidia_tao_cli.components.docker_handler.docker_handler 288: 
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the "/home/xint/.tao_mounts.json" file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
2024-01-16 16:33:41,812 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 301: Printing tty value True
INFO: Generating grammar tables from /usr/lib/python3.10/lib2to3/Grammar.txt
INFO: Generating grammar tables from /usr/lib/python3.10/lib2to3/PatternGrammar.txt
/usr/local/lib/python3.10/dist-packages/mmcv_full-1.7.1-py3.10-linux-x86_64.egg/mmcv/__init__.py:20: UserWarning: On January 1, 2023, MMCV will release v2.0.0, in which it will remove components related to the training process and add a data transformation module. In addition, it will rename the package names mmcv to mmcv-lite and mmcv-full to mmcv. See https://github.com/open-mmlab/mmcv/blob/master/docs/en/compatibility.md for more details.
  warnings.warn(
INFO: Generating grammar tables from /usr/lib/python3.10/lib2to3/Grammar.txt
INFO: Generating grammar tables from /usr/lib/python3.10/lib2to3/PatternGrammar.txt
/usr/local/lib/python3.10/dist-packages/mmcv_full-1.7.1-py3.10-linux-x86_64.egg/mmcv/__init__.py:20: UserWarning: On January 1, 2023, MMCV will release v2.0.0, in which it will remove components related to the training process and add a data transformation module. In addition, it will rename the package names mmcv to mmcv-lite and mmcv-full to mmcv. See https://github.com/open-mmlab/mmcv/blob/master/docs/en/compatibility.md for more details.
  warnings.warn(
sys:1: UserWarning: 
'train.yaml' is validated against ConfigStore schema with the same name.
This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
<frozen core.hydra.hydra_runner>:-1: UserWarning: 
'train.yaml' is validated against ConfigStore schema with the same name.
This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
/usr/local/lib/python3.10/dist-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/next/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
  ret = run_job(
Train results will be saved at: /home/xint/TAO-Toolkit/D-DETR/results/train
Loaded pretrained weights from /home/xint/TAO-Toolkit/D-DETR/results/pretrained_deformable_detr_nvimagenet_vresnet50/resnet50_nvimagenetv2.pth.tar
<All keys matched successfully>
<frozen core.loggers.api_logging>:245: UserWarning: Log file already exists at /home/xint/TAO-Toolkit/D-DETR/results/train/status.json
Using 16bit native Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Serializing 118287 elements to byte tensors and concatenating them all ...
Serialized dataset takes 74.13 MiB
Serializing 5000 elements to byte tensors and concatenating them all ...
Serialized dataset takes 3.15 MiB
/usr/local/lib/python3.10/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py:604: UserWarning: Checkpoint directory /home/xint/TAO-Toolkit/D-DETR/results/train exists and is not empty.
  rank_zero_warn(f"Checkpoint directory {dirpath} exists and is not empty.")
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name           | Type             | Params
----------------------------------------------------
0 | model          | DDModel          | 20.1 M
1 | matcher        | HungarianMatcher | 0     
2 | criterion      | SetCriterion     | 0     
3 | box_processors | PostProcess      | 0     
----------------------------------------------------
19.8 M    Trainable params
222 K     Non-trainable params
20.1 M    Total params
40.141    Total estimated model params size (MB)
Sanity Checking DataLoader 0:   0%|                       | 0/2 [00:00<?, ?it/s]/usr/local/lib/python3.10/dist-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/native/TensorShape.cpp:3516.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:428: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:61: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
  warnings.warn(
Sanity Checking DataLoader 0: 100%|███████████████| 2/2 [00:00<00:00,  2.29it/s]
 Validation mAP : 0.0


 Validation mAP50 : 0.0

Training: 0it [00:00, ?it/s]Starting Training Loop.                             
Epoch 0:  96%|█████▊| 59143/61643 [1:24:39<03:34, 11.64it/s, loss=23.1, v_num=1]
Validation: 0it [00:00, ?it/s]
Validation:   0%|                                      | 0/2500 [00:00<?, ?it/s]
Validation DataLoader 0:   0%|                         | 0/2500 [00:00<?, ?it/s]
Epoch 0:  96%|█████▊| 59144/61643 [1:24:41<03:34, 11.64it/s, loss=23.1, v_num=1]
Epoch 0: 100%|██████| 61643/61643 [1:26:47<00:00, 11.84it/s, loss=23.1, v_num=1]
 Validation mAP : 0.0002509435487899291


 Validation mAP50 : 0.0005570636727930148

Epoch 0: 100%|█| 61643/61643 [1:26:50<00:00, 11.83it/s, loss=23.1, v_num=1, val_
                                                                                Train and Val metrics generated.
Epoch 0: 100%|█| 61643/61643 [1:26:51<00:00, 11.83it/s, loss=23.1, v_num=1, val_Training loop in progress
Epoch 1:  96%|▉| 59143/61643 [1:23:29<03:31, 11.81it/s, loss=22.1, v_num=1, val_
Validation: 0it [00:00, ?it/s]
Validation:   0%|                                      | 0/2500 [00:00<?, ?it/s]
Validation DataLoader 0:   0%|                         | 0/2500 [00:00<?, ?it/s]
Epoch 1:  96%|▉| 59144/61643 [1:23:33<03:31, 11.80it/s, loss=22.1, v_num=1, val_
Epoch 1: 100%|█| 61643/61643 [1:25:40<00:00, 11.99it/s, loss=22.1, v_num=1, val_
 Validation mAP : 0.000244808033156113


 Validation mAP50 : 0.0005862341813355832

Epoch 1: 100%|█| 61643/61643 [1:25:43<00:00, 11.98it/s, loss=22.1, v_num=1, val_
                                                                                Train and Val metrics generated.
Epoch 1: 100%|█| 61643/61643 [1:25:44<00:00, 11.98it/s, loss=22.1, v_num=1, val_Training loop in progress
Epoch 2:  49%|▍| 30116/61643 [9:05:59<9:31:34,  1.09s/it, loss=22.7, v_num=1, vaTelemetry data couldn't be sent, but the command ran successfully.
[WARNING]: module 'urllib3.exceptions' has no attribute 'SubjectAltNameWarning'
Execution status: FAIL
2024-01-17 04:32:53,046 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 363: Stopping container.

I even tried to train it on terminal or on tao-docker and both didnt work after 3-4 epochs.

What could be a reason here? I attached the notebook file just in case.

D-DETR.zip (28.0 KB)

3 posts - 2 participants

Read full topic

Deformable detr model keeps failing to train

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112