Quantcast
Channel: TAO Toolkit - NVIDIA Developer Forums
Viewing all articles
Browse latest Browse all 497

DINO: Error executing job with overrides

$
0
0

Please provide the following information when requesting support.

• Hardware (T4/V100/Xavier/Nano/etc) RTX 3080ti
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc) Dino
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here) 5.3.0
• Training spec file(If have, please share here)
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)

Training specs:

train:
  num_gpus: 1
  num_nodes: 1
  validation_interval: 1
  optim:
    lr_backbone: 2e-05
    lr: 2e-4
    lr_steps: [11]
    momentum: 0.9
  num_epochs: 12
  precision: fp16
dataset:
  train_data_sources:
    - image_dir: /ws/mm_trainer/data/pgie/train/images
      json_file: /ws/mm_trainer/data/pgie/train/train.json
  val_data_sources:
    - image_dir: /ws/mm_trainer/data/pgie/valid/images
      json_file: /ws/mm_trainer/data/pgie/valid/valid.json
  num_classes: 6
  batch_size: 4
  workers: 1
  augmentation:
    fixed_padding: False
model:
  backbone: fan_small
  train_backbone: False
  pretrained_backbone_path: /ws/tao_trainer/dino/fan_small_hybrid_nvimagenet.pth
  num_feature_levels: 4
  dec_layers: 6
  enc_layers: 6
  num_queries: 300
  num_select: 100
  dropout_ratio: 0.0
  dim_feedforward: 2048

Reproduce

docker run --runtime=nvidia -it --ipc=host -v /home/tmp/Documents:/ws nvcr.io/nvidia/tao/tao-toolkit:5.3.0-pyt /bin/bash
dino train -e /ws/tao_trainer/dino/train_total.yaml results_dir=/ws/tao_trainer/dino/training_models -k detection

Logs

sys:1: UserWarning: 
'train_total.yaml' is validated against ConfigStore schema with the same name.
This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/hydra/hydra_runner.py:107: UserWarning: 
'train_total.yaml' is validated against ConfigStore schema with the same name.
This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
  _run_hydra(
/usr/local/lib/python3.10/dist-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/next/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
  ret = run_job(
Train results will be saved at: /ws/tao_trainer/dino/training_models/train
No pretrained configuration specified for convnext_base_in22k model. Using a default. Please add a config to the model pretrained_cfg registry or pass explicitly.
Loaded pretrained weights from /ws/tao_trainer/dino/fan_small_hybrid_nvimagenet.pth
_IncompatibleKeys(missing_keys=['out_norm1.weight', 'out_norm1.bias', 'out_norm2.weight', 'out_norm2.bias', 'out_norm3.weight', 'out_norm3.bias', 'learnable_downsample.weight', 'learnable_downsample.bias'], unexpected_keys=['norm.weight', 'norm.bias', 'head.fc.weight', 'head.fc.bias'])
/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/loggers/api_logging.py:240: UserWarning: Log file already exists at /ws/tao_trainer/dino/training_models/train/status.json
  rank_zero_warn(
Using 16bit native Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Missing logger folder: /ws/tao_trainer/dino/training_models/train/lightning_logs
Serializing 95898 elements to byte tensors and concatenating them all ...
Serialized dataset takes 23.86 MiB
Serializing 13820 elements to byte tensors and concatenating them all ...
Serialized dataset takes 3.40 MiB
/usr/local/lib/python3.10/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py:604: UserWarning: Checkpoint directory /ws/tao_trainer/dino/training_models/train exists and is not empty.
  rank_zero_warn(f"Checkpoint directory {dirpath} exists and is not empty.")
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name           | Type             | Params
----------------------------------------------------
0 | model          | DINOModel        | 48.1 M
1 | matcher        | HungarianMatcher | 0     
2 | criterion      | SetCriterion     | 0     
3 | box_processors | PostProcess      | 0     
----------------------------------------------------
19.7 M    Trainable params
28.4 M    Non-trainable params
48.1 M    Total params
96.206    Total estimated model params size (MB)
Sanity Checking: 0it [00:00, ?it/s]/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/data_connector.py:224: PossibleUserWarning: The dataloader, val_dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 24 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  rank_zero_warn(
Sanity Checking DataLoader 0:   0%|                                                                                                                                                   | 0/2 [00:00<?, ?it/s]/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:459: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:91: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/torch/functional.py:507: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/native/TensorShape.cpp:3549.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
Sanity Checking DataLoader 0: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  2.16it/s]
 Validation mAP : 0.0


 Validation mAP50 : 0.0

/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/data_connector.py:224: PossibleUserWarning: The dataloader, train_dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 24 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  rank_zero_warn(
Training: 0it [00:00, ?it/s]Starting Training Loop.
Epoch 0:   0%|                                                                                                                                                                    | 0/27429 [00:00<?, ?it/s]/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/dino/model/criterion.py:199: UserWarning: torch.range is deprecated and will be removed in a future release because its behavior is inconsistent with Python's range builtin. Instead, use torch.arange, which produces values in [start, end).
  t = torch.range(0, len(targets[i]['labels']) - 1).long().cuda()
Epoch 0:  62%|███████████████████████████████████████████████████████████████████████████████▉                                                | 17132/27429 [1:58:45<1:11:22,  2.40it/s, loss=37.8, v_num=0]
Error executing job with overrides: ['encryption_key=threat_detection', 'results_dir=/ws/tao_trainer/dino/training_models']
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/dino/scripts/train.py", line 222, in main
    raise e
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/dino/scripts/train.py", line 204, in main
    run_experiment(experiment_config=cfg,
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/dino/scripts/train.py", line 188, in run_experiment
    trainer.fit(pt_model, dm, ckpt_path=resume_ckpt or None)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 603, in fit
    call._call_and_handle_interrupt(
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 645, in _fit_impl
    self._run(model, ckpt_path=self.ckpt_path)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1098, in _run
    results = self._run_stage()
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1177, in _run_stage
    self._run_train()
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1200, in _run_train
    self.fit_loop.run()
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/loop.py", line 199, in run
    self.advance(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/fit_loop.py", line 267, in advance
    self._outputs = self.epoch_loop.run(self._data_fetcher)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/loop.py", line 199, in run
    self.advance(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 214, in advance
    batch_output = self.batch_loop.run(kwargs)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/loop.py", line 199, in run
    self.advance(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 88, in advance
    outputs = self.optimizer_loop.run(optimizers, kwargs)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/loop.py", line 199, in run
    self.advance(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 200, in advance
    result = self._run_optimization(kwargs, self._optimizers[self.optim_progress.optimizer_position])
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 247, in _run_optimization
    self._optimizer_step(optimizer, opt_idx, kwargs.get("batch_idx", 0), closure)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 357, in _optimizer_step
    self.trainer._call_lightning_module_hook(
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1342, in _call_lightning_module_hook
    output = fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/core/module.py", line 1661, in optimizer_step
    optimizer.step(closure=optimizer_closure)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/core/optimizer.py", line 169, in step
    step_output = self._strategy.optimizer_step(self._optimizer, self._optimizer_idx, closure, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/strategy.py", line 234, in optimizer_step
    return self.precision_plugin.optimizer_step(
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/plugins/precision/native_amp.py", line 85, in optimizer_step
    closure_result = closure()
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 147, in __call__
    self._result = self.closure(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 133, in closure
    step_output = self._step_fn()
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 406, in _training_step
    training_step_output = self.trainer._call_strategy_hook("training_step", *kwargs.values())
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1480, in _call_strategy_hook
    output = fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/strategy.py", line 378, in training_step
    return self.model.training_step(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/dino/model/pl_dino_model.py", line 203, in training_step
    loss_dict = self.criterion(outputs, targets)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1510, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1519, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/dino/model/criterion.py", line 173, in forward
    indices = self.matcher(outputs_without_aux, targets)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1510, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1519, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/dino/model/matcher.py", line 89, in forward
    cost_giou = -generalized_box_iou(box_cxcywh_to_xyxy(out_bbox), box_cxcywh_to_xyxy(tgt_bbox))
  File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/deformable_detr/utils/box_ops.py", line 80, in generalized_box_iou
    assert (boxes1[:, 2:] >= boxes1[:, :2]).all()
AssertionError

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Epoch 0:  62%|███████████████████████████████████████████████████████████████████████████████▉                                                | 17132/27429 [1:58:45<1:11:22,  2.40it/s, loss=37.8, v_num=0]
Execution status: FAIL

11 posts - 2 participants

Read full topic


Viewing all articles
Browse latest Browse all 497

Trending Articles