Please provide the following information when requesting support.
• Hardware (T4/V100/Xavier/Nano/etc) A6000 x 2
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc) MAL
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here) 5.02
• Training spec file(If have, please share here)
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.) Running tao dataset auto_label -e $SPEC_DIR/spec.yaml
Hi
Im running a very simple MAL job to get used to the api. I have a single image, with an annotation in COCO format:
{
“images”: [
{
“width”: 640,
“height”: 640,
“id”: 0,
“file_name”: “a5b58e42-1.png”
}
],
“categories”: [
{
“id”: 0,
“name”: “Roof”
}
],
“annotations”: [
{
“id”: 0,
“image_id”: 0,
“category_id”: 0,
“segmentation”: ,
“bbox”: [
342.53521126760563,
291.7285531370038,
71.29321382842508,
46.70934699103714
],
“ignore”: 0,
“iscrowd”: 0,
“area”: 3330.0594628181143
}, …
Here is the spec file:
results_dir: ‘/results’
checkpoint: ‘/workspace/tao-experiments/PreTrained/pretrained_mask_auto_label_vvit-base/checkpoint-99.pth’
strategy: ‘ddp’
model:
arch: ‘vit-mae-base/16’
train:
lr: 0.000001
num_epochs: 10
warmup_epochs: 0
batch_size: 4
use_amp: True
inference:
load_mask: False
ann_path: /data/result.json
img_dir: /data/images
label_dump_path: ‘/results/instances_mal.json’
dataset:
crop_size: 512
train_ann_path: /data/raw-data/annotations/instances_train2017.json
train_img_dir: /data/raw-data/train2017/
val_ann_path: /data/raw-data/annotations/instances_val2017.json
val_img_dir: /data/raw-data/val2017/
evaluate:
batch_size: 4
use_teacher_test: False
And here is the error:
Restoring states from the checkpoint path at /workspace/tao-experiments/PreTrained/pretrained_mask_auto_label_vvit-base/checkpoint-99.pth
‘state_dict’
Error executing job with overrides:
‘state_dict’
Here’s the full log. I’ve downloaded the pretrained model and it’s obviously loaded that model but fails with a key error:
> 2024-02-19 14:11:21,277 [TAO Toolkit] [INFO] root 160: Registry: [‘nvcr.io’]
> 2024-02-19 14:11:21,425 [TAO Toolkit] [INFO] nvidia_tao_cli.components.instance_handler.local_instance 360: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:5.2.0-data-services
> 2024-02-19 14:11:21,577 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 301: Printing tty value True
> Matplotlib created a temporary cache directory at /tmp/matplotlib-uuo0ffxd because the default path (/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
> /usr/local/lib/python3.10/dist-packages/mmcv_full-1.7.1-py3.10-linux-x86_64.egg/mmcv/init.py:20: UserWarning: On January 1, 2023, MMCV will release v2.0.0, in which it will remove components related to the training process and add a data transformation module. In addition, it will rename the package names mmcv to mmcv-lite and mmcv-full to mmcv. See mmcv/docs/en/compatibility.md at master · open-mmlab/mmcv · GitHub for more details.
> warnings.warn(
> /usr/local/lib/python3.10/dist-packages/mmcv_full-1.7.1-py3.10-linux-x86_64.egg/mmcv/init.py:20: UserWarning: On January 1, 2023, MMCV will release v2.0.0, in which it will remove components related to the training process and add a data transformation module. In addition, it will rename the package names mmcv to mmcv-lite and mmcv-full to mmcv. See mmcv/docs/en/compatibility.md at master · open-mmlab/mmcv · GitHub for more details.
> warnings.warn(
> Log file already exists at /results/status.json
> Starting Data-services Auto-label.
> Loading validation set…
> loading annotations into memory…
> Done (t=0.00s)
> creating index…
> index created!
> Validation set is loaded successfully.
> Loading pretrained weights…
> Loading pretrained weights…
> Using 16bit native Automatic Mixed Precision (AMP)
> GPU available: True (cuda), used: True
> TPU available: False, using: 0 TPU cores
> IPU available: False, using: 0 IPUs
> HPU available: False, using: 0 HPUs
> Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
> /usr/local/lib/python3.10/dist-packages/mmcv_full-1.7.1-py3.10-linux-x86_64.egg/mmcv/init.py:20: UserWarning: On January 1, 2023, MMCV will release v2.0.0, in which it will remove components related to the training process and add a data transformation module. In addition, it will rename the package names mmcv to mmcv-lite and mmcv-full to mmcv. See mmcv/docs/en/compatibility.md at master · open-mmlab/mmcv · GitHub for more details.
> warnings.warn(
> Log file already exists at /results/status.json
> Starting Data-services Auto-label.
> Loading validation set…
> loading annotations into memory…
> Done (t=0.00s)
> creating index…
> index created!
> Validation set is loaded successfully.
> Loading pretrained weights…
> Loading pretrained weights…
> Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
> ----------------------------------------------------------------------------------------------------
> distributed_backend=nccl
> All distributed processes registered. Starting with 2 processes
> ----------------------------------------------------------------------------------------------------
**> **
> Missing logger folder: /results/lightning_logs
> Missing logger folder: /results/lightning_logs
> Restoring states from the checkpoint path at /workspace/tao-experiments/PreTrained/pretrained_mask_auto_label_vvit-base/checkpoint-99.pth
> ‘state_dict’
> Error executing job with overrides: []
> ‘state_dict’
> Error executing job with overrides: []
> Traceback (most recent call last):
> File “/usr/local/lib/python3.10/dist-packages/nvidia_tao_ds/auto_label/scripts/generate.py”, line 30, in main
> run_inference(cfg=cfg)
> File “/usr/local/lib/python3.10/dist-packages/nvidia_tao_ds/core/decorators.py”, line 90, in _func
> raise e
> File “/usr/local/lib/python3.10/dist-packages/nvidia_tao_ds/core/decorators.py”, line 62, in _func
**> runner(cfg, kwargs)
> File “/usr/local/lib/python3.10/dist-packages/nvidia_tao_ds/auto_label/scripts/generate.py”, line 75, in run_inference
> trainer.validate(model, ckpt_path=cfg.checkpoint, dataloaders=data_loader.val_dataloader())
> File “/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py”, line 687, in validate
> return call._call_and_handle_interrupt(
> File “/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py”, line 36, in _call_and_handle_interrupt
**> return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, kwargs)
> File “/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/launchers/subprocess_script.py”, line 90, in launch
**> return function(*args, kwargs)
> File “/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py”, line 736, in _validate_impl
> results = self._run(model, ckpt_path=self.ckpt_path)
> File “/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py”, line 1042, in _run
> self._restore_modules_and_callbacks(ckpt_path)
> File “/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py”, line 986, in _restore_modules_and_callbacks
> self._checkpoint_connector.restore_model()
> File “/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py”, line 271, in restore_model
> self.trainer.strategy.load_model_state_dict(self._loaded_checkpoint)
> File “/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/strategy.py”, line 363, in load_model_state_dict
> self.lightning_module.load_state_dict(checkpoint[“state_dict”])
> KeyError: ‘state_dict’
**> **
> Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
> Traceback (most recent call last):
> File “/usr/local/lib/python3.10/dist-packages/nvidia_tao_ds/auto_label/scripts/generate.py”, line 30, in main
> run_inference(cfg=cfg)
> File “/usr/local/lib/python3.10/dist-packages/nvidia_tao_ds/core/decorators.py”, line 90, in _func
> raise e
> File “/usr/local/lib/python3.10/dist-packages/nvidia_tao_ds/core/decorators.py”, line 62, in _func
**> runner(cfg, kwargs)
> File “/usr/local/lib/python3.10/dist-packages/nvidia_tao_ds/auto_label/scripts/generate.py”, line 75, in run_inference
> trainer.validate(model, ckpt_path=cfg.checkpoint, dataloaders=data_loader.val_dataloader())
> File “/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py”, line 687, in validate
> return call._call_and_handle_interrupt(
> File “/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py”, line 38, in _call_and_handle_interrupt
**> return trainer_fn(*args, kwargs)
> File “/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py”, line 736, in _validate_impl
> results = self._run(model, ckpt_path=self.ckpt_path)
> File “/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py”, line 1042, in _run
> self._restore_modules_and_callbacks(ckpt_path)
> File “/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py”, line 986, in _restore_modules_and_callbacks
> self._checkpoint_connector.restore_model()
> File “/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py”, line 271, in restore_model
> self.trainer.strategy.load_model_state_dict(self._loaded_checkpoint)
> File “/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/strategy.py”, line 363, in load_model_state_dict
> self.lightning_module.load_state_dict(checkpoint[“state_dict”])
> KeyError: ‘state_dict’
**> **
> Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
> Sending telemetry data.
> Execution status: FAIL
> 2024-02-19 14:11:49,668 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 363: Stopping container.
Thank you.
1 post - 1 participant