MAL Error - KeyError: 'state

Please provide the following information when requesting support.

• Hardware (T4/V100/Xavier/Nano/etc) A6000 x 2
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc) MAL
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here) 5.02
• Training spec file(If have, please share here)
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.) Running tao dataset auto_label -e $SPEC_DIR/spec.yaml

Im running a very simple MAL job to get used to the api. I have a single image, with an annotation in COCO format:

{
“images”: [
{
“width”: 640,
“height”: 640,
“id”: 0,
“file_name”: “a5b58e42-1.png”
}
],
“categories”: [
{
“id”: 0,
“name”: “Roof”
}
],
“annotations”: [
{
“id”: 0,
“image_id”: 0,
“category_id”: 0,
“segmentation”: ,
“bbox”: [
342.53521126760563,
291.7285531370038,
71.29321382842508,
46.70934699103714
],
“ignore”: 0,
“iscrowd”: 0,
“area”: 3330.0594628181143
}, …

Here is the spec file:

results_dir: ‘/results’
checkpoint: ‘/workspace/tao-experiments/PreTrained/pretrained_mask_auto_label_vvit-base/checkpoint-99.pth’
strategy: ‘ddp’
model:
arch: ‘vit-mae-base/16’
train:
lr: 0.000001
num_epochs: 10
warmup_epochs: 0
batch_size: 4
use_amp: True
inference:
load_mask: False
ann_path: /data/result.json
img_dir: /data/images
label_dump_path: ‘/results/instances_mal.json’
dataset:
crop_size: 512
train_ann_path: /data/raw-data/annotations/instances_train2017.json
train_img_dir: /data/raw-data/train2017/
val_ann_path: /data/raw-data/annotations/instances_val2017.json
val_img_dir: /data/raw-data/val2017/
evaluate:
batch_size: 4
use_teacher_test: False

And here is the error:

Restoring states from the checkpoint path at /workspace/tao-experiments/PreTrained/pretrained_mask_auto_label_vvit-base/checkpoint-99.pth
‘state_dict’
Error executing job with overrides:
‘state_dict’

Here’s the full log. I’ve downloaded the pretrained model and it’s obviously loaded that model but fails with a key error:
> 2024-02-19 14:11:21,277 [TAO Toolkit] [INFO] root 160: Registry: [‘nvcr.io’]
> 2024-02-19 14:11:21,425 [TAO Toolkit] [INFO] nvidia_tao_cli.components.instance_handler.local_instance 360: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:5.2.0-data-services
> 2024-02-19 14:11:21,577 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 301: Printing tty value True
> Matplotlib created a temporary cache directory at /tmp/matplotlib-uuo0ffxd because the default path (/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
> /usr/local/lib/python3.10/dist-packages/mmcv_full-1.7.1-py3.10-linux-x86_64.egg/mmcv/init.py:20: UserWarning: On January 1, 2023, MMCV will release v2.0.0, in which it will remove components related to the training process and add a data transformation module. In addition, it will rename the package names mmcv to mmcv-lite and mmcv-full to mmcv. See mmcv/docs/en/compatibility.md at master · open-mmlab/mmcv · GitHub for more details.
> warnings.warn(
> /usr/local/lib/python3.10/dist-packages/mmcv_full-1.7.1-py3.10-linux-x86_64.egg/mmcv/init.py:20: UserWarning: On January 1, 2023, MMCV will release v2.0.0, in which it will remove components related to the training process and add a data transformation module. In addition, it will rename the package names mmcv to mmcv-lite and mmcv-full to mmcv. See mmcv/docs/en/compatibility.md at master · open-mmlab/mmcv · GitHub for more details.
> warnings.warn(
> Log file already exists at /results/status.json
> Starting Data-services Auto-label.
> Loading validation set…
> loading annotations into memory…
> Done (t=0.00s)
> creating index…
> index created!
> Validation set is loaded successfully.
> Loading pretrained weights…
> Loading pretrained weights…
> Using 16bit native Automatic Mixed Precision (AMP)
> GPU available: True (cuda), used: True
> TPU available: False, using: 0 TPU cores
> IPU available: False, using: 0 IPUs
> HPU available: False, using: 0 HPUs
> Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
> /usr/local/lib/python3.10/dist-packages/mmcv_full-1.7.1-py3.10-linux-x86_64.egg/mmcv/init.py:20: UserWarning: On January 1, 2023, MMCV will release v2.0.0, in which it will remove components related to the training process and add a data transformation module. In addition, it will rename the package names mmcv to mmcv-lite and mmcv-full to mmcv. See mmcv/docs/en/compatibility.md at master · open-mmlab/mmcv · GitHub for more details.
> warnings.warn(
> Log file already exists at /results/status.json
> Starting Data-services Auto-label.
> Loading validation set…
> loading annotations into memory…
> Done (t=0.00s)
> creating index…
> index created!
> Validation set is loaded successfully.
> Loading pretrained weights…
> Loading pretrained weights…
> Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
> ----------------------------------------------------------------------------------------------------
> distributed_backend=nccl
> All distributed processes registered. Starting with 2 processes
> ----------------------------------------------------------------------------------------------------
**> **
> Missing logger folder: /results/lightning_logs
> Missing logger folder: /results/lightning_logs
> Restoring states from the checkpoint path at /workspace/tao-experiments/PreTrained/pretrained_mask_auto_label_vvit-base/checkpoint-99.pth
> ‘state_dict’
> Error executing job with overrides: []
> ‘state_dict’
> Error executing job with overrides: []
> Traceback (most recent call last):
> File “/usr/local/lib/python3.10/dist-packages/nvidia_tao_ds/auto_label/scripts/generate.py”, line 30, in main
> run_inference(cfg=cfg)
> File “/usr/local/lib/python3.10/dist-packages/nvidia_tao_ds/core/decorators.py”, line 90, in _func
> raise e
> File “/usr/local/lib/python3.10/dist-packages/nvidia_tao_ds/core/decorators.py”, line 62, in _func
**> runner(cfg, kwargs)
> File “/usr/local/lib/python3.10/dist-packages/nvidia_tao_ds/auto_label/scripts/generate.py”, line 75, in run_inference
> trainer.validate(model, ckpt_path=cfg.checkpoint, dataloaders=data_loader.val_dataloader())
> File “/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py”, line 687, in validate
> return call._call_and_handle_interrupt(
> File “/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py”, line 36, in _call_and_handle_interrupt
**> return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, kwargs)
> File “/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/launchers/subprocess_script.py”, line 90, in launch
**> return function(*args, kwargs)
> File “/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py”, line 736, in _validate_impl
> results = self._run(model, ckpt_path=self.ckpt_path)
> File “/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py”, line 1042, in _run
> self._restore_modules_and_callbacks(ckpt_path)
> File “/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py”, line 986, in _restore_modules_and_callbacks
> self._checkpoint_connector.restore_model()
> File “/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py”, line 271, in restore_model
> self.trainer.strategy.load_model_state_dict(self._loaded_checkpoint)
> File “/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/strategy.py”, line 363, in load_model_state_dict
> self.lightning_module.load_state_dict(checkpoint[“state_dict”])
> KeyError: ‘state_dict’
**> **
> Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
> Traceback (most recent call last):
> File “/usr/local/lib/python3.10/dist-packages/nvidia_tao_ds/auto_label/scripts/generate.py”, line 30, in main
> run_inference(cfg=cfg)
> File “/usr/local/lib/python3.10/dist-packages/nvidia_tao_ds/core/decorators.py”, line 90, in _func
> raise e
> File “/usr/local/lib/python3.10/dist-packages/nvidia_tao_ds/core/decorators.py”, line 62, in _func
**> runner(cfg, kwargs)
> File “/usr/local/lib/python3.10/dist-packages/nvidia_tao_ds/auto_label/scripts/generate.py”, line 75, in run_inference
> trainer.validate(model, ckpt_path=cfg.checkpoint, dataloaders=data_loader.val_dataloader())
> File “/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py”, line 687, in validate
> return call._call_and_handle_interrupt(
> File “/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py”, line 38, in _call_and_handle_interrupt
**> return trainer_fn(*args, kwargs)
> File “/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py”, line 736, in _validate_impl
> results = self._run(model, ckpt_path=self.ckpt_path)
> File “/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py”, line 1042, in _run
> self._restore_modules_and_callbacks(ckpt_path)
> File “/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py”, line 986, in _restore_modules_and_callbacks
> self._checkpoint_connector.restore_model()
> File “/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py”, line 271, in restore_model
> self.trainer.strategy.load_model_state_dict(self._loaded_checkpoint)
> File “/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/strategy.py”, line 363, in load_model_state_dict
> self.lightning_module.load_state_dict(checkpoint[“state_dict”])
> KeyError: ‘state_dict’
**> **
> Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
> Sending telemetry data.
> Execution status: FAIL
> 2024-02-19 14:11:49,668 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 363: Stopping container.
Thank you.

1 post - 1 participant

Read full topic

MAL Error - KeyError: 'state_dict'

Trending Articles

Bath man appears in court charged with attempted murder of a man...

MACLEAN, Allan

Black Angus Grilled Artichokes

Practice Sheet of Right form of verbs for HSC Students

Police blotter for Jan. 12

99 God Status for Whatsapp, Facebook

Rajasthan Board 12th Science Result 2018 name wise- RBSE 12th commerce result...

Notorious Naushad of Ippa gang nabbed

Child Kidnapping: Amy McNeil was kidnapped on her way to school by 5 adults;...

Sonible Smartlimit v1.1.5-R2R

NCERT Solutions for Class 9th Sanskrit Chapter 3 पाथेयम्

मतलबी दोस्त स्टेट्स | Matlabi Dost Status in Hindi – Selfish Friends Status

Arrow Flash 2 – Sinhala Dubbed – Episode 23 – 20th March 2016

[GET] AI Traffic Goldmine

[E² Plugin] HDF-Radio

Universal Multi-Patch v1.3 By RADIXX11

IWAN – Thanks and Praise ( Throw Back Thursday )

RONALD P SONDERGAARD Arrested by Miami-Dade County Corrections on Mar 03, 2017

मुख मैथुन से उठाएं सेक्स का भरपूर मज़ा, जानें क्या है इसका सही तरीकामुख मैथुन...

HSSC Excise & Taxation Inspector Result 2017 Scorecard/ Category Wise Merit List