Please provide the following information when requesting support.
• Hardware : T4
• Network Type : ActionRecognitionNet
• Training spec file(If have, please share here)
results_dir: /results/joint_2d
encryption_key: nvidia_tao
model:
model_type: joint
backbone: resnet_18
rgb_seq_length: 32
rgb_pretrained_model_path: /workspace/tao-experiments/pretrained/actionrecognitionnet_vtrainable_v1.0/resnet18_2d_rgb_hmdb5_32.tlt
input_height: 224
of_seq_length: 32
of_pretrained_model_path: /workspace/tao-experiments/pretrained/resnet18_2d_of_hmdb5_32_a100.tlt
input_width: 224
input_type: 2d
sample_strategy: consecutive
dropout_ratio: 0.0
of_pretrained_num_classes: 2
rgb_pretrained_num_classes: 2
dataset:
train_dataset_dir: /data/train
val_dataset_dir: /data/test
label_map:
normal: 0
shoplifting: 1
batch_size: 16
workers: 4
clips_per_video: 5
augmentation_config:
train_crop_type: no_crop
horizontal_flip_prob: 0.5
rgb_input_mean: [0.5]
rgb_input_std: [0.5]
val_center_crop: False
train:
optim:
lr: 0.001
momentum: 0.9
weight_decay: 0.0001
lr_scheduler: MultiStep
lr_steps: [5, 15, 20]
lr_decay: 0.1
num_epochs: 20
checkpoint_interval: 1
evaluate:
checkpoint: “??”
test_dataset_dir: “??”
inference:
checkpoint: “??”
inference_dataset_dir: “??”
export:
checkpoint: “??”
• How to reproduce the issue ?
(launcher) root:~/training$ tao model action_recognition train -e /specs/train_joint_2d.yaml -k nvidia_tao results_dir=/results/joint_2d
2024-11-02 04:15:47,317 [TAO Toolkit] [INFO] root 160:
2024-11-02 04:15:47,407 [TAO Toolkit] [INFO] nvidia_tao_cli.components.instance_handler.local_instance 361: Running command in container:
2024-11-02 04:15:47,435 [TAO Toolkit] [WARNING] nvidia_tao_cli.components.docker_handler.docker_handler 293:
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the “user”:“UID:GID” in the
DockerOptions portion of the “/home/contact/.tao_mounts.json” file. You can obtain your
users UID and GID by using the “id -u” and “id -g” commands on the
terminal.
2024-11-02 04:15:47,435 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 301: Printing tty value True
sys:1: UserWarning:
‘train_joint_2d.yaml’ is validated against ConfigStore schema with the same name.
This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/hydra/hydra_runner.py:107: UserWarning:
‘train_joint_2d.yaml’ is validated against ConfigStore schema with the same name.
This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
_run_hydra(
/usr/local/lib/python3.10/dist-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/next/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
ret = run_job(
loading trained weights from /workspace/tao-experiments/pretrained/actionrecognitionnet_vtrainable_v1.0/resnet18_2d_rgb_hmdb5_32.tlt
Error executing job with overrides: [‘encryption_key=nvidia_tao’, ‘results_dir=/results/joint_2d’]
Traceback (most recent call last):
File “/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/action_recognition/scripts/train.py”, line 142, in main
raise e
File “/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/action_recognition/scripts/train.py”, line 124, in main
run_experiment(experiment_config=cfg,
File “/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/action_recognition/scripts/train.py”, line 40, in run_experiment
ar_model = ActionRecognitionModel(experiment_config)
File “/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/action_recognition/model/pl_ar_model.py”, line 49, in init
self._build_model(experiment_spec, export)
File “/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/action_recognition/model/pl_ar_model.py”, line 68, in _build_model
self.model = build_ar_model(experiment_config=experiment_spec,
File “/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/action_recognition/model/build_nn_model.py”, line 76, in build_ar_model
model = JointModel(backbone=backbone,
File “/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/action_recognition/model/ar_model.py”, line 225, in init
self.model_rgb = get_basemodel(backbone=backbone,
File “/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/action_recognition/model/ar_model.py”, line 110, in get_basemodel
model = resnet2d(backbone=backbone,
File “/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/action_recognition/model/resnet.py”, line 276, in resnet2d
model.load_state_dict(model_dict)
File “/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py”, line 2152, in load_state_dict
raise RuntimeError(‘Error(s) in loading state_dict for {}:\n\t{}’.format(
RuntimeError: Error(s) in loading state_dict for ResNet:
size mismatch for fc_cls.weight: copying a param with shape torch.Size([5, 512]) from checkpoint, the shape in current model is torch.Size([2, 512]).
size mismatch for fc_cls.bias: copying a param with shape torch.Size([5]) from checkpoint, the shape in current model is torch.Size([2]).Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
[2024-11-02 04:16:03,741 - TAO Toolkit - root - ERROR] Execution status: FAIL
2024-11-02 04:16:04,665 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 363: Stopping container.
I tried to train the joint ActionRecognitionNet model to recognize 2 classes, but I get this shape mismatch error. I have successfully trained the rgb only model and I tried changing the num of pretrained classes on the config file to 5, but I got the same error.
1 post - 1 participant