Please provide the following information when requesting support.
- Hardware: GeForce RTX 4090 Laptop GPU
- Software: Ubuntu 22.04
- Network Type: centerpose_fan from centerpose_synth quickstart notebook, no changes made
- TLT Version (Please run “tlt info --verbose” and share “docker_tag” here): tlt is not installed locally using this notebook, docker tag is ISAAC Sim 4.0.0 and all dockers for nvidia/tao/tao-toolkit 5.5.0 dataset/deploy/model
- Training spec file: default spec file from notebook train_synthetic.yaml:
results_dir: /results
dataset:
train_data: /data/results/images
val_data: /data/results/images
num_classes: 1
batch_size: 4
workers: 8
category: "pallet"
num_symmetry: 1
max_objs: 10
train:
num_gpus: 1
validation_interval: 20
checkpoint_interval: ${train.validation_interval}
num_epochs: 40
clip_grad_val: 100.0
seed: 317
pretrained_model_path: /results/pretrained_models/centerpose_vtrainable_fan_small/centerpose_trainable_FAN_small.pth
precision: "fp32"
optim:
lr: 6e-05
lr_steps: [90, 120]
model:
down_ratio: 4
use_pretrained: False
backbone:
model_type: fan_small
pretrained_backbone_path: /results/pretrained_models/centerpose_vtrainable_fan_small/centerpose_trainable_FAN_small.pth
- Tao Mounts file:
{
"Mounts": [
{
"source": "/home/mb/tao-experiments",
"destination": "/workspace/tao-experiments"
},
{
"source": "/home/mb/tao-experiments/data/centerpose",
"destination": "/data"
},
{
"source": "/home/mb/tao_tutorials/notebooks/tao_launcher_starter_kit/centerpose/specs",
"destination": "/specs"
},
{
"source": "/home/mb/tao-experiments/centerpose/results",
"destination": "/results"
}
],
"DockerOptions": {
"shm_size": "16G",
"ulimits": {
"memlock": -1,
"stack": 67108864
},
"user": "1000:1000",
"network": "host"
}
}
• How to reproduce the issue ?
execute the latest centerpose_synthetic notebook and specifically, the step
print("For multi-GPU, change train.num_gpus in train.yaml based on your machine.")
# If you face out of memory issue, you may reduce the batch size in the spec file by passing dataset. batch_size=2
!tao model centerpose train \
-e $SPECS_DIR/train_synthetic.yaml \
results_dir=$RESULTS_DIR/
will throw the error with the default dataset:
Error executing job with overrides: ['results_dir=/results/']Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/decorators/workflow.py", line 69, in _func
raise e
File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/decorators/workflow.py", line 48, in _func
runner(cfg, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/centerpose/scripts/train.py", line 84, in main
run_experiment(experiment_config=cfg,
File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/centerpose/scripts/train.py", line 70, in run_experiment
trainer.fit(pt_model, dm, ckpt_path=resume_ckpt)
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 543, in fit
call._call_and_handle_interrupt(
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 44, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 579, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 986, in _run
results = self._run_stage()
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1030, in _run_stage
self._run_sanity_check()
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1059, in _run_sanity_check
val_loop.run()
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/utilities.py", line 182, in _decorator
return loop_run(self, *args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/evaluation_loop.py", line 135, in run
self._evaluation_step(batch, batch_idx, dataloader_idx, dataloader_iter)
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/evaluation_loop.py", line 396, in _evaluation_step
output = call._call_strategy_hook(trainer, hook_name, *step_args)
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 309, in _call_strategy_hook
output = fn(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/strategy.py", line 412, in validation_step
return self.lightning_module.validation_step(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/centerpose/model/pl_centerpose_model.py", line 136, in validation_step
self.val_cp_evaluator.evaluate(final_output, batch)
File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/centerpose/utils/centerpose_evaluator.py", line 279, in evaluate
center = np.asarray(anns['AR_data']['plane_center'])KeyError: 'plane_center'
What should be modified to be able to run the default notebook sucessfully?
Best regards
5 posts - 2 participants