• Hardware (Tesla P40)
• Network Type (Classification)
• nvidai-tao version: 5.2.0.1
I am running a classification_tf2 example from v5.1.0 and my command is,
tao model classification_tf2 train -e path/to/spec/bind/mount
but i am getting this error,
2024-01-22 18:48:31,242 [TAO Toolkit] [INFO] root 160: Registry: ['nvcr.io']
2024-01-22 18:48:31,502 [TAO Toolkit] [INFO] nvidia_tao_cli.components.instance_handler.local_instance 360: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf2.11.0
2024-01-22 18:48:33,158 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 301: Printing tty value True
2024-01-22 13:18:35.096853: I tensorflow/core/platform/cpu_feature_guard.cc:194] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE3 SSE4.1 SSE4.2 AVX
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Error executing job with overrides: []
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py", line 211, in run_and_report
return func()
File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py", line 368, in <lambda>
lambda: hydra.run(
File "/usr/local/lib/python3.8/dist-packages/clearml/binding/hydra_bind.py", line 88, in _patched_hydra_run
return PatchHydra._original_hydra_run(self, config_name, task_function, overrides, *args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/hydra.py", line 110, in run
_ = ret.return_value
File "/usr/local/lib/python3.8/dist-packages/hydra/core/utils.py", line 233, in return_value
raise self._return_value
File "/usr/local/lib/python3.8/dist-packages/hydra/core/utils.py", line 160, in run_job
ret.return_value = task_function(task_cfg)
File "/usr/local/lib/python3.8/dist-packages/clearml/binding/hydra_bind.py", line 170, in _patched_task_function
return task_function(a_config, *a_args, **a_kwargs)
File "<frozen cv.classification.scripts.train>", line 215, in main
File "<frozen common.utils>", line 62, in update_results_dir
File "/usr/local/lib/python3.8/dist-packages/omegaconf/dictconfig.py", line 369, in __getitem__
self._format_and_raise(
File "/usr/local/lib/python3.8/dist-packages/omegaconf/base.py", line 190, in _format_and_raise
format_and_raise(
File "/usr/local/lib/python3.8/dist-packages/omegaconf/_utils.py", line 741, in format_and_raise
_raise(ex, cause)
File "/usr/local/lib/python3.8/dist-packages/omegaconf/_utils.py", line 719, in _raise
raise ex.with_traceback(sys.exc_info()[2]) # set end OC_CAUSE=1 for full backtrace
File "/usr/local/lib/python3.8/dist-packages/omegaconf/dictconfig.py", line 367, in __getitem__
return self._get_impl(key=key, default_value=_DEFAULT_MARKER_)
File "/usr/local/lib/python3.8/dist-packages/omegaconf/dictconfig.py", line 438, in _get_impl
node = self._get_node(key=key, throw_on_missing_key=True)
File "/usr/local/lib/python3.8/dist-packages/omegaconf/dictconfig.py", line 465, in _get_node
self._validate_get(key)
File "/usr/local/lib/python3.8/dist-packages/omegaconf/dictconfig.py", line 166, in _validate_get
self._format_and_raise(
File "/usr/local/lib/python3.8/dist-packages/omegaconf/base.py", line 190, in _format_and_raise
format_and_raise(
File "/usr/local/lib/python3.8/dist-packages/omegaconf/_utils.py", line 821, in format_and_raise
_raise(ex, cause)
File "/usr/local/lib/python3.8/dist-packages/omegaconf/_utils.py", line 719, in _raise
raise ex.with_traceback(sys.exc_info()[2]) # set end OC_CAUSE=1 for full backtrace
omegaconf.errors.ConfigKeyError: Key 'results_dir' is not in struct
full_key: train.results_dir
object_type=dict
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "</usr/local/lib/python3.8/dist-packages/nvidia_tao_tf2/cv/classification/scripts/train.py>", line 3, in <module>
File "<frozen cv.classification.scripts.train>", line 221, in <module>
File "<frozen common.hydra.hydra_runner>", line 99, in wrapper
File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py", line 367, in _run_hydra
run_and_report(
File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py", line 251, in run_and_report
assert mdl is not None
AssertionError
Sending telemetry data.
Execution status: FAIL
2024-01-22 18:48:54,045 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 363: Stopping container.
I have put the key results_dir
in the spec file, here it is,
results_dir: '/workspace/'
encryption_key: 'nvidia_tlt'
dataset:
train_dataset_path: "/workspace/tao-experiments/data/split/train"
val_dataset_path: "/workspace/tao-experiments/data/split/val"
preprocess_mode: 'torch'
num_classes: 2
augmentation:
enable_color_augmentation: True
enable_center_crop: True
train:
qat: False
checkpoint: ''
batch_size_per_gpu: 64
num_epochs: 5
optim_config:
optimizer: 'sgd'
lr_config:
scheduler: 'cosine'
learning_rate: 0.05
soft_start: 0.05
reg_config:
type: 'L2'
scope: ['conv2d', 'dense']
weight_decay: 0.00005
model:
backbone: 'byom'
input_width: 227
input_height: 227
input_channels: 3
input_image_depth: 8
byom_model: '/workspace/tao-experiments/gender_net.tltb'
evaluate:
dataset_path: "/workspace/tao-experiments/data/split/test"
checkpoint: "/workspace/tao-experiments/class_net.tltb"
top_k: 3
batch_size: 256
n_workers: 8
prune:
checkpoint: '/workspace/tao-experiments/class_net.tltb'
threshold: 0.68
byom_model_path: '/workspace/tao-experiments/class_net.tltb'
Any idea what’s causing the issue?
9 posts - 2 participants