Quantcast
Viewing all articles
Browse latest Browse all 570

Error in TAO-toolkit classification_tf2 train

• Hardware (Tesla P40)
• Network Type (Classification)
• nvidai-tao version: 5.2.0.1

I am running a classification_tf2 example from v5.1.0 and my command is,

tao model classification_tf2 train -e path/to/spec/bind/mount 

but i am getting this error,

2024-01-22 18:48:31,242 [TAO Toolkit] [INFO] root 160: Registry: ['nvcr.io']
2024-01-22 18:48:31,502 [TAO Toolkit] [INFO] nvidia_tao_cli.components.instance_handler.local_instance 360: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf2.11.0
2024-01-22 18:48:33,158 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 301: Printing tty value True
2024-01-22 13:18:35.096853: I tensorflow/core/platform/cpu_feature_guard.cc:194] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE3 SSE4.1 SSE4.2 AVX
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Error executing job with overrides: []
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py", line 211, in run_and_report
    return func()
  File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py", line 368, in <lambda>
    lambda: hydra.run(
  File "/usr/local/lib/python3.8/dist-packages/clearml/binding/hydra_bind.py", line 88, in _patched_hydra_run
    return PatchHydra._original_hydra_run(self, config_name, task_function, overrides, *args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/hydra.py", line 110, in run
    _ = ret.return_value
  File "/usr/local/lib/python3.8/dist-packages/hydra/core/utils.py", line 233, in return_value
    raise self._return_value
  File "/usr/local/lib/python3.8/dist-packages/hydra/core/utils.py", line 160, in run_job
    ret.return_value = task_function(task_cfg)
  File "/usr/local/lib/python3.8/dist-packages/clearml/binding/hydra_bind.py", line 170, in _patched_task_function
    return task_function(a_config, *a_args, **a_kwargs)
  File "<frozen cv.classification.scripts.train>", line 215, in main
  File "<frozen common.utils>", line 62, in update_results_dir
  File "/usr/local/lib/python3.8/dist-packages/omegaconf/dictconfig.py", line 369, in __getitem__
    self._format_and_raise(
  File "/usr/local/lib/python3.8/dist-packages/omegaconf/base.py", line 190, in _format_and_raise
    format_and_raise(
  File "/usr/local/lib/python3.8/dist-packages/omegaconf/_utils.py", line 741, in format_and_raise
    _raise(ex, cause)
  File "/usr/local/lib/python3.8/dist-packages/omegaconf/_utils.py", line 719, in _raise
    raise ex.with_traceback(sys.exc_info()[2])  # set end OC_CAUSE=1 for full backtrace
  File "/usr/local/lib/python3.8/dist-packages/omegaconf/dictconfig.py", line 367, in __getitem__
    return self._get_impl(key=key, default_value=_DEFAULT_MARKER_)
  File "/usr/local/lib/python3.8/dist-packages/omegaconf/dictconfig.py", line 438, in _get_impl
    node = self._get_node(key=key, throw_on_missing_key=True)
  File "/usr/local/lib/python3.8/dist-packages/omegaconf/dictconfig.py", line 465, in _get_node
    self._validate_get(key)
  File "/usr/local/lib/python3.8/dist-packages/omegaconf/dictconfig.py", line 166, in _validate_get
    self._format_and_raise(
  File "/usr/local/lib/python3.8/dist-packages/omegaconf/base.py", line 190, in _format_and_raise
    format_and_raise(
  File "/usr/local/lib/python3.8/dist-packages/omegaconf/_utils.py", line 821, in format_and_raise
    _raise(ex, cause)
  File "/usr/local/lib/python3.8/dist-packages/omegaconf/_utils.py", line 719, in _raise
    raise ex.with_traceback(sys.exc_info()[2])  # set end OC_CAUSE=1 for full backtrace
omegaconf.errors.ConfigKeyError: Key 'results_dir' is not in struct
    full_key: train.results_dir
    object_type=dict

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "</usr/local/lib/python3.8/dist-packages/nvidia_tao_tf2/cv/classification/scripts/train.py>", line 3, in <module>
  File "<frozen cv.classification.scripts.train>", line 221, in <module>
  File "<frozen common.hydra.hydra_runner>", line 99, in wrapper
  File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py", line 367, in _run_hydra
    run_and_report(
  File "/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py", line 251, in run_and_report
    assert mdl is not None
AssertionError
Sending telemetry data.
Execution status: FAIL
2024-01-22 18:48:54,045 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 363: Stopping container.

I have put the key results_dir in the spec file, here it is,

results_dir: '/workspace/'
encryption_key: 'nvidia_tlt'
dataset:
  train_dataset_path: "/workspace/tao-experiments/data/split/train"
  val_dataset_path: "/workspace/tao-experiments/data/split/val"
  preprocess_mode: 'torch'
  num_classes: 2
  augmentation:
    enable_color_augmentation: True
    enable_center_crop: True
train:
  qat: False
  checkpoint: ''
  batch_size_per_gpu: 64
  num_epochs: 5
  optim_config:
    optimizer: 'sgd'
  lr_config:
    scheduler: 'cosine'
    learning_rate: 0.05
    soft_start: 0.05
  reg_config:
    type: 'L2'
    scope: ['conv2d', 'dense']
    weight_decay: 0.00005
model:
  backbone: 'byom'
  input_width: 227
  input_height: 227
  input_channels: 3
  input_image_depth: 8
  byom_model: '/workspace/tao-experiments/gender_net.tltb'
evaluate:
  dataset_path: "/workspace/tao-experiments/data/split/test"
  checkpoint: "/workspace/tao-experiments/class_net.tltb"
  top_k: 3
  batch_size: 256
  n_workers: 8
prune:
  checkpoint: '/workspace/tao-experiments/class_net.tltb'
  threshold: 0.68
  byom_model_path: '/workspace/tao-experiments/class_net.tltb'

Any idea what’s causing the issue?

9 posts - 2 participants

Read full topic


Viewing all articles
Browse latest Browse all 570

Trending Articles