Quantcast
Channel: TAO Toolkit - NVIDIA Developer Forums
Viewing all articles
Browse latest Browse all 497

Can not integrate WandB with Classification-TF2

$
0
0

Hello, I am trying to integrate WandB with Classification-TF2 following this tutorial. TAO WandB Integration - NVIDIA Docs

While it works for DetectNet-v2, it didn’t work for Classification-TF2.

Here is the reproduce steps.

  1. login the wandb account with
os.environ["WANDB_API_KEY"] = "my api key"
import wandb
WANDB_LOGGED_IN = wandb.login()
if WANDB_LOGGED_IN:
print("WANDB successfully logged in.")
  1. set the ~/.tao_mounts.json file as
{
    "Mounts": [
        {
            "source": "/home/nvidia/tao_tutorials/notebooks/tao_launcher_starter_kit/classification_tf2/tao_voc",
            "destination": "/workspace/tao-experiments"
        },
        {
            "source": "/home/nvidia/tao_tutorials/notebooks/tao_launcher_starter_kit/classification_tf2/tao_voc/specs",
            "destination": "/workspace/tao-experiments/classification_tf2/tao_voc/specs"
        }
    ],
    "DockerOptions": {},
    "Envs": [
        {
            "variable": "WANDB_API_KEY",
            "value": "my api key"
        }
    ]
}
  1. set the training spec spec.ymal file as
results_dir: '/workspace/tao-experiments/classification_tf2/output'
dataset:
  train_dataset_path: "/workspace/tao-experiments/data/split/training_set"
  val_dataset_path: "/workspace/tao-experiments/data/split/val_set"
  preprocess_mode: 'torch'
  num_classes: 2
  augmentation:
    enable_color_augmentation: True
    enable_center_crop: True
train:
  qat: False
  checkpoint: ''
  batch_size_per_gpu: 32
  num_epochs: 120
  optim_config:
    optimizer: 'adam'
  lr_config:
    scheduler: 'cosine'
    learning_rate: 0.05
    soft_start: 0.05
  reg_config:
    type: 'L2'
    scope: ['conv2d', 'dense']
    weight_decay: 0.00005
  wandb:
    entity: "name_of_entity"
    name: "name_of_the_experiment"
    project: "name_of_the_project"
model:
  backbone: 'efficientnet-b0'
  input_width: 256
  input_height: 256
  input_channels: 3
  input_image_depth: 8
evaluate:
  dataset_path: "/workspace/tao-experiments/data/split/test_set"
  checkpoint: "/workspace/tao-experiments/classification_tf2/output/train/efficientnet-b0_098.tlt"
  top_k: 1
  batch_size: 256
  n_workers: 8
prune:
  checkpoint: '/workspace/tao-experiments/classification_tf2/output/train/efficientnet-b0_120.tlt'
  threshold: 0.68
  byom_model_path: ''
  1. training the model with this command !tao model classification_tf2 train -e $SPECS_DIR/spec.yaml on sample jupyter notebook

  2. this error shows

2024-12-12 09:24:28,366 [TAO Toolkit] [INFO] root 160: Registry: ['nvcr.io']
2024-12-12 09:24:28,437 [TAO Toolkit] [INFO] nvidia_tao_cli.components.instance_handler.local_instance 360: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:5.5.0-tf2
2024-12-12 09:24:28,462 [TAO Toolkit] [WARNING] nvidia_tao_cli.components.docker_handler.docker_handler 288:
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the "/home/nvidia/.tao_mounts.json" file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
2024-12-12 09:24:28,462 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 301: Printing tty value True
2024-12-12 00:24:30.011626: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9373] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-12-12 00:24:30.011686: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-12-12 00:24:30.013371: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1534] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-12-12 00:24:30.020559: I tensorflow/core/platform/cpu_feature_guard.cc:183] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE3 SSE4.1 SSE4.2 AVX, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Train results will be saved at: /workspace/tao-experiments/classification_tf2/output/train
wandb: Currently logged in as: 99 (99-personal). Use `wandb login --relogin` to force relogin
wandb: Appending key for api.wandb.ai to your netrc file: /root/.netrc
Initializing wandb.
wandb: Currently logged in as: 99. Use `wandb login --relogin` to force relogin
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_init.py", line 1176, in init
    run = wi.init()
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_init.py", line 633, in init
    run = Run(
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_run.py", line 566, in __init__
    self._init(
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_run.py", line 676, in _init
    self._config._update(config, ignore_locked=True)
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_config.py", line 177, in _update
    sanitized = self._sanitize_dict(
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_config.py", line 264, in _sanitize_dict
    k, v = self._sanitize(k, v, allow_val_change)
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_config.py", line 282, in _sanitize
    val = json_friendly_val(val)
  File "/usr/local/lib/python3.10/dist-packages/wandb/util.py", line 671, in json_friendly_val
    converted = asdict(val)
  File "/usr/lib/python3.10/dataclasses.py", line 1238, in asdict
    return _asdict_inner(obj, dict_factory)
  File "/usr/lib/python3.10/dataclasses.py", line 1245, in _asdict_inner
    value = _asdict_inner(getattr(obj, f.name), dict_factory)
  File "/usr/lib/python3.10/dataclasses.py", line 1275, in _asdict_inner
    return type(obj)((_asdict_inner(k, dict_factory),
TypeError: first argument must be callable or None
Problem at: <frozen common.mlops.wandb> 119 initialize_wandb
Wandb logging failed with error An unexpected error occurred

Thanks in advance for your support!


Please provide the following information when requesting support.

• Hardware (A40-16q)
• Network Type (Classification-TF2)
• TLT Version (
Configuration of the TAO Toolkit Instance
task_group: [‘model’, ‘dataset’, ‘deploy’]
format_version: 3.0
toolkit_version: 5.5.0
published_date: 08/26/2024
)
• Training spec file(as shown above)
• How to reproduce the issue ? (as shown above)

1 post - 1 participant

Read full topic


Viewing all articles
Browse latest Browse all 497

Trending Articles