Hello, I am trying to integrate WandB with Classification-TF2 following this tutorial. TAO WandB Integration - NVIDIA Docs
While it works for DetectNet-v2, it didn’t work for Classification-TF2.
Here is the reproduce steps.
- login the wandb account with
os.environ["WANDB_API_KEY"] = "my api key"
import wandb
WANDB_LOGGED_IN = wandb.login()
if WANDB_LOGGED_IN:
print("WANDB successfully logged in.")
- set the
~/.tao_mounts.json
file as
{
"Mounts": [
{
"source": "/home/nvidia/tao_tutorials/notebooks/tao_launcher_starter_kit/classification_tf2/tao_voc",
"destination": "/workspace/tao-experiments"
},
{
"source": "/home/nvidia/tao_tutorials/notebooks/tao_launcher_starter_kit/classification_tf2/tao_voc/specs",
"destination": "/workspace/tao-experiments/classification_tf2/tao_voc/specs"
}
],
"DockerOptions": {},
"Envs": [
{
"variable": "WANDB_API_KEY",
"value": "my api key"
}
]
}
- set the training spec
spec.ymal
file as
results_dir: '/workspace/tao-experiments/classification_tf2/output'
dataset:
train_dataset_path: "/workspace/tao-experiments/data/split/training_set"
val_dataset_path: "/workspace/tao-experiments/data/split/val_set"
preprocess_mode: 'torch'
num_classes: 2
augmentation:
enable_color_augmentation: True
enable_center_crop: True
train:
qat: False
checkpoint: ''
batch_size_per_gpu: 32
num_epochs: 120
optim_config:
optimizer: 'adam'
lr_config:
scheduler: 'cosine'
learning_rate: 0.05
soft_start: 0.05
reg_config:
type: 'L2'
scope: ['conv2d', 'dense']
weight_decay: 0.00005
wandb:
entity: "name_of_entity"
name: "name_of_the_experiment"
project: "name_of_the_project"
model:
backbone: 'efficientnet-b0'
input_width: 256
input_height: 256
input_channels: 3
input_image_depth: 8
evaluate:
dataset_path: "/workspace/tao-experiments/data/split/test_set"
checkpoint: "/workspace/tao-experiments/classification_tf2/output/train/efficientnet-b0_098.tlt"
top_k: 1
batch_size: 256
n_workers: 8
prune:
checkpoint: '/workspace/tao-experiments/classification_tf2/output/train/efficientnet-b0_120.tlt'
threshold: 0.68
byom_model_path: ''
-
training the model with this command
!tao model classification_tf2 train -e $SPECS_DIR/spec.yaml
on sample jupyter notebook -
this error shows
2024-12-12 09:24:28,366 [TAO Toolkit] [INFO] root 160: Registry: ['nvcr.io']
2024-12-12 09:24:28,437 [TAO Toolkit] [INFO] nvidia_tao_cli.components.instance_handler.local_instance 360: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:5.5.0-tf2
2024-12-12 09:24:28,462 [TAO Toolkit] [WARNING] nvidia_tao_cli.components.docker_handler.docker_handler 288:
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the "/home/nvidia/.tao_mounts.json" file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
2024-12-12 09:24:28,462 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 301: Printing tty value True
2024-12-12 00:24:30.011626: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9373] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-12-12 00:24:30.011686: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-12-12 00:24:30.013371: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1534] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-12-12 00:24:30.020559: I tensorflow/core/platform/cpu_feature_guard.cc:183] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE3 SSE4.1 SSE4.2 AVX, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Train results will be saved at: /workspace/tao-experiments/classification_tf2/output/train
wandb: Currently logged in as: 99 (99-personal). Use `wandb login --relogin` to force relogin
wandb: Appending key for api.wandb.ai to your netrc file: /root/.netrc
Initializing wandb.
wandb: Currently logged in as: 99. Use `wandb login --relogin` to force relogin
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_init.py", line 1176, in init
run = wi.init()
File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_init.py", line 633, in init
run = Run(
File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_run.py", line 566, in __init__
self._init(
File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_run.py", line 676, in _init
self._config._update(config, ignore_locked=True)
File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_config.py", line 177, in _update
sanitized = self._sanitize_dict(
File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_config.py", line 264, in _sanitize_dict
k, v = self._sanitize(k, v, allow_val_change)
File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_config.py", line 282, in _sanitize
val = json_friendly_val(val)
File "/usr/local/lib/python3.10/dist-packages/wandb/util.py", line 671, in json_friendly_val
converted = asdict(val)
File "/usr/lib/python3.10/dataclasses.py", line 1238, in asdict
return _asdict_inner(obj, dict_factory)
File "/usr/lib/python3.10/dataclasses.py", line 1245, in _asdict_inner
value = _asdict_inner(getattr(obj, f.name), dict_factory)
File "/usr/lib/python3.10/dataclasses.py", line 1275, in _asdict_inner
return type(obj)((_asdict_inner(k, dict_factory),
TypeError: first argument must be callable or None
Problem at: <frozen common.mlops.wandb> 119 initialize_wandb
Wandb logging failed with error An unexpected error occurred
Thanks in advance for your support!
Please provide the following information when requesting support.
• Hardware (A40-16q)
• Network Type (Classification-TF2)
• TLT Version (
Configuration of the TAO Toolkit Instance
task_group: [‘model’, ‘dataset’, ‘deploy’]
format_version: 3.0
toolkit_version: 5.5.0
published_date: 08/26/2024
)
• Training spec file(as shown above)
• How to reproduce the issue ? (as shown above)
1 post - 1 participant