Please provide the following information when requesting support.
• Hardware: NVIDIA A10G
• Network Type: EfficientDet-d0
• TLT Version: 5.5.0-tf2
I have successfully trained EfficientDet on my custom dataset using
docker run -d --rm --gpus all -v /mnt/rod_efs/:/workspace/tao-experiments nvcr.io/nvidia/tao/tao-toolkit:5.5.0-tf2 efficientdet_tf2 train -e /workspace/tao-experiments/tao/specs/train.yaml results_dir=/workspace/tao-experiments/tao/results/training num_gpus=4
The accuracy I’ve got on the last epoch=200 is the following
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.364
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.720
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.327
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.006
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.324
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.405
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.107
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.442
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.538
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.350
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.495
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.581
However, when I run the below command to evaluate the model, I am getting 0 accuracy in all metrics.
tao model efficientdet_tf2 evaluate -e /workspace/tao-experiments/tao/specs/train.yaml results_dir=/workspace/tao-experiments/tao/results/training evaluate.checkpoint=/workspace/tao-experiments/tao/results/training/train/efficientdet-d0_200.tlt
025-04-04 05:38:36,139 [TAO Toolkit] [INFO] root 160: Registry: ['nvcr.io']
2025-04-04 05:38:36,213 [TAO Toolkit] [INFO] nvidia_tao_cli.components.instance_handler.local_instance 360: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:5.5.0-tf2
2025-04-04 05:38:36,240 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 301: Printing tty value True
2025-04-04 05:38:37.417230: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9373] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2025-04-04 05:38:37.417299: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-04-04 05:38:37.418948: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1534] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-04-04 05:38:37.426734: I tensorflow/core/platform/cpu_feature_guard.cc:183] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE3 SSE4.1 SSE4.2 AVX, in other operations, rebuild TensorFlow with the appropriate compiler flags.
[2025-04-04 05:38:40,820 - TAO Toolkit - matplotlib - WARNING] Matplotlib created a temporary cache directory at /tmp/matplotlib-ierd380j because the default path (/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
[2025-04-04 05:38:41,010 - TAO Toolkit - matplotlib.font_manager - INFO] generated new fontManager
/usr/local/lib/python3.10/dist-packages/tensorflow_addons/utils/tfa_eol_msg.py:23: UserWarning:
TensorFlow Addons (TFA) has ended development and introduction of new features.
TFA has entered a minimal maintenance and release mode until a planned end of life in May 2024.
Please modify downstream libraries to take dependencies from other repositories in our TensorFlow community (e.g. Keras, Keras-CV, and Keras-NLP).
For more information see: https://github.com/tensorflow/addons/issues/2807
warnings.warn(
/usr/local/lib/python3.10/dist-packages/tensorflow_addons/utils/ensure_tf_install.py:53: UserWarning: Tensorflow Addons supports using Python ops for all Tensorflow versions above or equal to 2.12.0 and strictly below 2.15.0 (nightly versions are not supported).
The versions of TensorFlow you are currently using is 2.15.0 and is not supported.
Some things might work, some things might not.
If you were to encounter a bug, do not file an issue.
If you want to make sure you're using a tested and supported configuration, either change the TensorFlow version or the TensorFlow Addons's version.
You can find the compatibility matrix in TensorFlow Addon's readme:
https://github.com/tensorflow/addons
warnings.warn(
sys:1: UserWarning:
'train38.yaml' is validated against ConfigStore schema with the same name.
This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
<frozen common.hydra.hydra_runner>:-1: UserWarning:
'train38.yaml' is validated against ConfigStore schema with the same name.
This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
/usr/local/lib/python3.10/dist-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/next/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
ret = run_job(
Evaluate results will be saved at: /workspace/tao-experiments/tao/results/retail_object_detection/training38_2/evaluate
Starting efficientdet evaluation.
WARNING:tensorflow:AutoGraph could not transform <function CocoDataset.__call__.<locals>._prefetch_dataset at 0x77887d832830> and will run it as-is.
Cause: Unable to locate the source code of <function CocoDataset.__call__.<locals>._prefetch_dataset at 0x77887d832830>. Note that functions defined in certain environments, like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Original error: could not get source code
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
AutoGraph could not transform <function CocoDataset.__call__.<locals>._prefetch_dataset at 0x77887d832830> and will run it as-is.
Cause: Unable to locate the source code of <function CocoDataset.__call__.<locals>._prefetch_dataset at 0x77887d832830>. Note that functions defined in certain environments, like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Original error: could not get source code
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
WARNING:tensorflow:AutoGraph could not transform <function CocoDataset.__call__.<locals>.<lambda> at 0x77887d8328c0> and will run it as-is.
Cause: could not parse the source code of <function CocoDataset.__call__.<locals>.<lambda> at 0x77887d8328c0>: no matching AST found among candidates:
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
AutoGraph could not transform <function CocoDataset.__call__.<locals>.<lambda> at 0x77887d8328c0> and will run it as-is.
Cause: could not parse the source code of <function CocoDataset.__call__.<locals>.<lambda> at 0x77887d8328c0>: no matching AST found among candidates:
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
WARNING:tensorflow:AutoGraph could not transform <function CocoDataset.__call__.<locals>.<lambda> at 0x77887d833400> and will run it as-is.
Cause: could not parse the source code of <function CocoDataset.__call__.<locals>.<lambda> at 0x77887d833400>: no matching AST found among candidates:
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
AutoGraph could not transform <function CocoDataset.__call__.<locals>.<lambda> at 0x77887d833400> and will run it as-is.
Cause: could not parse the source code of <function CocoDataset.__call__.<locals>.<lambda> at 0x77887d833400>: no matching AST found among candidates:
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
WARNING:tensorflow:AutoGraph could not transform <bound method ImageResizeLayer.call of <nvidia_tao_tf2.cv.efficientdet.layers.image_resize_layer.ImageResizeLayer object at 0x77887d367730>> and will run it as-is.
Cause: Unable to locate the source code of <bound method ImageResizeLayer.call of <nvidia_tao_tf2.cv.efficientdet.layers.image_resize_layer.ImageResizeLayer objectat 0x77887d367730>>. Note that functions defined in certain environments, like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Original error: could not get source code
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
AutoGraph could not transform <bound method ImageResizeLayer.call of <nvidia_tao_tf2.cv.efficientdet.layers.image_resize_layer.ImageResizeLayer object at 0x77887d367730>> and will run it as-is.
Cause: Unable to locate the source code of <bound method ImageResizeLayer.call of <nvidia_tao_tf2.cv.efficientdet.layers.image_resize_layer.ImageResizeLayer objectat 0x77887d367730>>. Note that functions defined in certain environments, like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Original error: could not get source code
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
WARNING:tensorflow:From /usr/local/lib/python3.10/dist-packages/tensorflow/python/util/dispatch.py:1260: resize_nearest_neighbor (from tensorflow.python.ops.image_ops_impl) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.image.resize(...method=ResizeMethod.NEAREST_NEIGHBOR...)` instead.
From /usr/local/lib/python3.10/dist-packages/tensorflow/python/util/dispatch.py:1260: resize_nearest_neighbor (from tensorflow.python.ops.image_ops_impl) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.image.resize(...method=ResizeMethod.NEAREST_NEIGHBOR...)` instead.
WARNING:tensorflow:AutoGraph could not transform <bound method WeightedFusion.call of <nvidia_tao_tf2.cv.efficientdet.layers.weighted_fusion_layer.WeightedFusion object at 0x77887d8d8fa0>> and will run it as-is.
Cause: Unable to locate the source code of <bound method WeightedFusion.call of <nvidia_tao_tf2.cv.efficientdet.layers.weighted_fusion_layer.WeightedFusion object at 0x77887d8d8fa0>>. Note that functions defined in certain environments, like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Originalerror: could not get source code
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
AutoGraph could not transform <bound method WeightedFusion.call of <nvidia_tao_tf2.cv.efficientdet.layers.weighted_fusion_layer.WeightedFusion object at 0x77887d8d8fa0>> and will run it as-is.
Cause: Unable to locate the source code of <bound method WeightedFusion.call of <nvidia_tao_tf2.cv.efficientdet.layers.weighted_fusion_layer.WeightedFusion object at 0x77887d8d8fa0>>. Note that functions defined in certain environments, like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Originalerror: could not get source code
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
WARNING:tensorflow:AutoGraph could not transform <function run_experiment.<locals>.eval_model_fn at 0x77887bb2c670> and will run it as-is.
Cause: Unable to locate the source code of <function run_experiment.<locals>.eval_model_fn at 0x77887bb2c670>. Note that functions defined in certain environments,like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Original error: could not get source code
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
AutoGraph could not transform <function run_experiment.<locals>.eval_model_fn at 0x77887bb2c670> and will run it as-is.
Cause: Unable to locate the source code of <function run_experiment.<locals>.eval_model_fn at 0x77887bb2c670>. Note that functions defined in certain environments,like the interactive Python shell, do not expose their source code. If that is the case, you should define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.experimental.do_not_convert. Original error: could not get source code
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
use max_nms_inputs for pre-nms topk.
15/16 [===========================>..] - ETA: 0sloading annotations into memory...Done (t=0.02s)
creating index...
index created!
Loading and preparing results...
Converting ndarray to lists...
(12800, 7)
0/12800
DONE (t=0.03s)
creating index...
index created!
Running per image evaluation...
Evaluate annotation type *bbox*
DONE (t=1.27s).
Accumulating evaluation results...
DONE (t=0.07s).
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.000
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.003
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.000
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.000
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.000
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.000
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.001
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.001
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.004
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.000
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.000
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.007
Evaluation finished successfully.
The path to the test tfrecords and annotations is the same in both case because I am using the same specs file. I have seen some other topics talking about the similar issue but did not find a solution there. I have tried to evaluate different epochs as well.
Any ideas, what is the difference of implementation within the evaluate function?
My specs file
dataset:
loader:
prefetch_size: 4
shuffle_file: False
shuffle_buffer: 10000
cycle_length: 32
block_length: 16
max_instances_per_image: 100
skip_crowd_during_training: True
use_fake_data: False
num_classes: 2
train_tfrecords:
- "/workspace/tao-experiments/tao/dataset/dataset_2025-26-03T1647_1742968074/tfrecords/train"
val_tfrecords:
- "/workspace/tao-experiments/tao/dataset/dataset_2025-26-03T1647_1742968074/tfrecords/test"
val_json_file: "/workspace/tao-experiments/tao/datasetdataset_2025-26-03T1647_1742968074/coco/annotations/instances_test.json"
augmentation:
rand_hflip: True
random_crop_min_scale: 0.1
random_crop_max_scale: 2
auto_color_distortion: False
auto_translate_xy: False
train:
optimizer:
name: 'sgd'
momentum: 0.9
lr_schedule:
name: 'cosine'
warmup_epoch: 1
warmup_init: 0.0001
learning_rate: 0.2
annealing_epoch: 10
amp: False
num_examples_per_epoch: 106
checkpoint: "/workspace/tao-experiments/tao/models/pre-trained_object_detection/pretrained_efficientdet_tf2_efficientnet_b0/"
#checkpoint: "/workspace/tao-experiments/tao/models/retail_object_detection/retail_object_detection_trainable_binary_v1.1/" #pre-trained retail
#checkpoint: "/workspace/tao-experiments/tao/models/pre-trained_object_detection/pretrained_efficientdet_vefficientnet_b0/efficientnet_b0.hdf5"
moving_average_decay: 0.999
batch_size: 8
checkpoint_interval: 10
l2_weight_decay: 0.00004
l1_weight_decay: 0.0
clip_gradients_norm: 10.0
image_preview: False
qat: False
random_seed: 42
pruned_model_path: ''
num_epochs: 200
label_smoothing: 0.0
box_loss_weight: 50.0
iou_loss_type: 'giou'
iou_loss_weight: 1.0
model:
name: 'efficientdet-d0'
aspect_ratios: '[(1.0, 1.0), (1.4, 0.7), (0.7, 1.4)]'
anchor_scale: 4
min_level: 3
max_level: 7
num_scales: 3
freeze_bn: False
freeze_blocks: []
input_width: 800
input_height: 608
evaluate:
batch_size: 8
num_samples: 128
max_detections_per_image: 100
checkpoint: ''
encryption_key: 'nvidia_tlt'
Thanks
2 posts - 2 participants