Quantcast
Channel: TAO Toolkit - NVIDIA Developer Forums
Viewing all articles
Browse latest Browse all 497

Triton Server Error with TAO FasterRCNN model: Validation failed: libNamespace == nullptr

$
0
0

Please provide the following information when requesting support.

• Hardware: Ubuntu 22.04 RTX 4090
• Network Type: FasterRCNN TAO model
• TAO version: 5.5.0
• Training spec file(If have, please share here)
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)

Training spec:

# Copyright (c) 2017-2020, NVIDIA CORPORATION.  All rights reserved.
random_seed: 42

verbose: True
model_config {
input_image_config {
image_type: RGB
image_channel_order: 'bgr'
size_height_width {
height: 640
width: 640
}
    image_channel_mean {
        key: 'b'
        value: 103.939
}
    image_channel_mean {
        key: 'g'
        value: 116.779
}
    image_channel_mean {
        key: 'r'
        value: 123.68
}
image_scaling_factor: 1.0
max_objects_num_per_image: 100
}
arch: "resnet:18"
anchor_box_config {
scale: 64.0
scale: 128.0
scale: 256.0
ratio: 1.0
ratio: 0.5
ratio: 2.0
}
freeze_bn: True
freeze_blocks: 0
freeze_blocks: 1
roi_mini_batch: 256
rpn_stride: 16
use_bias: False
roi_pooling_config {
pool_size: 7
pool_size_2x: False
}
all_projections: True
use_pooling:False
}
dataset_config {
  data_sources: {
    tfrecords_path: "/workspace/tao-experiments/data/faster_rcnn/tfrecords/new_trainval/new_trainval*"
    image_directory_path: "/workspace/tao-experiments/data/training"
  }
image_extension: 'png'
target_class_mapping {
key: 'item'
value: 'item'
}
target_class_mapping {
key: 'person'
value: 'person'
}
validation_fold: 0
}
augmentation_config {
preprocessing {
output_image_width: 640
output_image_height: 640
output_image_channel: 3
min_bbox_width: 1.0
min_bbox_height: 1.0
enable_auto_resize: True
}
spatial_augmentation {
hflip_probability: 0.5
vflip_probability: 0.0
zoom_min: 1.0
zoom_max: 1.0
translate_max_x: 0
translate_max_y: 0
}
color_augmentation {
hue_rotation_max: 0.0
saturation_shift_max: 0.0
contrast_scale_max: 0.0
contrast_center: 0.5
}
}
training_config {
visualizer {
    enabled: False
    num_images: 3
}
enable_augmentation: True
enable_qat: False
batch_size_per_gpu: 8
num_epochs: 12
rpn_min_overlap: 0.3
rpn_max_overlap: 0.7
classifier_min_overlap: 0.0
classifier_max_overlap: 0.5
gt_as_roi: False
std_scaling: 1.0
classifier_regr_std {
key: 'x'
value: 10.0
}
classifier_regr_std {
key: 'y'
value: 10.0
}
classifier_regr_std {
key: 'w'
value: 5.0
}
classifier_regr_std {
key: 'h'
value: 5.0
}

rpn_mini_batch: 256
rpn_pre_nms_top_N: 12000
rpn_nms_max_boxes: 2000
rpn_nms_overlap_threshold: 0.7

regularizer {
type: L2
weight: 1e-4
}

optimizer {
sgd {
lr: 0.02
momentum: 0.9
decay: 0.0
nesterov: False
}
}

learning_rate {
soft_start {
base_lr: 0.02
start_lr: 0.002
soft_start: 0.1
annealing_points: 0.8
annealing_points: 0.9
annealing_divider: 10.0
}
}

lambda_rpn_regr: 1.0
lambda_rpn_class: 1.0
lambda_cls_regr: 1.0
lambda_cls_class: 1.0
}
inference_config {
images_dir: '/workspace/tao-experiments/data/test_samples'
batch_size: 1
detection_image_output_dir: '/workspace/tao-experiments/faster_rcnn/inference_results_imgs_retrain'
labels_dump_dir: '/workspace/tao-experiments/faster_rcnn/inference_dump_labels_retrain'
rpn_pre_nms_top_N: 6000
rpn_nms_max_boxes: 300
rpn_nms_overlap_threshold: 0.7
object_confidence_thres: 0.0001
bbox_visualize_threshold: 0.6
classifier_nms_max_boxes: 100
classifier_nms_overlap_threshold: 0.3
nms_score_bits: 8
}
evaluation_config {
batch_size: 1
validation_period_during_training: 1
rpn_pre_nms_top_N: 6000
rpn_nms_max_boxes: 300
rpn_nms_overlap_threshold: 0.7
classifier_nms_max_boxes: 100
classifier_nms_overlap_threshold: 0.3
object_confidence_thres: 0.0001
use_voc07_11point_metric:False
gt_matching_iou_threshold: 0.5
}

Hello, I am having issues using transfer learning with the TAO FasterRCNN model, or more specifically with the Triton Inference Server after exporting as a TRT engine. I trained following the guidelines in the following notebook:

Training was successful and inference looked normal. However, when doing inference, I was receiving the error:

[02/10/2025-16:19:07] [TRT] [F] Validation failed: libNamespace == nullptr 
/workspace/trt_oss_src/TensorRT/plugin/proposalPlugin/proposalPlugin.cpp:528 
 [02/10/2025-16:19:07] [TRT] [E] std::exception

Note: I also received this error without any custom data and just the tutorial data, so to reproduce, you can use the tutorial data or I can send the tutorial model.

This error caused no issues with the inference using the TAO CLI. But when I attempted to launch a Triton Server instance with this model to test inference times, the server crashed due to this error. Is there a way to cause the server to ignore this validation issue or to fix this error with the model?

Do note this is a listed limitation with the TAO Toolkit 5.2.0 in the release notes of 5.3.0 as listed in the below link:
https://docs.nvidia.com/tao/archive/5.3.0/text/release_notes.html

Also, I used Triton Server version 24.04 as it is the last with TensorRT 8, as the TAO toolkit does not currently support TRT 10 yet from what I can see. Here is the line used to launch the triton server:

docker run --gpus=1 --rm -p 8000:8000 -p 8001:8001 -p 8002:8002 -v /home/ubuntu-testing/model_repository:/models nvcr.io/nvidia/tritonserver:24.04-py3 tritonserver --model-repository=/models

Here is my server config file for the model. I am not certain the output shapes are correct, but from what I can see it is not even getting to the config file before the server stops.

name: "FRCNN-resnet50"
platform: "tensorrt_plan"
max_batch_size : 0
input [
  {
    name: "input_image"
    data_type: TYPE_FP16
    dims: [ 3, 640, 640 ]
    reshape { shape: [ 1, 3, 640, 640 ] }
  }
]
output [
  {
    name: "nms_out"
    data_type: TYPE_FP32
    dims: [ 1, 1, 100, 7 ]
    reshape { shape: [ 1, 1, 100, 7 ] }
  },
  {
    name: "nms_out_1"
    data_type: TYPE_FP32
    dims: [ 1, 1 , 1, 1]
    reshape { shape: [ 1, 1, 1, 1 ] }
  }
]

And here is the output from the Triton server when it does not launch:

NVIDIA Release 24.04 (build 90085237)
Triton Server Version 2.45.0

Copyright (c) 2018-2023, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

I0210 23:05:46.603013 1 pinned_memory_manager.cc:275] Pinned memory pool is created at '0x7cb6d6000000' with size 268435456
I0210 23:05:46.604848 1 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864
I0210 23:05:46.608765 1 model_lifecycle.cc:469] loading: FRCNN-resnet50:1
I0210 23:05:46.634964 1 tensorrt.cc:65] TRITONBACKEND_Initialize: tensorrt
I0210 23:05:46.634975 1 tensorrt.cc:75] Triton TRITONBACKEND API version: 1.19
I0210 23:05:46.634977 1 tensorrt.cc:81] 'tensorrt' TRITONBACKEND API version: 1.19
I0210 23:05:46.634979 1 tensorrt.cc:105] backend configuration:
{"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","default-max-batch-size":"4"}}
I0210 23:05:46.636848 1 tensorrt.cc:231] TRITONBACKEND_ModelInitialize: FRCNN-resnet50 (version 1)
I0210 23:05:46.691943 1 logging.cc:46] Loaded engine size: 84 MiB
E0210 23:05:46.707516 1 logging.cc:40] Validation failed: libNamespace == nullptr
plugin/proposalPlugin/proposalPlugin.cpp:528

Thanks for your help!

7 posts - 2 participants

Read full topic


Viewing all articles
Browse latest Browse all 497

Trending Articles