Dear @Morganh
I am trying to run inference on mask_rcnn model inside the nvcr.io/nvidia/tao/tao-toolkit:5.5.0-deploy container but getting below issue.
root@ca83379e98f1:/home/data/Model-Training/BOX-SEGMENTATION_V1/mask_rcnn/experiment_dir_unpruned# python3 load_engine_infer_mask_rcnn.py
[01/15/2025-07:00:34] [TRT] [E] 3: getPluginCreator could not find plugin: ResizeNearest_TRT version: 1
[01/15/2025-07:00:34] [TRT] [E] 1: [pluginV2Runner.cpp::load::303] Error Code 1: Serialization (Serialization assertion creator failed.Cannot deserialize plugin since corresponding IPluginCreator not found in Plugin Registry)
Traceback (most recent call last):
File "/home/data/Model-Training/BOX-SEGMENTATION_V1/mask_rcnn/experiment_dir_unpruned/load_engine_infer_mask_rcnn.py", line 84, in <module>
inputs, outputs, bindings, stream = allocate_buffers(engine)
File "/home/data/Model-Training/BOX-SEGMENTATION_V1/mask_rcnn/experiment_dir_unpruned/load_engine_infer_mask_rcnn.py", line 20, in allocate_buffers
for binding in engine:
TypeError: 'NoneType' object is not iterable
First I have converted uff file to engine using below command.
-
Execute the TAO-5.5 deploy container
docker run -d --gpus all -it --rm --shm-size=4g -v /home/smarg/Documents/TAO/:/home/data nvcr.io/nvidia/tao/tao-toolkit:5.5.0-deploy /bin/bash
-
command for engine generation.
mask_rcnn gen_trt_engine -m ./model.epoch-6.uff \
--batch_size 1 \
--data_type fp16 \
--engine_file ./model.epoch-6.uff.engine \
--results_dir ./exportT
root@ca83379e98f1:/home/data/Model-Training/BOX-SEGMENTATION_V1/mask_rcnn/experiment_dir_unpruned# mask_rcnn gen_trt_engine -m ./model.epoch-6.uff \
--batch_size 1 \
--data_type fp16 \
--engine_file ./model.epoch-6.uff.engine \
--results_dir ./exportT
Loading uff directly from the package source code
Loading uff directly from the package source code
2025-01-15 06:40:15,915 [TAO Toolkit] [INFO] root 167: Starting mask_rcnn gen_trt_engine.
[01/15/2025-06:40:15] [TRT] [I] [MemUsageChange] Init CUDA: CPU +1, GPU +0, now: CPU 35, GPU 1019 (MiB)
[01/15/2025-06:40:20] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +1453, GPU +268, now: CPU 1565, GPU 1287 (MiB)
2025-01-15 06:40:20,484 [TAO Toolkit] [INFO] nvidia_tao_deploy.cv.mask_rcnn.engine_builder 96: Parsing UFF model
[01/15/2025-06:40:20] [TRT] [W] The implicit batch dimension mode has been deprecated. Please create the network with NetworkDefinitionCreationFlag::kEXPLICIT_BATCH flag whenever possible.
2025-01-15 06:40:21,075 [TAO Toolkit] [INFO] nvidia_tao_deploy.engine.builder 150: TensorRT engine build configurations:
2025-01-15 06:40:21,075 [TAO Toolkit] [INFO] nvidia_tao_deploy.engine.builder 163:
2025-01-15 06:40:21,075 [TAO Toolkit] [INFO] nvidia_tao_deploy.engine.builder 165: BuilderFlag.FP16
2025-01-15 06:40:21,075 [TAO Toolkit] [INFO] nvidia_tao_deploy.engine.builder 179: BuilderFlag.TF32
2025-01-15 06:40:21,075 [TAO Toolkit] [INFO] nvidia_tao_deploy.engine.builder 195:
2025-01-15 06:40:21,075 [TAO Toolkit] [INFO] nvidia_tao_deploy.engine.builder 197: Note: max representabile value is 2,147,483,648 bytes or 2GB.
2025-01-15 06:40:21,075 [TAO Toolkit] [INFO] nvidia_tao_deploy.engine.builder 199: MemoryPoolType.WORKSPACE = 2147483648 bytes
2025-01-15 06:40:21,075 [TAO Toolkit] [INFO] nvidia_tao_deploy.engine.builder 201: MemoryPoolType.DLA_MANAGED_SRAM = 0 bytes
2025-01-15 06:40:21,075 [TAO Toolkit] [INFO] nvidia_tao_deploy.engine.builder 203: MemoryPoolType.DLA_LOCAL_DRAM = 1073741824 bytes
2025-01-15 06:40:21,075 [TAO Toolkit] [INFO] nvidia_tao_deploy.engine.builder 205: MemoryPoolType.DLA_GLOBAL_DRAM = 536870912 bytes
2025-01-15 06:40:21,075 [TAO Toolkit] [INFO] nvidia_tao_deploy.engine.builder 207:
2025-01-15 06:40:21,075 [TAO Toolkit] [INFO] nvidia_tao_deploy.engine.builder 209: PreviewFeature.FASTER_DYNAMIC_SHAPES_0805
2025-01-15 06:40:21,075 [TAO Toolkit] [INFO] nvidia_tao_deploy.engine.builder 211: PreviewFeature.DISABLE_EXTERNAL_TACTIC_SOURCES_FOR_CORE_0805
2025-01-15 06:40:21,075 [TAO Toolkit] [INFO] nvidia_tao_deploy.engine.builder 215: Tactic Sources = 31
[01/15/2025-06:40:21] [TRT] [I] Graph optimization time: 0.0165025 seconds.
[01/15/2025-06:40:21] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +6, GPU +10, now: CPU 1793, GPU 1297 (MiB)
[01/15/2025-06:40:21] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +0, GPU +0, now: CPU 1793, GPU 1297 (MiB)
[01/15/2025-06:40:21] [TRT] [W] cuDNN tactic soruce is always disabled in this TensorRT version
[01/15/2025-06:40:21] [TRT] [I] Local timing cache in use. Profiling results in this builder pass will not be stored.
[01/15/2025-06:42:07] [TRT] [I] Detected 1 inputs and 2 output network tensors.
[01/15/2025-06:42:07] [TRT] [I] Total Host Persistent Memory: 245568
[01/15/2025-06:42:07] [TRT] [I] Total Device Persistent Memory: 11776
[01/15/2025-06:42:07] [TRT] [I] Total Scratch Memory: 51951616
[01/15/2025-06:42:07] [TRT] [I] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 92 MiB, GPU 350 MiB
[01/15/2025-06:42:07] [TRT] [I] [BlockAssignment] Started assigning block shifts. This will take 107 steps to complete.
[01/15/2025-06:42:07] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 9.0624ms to assign 21 blocks to 107 nodes requiring 86775296 bytes.
[01/15/2025-06:42:07] [TRT] [I] Total Activation Memory: 86773248
[01/15/2025-06:42:07] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 2090, GPU 1359 (MiB)
[01/15/2025-06:42:07] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +0, GPU +0, now: CPU 2090, GPU 1359 (MiB)
[01/15/2025-06:42:07] [TRT] [W] cuDNN tactic soruce is always disabled in this TensorRT version
[01/15/2025-06:42:07] [TRT] [W] TensorRT encountered issues when converting weights between types and that could affect accuracy.
[01/15/2025-06:42:07] [TRT] [W] If this is not the desired behavior, please modify the weights or retrain with regularization to adjust the magnitude of the weights.
[01/15/2025-06:42:07] [TRT] [W] Check verbose logs for the list of affected weights.
[01/15/2025-06:42:07] [TRT] [W] - 57 weights are affected by this issue: Detected subnormal FP16 values.
[01/15/2025-06:42:07] [TRT] [W] - 13 weights are affected by this issue: Detected values less than smallest positive FP16 subnormal value and converted them to the FP16 minimum subnormalized value.
[01/15/2025-06:42:07] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +51, GPU +48, now: CPU 51, GPU 48 (MiB)
Export finished successfully.
2025-01-15 06:42:07,650 [TAO Toolkit] [INFO] root 167: Gen_trt_engine finished successfully.
[2025-01-15 06:42:07,829 - TAO Toolkit - nvidia_tao_deploy.cv.common.entrypoint.entrypoint_proto - INFO] Sending telemetry data.
[2025-01-15 06:42:07,829 - TAO Toolkit - root - INFO] ================> Start Reporting Telemetry <================
[2025-01-15 06:42:07,829 - TAO Toolkit - root - INFO] Sending {'version': '5.5.0', 'action': 'gen_trt_engine', 'network': 'mask_rcnn', 'gpu': ['NVIDIA-RTX-A4000'], 'success': True, 'time_lapsed': 112.39582538604736} to https://api.tao.ngc.nvidia.com.
[2025-01-15 06:42:09,555 - TAO Toolkit - root - INFO] Telemetry sent successfully.
[2025-01-15 06:42:09,555 - TAO Toolkit - root - INFO] ================> End Reporting Telemetry <================
[2025-01-15 06:42:09,555 - TAO Toolkit - nvidia_tao_deploy.cv.common.entrypoint.entrypoint_proto - INFO] Execution status: PASS
- Executing the below script inside the same container but getting the issue mentioned above.
import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit
import numpy as np
import cv2
def load_engine(engine_file_path):
"""Load the TensorRT engine from file."""
TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
with open(engine_file_path, "rb") as f, trt.Runtime(TRT_LOGGER) as runtime:
return runtime.deserialize_cuda_engine(f.read())
def allocate_buffers(engine):
"""Allocate input and output buffers for TensorRT engine."""
inputs = []
outputs = []
bindings = []
stream = cuda.Stream()
for binding in engine:
size = trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size
dtype = trt.nptype(engine.get_binding_dtype(binding))
# Allocate host and device buffers
host_mem = cuda.pagelocked_empty(size, dtype)
device_mem = cuda.mem_alloc(host_mem.nbytes)
# Append the device buffer to device bindings
bindings.append(int(device_mem))
# Store the buffers
if engine.binding_is_input(binding):
inputs.append({"host": host_mem, "device": device_mem})
else:
outputs.append({"host": host_mem, "device": device_mem})
return inputs, outputs, bindings, stream
def preprocess_image(image_path, input_shape):
"""Preprocess input image to match the engine's input size."""
image = cv2.imread(image_path)
original_image = image.copy()
resized_image = cv2.resize(image, (input_shape[2], input_shape[1]))
normalized_image = resized_image.astype(np.float32) / 255.0 # Normalize to [0, 1]
transposed_image = np.transpose(normalized_image, (2, 0, 1)) # HWC to CHW
batch_image = np.expand_dims(transposed_image, axis=0) # Add batch dimension
return batch_image, original_image # Return original for visualization
def do_inference(engine, inputs, outputs, bindings, stream, input_image):
"""Run inference on the TensorRT engine."""
# Copy input data to the input buffer
np.copyto(inputs[0]["host"], input_image.ravel())
# Transfer input data to the GPU
cuda.memcpy_htod_async(inputs[0]["device"], inputs[0]["host"], stream)
# Run inference
context = engine.create_execution_context()
context.execute_async_v2(bindings=bindings, stream_handle=stream.handle)
# Transfer predictions back to the host
cuda.memcpy_dtoh_async(outputs[0]["host"], outputs[0]["device"], stream)
stream.synchronize()
return outputs[0]["host"]
def postprocess_output(output, image, threshold=0.5):
"""Post-process the output to overlay detections on the image."""
# Parse the output (example assumes output contains boxes, scores, and masks)
# You must modify this part based on your model's output structure.
boxes, scores, masks = output[0], output[1], output[2] # Adjust as needed
for i, score in enumerate(scores):
if score > threshold:
box = boxes[i]
x1, y1, x2, y2 = map(int, box)
cv2.rectangle(image, (x1, y1), (x2, y2), (0, 255, 0), 2)
# Overlay mask on the image
mask = masks[i]
mask = (mask > threshold).astype(np.uint8)
colored_mask = np.zeros_like(image, dtype=np.uint8)
colored_mask[:, :, 1] = mask * 255
image = cv2.addWeighted(image, 1, colored_mask, 0.5, 0)
return image
if __name__ == "__main__":
engine_file = "model.epoch-6.uff.engine" # Path to your TensorRT engine
image_path = "net-5809-_jpg.rf.4e20462228dd67b33cbbda88966dbbae.jpg" # Path to the input image
input_shape = (1, 3, 640, 640) # Batch size 1, 3 channels, 640x640 resolution
# Load TensorRT engine
engine = load_engine(engine_file)
inputs, outputs, bindings, stream = allocate_buffers(engine)
# Preprocess input image
input_image, original_image = preprocess_image(image_path, input_shape)
# Run inference
output = do_inference(engine, inputs, outputs, bindings, stream, input_image)
# Post-process and visualize the result
result_image = postprocess_output(output, original_image)
cv2.imshow("Result", result_image)
cv2.waitKey(0)
cv2.destroyAllWindows()
Please suggest where are the gaps or is there any other infer script u can provide for mask_rcnn model infer.
1 post - 1 participant