Please provide the following information when requesting support.
• Hardware Jetson Nano/Jetson AGX Orin/dGPU
• Network Type Mask_rcnn
• TAO 5.3
• Training spec file: default
I want to learn how to provide maskrcnn uff model to target like Jetson Nano and AGX Orin. I got it working on the Jetson Nano: I specified config_infer_primary.txt etc. Works because I used TensorRT OSS. But I don’t know why tao-convert dosn’t work on Nano and AGX. For example I got this:
./tao-converter -d 3,256,256 -k key -o generate_detections,mask_fcn_logits/BiasAdd model.epoch-20.tlt
[INFO] [MemUsageChange] Init CUDA: CPU +203, GPU +0, now: CPU 285, GPU 3889 (MiB)
[ERROR] UffParser: Unsupported number of graph 0
[ERROR] Failed to parse the model, please check the encoding key to make sure it's correct
[ERROR] 4: [network.cpp::validate::2411] Error Code 4: Internal Error (Network must have at least one output)
[ERROR] Unable to create engine
zsh: segmentation fault (core dumped) ./tao-converter -d 3,256,256 -k key -o model.epoch-20.tlt
I got the key value from TAO 5.3 source code. Again this dosn’t work on both Jetsons.
Then I see that the newest version of the AGX Orin has version of TensorRT 8.6.1 and TensorRT 10. Because of this trtexec dons’t support UFF models (I also read this on the documentation and forums that this format is deprecated). Because of this I used this on the AGX Orin:
docker run --runtime=nvidia --gpus all -it --rm -v $(pwd):/workspace/deep nvcr.io/nvidia/tensorrt:24.01-py3
and inside container:
trtexec --uff=model.epoch-20. --maxBatch=1 --uffInput=Input,3,256,256 --output=generate_detections,mask_fcn_logits/BiasAdd --fp16 --best --saveEngine=model.engine
And process stuck with one CPU usage at 100%:
[11/15/2024-12:41:09] [I] === Model Options ===
[11/15/2024-12:41:09] [I] Format: UFF
[11/15/2024-12:41:09] [I] Model: model.epoch-20.uff
[11/15/2024-12:41:09] [I] Uff Inputs Layout: NCHW
[11/15/2024-12:41:09] [I] Input: Input,3,256,256
[11/15/2024-12:41:09] [I] Output: generate_detections mask_fcn_logits/BiasAdd
[11/15/2024-12:41:09] [I] === Build Options ===
[11/15/2024-12:41:09] [I] Max batch: 1
[11/15/2024-12:41:09] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[11/15/2024-12:41:09] [I] minTiming: 1
[11/15/2024-12:41:09] [I] avgTiming: 8
[11/15/2024-12:41:09] [I] Precision: FP32+FP16+INT8
[11/15/2024-12:41:09] [I] LayerPrecisions:
[11/15/2024-12:41:09] [I] Layer Device Types:
[11/15/2024-12:41:09] [I] Calibration: Dynamic
[11/15/2024-12:41:09] [I] Refit: Disabled
[11/15/2024-12:41:09] [I] Version Compatible: Disabled
[11/15/2024-12:41:09] [I] TensorRT runtime: full
[11/15/2024-12:41:09] [I] Lean DLL Path:
[11/15/2024-12:41:09] [I] Tempfile Controls: { in_memory: allow, temporary: allow }
[11/15/2024-12:41:09] [I] Exclude Lean Runtime: Disabled
[11/15/2024-12:41:09] [I] Sparsity: Disabled
[11/15/2024-12:41:09] [I] Safe mode: Disabled
[11/15/2024-12:41:09] [I] Build DLA standalone loadable: Disabled
[11/15/2024-12:41:09] [I] Allow GPU fallback for DLA: Disabled
[11/15/2024-12:41:09] [I] DirectIO mode: Disabled
[11/15/2024-12:41:09] [I] Restricted mode: Disabled
[11/15/2024-12:41:09] [I] Skip inference: Disabled
[11/15/2024-12:41:09] [I] Save engine: model.engine
[11/15/2024-12:41:09] [I] Load engine:
[11/15/2024-12:41:09] [I] Profiling verbosity: 0
[11/15/2024-12:41:09] [I] Tactic sources: Using default tactic sources
[11/15/2024-12:41:09] [I] timingCacheMode: local
[11/15/2024-12:41:09] [I] timingCacheFile:
[11/15/2024-12:41:09] [I] Heuristic: Disabled
[11/15/2024-12:41:09] [I] Preview Features: Use default preview flags.
[11/15/2024-12:41:09] [I] MaxAuxStreams: -1
[11/15/2024-12:41:09] [I] BuilderOptimizationLevel: -1
[11/15/2024-12:41:09] [I] Input(s)s format: fp32:CHW
[11/15/2024-12:41:09] [I] Output(s)s format: fp32:CHW
[11/15/2024-12:41:09] [I] Input build shapes: model
[11/15/2024-12:41:09] [I] Input calibration shapes: model
[11/15/2024-12:41:09] [I] === System Options ===
[11/15/2024-12:41:09] [I] Device: 0
[11/15/2024-12:41:09] [I] DLACore:
[11/15/2024-12:41:09] [I] Plugins:
[11/15/2024-12:41:09] [I] setPluginsToSerialize:
[11/15/2024-12:41:09] [I] dynamicPlugins:
[11/15/2024-12:41:09] [I] ignoreParsedPluginLibs: 0
[11/15/2024-12:41:09] [I]
[11/15/2024-12:41:09] [I] === Inference Options ===
[11/15/2024-12:41:09] [I] Batch: 1
[11/15/2024-12:41:09] [I] Input inference shapes: model
[11/15/2024-12:41:09] [I] Iterations: 10
[11/15/2024-12:41:09] [I] Duration: 3s (+ 200ms warm up)
[11/15/2024-12:41:09] [I] Sleep time: 0ms
[11/15/2024-12:41:09] [I] Idle time: 0ms
[11/15/2024-12:41:09] [I] Inference Streams: 1
[11/15/2024-12:41:09] [I] ExposeDMA: Disabled
[11/15/2024-12:41:09] [I] Data transfers: Enabled
[11/15/2024-12:41:09] [I] Spin-wait: Disabled
[11/15/2024-12:41:09] [I] Multithreading: Disabled
[11/15/2024-12:41:09] [I] CUDA Graph: Disabled
[11/15/2024-12:41:09] [I] Separate profiling: Disabled
[11/15/2024-12:41:09] [I] Time Deserialize: Disabled
[11/15/2024-12:41:09] [I] Time Refit: Disabled
[11/15/2024-12:41:09] [I] NVTX verbosity: 0
[11/15/2024-12:41:09] [I] Persistent Cache Ratio: 0
[11/15/2024-12:41:09] [I] Inputs:
[11/15/2024-12:41:09] [I] === Reporting Options ===
[11/15/2024-12:41:09] [I] Verbose: Disabled
[11/15/2024-12:41:09] [I] Averages: 10 inferences
[11/15/2024-12:41:09] [I] Percentiles: 90,95,99
[11/15/2024-12:41:09] [I] Dump refittable layers:Disabled
[11/15/2024-12:41:09] [I] Dump output: Disabled
[11/15/2024-12:41:09] [I] Profile: Disabled
[11/15/2024-12:41:09] [I] Export timing to JSON file:
[11/15/2024-12:41:09] [I] Export output to JSON file:
[11/15/2024-12:41:09] [I] Export profile to JSON file:
[11/15/2024-12:41:09] [I]
[11/15/2024-12:41:09] [I] === Device Information ===
[11/15/2024-12:41:09] [I] Selected Device: Orin
[11/15/2024-12:41:09] [I] Compute Capability: 8.7
[11/15/2024-12:41:09] [I] SMs: 16
[11/15/2024-12:41:09] [I] Device Global Memory: 30696 MiB
[11/15/2024-12:41:09] [I] Shared Memory per SM: 164 KiB
[11/15/2024-12:41:09] [I] Memory Bus Width: 256 bits (ECC disabled)
[11/15/2024-12:41:09] [I] Application Compute Clock Rate: 1.3 GHz
[11/15/2024-12:41:09] [I] Application Memory Clock Rate: 1.3 GHz
[11/15/2024-12:41:09] [I]
[11/15/2024-12:41:09] [I] Note: The application clock rates do not reflect the actual clock rates that the GPU is currently running at.
[11/15/2024-12:41:09] [I]
[11/15/2024-12:41:09] [I] TensorRT version: 8.6.1
[11/15/2024-12:41:09] [I] Loading standard plugins
[11/15/2024-12:41:09] [I] [TRT] [MemUsageChange] Init CUDA: CPU +1, GPU +0, now: CPU 18, GPU 4105 (MiB)
^C
I don’t know how should spec for DeepStream 7.1 should look because also I tried conversion through first running but also dosn’t work and I think because of deprivation of UFF models.
On the PC and RTX4090 Trtexec works.
What should I do to run this model from TAO mask_rcnn on the Jetson AGX Orin, because mask_rcnn officially export mask_rcnn model to UFF format?
Best regards,
Darek
2 posts - 2 participants