Quantcast
Channel: TAO Toolkit - NVIDIA Developer Forums
Viewing all articles
Browse latest Browse all 497

Problems with PointPillar export

$
0
0

Required Information

  • Hardware: NVIDIA GeForce RTX 4080
  • Network Type: PointPillar
  • GitHub Repository: tao_pytorch_backend
  • TLT Version (tlt info --verbose doesn’t work): docker is nvcr.io/nvidia/tao/tao-toolkit:5.0.0-pyt-base → command used to run: docker run -it --rm --gpus all -v /path/to/project/tao_pytorch_backend:/tao-pt -v -e PYTHONPATH=/tao-pt:$PYTHONPATH -e PYTHONPATH=/tao-pt --shm-size 16G --net=host nvcr.io/nvidia/tao/tao-toolkit:5.0.0-pyt-base
  • Training spec file: pointpillar_general.yaml
  • How to reproduce the issue: see later

Documentation I followed

Forum topics I already checked

My training

I managed to train from scratch on KITTI dataset, on cars only.
log_train_20240507-082745.txt

2024-05-07 08:27:45,090   INFO  **********************Start logging**********************
2024-05-07 08:27:45,090   INFO  CUDA_VISIBLE_DEVICES=ALL
2024-05-07 08:27:45,436   INFO  Loading point cloud dataset
2024-05-07 08:27:45,489   INFO  Total samples for point cloud dataset: 5366
2024-05-07 08:27:45,658   INFO  **********************Start training**********************
2024-05-07 18:23:57,833   INFO  **********************End training**********************

status.json

{"date": "5/7/2024", "time": "8:27:45", "status": "STARTED", "verbosity": "INFO", "message": "Starting PointPillars training"}
{"epoch": 0, "time_per_epoch": "0:07:23.530897", "max_epoch": 80, "eta": "9:51:22.471797", "date": "5/7/2024", "time": "8:35:9", "status": "RUNNING", "verbosity": "INFO", "message": "Train metrics generated.", "kpi": {"learning_rate": 0.0003064909424133253, "loss": 1.3810601234436035}}
{"epoch": 1, "time_per_epoch": "0:07:35.166232", "max_epoch": 80, "eta": "9:59:18.132325", "date": "5/7/2024", "time": "8:42:45", "status": "RUNNING", "verbosity": "INFO", "message": "Train metrics generated.", "kpi": {"learning_rate": 0.0003259206078755463, "loss": 1.2703205347061157}}
{"epoch": 2, "time_per_epoch": "0:07:42.412781", "max_epoch": 80, "eta": "10:01:08.196897", "date": "5/7/2024", "time": "8:50:28", "status": "RUNNING", "verbosity": "INFO", "message": "Train metrics generated.", "kpi": {"learning_rate": 0.0003581018816993485, "loss": 0.8312548995018005}}
{"epoch": 3, "time_per_epoch": "0:07:43.532270", "max_epoch": 80, "eta": "9:54:51.984766", "date": "5/7/2024", "time": "8:58:12", "status": "RUNNING", "verbosity": "INFO", "message": "Train metrics generated.", "kpi": {"learning_rate": 0.0004027248406257354, "loss": 0.601682186126709}}
{"epoch": 4, "time_per_epoch": "0:07:35.183181", "max_epoch": 80, "eta": "9:36:33.921729", "date": "5/7/2024", "time": "9:5:48", "status": "RUNNING", "verbosity": "INFO", "message": "Train metrics generated.", "kpi": {"learning_rate": 0.00045935974116685435, "loss": 0.7014744877815247}}
{"epoch": 5, "time_per_epoch": "0:07:36.934350", "max_epoch": 80, "eta": "9:31:10.076217", "date": "5/7/2024", "time": "9:13:25", "status": "RUNNING", "verbosity": "INFO", "message": "Train metrics generated.", "kpi": {"learning_rate": 0.0005274611582707094, "loss": 0.9280206561088562}}
{"epoch": 6, "time_per_epoch": "0:07:35.909267", "max_epoch": 80, "eta": "9:22:17.285785", "date": "5/7/2024", "time": "9:21:2", "status": "RUNNING", "verbosity": "INFO", "message": "Train metrics generated.", "kpi": {"learning_rate": 0.0006063732380625683, "loss": 0.8185919523239136}}
{"epoch": 7, "time_per_epoch": "0:07:34.102590", "max_epoch": 80, "eta": "9:12:29.489096", "date": "5/7/2024", "time": "9:28:36", "status": "RUNNING", "verbosity": "INFO", "message": "Train metrics generated.", "kpi": {"learning_rate": 0.0006953360140763047, "loss": 0.5776488780975342}}
{"epoch": 8, "time_per_epoch": "0:07:31.029943", "max_epoch": 80, "eta": "9:01:14.155902", "date": "5/7/2024", "time": "9:36:8", "status": "RUNNING", "verbosity": "INFO", "message": "Train metrics generated.", "kpi": {"learning_rate": 0.0007934927261469067, "loss": 0.7977230548858643}}
{"epoch": 9, "time_per_epoch": "0:07:35.350255", "max_epoch": 80, "eta": "8:58:49.868115", "date": "5/7/2024", "time": "9:43:44", "status": "RUNNING", "verbosity": "INFO", "message": "Train metrics generated.", "kpi": {"learning_rate": 0.0008998980714792163, "loss": 0.759836733341217}}
{"epoch": 10, "time_per_epoch": "0:07:34.159572", "max_epoch": 80, "eta": "8:49:51.170060", "date": "5/7/2024", "time": "9:51:19", "status": "RUNNING", "verbosity": "INFO", "message": "Train metrics generated.", "kpi": {"learning_rate": 0.0010135273084306063, "loss": 0.5445941686630249}}
{"epoch": 11, "time_per_epoch": "0:07:35.378166", "max_epoch": 80, "eta": "8:43:41.093459", "date": "5/7/2024", "time": "9:58:55", "status": "RUNNING", "verbosity": "INFO", "message": "Train metrics generated.", "kpi": {"learning_rate": 0.0011332861253331745, "loss": 0.6107680201530457}}
{"epoch": 12, "time_per_epoch": "0:07:29.786740", "max_epoch": 80, "eta": "8:29:45.498304", "date": "5/7/2024", "time": "10:6:25", "status": "RUNNING", "verbosity": "INFO", "message": "Train metrics generated.", "kpi": {"learning_rate": 0.0012580211793133203, "loss": 0.5008313059806824}}
{"epoch": 13, "time_per_epoch": "0:07:34.622371", "max_epoch": 80, "eta": "8:27:39.698833", "date": "5/7/2024", "time": "10:14:0", "status": "RUNNING", "verbosity": "INFO", "message": "Train metrics generated.", "kpi": {"learning_rate": 0.001386531203614099, "loss": 0.6656970977783203}}
{"epoch": 14, "time_per_epoch": "0:07:30.476321", "max_epoch": 80, "eta": "8:15:31.437162", "date": "5/7/2024", "time": "10:21:31", "status": "RUNNING", "verbosity": "INFO", "message": "Train metrics generated.", "kpi": {"learning_rate": 0.0015175785764507683, "loss": 0.7663007974624634}}
{"epoch": 15, "time_per_epoch": "0:07:31.792888", "max_epoch": 80, "eta": "8:09:26.537725", "date": "5/7/2024", "time": "10:29:4", "status": "RUNNING", "verbosity": "INFO", "message": "Train metrics generated.", "kpi": {"learning_rate": 0.0016499012399851304, "loss": 0.46346515417099}}
{"epoch": 16, "time_per_epoch": "0:07:33.283206", "max_epoch": 80, "eta": "8:03:30.125179", "date": "5/7/2024", "time": "10:36:38", "status": "RUNNING", "verbosity": "INFO", "message": "Train metrics generated.", "kpi": {"learning_rate": 0.0017822248546324234, "loss": 0.3259502351284027}}
{"epoch": 17, "time_per_epoch": "0:07:24.865216", "max_epoch": 80, "eta": "7:47:06.508636", "date": "5/7/2024", "time": "10:44:3", "status": "RUNNING", "verbosity": "INFO", "message": "Train metrics generated.", "kpi": {"learning_rate": 0.001913275071648148, "loss": 0.6114525198936462}}
{"epoch": 18, "time_per_epoch": "0:07:22.195047", "max_epoch": 80, "eta": "7:36:56.092898", "date": "5/7/2024", "time": "10:51:26", "status": "RUNNING", "verbosity": "INFO", "message": "Train metrics generated.", "kpi": {"learning_rate": 0.002041789805803107, "loss": 0.5467955470085144}}
{"epoch": 19, "time_per_epoch": "0:07:24.841465", "max_epoch": 80, "eta": "7:32:15.329358", "date": "5/7/2024", "time": "10:58:51", "status": "RUNNING", "verbosity": "INFO", "message": "Train metrics generated.", "kpi": {"learning_rate": 0.0021665313899540883, "loss": 0.6311314702033997}}
{"epoch": 20, "time_per_epoch": "0:07:23.714800", "max_epoch": 80, "eta": "7:23:42.887972", "date": "5/7/2024", "time": "11:6:16", "status": "RUNNING", "verbosity": "INFO", "message": "Train metrics generated.", "kpi": {"learning_rate": 0.0022862984944550316, "loss": 0.4225645959377289}}
{"epoch": 21, "time_per_epoch": "0:07:25.572585", "max_epoch": 80, "eta": "7:18:08.782537", "date": "5/7/2024", "time": "11:13:42", "status": "RUNNING", "verbosity": "INFO", "message": "Train metrics generated.", "kpi": {"learning_rate": 0.002399937696618234, "loss": 0.4868691563606262}}
{"epoch": 22, "time_per_epoch": "0:07:24.633668", "max_epoch": 80, "eta": "7:09:48.752744", "date": "5/7/2024", "time": "11:21:7", "status": "RUNNING", "verbosity": "INFO", "message": "Train metrics generated.", "kpi": {"learning_rate": 0.0025063545888053566, "loss": 0.6387386322021484}}
{"epoch": 23, "time_per_epoch": "0:07:24.312719", "max_epoch": 80, "eta": "7:02:05.824984", "date": "5/7/2024", "time": "11:28:32", "status": "RUNNING", "verbosity": "INFO", "message": "Train metrics generated.", "kpi": {"learning_rate": 0.0026045243181712467, "loss": 0.5239708423614502}}
{"epoch": 24, "time_per_epoch": "0:07:25.113793", "max_epoch": 80, "eta": "6:55:26.372427", "date": "5/7/2024", "time": "11:35:58", "status": "RUNNING", "verbosity": "INFO", "message": "Train metrics generated.", "kpi": {"learning_rate": 0.0026935014565570774, "loss": 0.407818466424942}}
{"epoch": 25, "time_per_epoch": "0:07:23.972846", "max_epoch": 80, "eta": "6:46:58.506531", "date": "5/7/2024", "time": "11:43:22", "status": "RUNNING", "verbosity": "INFO", "message": "Train metrics generated.", "kpi": {"learning_rate": 0.002772429105480342, "loss": 0.4710182845592499}}
{"epoch": 26, "time_per_epoch": "0:07:22.670298", "max_epoch": 80, "eta": "6:38:24.196090", "date": "5/7/2024", "time": "11:50:45", "status": "RUNNING", "verbosity": "INFO", "message": "Train metrics generated.", "kpi": {"learning_rate": 0.0028405471485356687, "loss": 0.6307605504989624}}
{"epoch": 27, "time_per_epoch": "0:07:24.125681", "max_epoch": 80, "eta": "6:32:18.661090", "date": "5/7/2024", "time": "11:58:10", "status": "RUNNING", "verbosity": "INFO", "message": "Train metrics generated.", "kpi": {"learning_rate": 0.0028971995717313234, "loss": 0.8641102313995361}}
{"epoch": 28, "time_per_epoch": "0:07:20.112568", "max_epoch": 80, "eta": "6:21:25.853554", "date": "5/7/2024", "time": "12:5:31", "status": "RUNNING", "verbosity": "INFO", "message": "Train metrics generated.", "kpi": {"learning_rate": 0.002941840781262567, "loss": 0.6356572508811951}}
{"epoch": 29, "time_per_epoch": "0:07:22.350795", "max_epoch": 80, "eta": "6:15:59.890523", "date": "5/7/2024", "time": "12:12:54", "status": "RUNNING", "verbosity": "INFO", "message": "Train metrics generated.", "kpi": {"learning_rate": 0.002974040857878247, "loss": 0.7764367461204529}}
{"epoch": 30, "time_per_epoch": "0:07:21.374408", "max_epoch": 80, "eta": "6:07:48.720419", "date": "5/7/2024", "time": "12:20:16", "status": "RUNNING", "verbosity": "INFO", "message": "Train metrics generated.", "kpi": {"learning_rate": 0.002993489697238202, "loss": 0.4349867105484009}}
{"epoch": 31, "time_per_epoch": "0:07:29.639962", "max_epoch": 80, "eta": "6:07:12.358129", "date": "5/7/2024", "time": "12:27:46", "status": "RUNNING", "verbosity": "INFO", "message": "Train metrics generated.", "kpi": {"learning_rate": 0.0029999999963875776, "loss": 0.441621333360672}}
{"epoch": 32, "time_per_epoch": "0:07:31.809968", "max_epoch": 80, "eta": "6:01:26.878446", "date": "5/7/2024", "time": "12:35:18", "status": "RUNNING", "verbosity": "INFO", "message": "Train metrics generated.", "kpi": {"learning_rate": 0.0029967931997491124, "loss": 0.5041215419769287}}
{"epoch": 33, "time_per_epoch": "0:07:36.250737", "max_epoch": 80, "eta": "5:57:23.784623", "date": "5/7/2024", "time": "12:42:55", "status": "RUNNING", "verbosity": "INFO", "message": "Train metrics generated.", "kpi": {"learning_rate": 0.002987176967241224, "loss": 0.40223047137260437}}
{"epoch": 34, "time_per_epoch": "0:07:29.651408", "max_epoch": 80, "eta": "5:44:43.964753", "date": "5/7/2024", "time": "12:50:25", "status": "RUNNING", "verbosity": "INFO", "message": "Train metrics generated.", "kpi": {"learning_rate": 0.0029711924788763493, "loss": 0.5588961839675903}}
{"epoch": 35, "time_per_epoch": "0:07:27.479540", "max_epoch": 80, "eta": "5:35:36.579287", "date": "5/7/2024", "time": "12:57:53", "status": "RUNNING", "verbosity": "INFO", "message": "Train metrics generated.", "kpi": {"learning_rate": 0.0029489081826876507, "loss": 0.36693549156188965}}
{"epoch": 36, "time_per_epoch": "0:07:28.327920", "max_epoch": 80, "eta": "5:28:46.428481", "date": "5/7/2024", "time": "13:5:22", "status": "RUNNING", "verbosity": "INFO", "message": "Train metrics generated.", "kpi": {"learning_rate": 0.0029204195034525557, "loss": 0.9166396260261536}}
{"epoch": 37, "time_per_epoch": "0:07:29.150868", "max_epoch": 80, "eta": "5:21:53.487321", "date": "5/7/2024", "time": "13:12:52", "status": "RUNNING", "verbosity": "INFO", "message": "Train metrics generated.", "kpi": {"learning_rate": 0.00288584843406921, "loss": 0.7163593173027039}}
{"epoch": 38, "time_per_epoch": "0:07:37.383134", "max_epoch": 80, "eta": "5:20:10.091611", "date": "5/7/2024", "time": "13:20:30", "status": "RUNNING", "verbosity": "INFO", "message": "Train metrics generated.", "kpi": {"learning_rate": 0.002845343013164161, "loss": 0.5028563737869263}}
{"epoch": 39, "time_per_epoch": "0:07:35.479606", "max_epoch": 80, "eta": "5:11:14.663833", "date": "5/7/2024", "time": "13:28:6", "status": "RUNNING", "verbosity": "INFO", "message": "Train metrics generated.", "kpi": {"learning_rate": 0.0027990766911682287, "loss": 0.5387132167816162}}
{"epoch": 40, "time_per_epoch": "0:07:31.724850", "max_epoch": 80, "eta": "5:01:08.994012", "date": "5/7/2024", "time": "13:35:38", "status": "RUNNING", "verbosity": "INFO", "message": "Train metrics generated.", "kpi": {"learning_rate": 0.002747247587575135, "loss": 0.4849264919757843}}
{"epoch": 41, "time_per_epoch": "0:07:30.573824", "max_epoch": 80, "eta": "4:52:52.379118", "date": "5/7/2024", "time": "13:43:10", "status": "RUNNING", "verbosity": "INFO", "message": "Train metrics generated.", "kpi": {"learning_rate": 0.002690077642563414, "loss": 0.5199539661407471}}
{"epoch": 42, "time_per_epoch": "0:07:36.183149", "max_epoch": 80, "eta": "4:48:54.959673", "date": "5/7/2024", "time": "13:50:46", "status": "RUNNING", "verbosity": "INFO", "message": "Train metrics generated.", "kpi": {"learning_rate": 0.002627811666614496, "loss": 0.4742547273635864}}
{"epoch": 43, "time_per_epoch": "0:07:23.678617", "max_epoch": 80, "eta": "4:33:36.108835", "date": "5/7/2024", "time": "13:58:11", "status": "RUNNING", "verbosity": "INFO", "message": "Train metrics generated.", "kpi": {"learning_rate": 0.002560716292196641, "loss": 0.3814888298511505}}
{"epoch": 44, "time_per_epoch": "0:07:21.946651", "max_epoch": 80, "eta": "4:25:10.079438", "date": "5/7/2024", "time": "14:5:33", "status": "RUNNING", "verbosity": "INFO", "message": "Train metrics generated.", "kpi": {"learning_rate": 0.002489078832003774, "loss": 0.34325650334358215}}
{"epoch": 45, "time_per_epoch": "0:07:21.781962", "max_epoch": 80, "eta": "4:17:42.368685", "date": "5/7/2024", "time": "14:12:55", "status": "RUNNING", "verbosity": "INFO", "message": "Train metrics generated.", "kpi": {"learning_rate": 0.0024132060486384255, "loss": 0.5689999461174011}}
{"epoch": 46, "time_per_epoch": "0:07:30.936750", "max_epoch": 80, "eta": "4:15:31.849487", "date": "5/7/2024", "time": "14:20:27", "status": "RUNNING", "verbosity": "INFO", "message": "Train metrics generated.", "kpi": {"learning_rate": 0.0023334228410071675, "loss": 0.4784387946128845}}
{"epoch": 47, "time_per_epoch": "0:07:25.671313", "max_epoch": 80, "eta": "4:05:07.153333", "date": "5/7/2024", "time": "14:27:53", "status": "RUNNING", "verbosity": "INFO", "message": "Train metrics generated.", "kpi": {"learning_rate": 0.0022500708530536163, "loss": 0.589332103729248}}
{"epoch": 48, "time_per_epoch": "0:07:22.110762", "max_epoch": 80, "eta": "3:55:47.544388", "date": "5/7/2024", "time": "14:35:16", "status": "RUNNING", "verbosity": "INFO", "message": "Train metrics generated.", "kpi": {"learning_rate": 0.0021635070107866206, "loss": 0.27218320965766907}}
{"epoch": 49, "time_per_epoch": "0:07:19.994904", "max_epoch": 80, "eta": "3:47:19.842036", "date": "5/7/2024", "time": "14:42:37", "status": "RUNNING", "verbosity": "INFO", "message": "Train metrics generated.", "kpi": {"learning_rate": 0.00207410199386829, "loss": 0.48348644375801086}}
{"epoch": 50, "time_per_epoch": "0:07:24.929756", "max_epoch": 80, "eta": "3:42:27.892692", "date": "5/7/2024", "time": "14:50:2", "status": "RUNNING", "verbosity": "INFO", "message": "Train metrics generated.", "kpi": {"learning_rate": 0.0019822386483067766, "loss": 0.37582147121429443}}
{"epoch": 51, "time_per_epoch": "0:07:20.617776", "max_epoch": 80, "eta": "3:32:57.915507", "date": "5/7/2024", "time": "14:57:23", "status": "RUNNING", "verbosity": "INFO", "message": "Train metrics generated.", "kpi": {"learning_rate": 0.0018883103470508924, "loss": 0.39562439918518066}}
{"epoch": 52, "time_per_epoch": "0:07:29.628477", "max_epoch": 80, "eta": "3:29:49.597356", "date": "5/7/2024", "time": "15:4:53", "status": "RUNNING", "verbosity": "INFO", "message": "Train metrics generated.", "kpi": {"learning_rate": 0.00179271930550675, "loss": 0.4453161954879761}}
{"epoch": 53, "time_per_epoch": "0:07:24.416505", "max_epoch": 80, "eta": "3:19:59.245644", "date": "5/7/2024", "time": "15:12:18", "status": "RUNNING", "verbosity": "INFO", "message": "Train metrics generated.", "kpi": {"learning_rate": 0.0016958748591896452, "loss": 0.6267533302307129}}
{"epoch": 54, "time_per_epoch": "0:07:34.661344", "max_epoch": 80, "eta": "3:17:01.194940", "date": "5/7/2024", "time": "15:19:54", "status": "RUNNING", "verbosity": "INFO", "message": "Train metrics generated.", "kpi": {"learning_rate": 0.0015981917108865375, "loss": 0.23352468013763428}}
{"epoch": 55, "time_per_epoch": "0:07:25.929298", "max_epoch": 80, "eta": "3:05:48.232458", "date": "5/7/2024", "time": "15:27:20", "status": "RUNNING", "verbosity": "INFO", "message": "Train metrics generated.", "kpi": {"learning_rate": 0.001500088154835051, "loss": 0.5205101370811462}}
{"epoch": 56, "time_per_epoch": "0:07:24.857243", "max_epoch": 80, "eta": "2:57:56.573843", "date": "5/7/2024", "time": "15:34:46", "status": "RUNNING", "verbosity": "INFO", "message": "Train metrics generated.", "kpi": {"learning_rate": 0.00140198428552333, "loss": 0.3533079922199249}}
{"epoch": 57, "time_per_epoch": "0:07:36.734620", "max_epoch": 80, "eta": "2:55:04.896266", "date": "5/7/2024", "time": "15:42:23", "status": "RUNNING", "verbosity": "INFO", "message": "Train metrics generated.", "kpi": {"learning_rate": 0.0013043001987809468, "loss": 0.5288237929344177}}
{"epoch": 58, "time_per_epoch": "0:07:22.240888", "max_epoch": 80, "eta": "2:42:09.299538", "date": "5/7/2024", "time": "15:49:46", "status": "RUNNING", "verbosity": "INFO", "message": "Train metrics generated.", "kpi": {"learning_rate": 0.0012074541928640665, "loss": 0.5628782510757446}}
{"epoch": 59, "time_per_epoch": "0:07:21.112908", "max_epoch": 80, "eta": "2:34:23.371059", "date": "5/7/2024", "time": "15:57:8", "status": "RUNNING", "verbosity": "INFO", "message": "Train metrics generated.", "kpi": {"learning_rate": 0.001111860977238095, "loss": 0.31243863701820374}}
{"epoch": 60, "time_per_epoch": "0:07:21.676104", "max_epoch": 80, "eta": "2:27:13.522089", "date": "5/7/2024", "time": "16:4:30", "status": "RUNNING", "verbosity": "INFO", "message": "Train metrics generated.", "kpi": {"learning_rate": 0.0010179298967280793, "loss": 0.24063703417778015}}
{"epoch": 61, "time_per_epoch": "0:07:22.304313", "max_epoch": 80, "eta": "2:20:03.781941", "date": "5/7/2024", "time": "16:11:53", "status": "RUNNING", "verbosity": "INFO", "message": "Train metrics generated.", "kpi": {"learning_rate": 0.0009260631786413252, "loss": 0.4758918583393097}}
{"epoch": 62, "time_per_epoch": "0:07:20.590280", "max_epoch": 80, "eta": "2:12:10.625033", "date": "5/7/2024", "time": "16:19:14", "status": "RUNNING", "verbosity": "INFO", "message": "Train metrics generated.", "kpi": {"learning_rate": 0.0008366542103683161, "loss": 0.48947787284851074}}
{"epoch": 63, "time_per_epoch": "0:07:20.320319", "max_epoch": 80, "eta": "2:04:45.445420", "date": "5/7/2024", "time": "16:26:35", "status": "RUNNING", "verbosity": "INFO", "message": "Train metrics generated.", "kpi": {"learning_rate": 0.0007500858548375109, "loss": 0.38016995787620544}}
{"epoch": 64, "time_per_epoch": "0:07:19.587276", "max_epoch": 80, "eta": "1:57:13.396410", "date": "5/7/2024", "time": "16:33:55", "status": "RUNNING", "verbosity": "INFO", "message": "Train metrics generated.", "kpi": {"learning_rate": 0.0006667288110375084, "loss": 0.39876535534858704}}
{"epoch": 65, "time_per_epoch": "0:07:19.526198", "max_epoch": 80, "eta": "1:49:52.892974", "date": "5/7/2024", "time": "16:41:15", "status": "RUNNING", "verbosity": "INFO", "message": "Train metrics generated.", "kpi": {"learning_rate": 0.0005869400266270666, "loss": 0.4667430520057678}}
{"epoch": 66, "time_per_epoch": "0:07:20.005884", "max_epoch": 80, "eta": "1:42:40.082382", "date": "5/7/2024", "time": "16:48:36", "status": "RUNNING", "verbosity": "INFO", "message": "Train metrics generated.", "kpi": {"learning_rate": 0.0005110611694304271, "loss": 0.29991692304611206}}
{"epoch": 67, "time_per_epoch": "0:07:19.748235", "max_epoch": 80, "eta": "1:35:16.727050", "date": "5/7/2024", "time": "16:55:56", "status": "RUNNING", "verbosity": "INFO", "message": "Train metrics generated.", "kpi": {"learning_rate": 0.0004394171643632412, "loss": 0.341851145029068}}
{"epoch": 68, "time_per_epoch": "0:07:19.018176", "max_epoch": 80, "eta": "1:27:48.218115", "date": "5/7/2024", "time": "17:3:16", "status": "RUNNING", "verbosity": "INFO", "message": "Train metrics generated.", "kpi": {"learning_rate": 0.000372314802054194, "loss": 0.3857216238975525}}
{"epoch": 69, "time_per_epoch": "0:07:19.552546", "max_epoch": 80, "eta": "1:20:35.078006", "date": "5/7/2024", "time": "17:10:36", "status": "RUNNING", "verbosity": "INFO", "message": "Train metrics generated.", "kpi": {"learning_rate": 0.0003100414251204348, "loss": 0.31028807163238525}}
{"epoch": 70, "time_per_epoch": "0:07:20.258756", "max_epoch": 80, "eta": "1:13:22.587557", "date": "5/7/2024", "time": "17:17:57", "status": "RUNNING", "verbosity": "INFO", "message": "Train metrics generated.", "kpi": {"learning_rate": 0.0002528636977223765, "loss": 0.4571707844734192}}
{"epoch": 71, "time_per_epoch": "0:07:18.766678", "max_epoch": 80, "eta": "1:05:48.900103", "date": "5/7/2024", "time": "17:25:16", "status": "RUNNING", "verbosity": "INFO", "message": "Train metrics generated.", "kpi": {"learning_rate": 0.00020102646366682223, "loss": 0.4598176181316376}}
{"epoch": 72, "time_per_epoch": "0:07:19.548373", "max_epoch": 80, "eta": "0:58:36.386986", "date": "5/7/2024", "time": "17:32:36", "status": "RUNNING", "verbosity": "INFO", "message": "Train metrics generated.", "kpi": {"learning_rate": 0.0001547516979481945, "loss": 0.2587343454360962}}
{"epoch": 73, "time_per_epoch": "0:07:19.405268", "max_epoch": 80, "eta": "0:51:15.836877", "date": "5/7/2024", "time": "17:39:56", "status": "RUNNING", "verbosity": "INFO", "message": "Train metrics generated.", "kpi": {"learning_rate": 0.00011423755621753237, "loss": 0.8911759853363037}}
{"epoch": 74, "time_per_epoch": "0:07:19.128135", "max_epoch": 80, "eta": "0:43:54.768811", "date": "5/7/2024", "time": "17:47:16", "status": "RUNNING", "verbosity": "INFO", "message": "Train metrics generated.", "kpi": {"learning_rate": 7.965752624957046e-05, "loss": 0.45557257533073425}}
{"epoch": 75, "time_per_epoch": "0:07:19.406747", "max_epoch": 80, "eta": "0:36:37.033734", "date": "5/7/2024", "time": "17:54:36", "status": "RUNNING", "verbosity": "INFO", "message": "Train metrics generated.", "kpi": {"learning_rate": 5.1159685041454155e-05, "loss": 0.26564571261405945}}
{"epoch": 76, "time_per_epoch": "0:07:19.508782", "max_epoch": 80, "eta": "0:29:18.035129", "date": "5/7/2024", "time": "18:1:56", "status": "RUNNING", "verbosity": "INFO", "message": "Train metrics generated.", "kpi": {"learning_rate": 2.8866064724304957e-05, "loss": 0.48253774642944336}}
{"epoch": 77, "time_per_epoch": "0:07:20.132161", "max_epoch": 80, "eta": "0:22:00.396483", "date": "5/7/2024", "time": "18:9:16", "status": "RUNNING", "verbosity": "INFO", "message": "Train metrics generated.", "kpi": {"learning_rate": 1.2872130002899723e-05, "loss": 0.6122196912765503}}
{"epoch": 78, "time_per_epoch": "0:07:20.184826", "max_epoch": 80, "eta": "0:14:40.369652", "date": "5/7/2024", "time": "18:16:37", "status": "RUNNING", "verbosity": "INFO", "message": "Train metrics generated.", "kpi": {"learning_rate": 3.2463693611489186e-06, "loss": 0.1481819897890091}}
{"epoch": 79, "time_per_epoch": "0:07:18.947498", "max_epoch": 80, "eta": "0:07:18.947498", "date": "5/7/2024", "time": "18:23:57", "status": "RUNNING", "verbosity": "INFO", "message": "Train metrics generated.", "kpi": {"learning_rate": 3.0001783894512056e-08, "loss": 0.2755662798881531}}
{"date": "5/7/2024", "time": "18:23:57", "status": "SUCCESS", "verbosity": "INFO", "message": "Training finished successfully.", "kpi": {"learning_rate": 3.0001783894512056e-08, "loss": 0.2755662798881531}}

This is the output model: checkpoint_epoch_80.tlt

Performance of this model on with evaluation.py script on my validation set is:

Average predicted number of objects(1318 samples): 6.969

Car AP@0.50, 0.50:
bev  AP:83.1364
3d   AP:78.0459
bev mAP: 83.1364
3d mAP: 78.0459

Problem: conversion to tensorrt

Note: for conversion max_points_num in inference section of pointpillar_general.yaml was set to max_points_num: 204800.

I tried to convert the model into tensorrt engine in two ways.

Method 1: export script

First I tried to use the provided export.py script.

Question: why do we set dummy_voxel_num_points and dummy_coords as torch.int32 and we don’t keep them as float?

python nvidia_tao_pytorch/pointcloud/pointpillars/scripts/export.py --cfg_file nvidia_tao_pytorch/pointcloud/pointpillars/tools/cfgs/pointpillar_general.yaml --save_engine path/to/output/checkpoint_epoch_80.engine  --key tlt_encode

I obtain an engine checkpoint_epoch_80.engine

However all metrics drop to zero.

Evaluation with checkpoint_epoch_80.engine

Average predicted number of objects(1318 samples): 1.174

2024-05-16 12:47:50,937   INFO  Car AP@0.50, 0.50:
bev  AP:0.0000
3d   AP:0.0000
bev mAP: 0.0000
3d mAP: 0.0000

Method 2: trtexec

I tried to take the onnx generated by export script (checkpoint_epoch_80.onnx the one after simplification with graph surgeon) and I tried to generate the engine with trtexeccheckpoint_epoch_80_trtexec.engine

trtexec --onnx=/path/to/checkpoint_epoch_80.onnx \
        --maxShapes=points:1x204800x4,num_points:1 \
        --minShapes=points:1x204800x4,num_points:1 \
        --optShapes=points:1x204800x4,num_points:1 \
        --fp16 \
        --saveEngine=/path/to/checkpoint_epoch_80_trtexec.engine

However the metrics are still zero:

Average predicted number of objects(1318 samples): 1.169

Car AP@0.50, 0.50:
bev  AP:0.0000
3d   AP:0.0000
bev mAP: 0.0000
3d mAP: 0.0000

Experiments

I tried to:

  • remove --fp16 flag when using trtexec
  • use different docker as suggested in other topics
  • convert the non-simplified onnx with trtexec → this kind of worked but it was outputting nans. To overcome this problem I used the --best flag during conversion. However the prediction are still wrong.

12 posts - 2 participants

Read full topic


Viewing all articles
Browse latest Browse all 497

Trending Articles