Dear @Morganh
I am trying to train mask-rcnn with custom dataset but getting below issue after certain steps.
For multi-GPU, change --gpus based on your machine.
2025-01-16 10:59:36,595 [TAO Toolkit] [INFO] root 160: Registry: ['nvcr.io']
2025-01-16 10:59:36,670 [TAO Toolkit] [INFO] nvidia_tao_cli.components.instance_handler.local_instance 361: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5
2025-01-16 10:59:36,714 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 301: Printing tty value True
2025-01-16 05:29:37.366563: I tensorflow/stream_executor/platform/default/dso_loader.cc:50] Successfully opened dynamic library libcudart.so.12
2025-01-16 05:29:37,403 [TAO Toolkit] [WARNING] tensorflow 40: Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
2025-01-16 05:29:38.895241: I tensorflow/stream_executor/platform/default/dso_loader.cc:50] Successfully opened dynamic library libcudart.so.12
Using TensorFlow backend.
2025-01-16 05:29:39,015 [TAO Toolkit] [WARNING] tensorflow 43: TensorFlow will not use sklearn by default. This improves performance in some cases. To enable sklearn export the environment variable TF_ALLOW_IOLIBS=1.
2025-01-16 05:29:39,046 [TAO Toolkit] [WARNING] tensorflow 42: TensorFlow will not use Dask by default. This improves performance in some cases. To enable Dask export the environment variable TF_ALLOW_IOLIBS=1.
2025-01-16 05:29:39,050 [TAO Toolkit] [WARNING] tensorflow 43: TensorFlow will not use Pandas by default. This improves performance in some cases. To enable Pandas export the environment variable TF_ALLOW_IOLIBS=1.
2025-01-16 05:29:39,316 [TAO Toolkit] [WARNING] matplotlib 500: Matplotlib created a temporary config/cache directory at /tmp/matplotlib-xvt6aibf because the default path (/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
2025-01-16 05:29:39,496 [TAO Toolkit] [INFO] matplotlib.font_manager 1633: generated new fontManager
2025-01-16 05:29:40.181244: I tensorflow/stream_executor/platform/default/dso_loader.cc:50] Successfully opened dynamic library libnvinfer.so.8
2025-01-16 05:29:40.195485: I tensorflow/stream_executor/platform/default/dso_loader.cc:50] Successfully opened dynamic library libcuda.so.1
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
Using TensorFlow backend.
WARNING:tensorflow:TensorFlow will not use sklearn by default. This improves performance in some cases. To enable sklearn export the environment variable TF_ALLOW_IOLIBS=1.
2025-01-16 05:29:42,155 [TAO Toolkit] [WARNING] tensorflow 43: TensorFlow will not use sklearn by default. This improves performance in some cases. To enable sklearn export the environment variable TF_ALLOW_IOLIBS=1.
WARNING:tensorflow:TensorFlow will not use Dask by default. This improves performance in some cases. To enable Dask export the environment variable TF_ALLOW_IOLIBS=1.
2025-01-16 05:29:42,185 [TAO Toolkit] [WARNING] tensorflow 42: TensorFlow will not use Dask by default. This improves performance in some cases. To enable Dask export the environment variable TF_ALLOW_IOLIBS=1.
WARNING:tensorflow:TensorFlow will not use Pandas by default. This improves performance in some cases. To enable Pandas export the environment variable TF_ALLOW_IOLIBS=1.
2025-01-16 05:29:42,188 [TAO Toolkit] [WARNING] tensorflow 43: TensorFlow will not use Pandas by default. This improves performance in some cases. To enable Pandas export the environment variable TF_ALLOW_IOLIBS=1.
[INFO] Loading specification from /workspace/tao-experiments/mask_rcnn/specs/maskrcnn_retrain_resnet50.txt
[MaskRCNN] INFO : Loading weights from /workspace/tao-experiments/mask_rcnn/experiment_dir_unpruned/model.epoch-0.tlt
[MaskRCNN] INFO : Loading weights from /workspace/tao-experiments/mask_rcnn/experiment_dir_unpruned/model.epoch-0.tlt
[MaskRCNN] INFO : current step from checkpoint: 3964
[INFO] Log file already exists at /workspace/tao-experiments/mask_rcnn/experiment_dir_unpruned/status.json
[INFO] Starting MaskRCNN training.
INFO:tensorflow:Using config: {'_model_dir': '/tmp/tmphgad4qnt', '_tf_random_seed': 123, '_save_summary_steps': None, '_save_checkpoints_steps': None, '_save_checkpoints_secs': None, '_session_config': intra_op_parallelism_threads: 1
inter_op_parallelism_threads: 4
gpu_options {
allow_growth: true
force_gpu_compatible: true
}
allow_soft_placement: true
graph_options {
rewrite_options {
meta_optimizer_iterations: TWO
}
}
, '_keep_checkpoint_max': 20, '_keep_checkpoint_every_n_hours': None, '_log_step_count_steps': None, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fb304eebd90>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
[MaskRCNN] INFO : Create EncryptCheckpointSaverHook.
[MaskRCNN] INFO : =================================
[MaskRCNN] INFO : Start training cycle 01
[MaskRCNN] INFO : =================================
WARNING:tensorflow:From /usr/local/lib/python3.8/dist-packages/third_party/keras/tensorflow_backend.py:361: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.
WARNING:tensorflow:From /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/autograph/converters/directives.py:119: The name tf.set_random_seed is deprecated. Please use tf.compat.v1.set_random_seed instead.
WARNING:tensorflow:The operation `tf.image.convert_image_dtype` will be skipped since the input and output dtypes are identical.
WARNING:tensorflow:The operation `tf.image.convert_image_dtype` will be skipped since the input and output dtypes are identical.
WARNING:tensorflow:The operation `tf.image.convert_image_dtype` will be skipped since the input and output dtypes are identical.
WARNING:tensorflow:The operation `tf.image.convert_image_dtype` will be skipped since the input and output dtypes are identical.
INFO:tensorflow:Calling model_fn.
[MaskRCNN] INFO : ***********************
[MaskRCNN] INFO : Building model graph...
[MaskRCNN] INFO : ***********************
[MaskRCNN] INFO : [ROI OPs] Using Batched NMS... Scope: MLP/multilevel_propose_rois/level_2/
[MaskRCNN] INFO : [ROI OPs] Using Batched NMS... Scope: MLP/multilevel_propose_rois/level_3/
[MaskRCNN] INFO : [ROI OPs] Using Batched NMS... Scope: MLP/multilevel_propose_rois/level_4/
[MaskRCNN] INFO : [ROI OPs] Using Batched NMS... Scope: MLP/multilevel_propose_rois/level_5/
[MaskRCNN] INFO : [ROI OPs] Using Batched NMS... Scope: MLP/multilevel_propose_rois/level_6/
4 ops no flops stats due to incomplete shapes.
Parsing Inputs...
[MaskRCNN] INFO : [Training Compute Statistics] 216.3 GFLOPS/image
WARNING:tensorflow:
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
* https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
* https://github.com/tensorflow/addons
* https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/tmphgad4qnt/model.ckpt-3964
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
[GPU 00] Restoring pretrained weights (307 Tensors)
[MaskRCNN] INFO : Pretrained weights loaded with success...
[MaskRCNN] INFO : Saving checkpoints for epoch 0 into /workspace/tao-experiments/mask_rcnn/experiment_dir_unpruned/model.epoch-0.tlt.
[MaskRCNN] INFO : Global step 3970 (epoch 1/60): total loss: 1.97247 (rpn score loss: 0.11123 rpn box loss: 0.01986 fast_rcnn class loss: 0.34988 fast_rcnn box loss: 0.28576) learning rate: 0.00081
[MaskRCNN] INFO : Global step 3980 (epoch 1/60): total loss: 1.62314 (rpn score loss: 0.03085 rpn box loss: 0.01711 fast_rcnn class loss: 0.25197 fast_rcnn box loss: 0.17000) learning rate: 0.00082
[MaskRCNN] INFO : Global step 3990 (epoch 1/60): total loss: 1.79059 (rpn score loss: 0.07045 rpn box loss: 0.03234 fast_rcnn class loss: 0.34788 fast_rcnn box loss: 0.19229) learning rate: 0.00082
[MaskRCNN] INFO : Global step 4000 (epoch 1/60): total loss: 1.69635 (rpn score loss: 0.07163 rpn box loss: 0.02998 fast_rcnn class loss: 0.25212 fast_rcnn box loss: 0.19454) learning rate: 0.00082
[MaskRCNN] INFO : Global step 4010 (epoch 1/60): total loss: 2.10775 (rpn score loss: 0.09997 rpn box loss: 0.01796 fast_rcnn class loss: 0.39998 fast_rcnn box loss: 0.29925) learning rate: 0.00082
[MaskRCNN] INFO : Global step 4020 (epoch 1/60): total loss: 1.84139 (rpn score loss: 0.07552 rpn box loss: 0.03666 fast_rcnn class loss: 0.29757 fast_rcnn box loss: 0.26355) learning rate: 0.00082
[MaskRCNN] INFO : Global step 4030 (epoch 1/60): total loss: 2.12437 (rpn score loss: 0.16273 rpn box loss: 0.02378 fast_rcnn class loss: 0.42083 fast_rcnn box loss: 0.29021) learning rate: 0.00083
[MaskRCNN] INFO : Global step 4040 (epoch 1/60): total loss: 2.04857 (rpn score loss: 0.12899 rpn box loss: 0.03501 fast_rcnn class loss: 0.34103 fast_rcnn box loss: 0.29424) learning rate: 0.00083
[MaskRCNN] INFO : Global step 4050 (epoch 1/60): total loss: 1.90148 (rpn score loss: 0.05795 rpn box loss: 0.03936 fast_rcnn class loss: 0.32264 fast_rcnn box loss: 0.26968) learning rate: 0.00083
[MaskRCNN] INFO : Global step 4060 (epoch 1/60): total loss: 2.05801 (rpn score loss: 0.14668 rpn box loss: 0.03704 fast_rcnn class loss: 0.37895 fast_rcnn box loss: 0.27110) learning rate: 0.00083
[MaskRCNN] INFO : Global step 4070 (epoch 1/60): total loss: 1.79224 (rpn score loss: 0.06716 rpn box loss: 0.03116 fast_rcnn class loss: 0.29775 fast_rcnn box loss: 0.22957) learning rate: 0.00083
[MaskRCNN] INFO : Global step 4080 (epoch 1/60): total loss: 1.59308 (rpn score loss: 0.04801 rpn box loss: 0.01114 fast_rcnn class loss: 0.27881 fast_rcnn box loss: 0.16689) learning rate: 0.00083
[MaskRCNN] INFO : Global step 4090 (epoch 1/60): total loss: 2.12668 (rpn score loss: 0.11774 rpn box loss: 0.04123 fast_rcnn class loss: 0.40081 fast_rcnn box loss: 0.33850) learning rate: 0.00084
[MaskRCNN] INFO : Global step 4100 (epoch 1/60): total loss: 1.71191 (rpn score loss: 0.06594 rpn box loss: 0.01962 fast_rcnn class loss: 0.30247 fast_rcnn box loss: 0.18799) learning rate: 0.00084
[MaskRCNN] INFO : Global step 4110 (epoch 1/60): total loss: 1.89657 (rpn score loss: 0.05287 rpn box loss: 0.02735 fast_rcnn class loss: 0.33230 fast_rcnn box loss: 0.29379) learning rate: 0.00084
[MaskRCNN] INFO : Global step 4120 (epoch 1/60): total loss: 2.03658 (rpn score loss: 0.13105 rpn box loss: 0.04745 fast_rcnn class loss: 0.36375 fast_rcnn box loss: 0.28187) learning rate: 0.00084
[MaskRCNN] INFO : Global step 4130 (epoch 1/60): total loss: 1.93204 (rpn score loss: 0.07883 rpn box loss: 0.03727 fast_rcnn class loss: 0.37148 fast_rcnn box loss: 0.23915) learning rate: 0.00084
[MaskRCNN] INFO : Global step 4140 (epoch 1/60): total loss: 2.08287 (rpn score loss: 0.11430 rpn box loss: 0.03124 fast_rcnn class loss: 0.39994 fast_rcnn box loss: 0.27192) learning rate: 0.00085
[MaskRCNN] INFO : Global step 4150 (epoch 1/60): total loss: 2.11710 (rpn score loss: 0.09923 rpn box loss: 0.02143 fast_rcnn class loss: 0.37232 fast_rcnn box loss: 0.36772) learning rate: 0.00085
[MaskRCNN] INFO : Global step 4160 (epoch 1/60): total loss: 1.76273 (rpn score loss: 0.04779 rpn box loss: 0.01587 fast_rcnn class loss: 0.29432 fast_rcnn box loss: 0.26516) learning rate: 0.00085
[MaskRCNN] INFO : Global step 4170 (epoch 1/60): total loss: 1.71858 (rpn score loss: 0.05151 rpn box loss: 0.03616 fast_rcnn class loss: 0.25626 fast_rcnn box loss: 0.19708) learning rate: 0.00085
[MaskRCNN] INFO : Global step 4180 (epoch 1/60): total loss: 1.94979 (rpn score loss: 0.07615 rpn box loss: 0.04014 fast_rcnn class loss: 0.34977 fast_rcnn box loss: 0.26231) learning rate: 0.00085
[MaskRCNN] INFO : Global step 4190 (epoch 1/60): total loss: 2.23382 (rpn score loss: 0.13469 rpn box loss: 0.02437 fast_rcnn class loss: 0.46003 fast_rcnn box loss: 0.32868) learning rate: 0.00085
[MaskRCNN] INFO : Global step 4200 (epoch 1/60): total loss: 1.79038 (rpn score loss: 0.04073 rpn box loss: 0.02832 fast_rcnn class loss: 0.32231 fast_rcnn box loss: 0.21682) learning rate: 0.00086
[MaskRCNN] INFO : Global step 4210 (epoch 1/60): total loss: 1.99114 (rpn score loss: 0.10510 rpn box loss: 0.04607 fast_rcnn class loss: 0.36540 fast_rcnn box loss: 0.26908) learning rate: 0.00086
[MaskRCNN] INFO : Global step 4220 (epoch 1/60): total loss: 1.96724 (rpn score loss: 0.13704 rpn box loss: 0.02670 fast_rcnn class loss: 0.34379 fast_rcnn box loss: 0.25319) learning rate: 0.00086
[MaskRCNN] INFO : Global step 4230 (epoch 1/60): total loss: 1.98791 (rpn score loss: 0.12831 rpn box loss: 0.03703 fast_rcnn class loss: 0.31502 fast_rcnn box loss: 0.27176) learning rate: 0.00086
[MaskRCNN] INFO : Global step 4240 (epoch 1/60): total loss: 2.07191 (rpn score loss: 0.10061 rpn box loss: 0.01850 fast_rcnn class loss: 0.31984 fast_rcnn box loss: 0.37057) learning rate: 0.00086
[MaskRCNN] INFO : Global step 4250 (epoch 1/60): total loss: 1.66154 (rpn score loss: 0.06737 rpn box loss: 0.05524 fast_rcnn class loss: 0.22960 fast_rcnn box loss: 0.17090) learning rate: 0.00086
[MaskRCNN] INFO : Global step 4260 (epoch 1/60): total loss: 1.95168 (rpn score loss: 0.04300 rpn box loss: 0.03719 fast_rcnn class loss: 0.35215 fast_rcnn box loss: 0.30919) learning rate: 0.00087
[MaskRCNN] INFO : Global step 4270 (epoch 1/60): total loss: 1.66803 (rpn score loss: 0.05000 rpn box loss: 0.01810 fast_rcnn class loss: 0.29840 fast_rcnn box loss: 0.17104) learning rate: 0.00087
[MaskRCNN] INFO : Global step 4280 (epoch 1/60): total loss: 2.07519 (rpn score loss: 0.11418 rpn box loss: 0.03251 fast_rcnn class loss: 0.38866 fast_rcnn box loss: 0.27565) learning rate: 0.00087
[MaskRCNN] INFO : Global step 4290 (epoch 1/60): total loss: 1.52216 (rpn score loss: 0.02900 rpn box loss: 0.00757 fast_rcnn class loss: 0.22077 fast_rcnn box loss: 0.15350) learning rate: 0.00087
[MaskRCNN] INFO : Global step 4300 (epoch 1/60): total loss: 1.75907 (rpn score loss: 0.11032 rpn box loss: 0.02966 fast_rcnn class loss: 0.27160 fast_rcnn box loss: 0.21476) learning rate: 0.00087
.
.
.
[MaskRCNN] INFO : Global step 11240 (epoch 1/60): total loss: 1.87170 (rpn score loss: 0.08811 rpn box loss: 0.01817 fast_rcnn class loss: 0.33792 fast_rcnn box loss: 0.22661) learning rate: 0.00100
[MaskRCNN] INFO : Global step 11250 (epoch 1/60): total loss: 1.52081 (rpn score loss: 0.03515 rpn box loss: 0.01659 fast_rcnn class loss: 0.24017 fast_rcnn box loss: 0.13908) learning rate: 0.00100
[MaskRCNN] INFO : Global step 11260 (epoch 1/60): total loss: 1.56932 (rpn score loss: 0.04662 rpn box loss: 0.03519 fast_rcnn class loss: 0.22036 fast_rcnn box loss: 0.14993) learning rate: 0.00100
[MaskRCNN] INFO : Global step 11270 (epoch 1/60): total loss: 1.69349 (rpn score loss: 0.04376 rpn box loss: 0.01922 fast_rcnn class loss: 0.30813 fast_rcnn box loss: 0.16890) learning rate: 0.00100
[MaskRCNN] INFO : Global step 11280 (epoch 1/60): total loss: 1.94548 (rpn score loss: 0.11634 rpn box loss: 0.02167 fast_rcnn class loss: 0.35849 fast_rcnn box loss: 0.23329) learning rate: 0.00100
[MaskRCNN] INFO : Global step 11290 (epoch 1/60): total loss: 1.45143 (rpn score loss: 0.02866 rpn box loss: 0.00907 fast_rcnn class loss: 0.20220 fast_rcnn box loss: 0.11106) learning rate: 0.00100
[MaskRCNN] INFO : Global step 11300 (epoch 1/60): total loss: 1.57107 (rpn score loss: 0.06977 rpn box loss: 0.02299 fast_rcnn class loss: 0.22030 fast_rcnn box loss: 0.13345) learning rate: 0.00100
[MaskRCNN] INFO : Global step 11310 (epoch 1/60): total loss: 1.79068 (rpn score loss: 0.12115 rpn box loss: 0.01796 fast_rcnn class loss: 0.32999 fast_rcnn box loss: 0.19124) learning rate: 0.00100
[MaskRCNN] INFO : Global step 11320 (epoch 1/60): total loss: 1.62610 (rpn score loss: 0.06707 rpn box loss: 0.01374 fast_rcnn class loss: 0.27831 fast_rcnn box loss: 0.16727) learning rate: 0.00100
[MaskRCNN] INFO : Global step 11330 (epoch 1/60): total loss: 1.65370 (rpn score loss: 0.05514 rpn box loss: 0.04094 fast_rcnn class loss: 0.27206 fast_rcnn box loss: 0.16764) learning rate: 0.00100
[INFO] Input to reshape is a tensor with 3067968 values, but the requested shape has 2691200
[[{{node parser/process_gt_masks_for_training/Reshape_2}}]]
[[cluster_14_1/xla_compile]]
[[IteratorGetNext]]
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
return fn(*args)
File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 1349, in _run_fn
return self._call_tf_sessionrun(options, feed_dict, fetch_list,
File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 1441, in _call_tf_sessionrun
return tf_session.TF_SessionRun_wrapper(self._session, options, feed_dict,
tensorflow.python.framework.errors_impl.InvalidArgumentError: {{function_node __inference_Dataset_map__map_func_set_random_wrapper_1115}} Input to reshape is a tensor with 3067968 values, but the requested shape has 2691200
[[{{node parser/process_gt_masks_for_training/Reshape_2}}]]
[[cluster_14_1/xla_compile]]
[[IteratorGetNext]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/mask_rcnn/scripts/train.py", line 321, in <module>
main()
File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/mask_rcnn/scripts/train.py", line 313, in main
raise e
File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/mask_rcnn/scripts/train.py", line 300, in main
run_executer(RUN_CONFIG, train_input_fn, eval_input_fn)
File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/mask_rcnn/scripts/train.py", line 106, in run_executer
executer.train_and_eval(train_input_fn=train_input_fn, eval_input_fn=eval_input_fn)
File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/mask_rcnn/executer/distributed_executer.py", line 412, in train_and_eval
train_estimator.train(
File "/usr/local/lib/python3.8/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 370, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python3.8/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1161, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python3.8/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1193, in _train_model_default
return self._train_with_estimator_spec(estimator_spec, worker_hooks,
File "/usr/local/lib/python3.8/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1494, in _train_with_estimator_spec
_, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 750, in run
return self._sess.run(
File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1255, in run
return self._sess.run(
File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1360, in run
raise six.reraise(*original_exc_info)
File "/usr/local/lib/python3.8/dist-packages/six.py", line 719, in reraise
raise value
File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1345, in run
return self._sess.run(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1413, in run
outputs = _WrappedSession.run(
File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1176, in run
return self._sess.run(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 955, in run
result = self._run(None, fetches, feed_dict, options_ptr,
File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 1179, in _run
results = self._do_run(handle, final_targets, final_fetches,
File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 1358, in _do_run
return self._do_call(_run_fn, feeds, fetches, targets, options,
File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Input to reshape is a tensor with 3067968 values, but the requested shape has 2691200
[[{{node parser/process_gt_masks_for_training/Reshape_2}}]]
[[cluster_14_1/xla_compile]]
[[IteratorGetNext]]
Execution status: FAIL
2025-01-16 11:49:44,668 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 363: Stopping container.
I am using below config.
I have only single class i.e Carton
seed: 123
use_amp: False
warmup_steps: 5000
checkpoint: "/workspace/tao-experiments/mask_rcnn/pretrained_resnet50/pretrained_instance_segmentation_vresnet50/resnet50.hdf5"
learning_rate_steps: "[12500, 125000, 375000]"
learning_rate_decay_levels: "[0.1, 0.05, 0.01]"
total_steps: 750000
train_batch_size: 4
eval_batch_size: 4
num_steps_per_eval: 12500
momentum: 0.9
l2_weight_decay: 0.00004
warmup_learning_rate: 0.0001
init_learning_rate: 0.001
num_examples_per_epoch: 50000
data_config {
image_size: "(640, 640)"
augment_input_data: True
eval_samples: 7927
training_file_pattern: "/workspace/tao-experiments/data/maskrcnn/train*.tfrecord"
validation_file_pattern: "/workspace/tao-experiments/data/maskrcnn/val*.tfrecord"
val_json_file: "/workspace/tao-experiments/data/raw-data/annotations/val.json"
# dataset specific parameters
num_classes: 2 # Including background
skip_crowd_during_training: True
}
maskrcnn_config {
nlayers: 50
arch: "resnet"
freeze_bn: True
freeze_blocks: "[0,1]"
gt_mask_size: 112
# Region Proposal Network
rpn_positive_overlap: 0.7
rpn_negative_overlap: 0.3
rpn_batch_size_per_im: 128
rpn_fg_fraction: 0.5
rpn_min_size: 0.
# Proposal layer.
batch_size_per_im: 256
fg_fraction: 0.25
fg_thresh: 0.5
bg_thresh_hi: 0.5
bg_thresh_lo: 0.
# Faster-RCNN heads.
fast_rcnn_mlp_head_dim: 1024
bbox_reg_weights: "(10., 10., 5., 5.)"
# Mask-RCNN heads.
include_mask: True
mrcnn_resolution: 28
# training
train_rpn_pre_nms_topn: 2000
train_rpn_post_nms_topn: 1000
train_rpn_nms_threshold: 0.7
# evaluation
test_detections_per_image: 100
test_nms: 0.5
test_rpn_pre_nms_topn: 1000
test_rpn_post_nms_topn: 1000
test_rpn_nms_thresh: 0.7
# model architecture
min_level: 2
max_level: 6
num_scales: 1
aspect_ratios: "[(1.0, 1.0), (1.4, 0.7), (0.7, 1.4)]"
anchor_scale: 8
# localization loss
rpn_box_loss_weight: 1.0
fast_rcnn_box_loss_weight: 1.0
mrcnn_weight_loss_mask: 1.0
}
TF-Records logs for training:
2025-01-15 14:56:30,067 [TAO Toolkit] [INFO] root 160: Registry: ['nvcr.io']
2025-01-15 14:56:30,130 [TAO Toolkit] [INFO] nvidia_tao_cli.components.instance_handler.local_instance 361: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5
2025-01-15 14:56:30,177 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 301: Printing tty value True
2025-01-15 09:26:30.801687: I tensorflow/stream_executor/platform/default/dso_loader.cc:50] Successfully opened dynamic library libcudart.so.12
2025-01-15 09:26:30,843 [TAO Toolkit] [WARNING] tensorflow 40: Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
2025-01-15 09:26:32.312489: I tensorflow/stream_executor/platform/default/dso_loader.cc:50] Successfully opened dynamic library libcudart.so.12
Using TensorFlow backend.
2025-01-15 09:26:32,432 [TAO Toolkit] [WARNING] tensorflow 43: TensorFlow will not use sklearn by default. This improves performance in some cases. To enable sklearn export the environment variable TF_ALLOW_IOLIBS=1.
2025-01-15 09:26:32,463 [TAO Toolkit] [WARNING] tensorflow 42: TensorFlow will not use Dask by default. This improves performance in some cases. To enable Dask export the environment variable TF_ALLOW_IOLIBS=1.
2025-01-15 09:26:32,466 [TAO Toolkit] [WARNING] tensorflow 43: TensorFlow will not use Pandas by default. This improves performance in some cases. To enable Pandas export the environment variable TF_ALLOW_IOLIBS=1.
2025-01-15 09:26:32,739 [TAO Toolkit] [WARNING] matplotlib 500: Matplotlib created a temporary config/cache directory at /tmp/matplotlib-xg0743af because the default path (/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
2025-01-15 09:26:32,920 [TAO Toolkit] [INFO] matplotlib.font_manager 1633: generated new fontManager
2025-01-15 09:26:33.619547: I tensorflow/stream_executor/platform/default/dso_loader.cc:50] Successfully opened dynamic library libnvinfer.so.8
2025-01-15 09:26:33.633191: I tensorflow/stream_executor/platform/default/dso_loader.cc:50] Successfully opened dynamic library libcuda.so.1
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
INFO:tensorflow:writing to output path: /workspace/tao-experiments/data/maskrcnn/train
INFO:tensorflow:writing to output path: /workspace/tao-experiments/data/maskrcnn/train
INFO:tensorflow:Building bounding box index.
INFO:tensorflow:Building bounding box index.
INFO:tensorflow:0 images are missing bboxes.
INFO:tensorflow:0 images are missing bboxes.
INFO:tensorflow:On image 0 of 50000
INFO:tensorflow:On image 0 of 50000
INFO:tensorflow:On image 100 of 50000
INFO:tensorflow:On image 100 of 50000
INFO:tensorflow:On image 200 of 50000
INFO:tensorflow:On image 200 of 50000
INFO:tensorflow:On image 300 of 50000
INFO:tensorflow:On image 300 of 50000
INFO:tensorflow:On image 400 of 50000
INFO:tensorflow:On image 400 of 50000
INFO:tensorflow:On image 500 of 50000
INFO:tensorflow:On image 500 of 50000
.
.
.
INFO:tensorflow:On image 49900 of 50000
INFO:tensorflow:On image 49900 of 50000
INFO:tensorflow:Finished writing, skipped 0 annotations.
INFO:tensorflow:Finished writing, skipped 0 annotations.
Execution status: PASS
2025-01-15 15:20:06,391 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 363: Stopping container.
TF-Records logs for Validation:
2025-01-15 15:20:07,150 [TAO Toolkit] [INFO] root 160: Registry: ['nvcr.io']
2025-01-15 15:20:07,219 [TAO Toolkit] [INFO] nvidia_tao_cli.components.instance_handler.local_instance 361: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5
2025-01-15 15:20:07,267 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 301: Printing tty value True
2025-01-15 09:50:07.954723: I tensorflow/stream_executor/platform/default/dso_loader.cc:50] Successfully opened dynamic library libcudart.so.12
2025-01-15 09:50:07,992 [TAO Toolkit] [WARNING] tensorflow 40: Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
2025-01-15 09:50:09.528172: I tensorflow/stream_executor/platform/default/dso_loader.cc:50] Successfully opened dynamic library libcudart.so.12
Using TensorFlow backend.
2025-01-15 09:50:09,655 [TAO Toolkit] [WARNING] tensorflow 43: TensorFlow will not use sklearn by default. This improves performance in some cases. To enable sklearn export the environment variable TF_ALLOW_IOLIBS=1.
2025-01-15 09:50:09,686 [TAO Toolkit] [WARNING] tensorflow 42: TensorFlow will not use Dask by default. This improves performance in some cases. To enable Dask export the environment variable TF_ALLOW_IOLIBS=1.
2025-01-15 09:50:09,690 [TAO Toolkit] [WARNING] tensorflow 43: TensorFlow will not use Pandas by default. This improves performance in some cases. To enable Pandas export the environment variable TF_ALLOW_IOLIBS=1.
2025-01-15 09:50:09,974 [TAO Toolkit] [WARNING] matplotlib 500: Matplotlib created a temporary config/cache directory at /tmp/matplotlib-5yf7qto5 because the default path (/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
2025-01-15 09:50:10,158 [TAO Toolkit] [INFO] matplotlib.font_manager 1633: generated new fontManager
2025-01-15 09:50:10.859848: I tensorflow/stream_executor/platform/default/dso_loader.cc:50] Successfully opened dynamic library libnvinfer.so.8
2025-01-15 09:50:10.874129: I tensorflow/stream_executor/platform/default/dso_loader.cc:50] Successfully opened dynamic library libcuda.so.1
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
INFO:tensorflow:writing to output path: /workspace/tao-experiments/data/maskrcnn/val
INFO:tensorflow:writing to output path: /workspace/tao-experiments/data/maskrcnn/val
INFO:tensorflow:Building bounding box index.
INFO:tensorflow:Building bounding box index.
INFO:tensorflow:0 images are missing bboxes.
INFO:tensorflow:0 images are missing bboxes.
INFO:tensorflow:On image 0 of 7927
INFO:tensorflow:On image 0 of 7927
INFO:tensorflow:On image 100 of 7927
.
.
.
.
INFO:tensorflow:On image 7900 of 7927
INFO:tensorflow:Finished writing, skipped 0 annotations.
INFO:tensorflow:Finished writing, skipped 0 annotations.
Execution status: PASS
2025-01-15 15:23:52,878 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 363: Stopping container.
I also went through the post Training doesn't converge for Mapillary Vistas Dataset training with MaskRCNN - #36 by edit_or
but didnt get clarity. I am unable to find max_num_instances
in my config.
Please suggest how to resolve this issue.
2 posts - 2 participants