Quantcast
Channel: TAO Toolkit - NVIDIA Developer Forums
Viewing all articles
Browse latest Browse all 497

Input to reshape is a tensor with 3067968 values, but the requested shape has 2691200

$
0
0

Dear @Morganh

I am trying to train mask-rcnn with custom dataset but getting below issue after certain steps.

For multi-GPU, change --gpus based on your machine.
2025-01-16 10:59:36,595 [TAO Toolkit] [INFO] root 160: Registry: ['nvcr.io']
2025-01-16 10:59:36,670 [TAO Toolkit] [INFO] nvidia_tao_cli.components.instance_handler.local_instance 361: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5
2025-01-16 10:59:36,714 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 301: Printing tty value True
2025-01-16 05:29:37.366563: I tensorflow/stream_executor/platform/default/dso_loader.cc:50] Successfully opened dynamic library libcudart.so.12
2025-01-16 05:29:37,403 [TAO Toolkit] [WARNING] tensorflow 40: Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
2025-01-16 05:29:38.895241: I tensorflow/stream_executor/platform/default/dso_loader.cc:50] Successfully opened dynamic library libcudart.so.12
Using TensorFlow backend.
2025-01-16 05:29:39,015 [TAO Toolkit] [WARNING] tensorflow 43: TensorFlow will not use sklearn by default. This improves performance in some cases. To enable sklearn export the environment variable  TF_ALLOW_IOLIBS=1.
2025-01-16 05:29:39,046 [TAO Toolkit] [WARNING] tensorflow 42: TensorFlow will not use Dask by default. This improves performance in some cases. To enable Dask export the environment variable  TF_ALLOW_IOLIBS=1.
2025-01-16 05:29:39,050 [TAO Toolkit] [WARNING] tensorflow 43: TensorFlow will not use Pandas by default. This improves performance in some cases. To enable Pandas export the environment variable  TF_ALLOW_IOLIBS=1.
2025-01-16 05:29:39,316 [TAO Toolkit] [WARNING] matplotlib 500: Matplotlib created a temporary config/cache directory at /tmp/matplotlib-xvt6aibf because the default path (/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
2025-01-16 05:29:39,496 [TAO Toolkit] [INFO] matplotlib.font_manager 1633: generated new fontManager
2025-01-16 05:29:40.181244: I tensorflow/stream_executor/platform/default/dso_loader.cc:50] Successfully opened dynamic library libnvinfer.so.8
2025-01-16 05:29:40.195485: I tensorflow/stream_executor/platform/default/dso_loader.cc:50] Successfully opened dynamic library libcuda.so.1
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
Using TensorFlow backend.
WARNING:tensorflow:TensorFlow will not use sklearn by default. This improves performance in some cases. To enable sklearn export the environment variable  TF_ALLOW_IOLIBS=1.
2025-01-16 05:29:42,155 [TAO Toolkit] [WARNING] tensorflow 43: TensorFlow will not use sklearn by default. This improves performance in some cases. To enable sklearn export the environment variable  TF_ALLOW_IOLIBS=1.
WARNING:tensorflow:TensorFlow will not use Dask by default. This improves performance in some cases. To enable Dask export the environment variable  TF_ALLOW_IOLIBS=1.
2025-01-16 05:29:42,185 [TAO Toolkit] [WARNING] tensorflow 42: TensorFlow will not use Dask by default. This improves performance in some cases. To enable Dask export the environment variable  TF_ALLOW_IOLIBS=1.
WARNING:tensorflow:TensorFlow will not use Pandas by default. This improves performance in some cases. To enable Pandas export the environment variable  TF_ALLOW_IOLIBS=1.
2025-01-16 05:29:42,188 [TAO Toolkit] [WARNING] tensorflow 43: TensorFlow will not use Pandas by default. This improves performance in some cases. To enable Pandas export the environment variable  TF_ALLOW_IOLIBS=1.
[INFO] Loading specification from /workspace/tao-experiments/mask_rcnn/specs/maskrcnn_retrain_resnet50.txt
[MaskRCNN] INFO    : Loading weights from /workspace/tao-experiments/mask_rcnn/experiment_dir_unpruned/model.epoch-0.tlt
[MaskRCNN] INFO    : Loading weights from /workspace/tao-experiments/mask_rcnn/experiment_dir_unpruned/model.epoch-0.tlt
[MaskRCNN] INFO    : current step from checkpoint: 3964
[INFO] Log file already exists at /workspace/tao-experiments/mask_rcnn/experiment_dir_unpruned/status.json
[INFO] Starting MaskRCNN training.
INFO:tensorflow:Using config: {'_model_dir': '/tmp/tmphgad4qnt', '_tf_random_seed': 123, '_save_summary_steps': None, '_save_checkpoints_steps': None, '_save_checkpoints_secs': None, '_session_config': intra_op_parallelism_threads: 1
inter_op_parallelism_threads: 4
gpu_options {
  allow_growth: true
  force_gpu_compatible: true
}
allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: TWO
  }
}
, '_keep_checkpoint_max': 20, '_keep_checkpoint_every_n_hours': None, '_log_step_count_steps': None, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fb304eebd90>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
[MaskRCNN] INFO    : Create EncryptCheckpointSaverHook.

[MaskRCNN] INFO    : =================================
[MaskRCNN] INFO    :      Start training cycle 01
[MaskRCNN] INFO    : =================================
    
WARNING:tensorflow:From /usr/local/lib/python3.8/dist-packages/third_party/keras/tensorflow_backend.py:361: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

WARNING:tensorflow:From /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/autograph/converters/directives.py:119: The name tf.set_random_seed is deprecated. Please use tf.compat.v1.set_random_seed instead.

WARNING:tensorflow:The operation `tf.image.convert_image_dtype` will be skipped since the input and output dtypes are identical.
WARNING:tensorflow:The operation `tf.image.convert_image_dtype` will be skipped since the input and output dtypes are identical.
WARNING:tensorflow:The operation `tf.image.convert_image_dtype` will be skipped since the input and output dtypes are identical.
WARNING:tensorflow:The operation `tf.image.convert_image_dtype` will be skipped since the input and output dtypes are identical.
INFO:tensorflow:Calling model_fn.
[MaskRCNN] INFO    : ***********************
[MaskRCNN] INFO    : Building model graph...
[MaskRCNN] INFO    : ***********************
[MaskRCNN] INFO    : [ROI OPs] Using Batched NMS... Scope: MLP/multilevel_propose_rois/level_2/
[MaskRCNN] INFO    : [ROI OPs] Using Batched NMS... Scope: MLP/multilevel_propose_rois/level_3/
[MaskRCNN] INFO    : [ROI OPs] Using Batched NMS... Scope: MLP/multilevel_propose_rois/level_4/
[MaskRCNN] INFO    : [ROI OPs] Using Batched NMS... Scope: MLP/multilevel_propose_rois/level_5/
[MaskRCNN] INFO    : [ROI OPs] Using Batched NMS... Scope: MLP/multilevel_propose_rois/level_6/
4 ops no flops stats due to incomplete shapes.
Parsing Inputs...
[MaskRCNN] INFO    : [Training Compute Statistics] 216.3 GFLOPS/image
WARNING:tensorflow:
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/tmphgad4qnt/model.ckpt-3964
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
[GPU 00] Restoring pretrained weights (307 Tensors)
[MaskRCNN] INFO    : Pretrained weights loaded with success...
    
[MaskRCNN] INFO    : Saving checkpoints for epoch 0 into /workspace/tao-experiments/mask_rcnn/experiment_dir_unpruned/model.epoch-0.tlt.
[MaskRCNN] INFO    : Global step 3970 (epoch 1/60): total loss: 1.97247 (rpn score loss: 0.11123 rpn box loss: 0.01986 fast_rcnn class loss: 0.34988 fast_rcnn box loss: 0.28576) learning rate: 0.00081
[MaskRCNN] INFO    : Global step 3980 (epoch 1/60): total loss: 1.62314 (rpn score loss: 0.03085 rpn box loss: 0.01711 fast_rcnn class loss: 0.25197 fast_rcnn box loss: 0.17000) learning rate: 0.00082
[MaskRCNN] INFO    : Global step 3990 (epoch 1/60): total loss: 1.79059 (rpn score loss: 0.07045 rpn box loss: 0.03234 fast_rcnn class loss: 0.34788 fast_rcnn box loss: 0.19229) learning rate: 0.00082
[MaskRCNN] INFO    : Global step 4000 (epoch 1/60): total loss: 1.69635 (rpn score loss: 0.07163 rpn box loss: 0.02998 fast_rcnn class loss: 0.25212 fast_rcnn box loss: 0.19454) learning rate: 0.00082
[MaskRCNN] INFO    : Global step 4010 (epoch 1/60): total loss: 2.10775 (rpn score loss: 0.09997 rpn box loss: 0.01796 fast_rcnn class loss: 0.39998 fast_rcnn box loss: 0.29925) learning rate: 0.00082
[MaskRCNN] INFO    : Global step 4020 (epoch 1/60): total loss: 1.84139 (rpn score loss: 0.07552 rpn box loss: 0.03666 fast_rcnn class loss: 0.29757 fast_rcnn box loss: 0.26355) learning rate: 0.00082
[MaskRCNN] INFO    : Global step 4030 (epoch 1/60): total loss: 2.12437 (rpn score loss: 0.16273 rpn box loss: 0.02378 fast_rcnn class loss: 0.42083 fast_rcnn box loss: 0.29021) learning rate: 0.00083
[MaskRCNN] INFO    : Global step 4040 (epoch 1/60): total loss: 2.04857 (rpn score loss: 0.12899 rpn box loss: 0.03501 fast_rcnn class loss: 0.34103 fast_rcnn box loss: 0.29424) learning rate: 0.00083
[MaskRCNN] INFO    : Global step 4050 (epoch 1/60): total loss: 1.90148 (rpn score loss: 0.05795 rpn box loss: 0.03936 fast_rcnn class loss: 0.32264 fast_rcnn box loss: 0.26968) learning rate: 0.00083
[MaskRCNN] INFO    : Global step 4060 (epoch 1/60): total loss: 2.05801 (rpn score loss: 0.14668 rpn box loss: 0.03704 fast_rcnn class loss: 0.37895 fast_rcnn box loss: 0.27110) learning rate: 0.00083
[MaskRCNN] INFO    : Global step 4070 (epoch 1/60): total loss: 1.79224 (rpn score loss: 0.06716 rpn box loss: 0.03116 fast_rcnn class loss: 0.29775 fast_rcnn box loss: 0.22957) learning rate: 0.00083
[MaskRCNN] INFO    : Global step 4080 (epoch 1/60): total loss: 1.59308 (rpn score loss: 0.04801 rpn box loss: 0.01114 fast_rcnn class loss: 0.27881 fast_rcnn box loss: 0.16689) learning rate: 0.00083
[MaskRCNN] INFO    : Global step 4090 (epoch 1/60): total loss: 2.12668 (rpn score loss: 0.11774 rpn box loss: 0.04123 fast_rcnn class loss: 0.40081 fast_rcnn box loss: 0.33850) learning rate: 0.00084
[MaskRCNN] INFO    : Global step 4100 (epoch 1/60): total loss: 1.71191 (rpn score loss: 0.06594 rpn box loss: 0.01962 fast_rcnn class loss: 0.30247 fast_rcnn box loss: 0.18799) learning rate: 0.00084
[MaskRCNN] INFO    : Global step 4110 (epoch 1/60): total loss: 1.89657 (rpn score loss: 0.05287 rpn box loss: 0.02735 fast_rcnn class loss: 0.33230 fast_rcnn box loss: 0.29379) learning rate: 0.00084
[MaskRCNN] INFO    : Global step 4120 (epoch 1/60): total loss: 2.03658 (rpn score loss: 0.13105 rpn box loss: 0.04745 fast_rcnn class loss: 0.36375 fast_rcnn box loss: 0.28187) learning rate: 0.00084
[MaskRCNN] INFO    : Global step 4130 (epoch 1/60): total loss: 1.93204 (rpn score loss: 0.07883 rpn box loss: 0.03727 fast_rcnn class loss: 0.37148 fast_rcnn box loss: 0.23915) learning rate: 0.00084
[MaskRCNN] INFO    : Global step 4140 (epoch 1/60): total loss: 2.08287 (rpn score loss: 0.11430 rpn box loss: 0.03124 fast_rcnn class loss: 0.39994 fast_rcnn box loss: 0.27192) learning rate: 0.00085
[MaskRCNN] INFO    : Global step 4150 (epoch 1/60): total loss: 2.11710 (rpn score loss: 0.09923 rpn box loss: 0.02143 fast_rcnn class loss: 0.37232 fast_rcnn box loss: 0.36772) learning rate: 0.00085
[MaskRCNN] INFO    : Global step 4160 (epoch 1/60): total loss: 1.76273 (rpn score loss: 0.04779 rpn box loss: 0.01587 fast_rcnn class loss: 0.29432 fast_rcnn box loss: 0.26516) learning rate: 0.00085
[MaskRCNN] INFO    : Global step 4170 (epoch 1/60): total loss: 1.71858 (rpn score loss: 0.05151 rpn box loss: 0.03616 fast_rcnn class loss: 0.25626 fast_rcnn box loss: 0.19708) learning rate: 0.00085
[MaskRCNN] INFO    : Global step 4180 (epoch 1/60): total loss: 1.94979 (rpn score loss: 0.07615 rpn box loss: 0.04014 fast_rcnn class loss: 0.34977 fast_rcnn box loss: 0.26231) learning rate: 0.00085
[MaskRCNN] INFO    : Global step 4190 (epoch 1/60): total loss: 2.23382 (rpn score loss: 0.13469 rpn box loss: 0.02437 fast_rcnn class loss: 0.46003 fast_rcnn box loss: 0.32868) learning rate: 0.00085
[MaskRCNN] INFO    : Global step 4200 (epoch 1/60): total loss: 1.79038 (rpn score loss: 0.04073 rpn box loss: 0.02832 fast_rcnn class loss: 0.32231 fast_rcnn box loss: 0.21682) learning rate: 0.00086
[MaskRCNN] INFO    : Global step 4210 (epoch 1/60): total loss: 1.99114 (rpn score loss: 0.10510 rpn box loss: 0.04607 fast_rcnn class loss: 0.36540 fast_rcnn box loss: 0.26908) learning rate: 0.00086
[MaskRCNN] INFO    : Global step 4220 (epoch 1/60): total loss: 1.96724 (rpn score loss: 0.13704 rpn box loss: 0.02670 fast_rcnn class loss: 0.34379 fast_rcnn box loss: 0.25319) learning rate: 0.00086
[MaskRCNN] INFO    : Global step 4230 (epoch 1/60): total loss: 1.98791 (rpn score loss: 0.12831 rpn box loss: 0.03703 fast_rcnn class loss: 0.31502 fast_rcnn box loss: 0.27176) learning rate: 0.00086
[MaskRCNN] INFO    : Global step 4240 (epoch 1/60): total loss: 2.07191 (rpn score loss: 0.10061 rpn box loss: 0.01850 fast_rcnn class loss: 0.31984 fast_rcnn box loss: 0.37057) learning rate: 0.00086
[MaskRCNN] INFO    : Global step 4250 (epoch 1/60): total loss: 1.66154 (rpn score loss: 0.06737 rpn box loss: 0.05524 fast_rcnn class loss: 0.22960 fast_rcnn box loss: 0.17090) learning rate: 0.00086
[MaskRCNN] INFO    : Global step 4260 (epoch 1/60): total loss: 1.95168 (rpn score loss: 0.04300 rpn box loss: 0.03719 fast_rcnn class loss: 0.35215 fast_rcnn box loss: 0.30919) learning rate: 0.00087
[MaskRCNN] INFO    : Global step 4270 (epoch 1/60): total loss: 1.66803 (rpn score loss: 0.05000 rpn box loss: 0.01810 fast_rcnn class loss: 0.29840 fast_rcnn box loss: 0.17104) learning rate: 0.00087
[MaskRCNN] INFO    : Global step 4280 (epoch 1/60): total loss: 2.07519 (rpn score loss: 0.11418 rpn box loss: 0.03251 fast_rcnn class loss: 0.38866 fast_rcnn box loss: 0.27565) learning rate: 0.00087
[MaskRCNN] INFO    : Global step 4290 (epoch 1/60): total loss: 1.52216 (rpn score loss: 0.02900 rpn box loss: 0.00757 fast_rcnn class loss: 0.22077 fast_rcnn box loss: 0.15350) learning rate: 0.00087
[MaskRCNN] INFO    : Global step 4300 (epoch 1/60): total loss: 1.75907 (rpn score loss: 0.11032 rpn box loss: 0.02966 fast_rcnn class loss: 0.27160 fast_rcnn box loss: 0.21476) learning rate: 0.00087
.
.
.

[MaskRCNN] INFO    : Global step 11240 (epoch 1/60): total loss: 1.87170 (rpn score loss: 0.08811 rpn box loss: 0.01817 fast_rcnn class loss: 0.33792 fast_rcnn box loss: 0.22661) learning rate: 0.00100
[MaskRCNN] INFO    : Global step 11250 (epoch 1/60): total loss: 1.52081 (rpn score loss: 0.03515 rpn box loss: 0.01659 fast_rcnn class loss: 0.24017 fast_rcnn box loss: 0.13908) learning rate: 0.00100
[MaskRCNN] INFO    : Global step 11260 (epoch 1/60): total loss: 1.56932 (rpn score loss: 0.04662 rpn box loss: 0.03519 fast_rcnn class loss: 0.22036 fast_rcnn box loss: 0.14993) learning rate: 0.00100
[MaskRCNN] INFO    : Global step 11270 (epoch 1/60): total loss: 1.69349 (rpn score loss: 0.04376 rpn box loss: 0.01922 fast_rcnn class loss: 0.30813 fast_rcnn box loss: 0.16890) learning rate: 0.00100
[MaskRCNN] INFO    : Global step 11280 (epoch 1/60): total loss: 1.94548 (rpn score loss: 0.11634 rpn box loss: 0.02167 fast_rcnn class loss: 0.35849 fast_rcnn box loss: 0.23329) learning rate: 0.00100
[MaskRCNN] INFO    : Global step 11290 (epoch 1/60): total loss: 1.45143 (rpn score loss: 0.02866 rpn box loss: 0.00907 fast_rcnn class loss: 0.20220 fast_rcnn box loss: 0.11106) learning rate: 0.00100
[MaskRCNN] INFO    : Global step 11300 (epoch 1/60): total loss: 1.57107 (rpn score loss: 0.06977 rpn box loss: 0.02299 fast_rcnn class loss: 0.22030 fast_rcnn box loss: 0.13345) learning rate: 0.00100
[MaskRCNN] INFO    : Global step 11310 (epoch 1/60): total loss: 1.79068 (rpn score loss: 0.12115 rpn box loss: 0.01796 fast_rcnn class loss: 0.32999 fast_rcnn box loss: 0.19124) learning rate: 0.00100
[MaskRCNN] INFO    : Global step 11320 (epoch 1/60): total loss: 1.62610 (rpn score loss: 0.06707 rpn box loss: 0.01374 fast_rcnn class loss: 0.27831 fast_rcnn box loss: 0.16727) learning rate: 0.00100
[MaskRCNN] INFO    : Global step 11330 (epoch 1/60): total loss: 1.65370 (rpn score loss: 0.05514 rpn box loss: 0.04094 fast_rcnn class loss: 0.27206 fast_rcnn box loss: 0.16764) learning rate: 0.00100
[INFO]  Input to reshape is a tensor with 3067968 values, but the requested shape has 2691200
	 [[{{node parser/process_gt_masks_for_training/Reshape_2}}]]
	 [[cluster_14_1/xla_compile]]
	 [[IteratorGetNext]]
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 1349, in _run_fn
    return self._call_tf_sessionrun(options, feed_dict, fetch_list,
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 1441, in _call_tf_sessionrun
    return tf_session.TF_SessionRun_wrapper(self._session, options, feed_dict,
tensorflow.python.framework.errors_impl.InvalidArgumentError: {{function_node __inference_Dataset_map__map_func_set_random_wrapper_1115}} Input to reshape is a tensor with 3067968 values, but the requested shape has 2691200
	 [[{{node parser/process_gt_masks_for_training/Reshape_2}}]]
	 [[cluster_14_1/xla_compile]]
	 [[IteratorGetNext]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/mask_rcnn/scripts/train.py", line 321, in <module>
    main()
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/mask_rcnn/scripts/train.py", line 313, in main
    raise e
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/mask_rcnn/scripts/train.py", line 300, in main
    run_executer(RUN_CONFIG, train_input_fn, eval_input_fn)
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/mask_rcnn/scripts/train.py", line 106, in run_executer
    executer.train_and_eval(train_input_fn=train_input_fn, eval_input_fn=eval_input_fn)
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/mask_rcnn/executer/distributed_executer.py", line 412, in train_and_eval
    train_estimator.train(
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 370, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1161, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1193, in _train_model_default
    return self._train_with_estimator_spec(estimator_spec, worker_hooks,
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1494, in _train_with_estimator_spec
    _, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 750, in run
    return self._sess.run(
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1255, in run
    return self._sess.run(
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1360, in run
    raise six.reraise(*original_exc_info)
  File "/usr/local/lib/python3.8/dist-packages/six.py", line 719, in reraise
    raise value
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1345, in run
    return self._sess.run(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1413, in run
    outputs = _WrappedSession.run(
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1176, in run
    return self._sess.run(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 955, in run
    result = self._run(None, fetches, feed_dict, options_ptr,
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 1179, in _run
    results = self._do_run(handle, final_targets, final_fetches,
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 1358, in _do_run
    return self._do_call(_run_fn, feeds, fetches, targets, options,
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError:  Input to reshape is a tensor with 3067968 values, but the requested shape has 2691200
	 [[{{node parser/process_gt_masks_for_training/Reshape_2}}]]
	 [[cluster_14_1/xla_compile]]
	 [[IteratorGetNext]]
Execution status: FAIL
2025-01-16 11:49:44,668 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 363: Stopping container.

I am using below config.
I have only single class i.e Carton

seed: 123
use_amp: False
warmup_steps: 5000
checkpoint: "/workspace/tao-experiments/mask_rcnn/pretrained_resnet50/pretrained_instance_segmentation_vresnet50/resnet50.hdf5"
learning_rate_steps: "[12500, 125000, 375000]"
learning_rate_decay_levels: "[0.1, 0.05, 0.01]"
total_steps: 750000
train_batch_size: 4
eval_batch_size: 4
num_steps_per_eval: 12500
momentum: 0.9
l2_weight_decay: 0.00004
warmup_learning_rate: 0.0001
init_learning_rate: 0.001
num_examples_per_epoch: 50000

data_config {
    image_size: "(640, 640)"
    augment_input_data: True
    eval_samples: 7927
    training_file_pattern: "/workspace/tao-experiments/data/maskrcnn/train*.tfrecord"
    validation_file_pattern: "/workspace/tao-experiments/data/maskrcnn/val*.tfrecord"
    val_json_file: "/workspace/tao-experiments/data/raw-data/annotations/val.json"

    # dataset specific parameters
    num_classes: 2  # Including background
    skip_crowd_during_training: True
}

maskrcnn_config {
    nlayers: 50
    arch: "resnet"
    freeze_bn: True
    freeze_blocks: "[0,1]"
    gt_mask_size: 112
        
    # Region Proposal Network
    rpn_positive_overlap: 0.7
    rpn_negative_overlap: 0.3
    rpn_batch_size_per_im: 128
    rpn_fg_fraction: 0.5
    rpn_min_size: 0.

    # Proposal layer.
    batch_size_per_im: 256
    fg_fraction: 0.25
    fg_thresh: 0.5
    bg_thresh_hi: 0.5
    bg_thresh_lo: 0.

    # Faster-RCNN heads.
    fast_rcnn_mlp_head_dim: 1024
    bbox_reg_weights: "(10., 10., 5., 5.)"

    # Mask-RCNN heads.
    include_mask: True
    mrcnn_resolution: 28

    # training
    train_rpn_pre_nms_topn: 2000
    train_rpn_post_nms_topn: 1000
    train_rpn_nms_threshold: 0.7

    # evaluation
    test_detections_per_image: 100
    test_nms: 0.5
    test_rpn_pre_nms_topn: 1000
    test_rpn_post_nms_topn: 1000
    test_rpn_nms_thresh: 0.7

    # model architecture
    min_level: 2
    max_level: 6
    num_scales: 1
    aspect_ratios: "[(1.0, 1.0), (1.4, 0.7), (0.7, 1.4)]"
    anchor_scale: 8

    # localization loss
    rpn_box_loss_weight: 1.0
    fast_rcnn_box_loss_weight: 1.0
    mrcnn_weight_loss_mask: 1.0
    
}

TF-Records logs for training:

2025-01-15 14:56:30,067 [TAO Toolkit] [INFO] root 160: Registry: ['nvcr.io']
2025-01-15 14:56:30,130 [TAO Toolkit] [INFO] nvidia_tao_cli.components.instance_handler.local_instance 361: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5
2025-01-15 14:56:30,177 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 301: Printing tty value True
2025-01-15 09:26:30.801687: I tensorflow/stream_executor/platform/default/dso_loader.cc:50] Successfully opened dynamic library libcudart.so.12
2025-01-15 09:26:30,843 [TAO Toolkit] [WARNING] tensorflow 40: Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
2025-01-15 09:26:32.312489: I tensorflow/stream_executor/platform/default/dso_loader.cc:50] Successfully opened dynamic library libcudart.so.12
Using TensorFlow backend.
2025-01-15 09:26:32,432 [TAO Toolkit] [WARNING] tensorflow 43: TensorFlow will not use sklearn by default. This improves performance in some cases. To enable sklearn export the environment variable  TF_ALLOW_IOLIBS=1.
2025-01-15 09:26:32,463 [TAO Toolkit] [WARNING] tensorflow 42: TensorFlow will not use Dask by default. This improves performance in some cases. To enable Dask export the environment variable  TF_ALLOW_IOLIBS=1.
2025-01-15 09:26:32,466 [TAO Toolkit] [WARNING] tensorflow 43: TensorFlow will not use Pandas by default. This improves performance in some cases. To enable Pandas export the environment variable  TF_ALLOW_IOLIBS=1.
2025-01-15 09:26:32,739 [TAO Toolkit] [WARNING] matplotlib 500: Matplotlib created a temporary config/cache directory at /tmp/matplotlib-xg0743af because the default path (/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
2025-01-15 09:26:32,920 [TAO Toolkit] [INFO] matplotlib.font_manager 1633: generated new fontManager
2025-01-15 09:26:33.619547: I tensorflow/stream_executor/platform/default/dso_loader.cc:50] Successfully opened dynamic library libnvinfer.so.8
2025-01-15 09:26:33.633191: I tensorflow/stream_executor/platform/default/dso_loader.cc:50] Successfully opened dynamic library libcuda.so.1
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
INFO:tensorflow:writing to output path: /workspace/tao-experiments/data/maskrcnn/train
INFO:tensorflow:writing to output path: /workspace/tao-experiments/data/maskrcnn/train
INFO:tensorflow:Building bounding box index.
INFO:tensorflow:Building bounding box index.
INFO:tensorflow:0 images are missing bboxes.
INFO:tensorflow:0 images are missing bboxes.
INFO:tensorflow:On image 0 of 50000
INFO:tensorflow:On image 0 of 50000
INFO:tensorflow:On image 100 of 50000
INFO:tensorflow:On image 100 of 50000
INFO:tensorflow:On image 200 of 50000
INFO:tensorflow:On image 200 of 50000
INFO:tensorflow:On image 300 of 50000
INFO:tensorflow:On image 300 of 50000
INFO:tensorflow:On image 400 of 50000
INFO:tensorflow:On image 400 of 50000
INFO:tensorflow:On image 500 of 50000
INFO:tensorflow:On image 500 of 50000

.
.
.
INFO:tensorflow:On image 49900 of 50000
INFO:tensorflow:On image 49900 of 50000
INFO:tensorflow:Finished writing, skipped 0 annotations.
INFO:tensorflow:Finished writing, skipped 0 annotations.
Execution status: PASS
2025-01-15 15:20:06,391 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 363: Stopping container.

TF-Records logs for Validation:

2025-01-15 15:20:07,150 [TAO Toolkit] [INFO] root 160: Registry: ['nvcr.io']
2025-01-15 15:20:07,219 [TAO Toolkit] [INFO] nvidia_tao_cli.components.instance_handler.local_instance 361: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5
2025-01-15 15:20:07,267 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 301: Printing tty value True
2025-01-15 09:50:07.954723: I tensorflow/stream_executor/platform/default/dso_loader.cc:50] Successfully opened dynamic library libcudart.so.12
2025-01-15 09:50:07,992 [TAO Toolkit] [WARNING] tensorflow 40: Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
2025-01-15 09:50:09.528172: I tensorflow/stream_executor/platform/default/dso_loader.cc:50] Successfully opened dynamic library libcudart.so.12
Using TensorFlow backend.
2025-01-15 09:50:09,655 [TAO Toolkit] [WARNING] tensorflow 43: TensorFlow will not use sklearn by default. This improves performance in some cases. To enable sklearn export the environment variable  TF_ALLOW_IOLIBS=1.
2025-01-15 09:50:09,686 [TAO Toolkit] [WARNING] tensorflow 42: TensorFlow will not use Dask by default. This improves performance in some cases. To enable Dask export the environment variable  TF_ALLOW_IOLIBS=1.
2025-01-15 09:50:09,690 [TAO Toolkit] [WARNING] tensorflow 43: TensorFlow will not use Pandas by default. This improves performance in some cases. To enable Pandas export the environment variable  TF_ALLOW_IOLIBS=1.
2025-01-15 09:50:09,974 [TAO Toolkit] [WARNING] matplotlib 500: Matplotlib created a temporary config/cache directory at /tmp/matplotlib-5yf7qto5 because the default path (/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
2025-01-15 09:50:10,158 [TAO Toolkit] [INFO] matplotlib.font_manager 1633: generated new fontManager
2025-01-15 09:50:10.859848: I tensorflow/stream_executor/platform/default/dso_loader.cc:50] Successfully opened dynamic library libnvinfer.so.8
2025-01-15 09:50:10.874129: I tensorflow/stream_executor/platform/default/dso_loader.cc:50] Successfully opened dynamic library libcuda.so.1
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
INFO:tensorflow:writing to output path: /workspace/tao-experiments/data/maskrcnn/val
INFO:tensorflow:writing to output path: /workspace/tao-experiments/data/maskrcnn/val
INFO:tensorflow:Building bounding box index.
INFO:tensorflow:Building bounding box index.
INFO:tensorflow:0 images are missing bboxes.
INFO:tensorflow:0 images are missing bboxes.
INFO:tensorflow:On image 0 of 7927
INFO:tensorflow:On image 0 of 7927
INFO:tensorflow:On image 100 of 7927
.
.
.
.

INFO:tensorflow:On image 7900 of 7927
INFO:tensorflow:Finished writing, skipped 0 annotations.
INFO:tensorflow:Finished writing, skipped 0 annotations.
Execution status: PASS
2025-01-15 15:23:52,878 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 363: Stopping container.


I also went through the post Training doesn't converge for Mapillary Vistas Dataset training with MaskRCNN - #36 by edit_or
but didnt get clarity. I am unable to find max_num_instances in my config.

Please suggest how to resolve this issue.

2 posts - 2 participants

Read full topic


Viewing all articles
Browse latest Browse all 497

Trending Articles