When checkpoint_interval=10 is specified, no checkpoints are saved when training NVIDIA TAO 5.3 ReIdentification model.
When the checkpoint interval is set to 1, training generates checkpoints. However, it’s undesirable to use 1 because it saves too many files.
When checkpoint_interval=5 is specified, only some checkpoints are saved. Following is an example:
In TAO documentation,checkpoint_interval
is defined as the interval at which the checkpoints are saved, and no other explanation is provided.
Can you please explain how it is determined which epochs are saved and how we can determine checkpoint_interval
to be set in training configuration to achieve predictable checkpoints?
4 posts - 2 participants