Please provide the following information when requesting support.
• Hardware (T4/V100/Xavier/Nano/etc)
H100
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc)
LPRnet
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here)
5.0.0-tf1.15.5
• Training spec file(If have, please share here)
random_seed: 42
lpr_config {
hidden_units: 512
max_label_length: 12
arch: "baseline"
nlayers: 18 #setting nlayers to be 10 to use baseline10 model
}
training_config {
batch_size_per_gpu: 2048
num_epochs: 150
learning_rate {
soft_start_annealing_schedule {
min_learning_rate: 1e-6
max_learning_rate: 1e-5
soft_start: 0.001
annealing: 0.5
}
}
regularizer {
type: L2
weight: 5e-4
}
}
eval_config {
validation_period_during_training: 5
batch_size: 1
}
augmentation_config {
output_width: 100
output_height: 48
output_channel: 3
max_rotate_degree: 5
rotate_prob: 0.5
gaussian_kernel_size: 5
gaussian_kernel_size: 7
gaussian_kernel_size: 15
blur_prob: 0.5
reverse_color_prob: 0.5
keep_original_prob: 0.3
}
dataset_config {
data_sources: {
label_directory_path: "/workspace/tao-training/LPRnet_training/dataset/char/train/labels"
image_directory_path: "/workspace/tao-training/LPRnet_training/dataset/char/train/images"
}
characters_list_file: "/workspace/tao-training/LPRnet_training/us_lp_characters.txt"
validation_data_sources: {
label_directory_path: "/workspace/tao-training/LPRnet_training/dataset/char/val/labels"
image_directory_path: "/workspace/tao-training/LPRnet_training/dataset/char/val/images"
}
}
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)
Missing ranks:
0: [training/DistributedSGD_Allreduce/cond/HorovodAllreduce_training_DistributedSGD_gradients_gradients_AddN_58_0, training/DistributedSGD_Allreduce/cond_1/HorovodAllreduce_training_DistributedSGD_gradients_gradients_AddN_57_0, training/DistributedSGD_Allreduce/cond_10/HorovodAllreduce_training_DistributedSGD_gradients_gradients_AddN_53_0, training/DistributedSGD_Allreduce/cond_11/HorovodAllreduce_training_DistributedSGD_gradients_gradients_AddN_51_0, training/DistributedSGD_Allreduce/cond_12/HorovodAllreduce_training_DistributedSGD_gradients_gradients_bn2a_branch1_FusedBatchNormV3_grad_FusedBatchNormGradV3_1, training/DistributedSGD_Allreduce/cond_13/HorovodAllreduce_training_DistributedSGD_gradients_gradients_bn2a_branch1_FusedBatchNormV3_grad_FusedBatchNormGradV3_2 ...]
This is the error I am facing when I trained it and saved in the fifith epoch
2 posts - 2 participants