Quantcast
Channel: TAO Toolkit - NVIDIA Developer Forums
Viewing all articles
Browse latest Browse all 497

Missing ranks

$
0
0

Please provide the following information when requesting support.

• Hardware (T4/V100/Xavier/Nano/etc)
H100
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc)
LPRnet
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here)
5.0.0-tf1.15.5
• Training spec file(If have, please share here)

random_seed: 42
lpr_config {
  hidden_units: 512
  max_label_length: 12
  arch: "baseline"
  nlayers: 18 #setting nlayers to be 10 to use baseline10 model
}
training_config {
  batch_size_per_gpu: 2048
  num_epochs: 150
  learning_rate {
  soft_start_annealing_schedule {
    min_learning_rate: 1e-6
    max_learning_rate: 1e-5
    soft_start: 0.001
    annealing: 0.5
  }
  }
  regularizer {
    type: L2
    weight: 5e-4
  }
}
eval_config {
  validation_period_during_training: 5
  batch_size: 1
}
augmentation_config {
    output_width: 100
    output_height: 48
    output_channel: 3
    max_rotate_degree: 5
    rotate_prob: 0.5
    gaussian_kernel_size: 5
    gaussian_kernel_size: 7
    gaussian_kernel_size: 15
    blur_prob: 0.5
    reverse_color_prob: 0.5
    keep_original_prob: 0.3
}
dataset_config {
  data_sources: {
    label_directory_path: "/workspace/tao-training/LPRnet_training/dataset/char/train/labels"
    image_directory_path: "/workspace/tao-training/LPRnet_training/dataset/char/train/images"
  }
  characters_list_file: "/workspace/tao-training/LPRnet_training/us_lp_characters.txt"
  validation_data_sources: {
    label_directory_path: "/workspace/tao-training/LPRnet_training/dataset/char/val/labels"
    image_directory_path: "/workspace/tao-training/LPRnet_training/dataset/char/val/images"
  }
}

• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)

Missing ranks:
0: [training/DistributedSGD_Allreduce/cond/HorovodAllreduce_training_DistributedSGD_gradients_gradients_AddN_58_0, training/DistributedSGD_Allreduce/cond_1/HorovodAllreduce_training_DistributedSGD_gradients_gradients_AddN_57_0, training/DistributedSGD_Allreduce/cond_10/HorovodAllreduce_training_DistributedSGD_gradients_gradients_AddN_53_0, training/DistributedSGD_Allreduce/cond_11/HorovodAllreduce_training_DistributedSGD_gradients_gradients_AddN_51_0, training/DistributedSGD_Allreduce/cond_12/HorovodAllreduce_training_DistributedSGD_gradients_gradients_bn2a_branch1_FusedBatchNormV3_grad_FusedBatchNormGradV3_1, training/DistributedSGD_Allreduce/cond_13/HorovodAllreduce_training_DistributedSGD_gradients_gradients_bn2a_branch1_FusedBatchNormV3_grad_FusedBatchNormGradV3_2 ...]

This is the error I am facing when I trained it and saved in the fifith epoch

2 posts - 2 participants

Read full topic


Viewing all articles
Browse latest Browse all 497

Trending Articles