Quantcast
Channel: TAO Toolkit - NVIDIA Developer Forums
Viewing all articles
Browse latest Browse all 497

The container stops in between TAO training

$
0
0

I am training a model using the TAO Toolkit in WSL2 with GPU support, but when i start the training the container automatically stops after 1-2 epochs, also the training speed is relatively slow.

My setup is as follows:

WSL2 Distribution: Ubuntu 20.04 (Version: 2)
CUDA Toolkit Version: 12.2
NVIDIA Driver Version: 535.183.01
GPU: NVIDIA GeForce RTX 4090

This is my specs file:

random_seed: 42
ssd_config {
aspect_ratios_global: “[1.0, 2.0, 0.5, 3.0, 1.0/3.0]”
scales: “[0.05, 0.1, 0.25, 0.4, 0.55, 0.7, 0.85]”
two_boxes_for_ar1: true
clip_boxes: false
variances: “[0.1, 0.1, 0.2, 0.2]”
arch: “resnet”
nlayers: 18
freeze_bn: false
freeze_blocks: 0
}
training_config {
batch_size_per_gpu: 1
num_epochs: 100
enable_qat: false
learning_rate {
soft_start_annealing_schedule {
min_learning_rate: 5e-5
max_learning_rate: 2e-2
soft_start: 0.15
annealing: 0.8
}
}
regularizer {
type: L1
weight: 3e-5
}
}
eval_config {
validation_period_during_training: 10
average_precision_mode: SAMPLE
batch_size: 1
matching_iou_threshold: 0.5
}
nms_config {
confidence_threshold: 0.01
clustering_iou_threshold: 0.6
top_k: 200
}
augmentation_config {
output_width: 1920
output_height: 1080
output_channel: 3
}
dataset_config {
data_sources: {
#tfrecords_path: “/workspace/tao-experiments/data/tfrecords/kitti_train*”
image_directory_path: “/workspace/tao-experiments/data/training/image”
label_directory_path: “/workspace/tao-experiments/data/training/label”
}
include_difficult_in_training: true
target_class_mapping {
key: “person”
value: “person”
}
target_class_mapping {
key: “car”
value: “car”
}
target_class_mapping {
key: “heavy_vehicle”
value: “heavy_vehicle”
}
target_class_mapping {
key: “motor”
value: “motor”
}
target_class_mapping {
key: “tricycle”
value: “tricycle”
}
validation_data_sources: {
label_directory_path: “/workspace/tao-experiments/data/val/label”
image_directory_path: “/workspace/tao-experiments/data/val/image”
}
}

and this is the resultant issue:
!tao model ssd train --gpus 1 --gpu_index=$GPU_INDEX \

            -e $SPECS_DIR/ssd_train_resnet18_kitti.txt \ 

            -r $USER_EXPERIMENT_DIR/experiment_dir_unpruned \ 

            -k $KEY \ 

            -m $USER_EXPERIMENT_DIR/experiment_dir_unpruned/weights/ssd_resnet18_epoch_001.hdf5 \ 

            --initial_epoch 3 

Total params: 13,402,472
Trainable params: 13,379,624
Non-trainable params: 22,848


2024-12-07 03:47:51,360 [TAO Toolkit] [INFO] main 356: Number of images in the training dataset: 1634
2024-12-07 03:47:51,360 [TAO Toolkit] [INFO] main 358: Number of images in the validation dataset: 163

2024-12-07 03:47:51,844 [TAO Toolkit] [INFO] nvidia_tao_tf1.cv.common.logging.logging 197: Log file already exists at /workspace/tao-experiments/ssd/experiment_dir_unpruned/status.json
2024-12-07 03:47:53,883 [TAO Toolkit] [INFO] root 2102: Starting Training Loop.
Epoch 3/100
1634/1634 [==============================] - 2537s 2s/step - loss: 14.7520
[1733545829.422149] [9e597e0f07e8:249 :f] vfs_fuse.c:424 UCX WARN failed to connect to vfs socket ‘’: Invalid argument

Epoch 00003: saving model to /workspace/tao-experiments/ssd/experiment_dir_unpruned/weights/ssd_resnet18_epoch_003.hdf5
2024-12-07 04:30:44,682 [TAO Toolkit] [INFO] root 2102: Training loop in progress
Epoch 4/100
710/1634 [============>…] - ETA: 19:09 - loss: 13.8059Execution status: FAIL
2024-12-07 10:15:44,050 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 363: Stopping container.

I would appreciate any insights into what might be causing this issue.

2 posts - 2 participants

Read full topic


Viewing all articles
Browse latest Browse all 497

Trending Articles