Hi NVIDIA Devs,
I am trying to train a UNET with TAO-Toolkit in wsl2. It kind of works but at the end of every couple of epochs it crashes while saving checkpoints.
Error message:
INFO:tensorflow:Saving checkpoints for step-57304.
2024-07-30 13:15:02,828 [TAO Toolkit] [INFO] tensorflow 76: Saving checkpoints for step-57304.
2024-07-30 13:17:55,288 [TAO Toolkit] [INFO] root 2102: Dst tensor is not initialized.
[[node block_3a_conv_shortcut/kernel/Adam (defined at /tensorflow_core/python/framework/ops.py:1748) ]]
I can continue training from last checkpoint, but it’s a pain to always watch training and restart it after couple of epochs.
I googled a bit and only found threads saying that there is not enough GPU or CPU RAM, but I have plenty (see hardware below). I also reduced batch_size to 1 and still the error occurs.
I am tracing GPU memory with nvidia-smi --query-gpu=memory.used,memory.total --format=csv -i 0 -l 1
memory.used: 5070 MiB
memory.total: 16384 MiB
And CPU RAM with free -m -s 1
total: 54223
used: 3017
free: 48336
So I don’t know how there cannot be enough memory to cause the error. If that is the problem though.
Tank you in advance
More Information:
• Hardware: RTX A4500 Mobile, 16GB RAM ; 64GB CPU RAM
• Network Type: UNET
• OS: Windows 10 Enterprise → WSL2 Ubuntu 22.04, Docker v24.0.7
• TAO Info:
Configuration of the TAO Toolkit Instance
task_group: [‘model’, ‘dataset’, ‘deploy’]
format_version: 3.0
toolkit_version: 5.3.0
published_date: 03/14/2024
• Training Data: 6000 images, ~ 1GB of data in total
• .tao_mounts.json
{
“Mounts”: [
{
“source”: “/mnt/c/TAO-Toolkit”,
“destination”: “/workspace”
}
],
“DockerOptions”: {
“shm_size”: “16G”,
“ulimits”: {
“memlock”: -1,
“stack”: 67108864
},
“user”: “1000:1000”,
“ports”: {
“8888”: 8888
}
}
}
• Training spec file:
random_seed: 42
model_config {
model_input_width: 400
model_input_height: 224
model_input_channels: 3
num_layers: 18
all_projections: True
arch: “resnet”
use_batch_norm: False
training_precision {
backend_floatx: FLOAT32
}
}
training_config {
batch_size: 1
epochs: 20
log_summary_steps: 10
checkpoint_interval: 1
loss: “cross_entropy”
learning_rate:0.0001
regularizer {
type: L2
weight: 2e-5
}
optimizer {
adam {
epsilon: 9.99999993923e-09
beta1: 0.899999976158
beta2: 0.999000012875
}
}
visualizer{
enabled: true
}
}
dataset_config {
dataset: “custom”
augment: False
augmentation_config {
spatial_augmentation {
hflip_probability : 0.5
vflip_probability : 0.0
crop_and_resize_prob : 0.5
}
brightness_augmentation {
delta: 0.2
}
}
input_image_type: “color”
train_images_path:“/workspace/unet/dataset/train/img”
train_masks_path:“/workspace/unet/dataset/train/labels”
val_images_path:“/workspace/unet/dataset/validate/img”
val_masks_path:“/workspace/unet/dataset/validate/labels”
test_images_path:“/workspace/unet/dataset/test”
data_class_config {
target_classes {
name: “foreground”
mapping_class: “foreground”
label_id: 0
}
target_classes {
name: “background”
mapping_class: “background”
label_id: 255
}
}
}
5 posts - 2 participants