Dear @Morganh ,
We are trying to train Mask2Former_inst model, after 1 epoch training automatically get crashed.
Below it the configuration.
results_dir: /results_inst/
dataset:
contiguous_id: True
label_map: /specs/labelmap_inst.json
train:
type: 'coco'
name: "coco_2017_train"
instance_json: "/data/raw-data/annotations/coco_annotations_train_fixed_largeset.json"
img_dir: "/data/raw-data/train"
batch_size: 8
num_workers: 2
val:
type: 'coco'
name: "coco_2017_val"
instance_json: "/data/raw-data/annotations/coco_annotations_val_fixed_largeset.json"
img_dir: "/data/raw-data/val"
batch_size: 1
num_workers: 2
test:
img_dir: /data/raw-data/val
batch_size: 1
augmentation:
train_min_size: [640]
train_max_size: 640
train_crop_size: [640, 640]
test_min_size: 640
test_max_size: 640
train:
precision: 'fp16'
num_gpus: 1
checkpoint_interval: 1
validation_interval: 1
num_epochs: 50
optim:
lr_scheduler: "MultiStep"
milestones: [44, 48]
type: "AdamW"
lr: 0.0001
weight_decay: 0.05
model:
object_mask_threshold: 0.1
overlap_threshold: 0.8
mode: "instance"
backbone:
pretrained_weights: "/workspace/tao-experiments/mask2former/swin_tiny_patch4_window7_224_22k.pth"
type: "swin"
swin:
type: "tiny"
window_size: 7
ape: False
pretrain_img_size: 224
mask_former:
num_object_queries: 100
sem_seg_head:
norm: "GN"
num_classes: 80
export:
input_channel: 3
input_width: 640
input_height: 640
opset_version: 17
batch_size: -1 # dynamic batch size
on_cpu: False
gen_trt_engine:
gpu_id: 0
input_channel: 3
input_width: 640
input_height: 640
tensorrt:
data_type: fp16
workspace_size: 4096
min_batch_size: 1
opt_batch_size: 1
max_batch_size: 1
Training Section:
print("For multi-GPU, set NUM_TRAIN_GPUS based on your machine.")
os.environ["NUM_TRAIN_GPUS"] = "1"
os.environ["HYDRA_FULL_ERROR"] = "1"
!tao model mask2former train -e $SPECS_DIR/spec_inst1.yaml \
train.num_gpus=$NUM_TRAIN_GPUS \
results_dir=$RESULTS_DIR
Training logs:
/usr/local/lib/python3.6/pty.py:84: ResourceWarning: Unclosed socket <zmq.Socket(zmq.PUSH) at 0x782256094648>
pid, fd = os.forkpty()
For multi-GPU, set NUM_TRAIN_GPUS based on your machine.
2025-01-13 12:04:17,530 [TAO Toolkit] [INFO] root 160: Registry: ['nvcr.io']
2025-01-13 12:04:17,581 [TAO Toolkit] [INFO] nvidia_tao_cli.components.instance_handler.local_instance 361: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:5.5.0-pyt
2025-01-13 12:04:17,603 [TAO Toolkit] [WARNING] nvidia_tao_cli.components.docker_handler.docker_handler 293:
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the "/home/smarg/.tao_mounts.json" file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
2025-01-13 12:04:17,603 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 301: Printing tty value True
[2025-01-13 06:34:21,081 - TAO Toolkit - matplotlib.font_manager - INFO] generated new fontManager
sys:1: UserWarning:
'spec_inst1.yaml' is validated against ConfigStore schema with the same name.
This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/hydra/hydra_runner.py:107: UserWarning:
'spec_inst1.yaml' is validated against ConfigStore schema with the same name.
This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
_run_hydra(
/usr/local/lib/python3.10/dist-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/next/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
ret = run_job(
Train results will be saved at: /results_inst/train
/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/loggers/api_logging.py:236: UserWarning: Log file already exists at /results_inst/train/status.json
rank_zero_warn(
Seed set to 1234
loading annotations into memory...
Done (t=5.39s)
creating index...
index created!
/usr/local/lib/python3.10/dist-packages/torch/functional.py:512: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/native/TensorShape.cpp:3553.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]Loading backbone weights from: /workspace/tao-experiments/mask2former/swin_tiny_patch4_window7_224_22k.pth
The backbone weights were loaded successfuly.
Using 16bit Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/usr/local/lib/python3.10/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py:652: Checkpoint directory /results_inst/train exists and is not empty.
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
| Name | Type | Params
----------------------------------------------
0 | model | MaskFormerModel | 47.4 M
1 | criterion | SetCriterion | 0
----------------------------------------------
47.4 M Trainable params
0 Non-trainable params
47.4 M Total params
189.687 Total estimated model params size (MB)
Sanity Checking: | | 0/? [00:00<?, ?it/s]loading annotations into memory...Done (t=0.88s)
creating index...
index created!
Sanity Checking DataLoader 0: 100%|██████████| 2/2 [00:00<00:00, 2.10it/s]/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/mask2former/model/pl_model.py:443: RuntimeWarning: invalid value encountered in divide iou = total_area_intersect / total_area_union
loading annotations into memory...
Done (t=5.51s)
creating index...
index created!
Epoch 0: 100%|██████████| 6250/6250 [1:25:22<00:00, 1.22it/s, v_num=1, train_loss=6.460, lr=0.0001]
Validation: | | 0/? [00:00<?, ?it/s]
Validation: 0%| | 0/7927 [00:00<?, ?it/s]
Validation DataLoader 0: 0%| | 0/7927 [00:00<?, ?it/s]
Validation DataLoader 0: 0%| | 1/7927 [00:00<17:36, 7.50it/s]
Validation DataLoader 0: 0%| | 2/7927 [00:00<16:02, 8.23it/s]
Validation DataLoader 0: 0%| | 3/7927 [00:00<15:23, 8.58it/s]
Validation DataLoader 0: 0%| | 4/7927 [00:00<14:18, 9.23it/s]
Validation DataLoader 0: 0%| | 5/7927 [00:00<13:48, 9.56it/s]
Validation DataLoader 0: 0%| | 6/7927 [00:00<12:56, 10.20it/s]
Validation DataLoader 0: 0%| | 7/7927 [00:00<12:38, 10.44it/s]
Validation DataLoader 0: 0%| | 8/7927 [00:00<12:48, 10.31it/s]
Validation DataLoader 0: 0%| | 9/7927 [00:00<12:54, 10.22it/s]
Validation DataLoader 0: 0%| | 10/7927 [00:00<13:02, 10.12it/s]
Validation DataLoader 0: 0%| | 11/7927 [00:01<12:37, 10.45it/s]
Validation DataLoader 0: 0%| | 12/7927 [00:01<12:17, 10.73it/s]
Validation DataLoader 0: 0%| | 13/7927 [00:01<11:59, 10.99it/s]
Validation DataLoader 0: 0%| | 14/7927 [00:01<11:54, 11.08it/s]
Validation DataLoader 0: 0%| | 15/7927 [00:01<12:01, 10.97it/s]
Validation DataLoader 0: 0%| | 16/7927 [00:01<12:10, 10.83it/s]
Validation DataLoader 0: 0%| | 17/7927 [00:01<12:16, 10.74it/s]
Validation DataLoader 0: 0%| | 18/7927 [00:01<12:25, 10.61it/s]
Validation DataLoader 0: 0%| | 19/7927 [00:01<12:30, 10.54it/s]
Validation DataLoader 0: 0%| | 20/7927 [00:01<12:25, 10.61it/s]
Validation DataLoader 0: 0%| | 21/7927 [00:01<12:28, 10.56it/s]
Validation DataLoader 0: 0%| | 22/7927 [00:02<12:24, 10.62it/s]
Validation DataLoader 0: 0%| | 23/7927 [00:02<12:28, 10.57it/s]
Validation DataLoader 0: 0%| | 24/7927 [00:02<12:32, 10.50it/s]
Validation DataLoader 0: 0%| | 25/7927 [00:02<12:36, 10.45it/s]
Validation DataLoader 0: 0%| | 26/7927 [00:02<12:39, 10.41it/s]
Validation DataLoader 0: 0%| | 27/7927 [00:02<12:42, 10.36it/s]
Validation DataLoader 0: 0%| | 28/7927 [00:02<12:44, 10.33it/s]
Validation DataLoader 0: 0%| | 29/7927 [00:02<12:40, 10.38it/s]
Validation DataLoader 0: 0%| | 30/7927 [00:02<12:38, 10.41it/s]
Validation DataLoader 0: 0%| | 31/7927 [00:02<12:41, 10.37it/s]
Validation DataLoader 0: 0%| | 32/7927 [00:03<12:43, 10.34it/s]
Validation DataLoader 0: 0%| | 33/7927 [00:03<12:45, 10.31it/s]
Validation DataLoader 0: 0%| | 34/7927 [00:03<12:47, 10.28it/s]
Validation DataLoader 0: 0%| | 35/7927 [00:03<12:44, 10.32it/s]
Validation DataLoader 0: 0%| | 36/7927 [00:03<12:46, 10.30it/s]
Validation DataLoader 0: 0%| | 37/7927 [00:03<12:43, 10.34it/s]
Validation DataLoader 0: 0%| | 38/7927 [00:03<12:37, 10.42it/s]
Validation DataLoader 0: 0%| | 39/7927 [00:03<12:35, 10.45it/s]
Validation DataLoader 0: 1%| | 40/7927 [00:03<12:38, 10.40it/s]
Validation DataLoader 0: 1%| | 41/7927 [00:03<12:37, 10.41it/s]
Validation DataLoader 0: 1%| | 42/7927 [00:04<12:36, 10.42it/s]
Validation DataLoader 0: 1%| | 43/7927 [00:04<12:38, 10.40it/s]
.
.
.
.
.
.
.
Validation DataLoader 0: 100%|█████████▉| 7914/7927 [12:50<00:01, 10.28it/s]
Validation DataLoader 0: 100%|█████████▉| 7915/7927 [12:50<00:01, 10.28it/s]
Validation DataLoader 0: 100%|█████████▉| 7916/7927 [12:50<00:01, 10.28it/s]
Validation DataLoader 0: 100%|█████████▉| 7917/7927 [12:50<00:00, 10.27it/s]
Validation DataLoader 0: 100%|█████████▉| 7918/7927 [12:50<00:00, 10.27it/s]
Validation DataLoader 0: 100%|█████████▉| 7919/7927 [12:50<00:00, 10.27it/s]
Validation DataLoader 0: 100%|█████████▉| 7920/7927 [12:50<00:00, 10.27it/s]
Validation DataLoader 0: 100%|█████████▉| 7921/7927 [12:50<00:00, 10.27it/s]
Validation DataLoader 0: 100%|█████████▉| 7922/7927 [12:51<00:00, 10.27it/s]
Validation DataLoader 0: 100%|█████████▉| 7923/7927 [12:51<00:00, 10.27it/s]
Validation DataLoader 0: 100%|█████████▉| 7924/7927 [12:51<00:00, 10.27it/s]
Validation DataLoader 0: 100%|█████████▉| 7925/7927 [12:51<00:00, 10.27it/s]
Validation DataLoader 0: 100%|█████████▉| 7926/7927 [12:51<00:00, 10.28it/s]
Validation DataLoader 0: 100%|██████████| 7927/7927 [12:51<00:00, 10.28it/s]/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/mask2former/model/pl_model.py:443: RuntimeWarning: invalid value encountered in divide iou = total_area_intersect / total_area_union
Epoch 0: 100%|██████████| 6250/6250 [1:38:14<00:00, 1.06it/s, v_num=1, train_loss=6.460, lr=0.0001, val_loss=11.20, mIoU=1.000, all_acc=1.000][2025-01-13 08:15:15,069 - TAO Toolkit - root - INFO] Sending telemetry data.
[2025-01-13 08:15:15,082 - TAO Toolkit - root - INFO] ================> Start Reporting Telemetry <================
[2025-01-13 08:15:15,085 - TAO Toolkit - root - INFO] Sending {'version': '5.5.0', 'action': 'train', 'network': 'mask2former', 'gpu': ['NVIDIA-RTX-A4000'], 'success': False, 'time_lapsed': 6053} to https://api.tao.ngc.nvidia.com.
[2025-01-13 08:15:16,813 - TAO Toolkit - root - INFO] Telemetry sent successfully.
[2025-01-13 08:15:16,814 - TAO Toolkit - root - INFO] ================> End Reporting Telemetry <================
[2025-01-13 08:15:16,814 - TAO Toolkit - root - WARNING] Execution status: FAIL
2025-01-13 13:45:20,751 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 363: Stopping container.
where we are making mistake? Please help.
Thanks.
18 posts - 2 participants