Mask2Former_inst model training crashed after 1 epoch

We are trying to train Mask2Former_inst model, after 1 epoch training automatically get crashed.

Below it the configuration.

results_dir: /results_inst/
dataset:
  contiguous_id: True
  label_map: /specs/labelmap_inst.json
  train:
    type: 'coco'
    name: "coco_2017_train"
    instance_json: "/data/raw-data/annotations/coco_annotations_train_fixed_largeset.json"
    img_dir: "/data/raw-data/train"
    batch_size: 8
    num_workers: 2
  val:
    type: 'coco'
    name: "coco_2017_val"
    instance_json: "/data/raw-data/annotations/coco_annotations_val_fixed_largeset.json"
    img_dir: "/data/raw-data/val"
    batch_size: 1
    num_workers: 2
  test:
    img_dir: /data/raw-data/val
    batch_size: 1
  augmentation:
    train_min_size: [640]
    train_max_size: 640
    train_crop_size: [640, 640]
    test_min_size: 640
    test_max_size: 640
train:
  precision: 'fp16'
  num_gpus: 1
  checkpoint_interval: 1
  validation_interval: 1
  num_epochs: 50
  optim:
    lr_scheduler: "MultiStep"
    milestones: [44, 48]
    type: "AdamW"
    lr: 0.0001
    weight_decay: 0.05
model:
  object_mask_threshold: 0.1
  overlap_threshold: 0.8
  mode: "instance"
  backbone:
    pretrained_weights: "/workspace/tao-experiments/mask2former/swin_tiny_patch4_window7_224_22k.pth"
    type: "swin"
    swin:
      type: "tiny"
      window_size: 7
      ape: False
      pretrain_img_size: 224
  mask_former:
    num_object_queries: 100
  sem_seg_head:
    norm: "GN"
    num_classes: 80
export:
  input_channel: 3
  input_width: 640
  input_height: 640
  opset_version: 17
  batch_size: -1  # dynamic batch size
  on_cpu: False
gen_trt_engine:
  gpu_id: 0
  input_channel: 3
  input_width: 640
  input_height: 640
  tensorrt:
    data_type: fp16
    workspace_size: 4096
    min_batch_size: 1
    opt_batch_size: 1
    max_batch_size: 1

Training Section:

print("For multi-GPU, set NUM_TRAIN_GPUS based on your machine.")
os.environ["NUM_TRAIN_GPUS"] = "1"
os.environ["HYDRA_FULL_ERROR"] = "1"
!tao model mask2former train -e $SPECS_DIR/spec_inst1.yaml \
           train.num_gpus=$NUM_TRAIN_GPUS \
           results_dir=$RESULTS_DIR

Training logs:

/usr/local/lib/python3.6/pty.py:84: ResourceWarning: Unclosed socket <zmq.Socket(zmq.PUSH) at 0x782256094648>
  pid, fd = os.forkpty()
For multi-GPU, set NUM_TRAIN_GPUS based on your machine.
2025-01-13 12:04:17,530 [TAO Toolkit] [INFO] root 160: Registry: ['nvcr.io']
2025-01-13 12:04:17,581 [TAO Toolkit] [INFO] nvidia_tao_cli.components.instance_handler.local_instance 361: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:5.5.0-pyt
2025-01-13 12:04:17,603 [TAO Toolkit] [WARNING] nvidia_tao_cli.components.docker_handler.docker_handler 293: 
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the "/home/smarg/.tao_mounts.json" file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
2025-01-13 12:04:17,603 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 301: Printing tty value True
[2025-01-13 06:34:21,081 - TAO Toolkit - matplotlib.font_manager - INFO] generated new fontManager
sys:1: UserWarning: 
'spec_inst1.yaml' is validated against ConfigStore schema with the same name.
This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/hydra/hydra_runner.py:107: UserWarning: 
'spec_inst1.yaml' is validated against ConfigStore schema with the same name.
This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
  _run_hydra(
/usr/local/lib/python3.10/dist-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/next/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
  ret = run_job(
Train results will be saved at: /results_inst/train
/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/loggers/api_logging.py:236: UserWarning: Log file already exists at /results_inst/train/status.json
  rank_zero_warn(
Seed set to 1234
loading annotations into memory...
Done (t=5.39s)
creating index...
index created!
/usr/local/lib/python3.10/dist-packages/torch/functional.py:512: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/native/TensorShape.cpp:3553.)
return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]Loading backbone weights from: /workspace/tao-experiments/mask2former/swin_tiny_patch4_window7_224_22k.pth
The backbone weights were loaded successfuly.
Using 16bit Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/usr/local/lib/python3.10/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py:652: Checkpoint directory /results_inst/train exists and is not empty.
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
  | Name      | Type            | Params
----------------------------------------------
0 | model     | MaskFormerModel | 47.4 M
1 | criterion | SetCriterion    | 0     
----------------------------------------------
47.4 M    Trainable params
0         Non-trainable params
47.4 M    Total params
189.687   Total estimated model params size (MB)

Sanity Checking: |          | 0/? [00:00<?, ?it/s]loading annotations into memory...Done (t=0.88s)
creating index...
index created!

Sanity Checking DataLoader 0: 100%|██████████| 2/2 [00:00<00:00,  2.10it/s]/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/mask2former/model/pl_model.py:443: RuntimeWarning: invalid value encountered in divide  iou = total_area_intersect / total_area_union

                                                                           
loading annotations into memory...
Done (t=5.51s)
creating index...
index created!

Epoch 0: 100%|██████████| 6250/6250 [1:25:22<00:00,  1.22it/s, v_num=1, train_loss=6.460, lr=0.0001]
Validation: |          | 0/? [00:00<?, ?it/s]
Validation:   0%|          | 0/7927 [00:00<?, ?it/s]
Validation DataLoader 0:   0%|          | 0/7927 [00:00<?, ?it/s]
Validation DataLoader 0:   0%|          | 1/7927 [00:00<17:36,  7.50it/s]
Validation DataLoader 0:   0%|          | 2/7927 [00:00<16:02,  8.23it/s]
Validation DataLoader 0:   0%|          | 3/7927 [00:00<15:23,  8.58it/s]
Validation DataLoader 0:   0%|          | 4/7927 [00:00<14:18,  9.23it/s]
Validation DataLoader 0:   0%|          | 5/7927 [00:00<13:48,  9.56it/s]
Validation DataLoader 0:   0%|          | 6/7927 [00:00<12:56, 10.20it/s]
Validation DataLoader 0:   0%|          | 7/7927 [00:00<12:38, 10.44it/s]
Validation DataLoader 0:   0%|          | 8/7927 [00:00<12:48, 10.31it/s]
Validation DataLoader 0:   0%|          | 9/7927 [00:00<12:54, 10.22it/s]
Validation DataLoader 0:   0%|          | 10/7927 [00:00<13:02, 10.12it/s]
Validation DataLoader 0:   0%|          | 11/7927 [00:01<12:37, 10.45it/s]
Validation DataLoader 0:   0%|          | 12/7927 [00:01<12:17, 10.73it/s]
Validation DataLoader 0:   0%|          | 13/7927 [00:01<11:59, 10.99it/s]
Validation DataLoader 0:   0%|          | 14/7927 [00:01<11:54, 11.08it/s]
Validation DataLoader 0:   0%|          | 15/7927 [00:01<12:01, 10.97it/s]
Validation DataLoader 0:   0%|          | 16/7927 [00:01<12:10, 10.83it/s]
Validation DataLoader 0:   0%|          | 17/7927 [00:01<12:16, 10.74it/s]
Validation DataLoader 0:   0%|          | 18/7927 [00:01<12:25, 10.61it/s]
Validation DataLoader 0:   0%|          | 19/7927 [00:01<12:30, 10.54it/s]
Validation DataLoader 0:   0%|          | 20/7927 [00:01<12:25, 10.61it/s]
Validation DataLoader 0:   0%|          | 21/7927 [00:01<12:28, 10.56it/s]
Validation DataLoader 0:   0%|          | 22/7927 [00:02<12:24, 10.62it/s]
Validation DataLoader 0:   0%|          | 23/7927 [00:02<12:28, 10.57it/s]
Validation DataLoader 0:   0%|          | 24/7927 [00:02<12:32, 10.50it/s]
Validation DataLoader 0:   0%|          | 25/7927 [00:02<12:36, 10.45it/s]
Validation DataLoader 0:   0%|          | 26/7927 [00:02<12:39, 10.41it/s]
Validation DataLoader 0:   0%|          | 27/7927 [00:02<12:42, 10.36it/s]
Validation DataLoader 0:   0%|          | 28/7927 [00:02<12:44, 10.33it/s]
Validation DataLoader 0:   0%|          | 29/7927 [00:02<12:40, 10.38it/s]
Validation DataLoader 0:   0%|          | 30/7927 [00:02<12:38, 10.41it/s]
Validation DataLoader 0:   0%|          | 31/7927 [00:02<12:41, 10.37it/s]
Validation DataLoader 0:   0%|          | 32/7927 [00:03<12:43, 10.34it/s]
Validation DataLoader 0:   0%|          | 33/7927 [00:03<12:45, 10.31it/s]
Validation DataLoader 0:   0%|          | 34/7927 [00:03<12:47, 10.28it/s]
Validation DataLoader 0:   0%|          | 35/7927 [00:03<12:44, 10.32it/s]
Validation DataLoader 0:   0%|          | 36/7927 [00:03<12:46, 10.30it/s]
Validation DataLoader 0:   0%|          | 37/7927 [00:03<12:43, 10.34it/s]
Validation DataLoader 0:   0%|          | 38/7927 [00:03<12:37, 10.42it/s]
Validation DataLoader 0:   0%|          | 39/7927 [00:03<12:35, 10.45it/s]
Validation DataLoader 0:   1%|          | 40/7927 [00:03<12:38, 10.40it/s]
Validation DataLoader 0:   1%|          | 41/7927 [00:03<12:37, 10.41it/s]
Validation DataLoader 0:   1%|          | 42/7927 [00:04<12:36, 10.42it/s]
Validation DataLoader 0:   1%|          | 43/7927 [00:04<12:38, 10.40it/s]
.
.
.
.
.
.
.

Validation DataLoader 0: 100%|█████████▉| 7914/7927 [12:50<00:01, 10.28it/s]
Validation DataLoader 0: 100%|█████████▉| 7915/7927 [12:50<00:01, 10.28it/s]
Validation DataLoader 0: 100%|█████████▉| 7916/7927 [12:50<00:01, 10.28it/s]
Validation DataLoader 0: 100%|█████████▉| 7917/7927 [12:50<00:00, 10.27it/s]
Validation DataLoader 0: 100%|█████████▉| 7918/7927 [12:50<00:00, 10.27it/s]
Validation DataLoader 0: 100%|█████████▉| 7919/7927 [12:50<00:00, 10.27it/s]
Validation DataLoader 0: 100%|█████████▉| 7920/7927 [12:50<00:00, 10.27it/s]
Validation DataLoader 0: 100%|█████████▉| 7921/7927 [12:50<00:00, 10.27it/s]
Validation DataLoader 0: 100%|█████████▉| 7922/7927 [12:51<00:00, 10.27it/s]
Validation DataLoader 0: 100%|█████████▉| 7923/7927 [12:51<00:00, 10.27it/s]
Validation DataLoader 0: 100%|█████████▉| 7924/7927 [12:51<00:00, 10.27it/s]
Validation DataLoader 0: 100%|█████████▉| 7925/7927 [12:51<00:00, 10.27it/s]
Validation DataLoader 0: 100%|█████████▉| 7926/7927 [12:51<00:00, 10.28it/s]
Validation DataLoader 0: 100%|██████████| 7927/7927 [12:51<00:00, 10.28it/s]/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/mask2former/model/pl_model.py:443: RuntimeWarning: invalid value encountered in divide  iou = total_area_intersect / total_area_union


                                                                            
Epoch 0: 100%|██████████| 6250/6250 [1:38:14<00:00,  1.06it/s, v_num=1, train_loss=6.460, lr=0.0001, val_loss=11.20, mIoU=1.000, all_acc=1.000][2025-01-13 08:15:15,069 - TAO Toolkit - root - INFO] Sending telemetry data.
[2025-01-13 08:15:15,082 - TAO Toolkit - root - INFO] ================> Start Reporting Telemetry <================
[2025-01-13 08:15:15,085 - TAO Toolkit - root - INFO] Sending {'version': '5.5.0', 'action': 'train', 'network': 'mask2former', 'gpu': ['NVIDIA-RTX-A4000'], 'success': False, 'time_lapsed': 6053} to https://api.tao.ngc.nvidia.com.
[2025-01-13 08:15:16,813 - TAO Toolkit - root - INFO] Telemetry sent successfully.
[2025-01-13 08:15:16,814 - TAO Toolkit - root - INFO] ================> End Reporting Telemetry <================
[2025-01-13 08:15:16,814 - TAO Toolkit - root - WARNING] Execution status: FAIL
2025-01-13 13:45:20,751 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 363: Stopping container.

where we are making mistake? Please help.

Thanks.

18 posts - 2 participants

Read full topic

Mask2Former_inst model training crashed after 1 epoch

Trending Articles

Bath man appears in court charged with attempted murder of a man...

MACLEAN, Allan

Black Angus Grilled Artichokes

Practice Sheet of Right form of verbs for HSC Students

Police blotter for Jan. 12

99 God Status for Whatsapp, Facebook

Rajasthan Board 12th Science Result 2018 name wise- RBSE 12th commerce result...

Notorious Naushad of Ippa gang nabbed

Child Kidnapping: Amy McNeil was kidnapped on her way to school by 5 adults;...

Sonible Smartlimit v1.1.5-R2R

NCERT Solutions for Class 9th Sanskrit Chapter 3 पाथेयम्

मतलबी दोस्त स्टेट्स | Matlabi Dost Status in Hindi – Selfish Friends Status

Arrow Flash 2 – Sinhala Dubbed – Episode 23 – 20th March 2016

[GET] AI Traffic Goldmine

[E² Plugin] HDF-Radio

Universal Multi-Patch v1.3 By RADIXX11

IWAN – Thanks and Praise ( Throw Back Thursday )

RONALD P SONDERGAARD Arrested by Miami-Dade County Corrections on Mar 03, 2017

मुख मैथुन से उठाएं सेक्स का भरपूर मज़ा, जानें क्या है इसका सही तरीकामुख मैथुन...

HSSC Excise & Taxation Inspector Result 2017 Scorecard/ Category Wise Merit List