Quantcast
Channel: TAO Toolkit - NVIDIA Developer Forums
Viewing all articles
Browse latest Browse all 497

Mask2Former_inst model training crashed after 1 epoch

$
0
0

Dear @Morganh ,

We are trying to train Mask2Former_inst model, after 1 epoch training automatically get crashed.

Below it the configuration.

results_dir: /results_inst/
dataset:
  contiguous_id: True
  label_map: /specs/labelmap_inst.json
  train:
    type: 'coco'
    name: "coco_2017_train"
    instance_json: "/data/raw-data/annotations/coco_annotations_train_fixed_largeset.json"
    img_dir: "/data/raw-data/train"
    batch_size: 8
    num_workers: 2
  val:
    type: 'coco'
    name: "coco_2017_val"
    instance_json: "/data/raw-data/annotations/coco_annotations_val_fixed_largeset.json"
    img_dir: "/data/raw-data/val"
    batch_size: 1
    num_workers: 2
  test:
    img_dir: /data/raw-data/val
    batch_size: 1
  augmentation:
    train_min_size: [640]
    train_max_size: 640
    train_crop_size: [640, 640]
    test_min_size: 640
    test_max_size: 640
train:
  precision: 'fp16'
  num_gpus: 1
  checkpoint_interval: 1
  validation_interval: 1
  num_epochs: 50
  optim:
    lr_scheduler: "MultiStep"
    milestones: [44, 48]
    type: "AdamW"
    lr: 0.0001
    weight_decay: 0.05
model:
  object_mask_threshold: 0.1
  overlap_threshold: 0.8
  mode: "instance"
  backbone:
    pretrained_weights: "/workspace/tao-experiments/mask2former/swin_tiny_patch4_window7_224_22k.pth"
    type: "swin"
    swin:
      type: "tiny"
      window_size: 7
      ape: False
      pretrain_img_size: 224
  mask_former:
    num_object_queries: 100
  sem_seg_head:
    norm: "GN"
    num_classes: 80
export:
  input_channel: 3
  input_width: 640
  input_height: 640
  opset_version: 17
  batch_size: -1  # dynamic batch size
  on_cpu: False
gen_trt_engine:
  gpu_id: 0
  input_channel: 3
  input_width: 640
  input_height: 640
  tensorrt:
    data_type: fp16
    workspace_size: 4096
    min_batch_size: 1
    opt_batch_size: 1
    max_batch_size: 1

Training Section:

print("For multi-GPU, set NUM_TRAIN_GPUS based on your machine.")
os.environ["NUM_TRAIN_GPUS"] = "1"
os.environ["HYDRA_FULL_ERROR"] = "1"
!tao model mask2former train -e $SPECS_DIR/spec_inst1.yaml \
           train.num_gpus=$NUM_TRAIN_GPUS \
           results_dir=$RESULTS_DIR

Training logs:

/usr/local/lib/python3.6/pty.py:84: ResourceWarning: Unclosed socket <zmq.Socket(zmq.PUSH) at 0x782256094648>
  pid, fd = os.forkpty()
For multi-GPU, set NUM_TRAIN_GPUS based on your machine.
2025-01-13 12:04:17,530 [TAO Toolkit] [INFO] root 160: Registry: ['nvcr.io']
2025-01-13 12:04:17,581 [TAO Toolkit] [INFO] nvidia_tao_cli.components.instance_handler.local_instance 361: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:5.5.0-pyt
2025-01-13 12:04:17,603 [TAO Toolkit] [WARNING] nvidia_tao_cli.components.docker_handler.docker_handler 293: 
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the "/home/smarg/.tao_mounts.json" file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
2025-01-13 12:04:17,603 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 301: Printing tty value True
[2025-01-13 06:34:21,081 - TAO Toolkit - matplotlib.font_manager - INFO] generated new fontManager
sys:1: UserWarning: 
'spec_inst1.yaml' is validated against ConfigStore schema with the same name.
This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/hydra/hydra_runner.py:107: UserWarning: 
'spec_inst1.yaml' is validated against ConfigStore schema with the same name.
This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
  _run_hydra(
/usr/local/lib/python3.10/dist-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/next/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
  ret = run_job(
Train results will be saved at: /results_inst/train
/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/loggers/api_logging.py:236: UserWarning: Log file already exists at /results_inst/train/status.json
  rank_zero_warn(
Seed set to 1234
loading annotations into memory...
Done (t=5.39s)
creating index...
index created!
/usr/local/lib/python3.10/dist-packages/torch/functional.py:512: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/native/TensorShape.cpp:3553.)
return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]Loading backbone weights from: /workspace/tao-experiments/mask2former/swin_tiny_patch4_window7_224_22k.pth
The backbone weights were loaded successfuly.
Using 16bit Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/usr/local/lib/python3.10/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py:652: Checkpoint directory /results_inst/train exists and is not empty.
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
  | Name      | Type            | Params
----------------------------------------------
0 | model     | MaskFormerModel | 47.4 M
1 | criterion | SetCriterion    | 0     
----------------------------------------------
47.4 M    Trainable params
0         Non-trainable params
47.4 M    Total params
189.687   Total estimated model params size (MB)

Sanity Checking: |          | 0/? [00:00<?, ?it/s]loading annotations into memory...Done (t=0.88s)
creating index...
index created!

Sanity Checking DataLoader 0: 100%|██████████| 2/2 [00:00<00:00,  2.10it/s]/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/mask2former/model/pl_model.py:443: RuntimeWarning: invalid value encountered in divide  iou = total_area_intersect / total_area_union

                                                                           
loading annotations into memory...
Done (t=5.51s)
creating index...
index created!

Epoch 0: 100%|██████████| 6250/6250 [1:25:22<00:00,  1.22it/s, v_num=1, train_loss=6.460, lr=0.0001]
Validation: |          | 0/? [00:00<?, ?it/s]
Validation:   0%|          | 0/7927 [00:00<?, ?it/s]
Validation DataLoader 0:   0%|          | 0/7927 [00:00<?, ?it/s]
Validation DataLoader 0:   0%|          | 1/7927 [00:00<17:36,  7.50it/s]
Validation DataLoader 0:   0%|          | 2/7927 [00:00<16:02,  8.23it/s]
Validation DataLoader 0:   0%|          | 3/7927 [00:00<15:23,  8.58it/s]
Validation DataLoader 0:   0%|          | 4/7927 [00:00<14:18,  9.23it/s]
Validation DataLoader 0:   0%|          | 5/7927 [00:00<13:48,  9.56it/s]
Validation DataLoader 0:   0%|          | 6/7927 [00:00<12:56, 10.20it/s]
Validation DataLoader 0:   0%|          | 7/7927 [00:00<12:38, 10.44it/s]
Validation DataLoader 0:   0%|          | 8/7927 [00:00<12:48, 10.31it/s]
Validation DataLoader 0:   0%|          | 9/7927 [00:00<12:54, 10.22it/s]
Validation DataLoader 0:   0%|          | 10/7927 [00:00<13:02, 10.12it/s]
Validation DataLoader 0:   0%|          | 11/7927 [00:01<12:37, 10.45it/s]
Validation DataLoader 0:   0%|          | 12/7927 [00:01<12:17, 10.73it/s]
Validation DataLoader 0:   0%|          | 13/7927 [00:01<11:59, 10.99it/s]
Validation DataLoader 0:   0%|          | 14/7927 [00:01<11:54, 11.08it/s]
Validation DataLoader 0:   0%|          | 15/7927 [00:01<12:01, 10.97it/s]
Validation DataLoader 0:   0%|          | 16/7927 [00:01<12:10, 10.83it/s]
Validation DataLoader 0:   0%|          | 17/7927 [00:01<12:16, 10.74it/s]
Validation DataLoader 0:   0%|          | 18/7927 [00:01<12:25, 10.61it/s]
Validation DataLoader 0:   0%|          | 19/7927 [00:01<12:30, 10.54it/s]
Validation DataLoader 0:   0%|          | 20/7927 [00:01<12:25, 10.61it/s]
Validation DataLoader 0:   0%|          | 21/7927 [00:01<12:28, 10.56it/s]
Validation DataLoader 0:   0%|          | 22/7927 [00:02<12:24, 10.62it/s]
Validation DataLoader 0:   0%|          | 23/7927 [00:02<12:28, 10.57it/s]
Validation DataLoader 0:   0%|          | 24/7927 [00:02<12:32, 10.50it/s]
Validation DataLoader 0:   0%|          | 25/7927 [00:02<12:36, 10.45it/s]
Validation DataLoader 0:   0%|          | 26/7927 [00:02<12:39, 10.41it/s]
Validation DataLoader 0:   0%|          | 27/7927 [00:02<12:42, 10.36it/s]
Validation DataLoader 0:   0%|          | 28/7927 [00:02<12:44, 10.33it/s]
Validation DataLoader 0:   0%|          | 29/7927 [00:02<12:40, 10.38it/s]
Validation DataLoader 0:   0%|          | 30/7927 [00:02<12:38, 10.41it/s]
Validation DataLoader 0:   0%|          | 31/7927 [00:02<12:41, 10.37it/s]
Validation DataLoader 0:   0%|          | 32/7927 [00:03<12:43, 10.34it/s]
Validation DataLoader 0:   0%|          | 33/7927 [00:03<12:45, 10.31it/s]
Validation DataLoader 0:   0%|          | 34/7927 [00:03<12:47, 10.28it/s]
Validation DataLoader 0:   0%|          | 35/7927 [00:03<12:44, 10.32it/s]
Validation DataLoader 0:   0%|          | 36/7927 [00:03<12:46, 10.30it/s]
Validation DataLoader 0:   0%|          | 37/7927 [00:03<12:43, 10.34it/s]
Validation DataLoader 0:   0%|          | 38/7927 [00:03<12:37, 10.42it/s]
Validation DataLoader 0:   0%|          | 39/7927 [00:03<12:35, 10.45it/s]
Validation DataLoader 0:   1%|          | 40/7927 [00:03<12:38, 10.40it/s]
Validation DataLoader 0:   1%|          | 41/7927 [00:03<12:37, 10.41it/s]
Validation DataLoader 0:   1%|          | 42/7927 [00:04<12:36, 10.42it/s]
Validation DataLoader 0:   1%|          | 43/7927 [00:04<12:38, 10.40it/s]
.
.
.
.
.
.
.

Validation DataLoader 0: 100%|█████████▉| 7914/7927 [12:50<00:01, 10.28it/s]
Validation DataLoader 0: 100%|█████████▉| 7915/7927 [12:50<00:01, 10.28it/s]
Validation DataLoader 0: 100%|█████████▉| 7916/7927 [12:50<00:01, 10.28it/s]
Validation DataLoader 0: 100%|█████████▉| 7917/7927 [12:50<00:00, 10.27it/s]
Validation DataLoader 0: 100%|█████████▉| 7918/7927 [12:50<00:00, 10.27it/s]
Validation DataLoader 0: 100%|█████████▉| 7919/7927 [12:50<00:00, 10.27it/s]
Validation DataLoader 0: 100%|█████████▉| 7920/7927 [12:50<00:00, 10.27it/s]
Validation DataLoader 0: 100%|█████████▉| 7921/7927 [12:50<00:00, 10.27it/s]
Validation DataLoader 0: 100%|█████████▉| 7922/7927 [12:51<00:00, 10.27it/s]
Validation DataLoader 0: 100%|█████████▉| 7923/7927 [12:51<00:00, 10.27it/s]
Validation DataLoader 0: 100%|█████████▉| 7924/7927 [12:51<00:00, 10.27it/s]
Validation DataLoader 0: 100%|█████████▉| 7925/7927 [12:51<00:00, 10.27it/s]
Validation DataLoader 0: 100%|█████████▉| 7926/7927 [12:51<00:00, 10.28it/s]
Validation DataLoader 0: 100%|██████████| 7927/7927 [12:51<00:00, 10.28it/s]/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/mask2former/model/pl_model.py:443: RuntimeWarning: invalid value encountered in divide  iou = total_area_intersect / total_area_union


                                                                            
Epoch 0: 100%|██████████| 6250/6250 [1:38:14<00:00,  1.06it/s, v_num=1, train_loss=6.460, lr=0.0001, val_loss=11.20, mIoU=1.000, all_acc=1.000][2025-01-13 08:15:15,069 - TAO Toolkit - root - INFO] Sending telemetry data.
[2025-01-13 08:15:15,082 - TAO Toolkit - root - INFO] ================> Start Reporting Telemetry <================
[2025-01-13 08:15:15,085 - TAO Toolkit - root - INFO] Sending {'version': '5.5.0', 'action': 'train', 'network': 'mask2former', 'gpu': ['NVIDIA-RTX-A4000'], 'success': False, 'time_lapsed': 6053} to https://api.tao.ngc.nvidia.com.
[2025-01-13 08:15:16,813 - TAO Toolkit - root - INFO] Telemetry sent successfully.
[2025-01-13 08:15:16,814 - TAO Toolkit - root - INFO] ================> End Reporting Telemetry <================
[2025-01-13 08:15:16,814 - TAO Toolkit - root - WARNING] Execution status: FAIL
2025-01-13 13:45:20,751 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 363: Stopping container.

where we are making mistake? Please help.

Thanks.

18 posts - 2 participants

Read full topic


Viewing all articles
Browse latest Browse all 497

Trending Articles