Quantcast
Channel: TAO Toolkit - NVIDIA Developer Forums
Viewing all articles
Browse latest Browse all 497

Classification_pyt error

$
0
0

Please provide the following information when requesting support.

• Hardware (T4/V100/Xavier/Nano/etc)
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc)
classification_pyt
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here)
• Training spec file(If have, please share here)

train:
  exp_config:
    manual_seed: 49
  train_config:
    runner:
      max_epochs: 40
    checkpoint_config:
      interval: 1
    logging:
      interval: 500
    validate: True
    evaluation:
      interval: 1
    custom_hooks:
      - type: "EMAHook"
        momentum: 4e-5
        priority: "ABOVE_NORMAL"
dataset:
  data:
    samples_per_gpu: 8
    train:
      data_prefix: /data/cats_dogs_dataset/training_set/training_set/
      pipeline: # Augmentations alone
        - type: RandomResizedCrop
          size: 224
        - type: RandomFlip
          flip_prob: 0.5
          direction: "horizontal"
      classes: /data/cats_dogs_dataset/classes.txt
    val:
      data_prefix: /data/cats_dogs_dataset/val_set/val_set
      classes: /data/cats_dogs_dataset/classes.txt
    test:
      data_prefix: /data/cats_dogs_dataset/val_set/val_set
      classes: /data/cats_dogs_dataset/classes.txt
model:
  backbone:
    type: "fan_tiny_8_p4_hybrid"
    custom_args:
      drop_path: 0.1
  head:
    type: "FANLinearClsHead"
    custom_args:
      head_init_scale: 1
    num_classes: 2
    loss:
      type: "CrossEntropyLoss"
      loss_weight: 1.0
      use_soft: False

• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)

env: EPOCHS=5
Train Classification Model
2024-09-12 11:01:56,227 [TAO Toolkit] [INFO] root 160: Registry: ['nvcr.io']
2024-09-12 11:01:56,351 [TAO Toolkit] [INFO] nvidia_tao_cli.components.instance_handler.local_instance 360: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:5.5.0-pyt
2024-09-12 11:01:56,494 [TAO Toolkit] [WARNING] nvidia_tao_cli.components.docker_handler.docker_handler 288: 
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the "/home/ubuntu/.tao_mounts.json" file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
2024-09-12 11:01:56,494 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 301: Printing tty value True
[2024-09-12 03:02:02,955 - TAO Toolkit - matplotlib.font_manager - INFO] generated new fontManager
[overrides ...]train.py: error: unrecognized arguments: -g 1th}]]ydra,all}]
E0912 03:02:15.628000 139649349068608 torch/distributed/elastic/multiprocessing/api.py:881] failed (exitcode: 2) local_rank: 0 (pid: 541) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 879, in main
    run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/classification/scripts/train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:  time      : 2024-09-12_03:02:15
  host      : a70d5fe5d884
  rank      : 0 (local_rank: 0)
  exitcode  : 2 (pid: 541)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
[2024-09-12 03:02:15,868 - TAO Toolkit - root - INFO] Sending telemetry data.
[2024-09-12 03:02:15,868 - TAO Toolkit - root - INFO] ================> Start Reporting Telemetry <================
[2024-09-12 03:02:15,868 - TAO Toolkit - root - INFO] Sending {'version': '5.5.0', 'action': 'train', 'network': 'classification_pyt', 'gpu': ['NVIDIA-RTX-A6000'], 'success': False, 'time_lapsed': 11} to https://api.tao.ngc.nvidia.com.
[2024-09-12 03:02:17,422 - TAO Toolkit - root - INFO] Telemetry sent successfully.
[2024-09-12 03:02:17,423 - TAO Toolkit - root - INFO] ================> End Reporting Telemetry <================
[2024-09-12 03:02:17,423 - TAO Toolkit - root - WARNING] Execution status: FAIL
2024-09-12 11:02:18,346 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 363: Stopping container.

This is the error I am facing.

13 posts - 2 participants

Read full topic


Viewing all articles
Browse latest Browse all 497

Trending Articles