Please provide the following information when requesting support.
• Hardware (T4/V100/Xavier/Nano/etc)
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc)
classification_pyt
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here)
• Training spec file(If have, please share here)
train:
exp_config:
manual_seed: 49
train_config:
runner:
max_epochs: 40
checkpoint_config:
interval: 1
logging:
interval: 500
validate: True
evaluation:
interval: 1
custom_hooks:
- type: "EMAHook"
momentum: 4e-5
priority: "ABOVE_NORMAL"
dataset:
data:
samples_per_gpu: 8
train:
data_prefix: /data/cats_dogs_dataset/training_set/training_set/
pipeline: # Augmentations alone
- type: RandomResizedCrop
size: 224
- type: RandomFlip
flip_prob: 0.5
direction: "horizontal"
classes: /data/cats_dogs_dataset/classes.txt
val:
data_prefix: /data/cats_dogs_dataset/val_set/val_set
classes: /data/cats_dogs_dataset/classes.txt
test:
data_prefix: /data/cats_dogs_dataset/val_set/val_set
classes: /data/cats_dogs_dataset/classes.txt
model:
backbone:
type: "fan_tiny_8_p4_hybrid"
custom_args:
drop_path: 0.1
head:
type: "FANLinearClsHead"
custom_args:
head_init_scale: 1
num_classes: 2
loss:
type: "CrossEntropyLoss"
loss_weight: 1.0
use_soft: False
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)
env: EPOCHS=5
Train Classification Model
2024-09-12 11:01:56,227 [TAO Toolkit] [INFO] root 160: Registry: ['nvcr.io']
2024-09-12 11:01:56,351 [TAO Toolkit] [INFO] nvidia_tao_cli.components.instance_handler.local_instance 360: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:5.5.0-pyt
2024-09-12 11:01:56,494 [TAO Toolkit] [WARNING] nvidia_tao_cli.components.docker_handler.docker_handler 288:
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the "/home/ubuntu/.tao_mounts.json" file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
2024-09-12 11:01:56,494 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 301: Printing tty value True
[2024-09-12 03:02:02,955 - TAO Toolkit - matplotlib.font_manager - INFO] generated new fontManager
[overrides ...]train.py: error: unrecognized arguments: -g 1th}]]ydra,all}]
E0912 03:02:15.628000 139649349068608 torch/distributed/elastic/multiprocessing/api.py:881] failed (exitcode: 2) local_rank: 0 (pid: 541) of binary: /usr/bin/python
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 879, in main
run(args)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/classification/scripts/train.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]: time : 2024-09-12_03:02:15
host : a70d5fe5d884
rank : 0 (local_rank: 0)
exitcode : 2 (pid: 541)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
[2024-09-12 03:02:15,868 - TAO Toolkit - root - INFO] Sending telemetry data.
[2024-09-12 03:02:15,868 - TAO Toolkit - root - INFO] ================> Start Reporting Telemetry <================
[2024-09-12 03:02:15,868 - TAO Toolkit - root - INFO] Sending {'version': '5.5.0', 'action': 'train', 'network': 'classification_pyt', 'gpu': ['NVIDIA-RTX-A6000'], 'success': False, 'time_lapsed': 11} to https://api.tao.ngc.nvidia.com.
[2024-09-12 03:02:17,422 - TAO Toolkit - root - INFO] Telemetry sent successfully.
[2024-09-12 03:02:17,423 - TAO Toolkit - root - INFO] ================> End Reporting Telemetry <================
[2024-09-12 03:02:17,423 - TAO Toolkit - root - WARNING] Execution status: FAIL
2024-09-12 11:02:18,346 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 363: Stopping container.
This is the error I am facing.
13 posts - 2 participants