I am trying to train FAN from classification_pyt. I’ve trained before without any problem. Now getting this error.
To resume from a checkpoint, use the below command. Update the epoch number accordingly
2024-12-23 18:27:26,925 [TAO Toolkit] [INFO] root 160: Registry: ['nvcr.io']
2024-12-23 18:27:27,000 [TAO Toolkit] [INFO] nvidia_tao_cli.components.instance_handler.local_instance 360: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:5.5.0-pyt
2024-12-23 18:27:27,079 [TAO Toolkit] [WARNING] nvidia_tao_cli.components.docker_handler.docker_handler 288:
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the "/home/sigmind/.tao_mounts.json" file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
2024-12-23 18:27:27,079 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 301: Printing tty value True
[2024-12-23 12:27:35,015 - TAO Toolkit - matplotlib.font_manager - INFO] generated new fontManager
[overrides ...]train.py: error: unrecognized arguments: -g 1 train.train_config.resume_training_checkpoint_path=/results/classification_experiment/train/epoch_36.pth
E1223 12:27:48.226000 139660375336064 torch/distributed/elastic/multiprocessing/api.py:881] failed (exitcode: 2) local_rank: 0 (pid: 367) of binary: /usr/bin/python
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 879, in main
run(args)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/classification/scripts/train.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]: time : 2024-12-23_12:27:48
host : 70a62b4c5c31
rank : 0 (local_rank: 0)
exitcode : 2 (pid: 367)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
[2024-12-23 12:27:48,490 - TAO Toolkit - root - INFO] Sending telemetry data.
[2024-12-23 12:27:48,490 - TAO Toolkit - root - INFO] ================> Start Reporting Telemetry <================
[2024-12-23 12:27:48,490 - TAO Toolkit - root - INFO] Sending {'version': '5.5.0', 'action': 'train', 'network': 'classification_pyt', 'gpu': ['NVIDIA-RTX-A2000-12GB'], 'success': False, 'time_lapsed': 11} to https://api.tao.ngc.nvidia.com.
[2024-12-23 12:27:50,269 - TAO Toolkit - root - INFO] Telemetry sent successfully.
[2024-12-23 12:27:50,269 - TAO Toolkit - root - INFO] ================> End Reporting Telemetry <================
[2024-12-23 12:27:50,269 - TAO Toolkit - root - WARNING] Execution status: FAIL
2024-12-23 18:27:51,191 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 363: Stopping container.
1 post - 1 participant