Please provide the following information when requesting support.
• Hardware (T4/V100/Xavier/Nano/etc) A6000
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc) Segformer
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here) 5.3
• Training spec file(If have, please share here)
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)
Hi
I created a new python env and installed the TAO 5.3 Launcher. Ran exactly the same model as for 5.2 (same mounts, same dataset and same spec file) but it fails with:
Looking at the errors and tracing it through the mmseg and NVIDA-TAO sources on GitHub it fails because a call to Numpy’s concatenate fails. This is usually due to an array that is passed being empty. The failure appears to be around where datasets are loaded. But I have not changed any datasets or the spec file. As a check I repointed my notebook to TAO V5.2 kernel and reran without any errors.
I know that 5.3 is new so wondering if this has come up yet?
Cheers
Error below.
/usr/local/lib/python3.10/dist-packages/mmseg/engine/hooks/visualization_hook.py:60: UserWarning: The draw is False, it means that the hook for visualization will not take effect. The results will NOT be visualized or stored.
warnings.warn('The draw is False, it means that the ’
/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/segformer/dataloader/loading.py:53: UserWarning:reduce_zero_label
will be deprecated, if you would like to ignore the zero label, please setreduce_zero_label=True
when dataset initialized
warnings.warn('reduce_zero_label
will be deprecated, ’
Error executing job with overrides: [‘train.num_gpus=2’, ‘results_dir=/workspace/tao-experiments/results/Ex4’]
An error occurred during Hydra’s exception formatting:
AssertionError()
Traceback (most recent call last):
File “/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py”, line 254, in run_and_report
assert mdl is not None
AssertionError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File “/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/segformer/scripts/train.py”, line 123, in
main()
File “/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/hydra/hydra_runner.py”, line 107, in wrapper
_run_hydra(
File “/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py”, line 389, in _run_hydra
_run_app(
File “/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py”, line 452, in _run_app
run_and_report(
File “/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py”, line 296, in run_and_report
raise ex
File “/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py”, line 213, in run_and_report
return func()
File “/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py”, line 453, in
lambda: hydra.run(
File “/usr/local/lib/python3.10/dist-packages/hydra/_internal/hydra.py”, line 132, in run
_ = ret.return_value
File “/usr/local/lib/python3.10/dist-packages/hydra/core/utils.py”, line 260, in return_value
raise self._return_value
File “/usr/local/lib/python3.10/dist-packages/hydra/core/utils.py”, line 186, in run_job
ret.return_value = task_function(task_cfg)
File “/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/segformer/scripts/train.py”, line 119, in main
raise e
File “/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/segformer/scripts/train.py”, line 106, in main
run_experiment(experiment_config=cfg,
File “/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/segformer/scripts/train.py”, line 85, in run_experiment
runner.train()
File “/usr/local/lib/python3.10/dist-packages/mmengine/runner/runner.py”, line 1728, in train
self._train_loop = self.build_train_loop(
File “/usr/local/lib/python3.10/dist-packages/mmengine/runner/runner.py”, line 1520, in build_train_loop
loop = LOOPS.build(
File “/usr/local/lib/python3.10/dist-packages/mmengine/registry/registry.py”, line 570, in build
return self.build_func(cfg, *args, **kwargs, registry=self)
File “/usr/local/lib/python3.10/dist-packages/mmengine/registry/build_functions.py”, line 121, in build_from_cfg
obj = obj_cls(**args) # type: ignore
File “/usr/local/lib/python3.10/dist-packages/mmengine/runner/loops.py”, line 219, in init
super().init(runner, dataloader)
File “/usr/local/lib/python3.10/dist-packages/mmengine/runner/base_loop.py”, line 26, in init
self.dataloader = runner.build_dataloader(
File “/usr/local/lib/python3.10/dist-packages/mmengine/runner/runner.py”, line 1370, in build_dataloader
dataset = DATASETS.build(dataset_cfg)
File “/usr/local/lib/python3.10/dist-packages/mmengine/registry/registry.py”, line 570, in build
return self.build_func(cfg, *args, **kwargs, registry=self)
File “/usr/local/lib/python3.10/dist-packages/mmengine/registry/build_functions.py”, line 121, in build_from_cfg
obj = obj_cls(**args) # type: ignore
File “/usr/local/lib/python3.10/dist-packages/mmseg/datasets/basesegdataset.py”, line 142, in init
self.full_init()
File “/usr/local/lib/python3.10/dist-packages/mmengine/dataset/base_dataset.py”, line 307, in full_init
self.data_bytes, self.data_address = self._serialize_data()
File “/usr/local/lib/python3.10/dist-packages/mmengine/dataset/base_dataset.py”, line 768, in _serialize_data
data_bytes = np.concatenate(data_list)
File “<array_function internals>”, line 200, in concatenate
ValueError: need at least one array to concatenate
/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/segformer/dataloader/loading.py:53: UserWarning:reduce_zero_label
will be deprecated, if you would like to ignore the zero label, please setreduce_zero_label=True
when dataset initialized
warnings.warn('reduce_zero_label
will be deprecated, ’
need at least one array to concatenate
Error executing job with overrides: [‘train.num_gpus=2’, ‘results_dir=/workspace/tao-experiments/results/Ex4’]
An error occurred during Hydra’s exception formatting:
AssertionError()
Traceback (most recent call last):
File “/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py”, line 254, in run_and_report
assert mdl is not None
AssertionErrorDuring handling of the above exception, another exception occurred:
Traceback (most recent call last):
File “/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/segformer/scripts/train.py”, line 123, in
main()
File “/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/hydra/hydra_runner.py”, line 107, in wrapper
_run_hydra(
File “/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py”, line 389, in _run_hydra
_run_app(
File “/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py”, line 452, in _run_app
run_and_report(
File “/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py”, line 296, in run_and_report
raise ex
File “/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py”, line 213, in run_and_report
return func()
File “/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py”, line 453, in
lambda: hydra.run(
File “/usr/local/lib/python3.10/dist-packages/hydra/_internal/hydra.py”, line 132, in run
_ = ret.return_value
File “/usr/local/lib/python3.10/dist-packages/hydra/core/utils.py”, line 260, in return_value
raise self._return_value
File “/usr/local/lib/python3.10/dist-packages/hydra/core/utils.py”, line 186, in run_job
ret.return_value = task_function(task_cfg)
File “/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/segformer/scripts/train.py”, line 119, in main
raise e
File “/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/segformer/scripts/train.py”, line 106, in main
run_experiment(experiment_config=cfg,
File “/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/segformer/scripts/train.py”, line 85, in run_experiment
runner.train()
File “/usr/local/lib/python3.10/dist-packages/mmengine/runner/runner.py”, line 1728, in train
self._train_loop = self.build_train_loop(
File “/usr/local/lib/python3.10/dist-packages/mmengine/runner/runner.py”, line 1520, in build_train_loop
loop = LOOPS.build(
File “/usr/local/lib/python3.10/dist-packages/mmengine/registry/registry.py”, line 570, in build
return self.build_func(cfg, *args, **kwargs, registry=self)
File “/usr/local/lib/python3.10/dist-packages/mmengine/registry/build_functions.py”, line 121, in build_from_cfg
obj = obj_cls(**args) # type: ignore
File “/usr/local/lib/python3.10/dist-packages/mmengine/runner/loops.py”, line 219, in init
super().init(runner, dataloader)
File “/usr/local/lib/python3.10/dist-packages/mmengine/runner/base_loop.py”, line 26, in init
self.dataloader = runner.build_dataloader(
File “/usr/local/lib/python3.10/dist-packages/mmengine/runner/runner.py”, line 1370, in build_dataloader
dataset = DATASETS.build(dataset_cfg)
File “/usr/local/lib/python3.10/dist-packages/mmengine/registry/registry.py”, line 570, in build
return self.build_func(cfg, *args, **kwargs, registry=self)
File “/usr/local/lib/python3.10/dist-packages/mmengine/registry/build_functions.py”, line 121, in build_from_cfg
obj = obj_cls(**args) # type: ignore
File “/usr/local/lib/python3.10/dist-packages/mmseg/datasets/basesegdataset.py”, line 142, in init
self.full_init()
File “/usr/local/lib/python3.10/dist-packages/mmengine/dataset/base_dataset.py”, line 307, in full_init
self.data_bytes, self.data_address = self._serialize_data()
File “/usr/local/lib/python3.10/dist-packages/mmengine/dataset/base_dataset.py”, line 768, in _serialize_data
data_bytes = np.concatenate(data_list)
File “<array_function internals>”, line 200, in concatenate
ValueError: need at least one array to concatenate
[2024-04-03 22:41:25,154] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 404) of binary: /usr/bin/python
Traceback (most recent call last):
File “/usr/local/bin/torchrun”, line 8, in
sys.exit(main())
File “/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py”, line 351, in wrapper
return f(*args, **kwargs)
File “/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py”, line 806, in main
run(args)
File “/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py”, line 797, in run
elastic_launch(
File “/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py”, line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File “/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py”, line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
3 posts - 2 participants