Quantcast
Channel: TAO Toolkit - NVIDIA Developer Forums
Viewing all articles
Browse latest Browse all 497

TAO Toolkit 5.2 (5.2.0.1-pyt1.14.0:Segformer) - OSError: [Errno 39] Directory not empty: '/results/train/.eval_hook'

$
0
0

Please provide the following information when requesting support.

• Hardware (T4/V100/Xavier/Nano/etc) Dual A6000
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc) Segformer
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here) 5.2.0.1-pyt1.14.0:
• Training spec file(If have, please share here) See below
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.) See below

Hi,

For the first time I’m having issues during segformer training. It appears that when the iteration reaches the point where a validation_interval is triggered the container fails with:

OSError: [Errno 39] Directory not empty: ‘/results/train/.eval_hook’

The container is started with:

!tao model segformer train
-e $SPECS_DIR/train.yaml
-r $RESULTS_DIR
-g $NUM_GPUS

The container to host filesystem is valid as I am getting *.pth files as the training progresses. Here is the training spec:

train:
exp_config:
manual_seed: 49
checkpoint_interval: 50
logging_interval: 50
max_iters: 220
resume_training_checkpoint_path: null
validate: True
validation_interval: 220
trainer:
find_unused_parameters: True
sf_optim:
lr: 0.00006
model:
input_height: 800
input_width: 800
pretrained_model_path: null
backbone:
type: “mit_b5”
dataset:
data_root: /tlt-pytorch
input_type: “rgb”
img_norm_cfg:
mean:
- 127.5
- 127.5
- 127.5
std:
- 127.5
- 127.5
- 127.5
to_rgb: True
train_dataset:
img_dir:
- /data/training/images
ann_dir:
- /data/training/masks
pipeline:
augmentation_config:
random_crop:
crop_size:
- 700
- 700
cat_max_ratio: 0.75
resize:
ratio_range:
- 0.5
- 2.0
random_flip:
prob: 0.5
val_dataset:
img_dir: /data/val/images
ann_dir: /data/val/masks
palette:
- seg_class: background
rgb:
- 0
- 0
- 0
label_id: 0
mapping_class: background
- seg_class: window
rgb:
- 255
- 255
- 255
label_id: 1
mapping_class: foreground
repeat_data_times: 500
batch_size: 6
workers_per_gpu: 24

Here is the detailed log output:

[> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>] 86/85, 12.9 task/s, elapsed: 7s, ETA: 0sError executing job with overrides: [‘train.num_gpus=2’, ‘results_dir=/results’]

An error occurred during Hydra’s exception formatting:
AssertionError()
Traceback (most recent call last):
File “/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py”, line 254, in run_and_report
assert mdl is not None
AssertionError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “</usr/local/lib/python3.8/dist-packages/nvidia_tao_pytorch/cv/segformer/scripts/train.py>”, line 3, in
File “”, line 176, in
File “”, line 107, in wrapper
File “/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py”, line 389, in _run_hydra
_run_app(
File “/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py”, line 452, in _run_app
run_and_report(
File “/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py”, line 296, in run_and_report
raise ex
File “/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py”, line 213, in run_and_report
return func()
File “/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py”, line 453, in
lambda: hydra.run(
File “/usr/local/lib/python3.8/dist-packages/hydra/_internal/hydra.py”, line 132, in run
_ = ret.return_value
File “/usr/local/lib/python3.8/dist-packages/hydra/core/utils.py”, line 260, in return_value
raise self._return_value
File “/usr/local/lib/python3.8/dist-packages/hydra/core/utils.py”, line 186, in run_job
ret.return_value = task_function(task_cfg)
File “”, line 172, in main
File “”, line 162, in main
File “”, line 130, in run_experiment
File “”, line 198, in train_segmentor
File “/usr/local/lib/python3.8/dist-packages/mmcv/runner/iter_based_runner.py”, line 144, in run
iter_runner(iter_loaders[i], **kwargs)
File “/usr/local/lib/python3.8/dist-packages/mmcv/runner/iter_based_runner.py”, line 70, in train
self.call_hook(‘after_train_iter’)
File “/usr/local/lib/python3.8/dist-packages/mmcv/runner/base_runner.py”, line 317, in call_hook
getattr(hook, fn_name)(self)
File “”, line 114, in after_train_iter
File “”, line 159, in multi_gpu_test
File “”, line 202, in collect_results_cpu
File “/usr/lib/python3.8/shutil.py”, line 722, in rmtree
onerror(os.rmdir, path, sys.exc_info())
File “/usr/lib/python3.8/shutil.py”, line 720, in rmtree
os.rmdir(path)
OSError: [Errno 39] Directory not empty: ‘/results/train/.eval_hook’
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 341) of binary: /usr/bin/python
Traceback (most recent call last):
File “/usr/local/bin/torchrun”, line 33, in
sys.exit(load_entry_point(‘torch==1.14.0a0+44dac51’, ‘console_scripts’, ‘torchrun’)())
File “/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py”, line 346, in wrapper
return f(*args, **kwargs)
File “/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py”, line 762, in main
run(args)
File “/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py”, line 753, in run
elastic_launch(
File “/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py”, line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File “/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py”, line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/usr/local/lib/python3.8/dist-packages/nvidia_tao_pytorch/cv/segformer/scripts/train.py FAILED

(Sorry about the large font)
I’ve checked the images and masks; the container reports the correct number for each.

To me it seems like it should be simple and I’m missing something obvious. I notice that when the validation is triggered the file reported in the error is present.

Cheers

1 post - 1 participant

Read full topic


Viewing all articles
Browse latest Browse all 497

Trending Articles