TAO Toolkit 5.2 (5.2.0.1-pyt1.14.0:Segformer) - OSError: [Errno 39] Directory not empty: '/results/train/.eval

Please provide the following information when requesting support.

• Hardware (T4/V100/Xavier/Nano/etc) Dual A6000
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc) Segformer
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here) 5.2.0.1-pyt1.14.0:
• Training spec file(If have, please share here) See below
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.) See below

Hi,

For the first time I’m having issues during segformer training. It appears that when the iteration reaches the point where a validation_interval is triggered the container fails with:

OSError: [Errno 39] Directory not empty: ‘/results/train/.eval_hook’

The container is started with:

!tao model segformer train
-e $SPECS_DIR/train.yaml
-r $RESULTS_DIR
-g $NUM_GPUS

The container to host filesystem is valid as I am getting *.pth files as the training progresses. Here is the training spec:

train:
exp_config:
manual_seed: 49
checkpoint_interval: 50
logging_interval: 50
max_iters: 220
resume_training_checkpoint_path: null
validate: True
validation_interval: 220
trainer:
find_unused_parameters: True
sf_optim:
lr: 0.00006
model:
input_height: 800
input_width: 800
pretrained_model_path: null
backbone:
type: “mit_b5”
dataset:
data_root: /tlt-pytorch
input_type: “rgb”
img_norm_cfg:
mean:
- 127.5
- 127.5
- 127.5
std:
- 127.5
- 127.5
- 127.5
to_rgb: True
train_dataset:
img_dir:
- /data/training/images
ann_dir:
- /data/training/masks
pipeline:
augmentation_config:
random_crop:
crop_size:
- 700
- 700
cat_max_ratio: 0.75
resize:
ratio_range:
- 0.5
- 2.0
random_flip:
prob: 0.5
val_dataset:
img_dir: /data/val/images
ann_dir: /data/val/masks
palette:
- seg_class: background
rgb:
- 0
- 0
- 0
label_id: 0
mapping_class: background
- seg_class: window
rgb:
- 255
- 255
- 255
label_id: 1
mapping_class: foreground
repeat_data_times: 500
batch_size: 6
workers_per_gpu: 24

Here is the detailed log output:

[> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>] 86/85, 12.9 task/s, elapsed: 7s, ETA: 0sError executing job with overrides: [‘train.num_gpus=2’, ‘results_dir=/results’]

An error occurred during Hydra’s exception formatting:
AssertionError()
Traceback (most recent call last):
File “/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py”, line 254, in run_and_report
assert mdl is not None
AssertionError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “</usr/local/lib/python3.8/dist-packages/nvidia_tao_pytorch/cv/segformer/scripts/train.py>”, line 3, in
File “”, line 176, in
File “”, line 107, in wrapper
File “/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py”, line 389, in _run_hydra
_run_app(
File “/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py”, line 452, in _run_app
run_and_report(
File “/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py”, line 296, in run_and_report
raise ex
File “/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py”, line 213, in run_and_report
return func()
File “/usr/local/lib/python3.8/dist-packages/hydra/_internal/utils.py”, line 453, in
lambda: hydra.run(
File “/usr/local/lib/python3.8/dist-packages/hydra/_internal/hydra.py”, line 132, in run
_ = ret.return_value
File “/usr/local/lib/python3.8/dist-packages/hydra/core/utils.py”, line 260, in return_value
raise self._return_value
File “/usr/local/lib/python3.8/dist-packages/hydra/core/utils.py”, line 186, in run_job
ret.return_value = task_function(task_cfg)
File “”, line 172, in main
File “”, line 162, in main
File “”, line 130, in run_experiment
File “”, line 198, in train_segmentor
File “/usr/local/lib/python3.8/dist-packages/mmcv/runner/iter_based_runner.py”, line 144, in run
iter_runner(iter_loaders[i], **kwargs)
File “/usr/local/lib/python3.8/dist-packages/mmcv/runner/iter_based_runner.py”, line 70, in train
self.call_hook(‘after_train_iter’)
File “/usr/local/lib/python3.8/dist-packages/mmcv/runner/base_runner.py”, line 317, in call_hook
getattr(hook, fn_name)(self)
File “”, line 114, in after_train_iter
File “”, line 159, in multi_gpu_test
File “”, line 202, in collect_results_cpu
File “/usr/lib/python3.8/shutil.py”, line 722, in rmtree
onerror(os.rmdir, path, sys.exc_info())
File “/usr/lib/python3.8/shutil.py”, line 720, in rmtree
os.rmdir(path)
OSError: [Errno 39] Directory not empty: ‘/results/train/.eval_hook’
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 341) of binary: /usr/bin/python
Traceback (most recent call last):
File “/usr/local/bin/torchrun”, line 33, in
sys.exit(load_entry_point(‘torch==1.14.0a0+44dac51’, ‘console_scripts’, ‘torchrun’)())
File “/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py”, line 346, in wrapper
return f(*args, **kwargs)
File “/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py”, line 762, in main
run(args)
File “/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py”, line 753, in run
elastic_launch(
File “/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py”, line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File “/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py”, line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/usr/local/lib/python3.8/dist-packages/nvidia_tao_pytorch/cv/segformer/scripts/train.py FAILED

(Sorry about the large font)
I’ve checked the images and masks; the container reports the correct number for each.

To me it seems like it should be simple and I’m missing something obvious. I notice that when the validation is triggered the file reported in the error is present.

Cheers

1 post - 1 participant

Read full topic

TAO Toolkit 5.2 (5.2.0.1-pyt1.14.0:Segformer) - OSError: [Errno 39] Directory not empty: '/results/train/.eval_hook'

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112