Quantcast
Channel: TAO Toolkit - NVIDIA Developer Forums
Viewing all articles
Browse latest Browse all 497

TAO 5.3 Direct Container usage issues

$
0
0

Please provide the following information when requesting support.

• Hardware (T4/V100/Xavier/Nano/etc) H100
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc) Segformer
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here) nvcr.io/nvidia/tao/tao-toolkit:5.3.0-pyt Segformer
• Training spec file(If have, please share here)
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)

Hi,

I’ve been using TAO 5.3 for some time using the Launcher. A specific use case required that I use the containers directly. I used the following command:

docker run -it --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --gpus all -v /home/ubuntu/DBI-9/Transfer/WindowService:/workspace/tao-experiments nvcr.io/nvidia/tao/tao-toolkit:5.3.0-pyt segformer train -e /workspace/tao-experiments/specs/WindowsV2/MLOps/mit_b5/512/train.yaml -r /workspace/tao-experiments/results/WindowsV2/MLOps/mit_b5/51

I highlight here that I’m using the exact same spec file and dataset that runs without issues using the launcher. So there are no issues there.

When I run the above command I get a very verbose error message (below) that points to an mmengine execption in local_backend.py. The error is:

[Errno 40] Too many levels of symbolic links: ‘/dev/fd/52/dev/fd/52/dev/fd/52/dev/fd/52/dev/fd/52/dev/fd/52/dev/fd/52/dev/fd/52/dev/fd/52/dev/fd/52/dev/fd/52/dev/fd/52/dev/fd/52/dev/stderr’
Error executing job with overrides: [‘results_dir=/workspace/tao-experiments/results/WindowsV2/MLOps/mit_b5/51’]
Traceback (most recent call last):
File “/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/segformer/scripts/train.py”, line 119, in main
raise e
File “/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/segformer/scripts/train.py”, line 106, in main
run_experiment(experiment_config=cfg,
File “/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/segformer/scripts/train.py”, line 85, in run_experiment
runner.train()
File “/usr/local/lib/python3.10/dist-packages/mmengine/runner/runner.py”, line 1728, in train
self._train_loop = self.build_train_loop(
File “/usr/local/lib/python3.10/dist-packages/mmengine/runner/runner.py”, line 1520, in build_train_loop
loop = LOOPS.build(
File “/usr/local/lib/python3.10/dist-packages/mmengine/registry/registry.py”, line 570, in build
return self.build_func(cfg, *args, **kwargs, registry=self)
File “/usr/local/lib/python3.10/dist-packages/mmengine/registry/build_functions.py”, line 121, in build_from_cfg
obj = obj_cls(**args) # type: ignore
File “/usr/local/lib/python3.10/dist-packages/mmengine/runner/loops.py”, line 219, in init
super().init(runner, dataloader)
File “/usr/local/lib/python3.10/dist-packages/mmengine/runner/base_loop.py”, line 26, in init
self.dataloader = runner.build_dataloader(
File “/usr/local/lib/python3.10/dist-packages/mmengine/runner/runner.py”, line 1370, in build_dataloader
dataset = DATASETS.build(dataset_cfg)
File “/usr/local/lib/python3.10/dist-packages/mmengine/registry/registry.py”, line 570, in build
return self.build_func(cfg, *args, **kwargs, registry=self)
File “/usr/local/lib/python3.10/dist-packages/mmengine/registry/build_functions.py”, line 121, in build_from_cfg
obj = obj_cls(**args) # type: ignore
File “/usr/local/lib/python3.10/dist-packages/mmseg/datasets/basesegdataset.py”, line 142, in init
self.full_init()
File “/usr/local/lib/python3.10/dist-packages/mmengine/dataset/base_dataset.py”, line 298, in full_init
self.data_list = self.load_data_list()
File “/usr/local/lib/python3.10/dist-packages/mmseg/datasets/basesegdataset.py”, line 256, in load_data_list
for img in fileio.list_dir_or_file(
File “/usr/local/lib/python3.10/dist-packages/mmengine/fileio/io.py”, line 760, in list_dir_or_file
yield from backend.list_dir_or_file(dir_path, list_dir, list_file, suffix,
File “/usr/local/lib/python3.10/dist-packages/mmengine/fileio/backends/local_backend.py”, line 538, in _list_dir_or_file
yield from _list_dir_or_file(entry.path, list_dir,
File “/usr/local/lib/python3.10/dist-packages/mmengine/fileio/backends/local_backend.py”, line 538, in _list_dir_or_file
yield from _list_dir_or_file(entry.path, list_dir,
File “/usr/local/lib/python3.10/dist-packages/mmengine/fileio/backends/local_backend.py”, line 538, in _list_dir_or_file
yield from _list_dir_or_file(entry.path, list_dir,
[Previous line repeated 37 more times]
File “/usr/local/lib/python3.10/dist-packages/mmengine/fileio/backends/local_backend.py”, line 528, in _list_dir_or_file
if not entry.name.startswith(‘.’) and entry.is_file():
OSError: [Errno 40] Too many levels of symbolic links: ‘/dev/fd/52/dev/fd/52/dev/fd/52/dev/fd/52/dev/fd/52/dev/fd/52/dev/fd/52/dev/fd/52/dev/fd/52/dev/fd/52/dev/fd/52/dev/fd/52/dev/fd/52/dev/stderr’

The function that causes the exception is as follows (I shelled into the docker):

def list_dir_or_file(self,
                     dir_path: Union[str, Path],
                     list_dir: bool = True,
                     list_file: bool = True,
                     suffix: Optional[Union[str, Tuple[str]]] = None,
                     recursive: bool = False) -> Iterator[str]:
    """Scan a directory to find the interested directories or files in
    arbitrary order.

    Note:
        :meth:`list_dir_or_file` returns the path relative to ``dir_path``.

    Args:
        dir_path (str or Path): Path of the directory.
        list_dir (bool): List the directories. Defaults to True.
        list_file (bool): List the path of files. Defaults to True.
        suffix (str or tuple[str], optional): File suffix that we are
            interested in. Defaults to None.
        recursive (bool): If set to True, recursively scan the directory.
            Defaults to False.

    Yields:
        Iterable[str]: A relative path to ``dir_path``.

    Examples:
        >>> backend = LocalBackend()
        >>> dir_path = '/path/of/dir'
        >>> # list those files and directories in current directory
        >>> for file_path in backend.list_dir_or_file(dir_path):
        ...     print(file_path)
        >>> # only list files
        >>> for file_path in backend.list_dir_or_file(dir_path, list_dir=False):
        ...     print(file_path)
        >>> # only list directories
        >>> for file_path in backend.list_dir_or_file(dir_path, list_file=False):
        ...     print(file_path)
        >>> # only list files ending with specified suffixes
        >>> for file_path in backend.list_dir_or_file(dir_path, suffix='.txt'):
        ...     print(file_path)
        >>> # list all files and directory recursively
        >>> for file_path in backend.list_dir_or_file(dir_path, recursive=True):
        ...     print(file_path)
    """  # noqa: E501
    if list_dir and suffix is not None:
        raise TypeError('`suffix` should be None when `list_dir` is True')

    if (suffix is not None) and not isinstance(suffix, (str, tuple)):
        raise TypeError('`suffix` must be a string or tuple of strings')

    root = dir_path

    def _list_dir_or_file(dir_path, list_dir, list_file, suffix,
                          recursive):
        for entry in os.scandir(dir_path):
            if not entry.name.startswith('.') and entry.is_file():
                rel_path = osp.relpath(entry.path, root)
                if (suffix is None
                        or rel_path.endswith(suffix)) and list_file:
                    yield rel_path
            elif osp.isdir(entry.path):
                if list_dir:
                    rel_dir = osp.relpath(entry.path, root)
                    yield rel_dir
                if recursive:
                    yield from _list_dir_or_file(entry.path, list_dir,
                                                 list_file, suffix,
                                                 recursive)

    return _list_dir_or_file(dir_path, list_dir, list_file, suffix,
                             recursive)

I then edited that function as follows (I believe there is an infinite recursion going on):

def list_dir_or_file(self,
                     dir_path: Union[str, Path],
                     list_dir: bool = True,
                     list_file: bool = True,
                     suffix: Optional[Union[str, Tuple[str]]] = None,
                     recursive: bool = False) -> Iterator[str]:
    """Scan a directory to find the interested directories or files in
    arbitrary order.

    Note:
        :meth:`list_dir_or_file` returns the path relative to ``dir_path``.

    Args:
        dir_path (str or Path): Path of the directory.
        list_dir (bool): List the directories. Defaults to True.
        list_file (bool): List the path of files. Defaults to True.
        suffix (str or tuple[str], optional): File suffix that we are
            interested in. Defaults to None.
        recursive (bool): If set to True, recursively scan the directory.
            Defaults to False.

    Yields:
        Iterable[str]: A relative path to ``dir_path``.

    Examples:
        >>> backend = LocalBackend()
        >>> dir_path = '/path/of/dir'
        >>> # list those files and directories in current directory
        >>> for file_path in backend.list_dir_or_file(dir_path):
        ...     print(file_path)
        >>> # only list files
        >>> for file_path in backend.list_dir_or_file(dir_path, list_dir=False):
        ...     print(file_path)
        >>> # only list directories
        >>> for file_path in backend.list_dir_or_file(dir_path, list_file=False):
        ...     print(file_path)
        >>> # only list files ending with specified suffixes
        >>> for file_path in backend.list_dir_or_file(dir_path, suffix='.txt'):
        ...     print(file_path)
        >>> # list all files and directory recursively
        >>> for file_path in backend.list_dir_or_file(dir_path, recursive=True):
        ...     print(file_path)
    """  # noqa: E501
    if list_dir and suffix is not None:
        raise TypeError('`suffix` should be None when `list_dir` is True')

    if (suffix is not None) and not isinstance(suffix, (str, tuple)):
        raise TypeError('`suffix` must be a string or tuple of strings')

    root = dir_path
    visited_paths = set()
    
    def _list_dir_or_file(dir_path, list_dir, list_file, suffix,
                          recursive):
        for entry in os.scandir(dir_path):
            if entry.is_symlink():
                continue  # Skip symbolic links

            if entry.path in visited_paths:
                continue  # Skip already visited paths

            visited_paths.add(entry.path)

            if not entry.name.startswith('.') and entry.is_file():
                rel_path = osp.relpath(entry.path, root)
                if (suffix is None
                        or rel_path.endswith(suffix)) and list_file:
                    yield rel_path
            elif osp.isdir(entry.path): 
                if list_dir:
                    rel_dir = osp.relpath(entry.path, root)
                    yield rel_dir
                if recursive:
                    yield from _list_dir_or_file(entry.path, list_dir,
                                                 list_file, suffix,
                                                 recursive)
        # print('visited_paths:', visited_paths)
    return _list_dir_or_file(dir_path, list_dir, list_file, suffix,
                             recursive)

I then did a docker commit to save those modifications. With this new container I ran again using the same command (docker run …) with the same arguments. The above error did not occur, however now receiving multiple torch error related to tensor shape:

UserWarning: Please pay attention your ground truth segmentation map, usually the segmentation map is 2D, but got (456, 512, 4)

Now, my dataset has not changed and it successfully trains the exact same dataset (all images 512x512x3 PNG Mask 512x512 PNG). The training spec is the same except for pointing to the correct dataset locations (to take into account running the docker directly).

I’m wondering if the launcher code does some other manipulation of the dataset prior to sending to torch? Any ideas?

cheers

4 posts - 2 participants

Read full topic


Viewing all articles
Browse latest Browse all 497

Trending Articles