In the tao segformer notebook, when running
!tao model segformer train \
-e $SPECS_DIR/train_mit_b5.yaml \
-r $RESULTS_DIR/isbi_experiment \
-g $NUM_GPUS
It first run out of disk space on the root drive. After cleaning up, including erasing all docker images wir docker container rm
rebooted, and rerun the training and now get error
exec failed: unable to start container process: exec: “segformer”: executable file not found in $PATH: unknown
Complete results: >
Train SegFormer Model
2024-02-10 22:36:57,856 [TAO Toolkit] [INFO] root 160: Registry: [‘nvcr.io’]
2024-02-10 22:36:58,043 [TAO Toolkit] [INFO] nvidia_tao_cli.components.instance_handler.local_instance 360: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:5.2.0-pyt1.14.0
2024-02-10 22:36:58,211 [TAO Toolkit] [WARNING] nvidia_tao_cli.components.docker_handler.docker_handler 288:
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the “user”:“UID:GID” in the
DockerOptions portion of the “/home/david/.tao_mounts.json” file. You can obtain your
users UID and GID by using the “id -u” and “id -g” commands on the
terminal.
2024-02-10 22:36:58,211 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 301: Printing tty value True
OCI runtime exec failed: exec failed: unable to start container process: exec: “segformer”: executable file not found in $PATH: unknown
2024-02-10 22:36:59,469 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 363: Stopping container.
Trying to diagnose I run
docker run -it --rm --network=host nvcr.io/nvidia/tao/tao-toolkit:5.2.0-pyt1.14.0 /bin/bash
And get error
chmod: cannot access ‘/opt/ngccli/ngc’: No such file or directory
The complete docker run log
docker run.log (244.6 KB)
I found odd that at the begining of the docker run it says
ngccli_reg_linux.zi 100%[===================>] 44.93M 34.8MB/s in 1.3s
2024-02-10 21:51:38 (34.8 MB/s) - ‘/opt/ngccli/ngccli_reg_linux.zip’ saved [47113663/47113663]
But in fact, directory /opt/ngccli
doesn’t exist:
:/opt$ ls
containerd google microsoft nvidia ros
Thanks for the help
2 posts - 2 participants