Hi! I have some issues in installing TAO Toolkit API 5.2.0 on bare metal (single machine), using the provided scripts.
Starting from a fresh Ubuntu 20.04.6 these are the steps:
- Define a “ubuntu” user with password “password”
- Update the system with apt upgrade
- Install NVIDIA-NGC:
> wget --content-disposition https://api.ngc.nvidia.com/v2/resources/nvidia/ngc-apps/ngc_cli/versions/3.39.0/files/ngccli_linux.zip -O ngccli_linux.zip && unzip ngccli_linux.zip
> chmod u+x ngc-cli/ngc
> echo "export PATH=\"\$PATH:$(pwd)/ngc-cli\"" >> ~/.bash_profile && source ~/.bash_profile
> ngc config set
- Download TAO Toolkit:
> ngc registry resource download-version "nvidia/tao/tao-getting-started:5.2.0"
> cd tao-getting-started_v5.2.0/setup/quickstart_api_bare_metal
> sudo echo "ubuntu ALL=(ALL) NOPASSWD:ALL" >> /etc/sudoers
- Install openssh-server and get hostname via:
> hostname -i
-
Generate SSH keys pair, use ssh-copy-id and check that
ssh ubuntu@127.0.1.1 'sudo whoami'
gives “root” as result -
Update hosts file:
-
Set parameters in tao-toolkit-api-ansible-values.yml
Then running bash setup.sh install
this is the result:
first_run_log.txt (2.2 KB)
To solve the issue I’ve changed the value of check gpu per node
to False. Restarting the installation gives this log:
second_run_log.txt (2.2 KB)
My GPUs are seen as VGA controllers and not as 3D adapters, changing the grep condition to “VGA” fixed the problem. The new log is:
third_log_run.txt (15.7 KB)
Site packages.google seems to be down, changed it to google.com and restart the script. The systems reboots when [Waiting for the Cluster to become available]
.
Executing the script after reboot gives:
4_run_log.txt (124.1 KB)
and the installation doesn’t go on.
Command kubectl get pods --all-namespaces
:
Executing command kubectl delete crd clusterpolicies.nvidia.com
gives this log:
output.txt (167.2 KB)
With the TAO api installation stuck. The kubectl describe pod tao
command gives this info:
kubect_describe.txt (4.9 KB)
with errors related to connection refused during the liveness checks.
PC specs:
- CPU: Intel(R) Xeon(R) w5-2445
- RAM: 64 GB
- GPU: 2x RTXA2000 (12 GB)
Thanks!
31 posts - 2 participants