Quantcast
Channel: TAO Toolkit - NVIDIA Developer Forums
Viewing all articles
Browse latest Browse all 409

Unable to install TAO Toolkit 5.2.0 API on bare metal

$
0
0

Hi! I have some issues in installing TAO Toolkit API 5.2.0 on bare metal (single machine), using the provided scripts.
Starting from a fresh Ubuntu 20.04.6 these are the steps:

  1. Define a “ubuntu” user with password “password”
  2. Update the system with apt upgrade
  3. Install NVIDIA-NGC:
> wget --content-disposition https://api.ngc.nvidia.com/v2/resources/nvidia/ngc-apps/ngc_cli/versions/3.39.0/files/ngccli_linux.zip -O ngccli_linux.zip && unzip ngccli_linux.zip
> chmod u+x ngc-cli/ngc
> echo "export PATH=\"\$PATH:$(pwd)/ngc-cli\"" >> ~/.bash_profile && source ~/.bash_profile
> ngc config set
  1. Download TAO Toolkit:
> ngc registry resource download-version "nvidia/tao/tao-getting-started:5.2.0"
> cd tao-getting-started_v5.2.0/setup/quickstart_api_bare_metal
> sudo echo "ubuntu ALL=(ALL) NOPASSWD:ALL" >> /etc/sudoers
  1. Install openssh-server and get hostname via:
> hostname -i
  1. Generate SSH keys pair, use ssh-copy-id and check that ssh ubuntu@127.0.1.1 'sudo whoami' gives “root” as result

  2. Update hosts file:

  3. Set parameters in tao-toolkit-api-ansible-values.yml

Then running bash setup.sh install this is the result:
first_run_log.txt (2.2 KB)
To solve the issue I’ve changed the value of check gpu per node to False. Restarting the installation gives this log:
second_run_log.txt (2.2 KB)
My GPUs are seen as VGA controllers and not as 3D adapters, changing the grep condition to “VGA” fixed the problem. The new log is:
third_log_run.txt (15.7 KB)
Site packages.google seems to be down, changed it to google.com and restart the script. The systems reboots when [Waiting for the Cluster to become available].
Executing the script after reboot gives:
4_run_log.txt (124.1 KB)
and the installation doesn’t go on.
Command kubectl get pods --all-namespaces:

Executing command kubectl delete crd clusterpolicies.nvidia.com gives this log:
output.txt (167.2 KB)
With the TAO api installation stuck. The kubectl describe pod tao command gives this info:
kubect_describe.txt (4.9 KB)
with errors related to connection refused during the liveness checks.

PC specs:

  • CPU: Intel(R) Xeon(R) w5-2445
  • RAM: 64 GB
  • GPU: 2x RTXA2000 (12 GB)

Thanks!

31 posts - 2 participants

Read full topic


Viewing all articles
Browse latest Browse all 409

Trending Articles