Quantcast
Channel: TAO Toolkit - NVIDIA Developer Forums
Viewing all articles
Browse latest Browse all 497

TAO Toolkit API 5.3.0 - Installed with errors

$
0
0

Hi!

I was able to install TAO Toolkit API 5.3.0 but some problems arised during installation. This is the complete log:

installation_log.txt (225.1 KB)

And these are all the pods:

NAMESPACE             NAME                                                              READY   STATUS     RESTARTS   AGE
default               dgx-job-controller-74f7f5ccb8-7pl4f                               1/1     Running    0          67s
default               ingress-nginx-controller-764899d9c6-q4lwn                         1/1     Running    0          82s
default               nfs-subdir-external-provisioner-8665b7df97-w6sfz                  1/1     Running    0          75s
default               nvidia-smi-gpuaasl40vs                                            1/1     Running    0          70s
default               nvtl-api-app-pod-784b78fb8b-zdpsw                                 1/1     Running    0          67s
default               nvtl-api-jupyterlab-pod-5cbbbb8f8-9sb8n                           1/1     Running    0          67s
default               nvtl-api-workflow-pod-6db787664c-dtfrm                            1/1     Running    0          67s
kube-system           calico-kube-controllers-6ff746f7c5-8zxbl                          1/1     Running    0          4m48s
kube-system           calico-node-z6njm                                                 1/1     Running    0          4m43s
kube-system           coredns-5d78c9869d-b97mb                                          1/1     Running    0          3m29s
kube-system           coredns-5d78c9869d-df2zq                                          1/1     Running    0          3m29s
kube-system           etcd-gpuaasl40vs                                                  1/1     Running    0          5m5s
kube-system           kube-apiserver-gpuaasl40vs                                        1/1     Running    0          5m5s
kube-system           kube-controller-manager-gpuaasl40vs                               1/1     Running    0          5m5s
kube-system           kube-proxy-6sq7g                                                  1/1     Running    0          4m50s
kube-system           kube-scheduler-gpuaasl40vs                                        1/1     Running    0          5m5s
nvidia-gpu-operator   gpu-feature-discovery-h68wf                                       0/1     Init:0/1   0          3m36s
nvidia-gpu-operator   gpu-operator-1713535811-node-feature-discovery-gc-7b46cd672vvxc   1/1     Running    0          3m22s
nvidia-gpu-operator   gpu-operator-1713535811-node-feature-discovery-master-5b9942t5h   1/1     Running    0          4m32s
nvidia-gpu-operator   gpu-operator-1713535811-node-feature-discovery-worker-2l9qk       1/1     Running    0          3m19s
nvidia-gpu-operator   gpu-operator-5587854f69-7pnrs                                     1/1     Running    0          4m32s
nvidia-gpu-operator   nvidia-container-toolkit-daemonset-tbvgv                          0/1     Init:0/1   0          3m36s
nvidia-gpu-operator   nvidia-dcgm-exporter-qqp2f                                        0/1     Init:0/1   0          3m36s
nvidia-gpu-operator   nvidia-device-plugin-daemonset-rmwhf                              0/1     Init:0/1   0          3m36s
nvidia-gpu-operator   nvidia-driver-daemonset-tb74g                                     0/1     Running    0          3m58s
nvidia-gpu-operator   nvidia-operator-validator-vkfx4                                   0/1     Init:0/4   0          3m36s

The “Nvidia” ones are stuck in initialization phase. Could be a problem?
Result of kubectl logs -f nvtl-api-app-pod-784b78fb8b-zdpsw is:

api-logs.txt (16.8 KB)

Working with an L40 GPU inside a VM with Passthrough. There’s a way to monitor the GPU usage inside the Kubernetes Cluster?

Thanks!

1 post - 1 participant

Read full topic


Viewing all articles
Browse latest Browse all 497

Trending Articles