I got none allocatable GPU after I provisioned an EKS cluster with GPU instance.
kubectl get nodes "-o=custom-columns=NAME:.metadata.name ,GPU:.status.allocatable.nvidia.com/gpu"
NAME GPU
ip-10-2-2-71.us-west-2.compute.internal <none>
ip-10-2-3-146.us-west-2.compute.internal <none>
ip-10-2-4-217.us-west-2.compute.internal <none>
As follows are steps I built the customized ubuntu image:
Install docker: https://docs.docker.com/engine/install/ubuntu/
Install NVIDIA-docker https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/install-nvidia-driver.html#nvidia-installation-options
I chose Option 3 for G5 instance (I launched an EC2 instance with G5.4xlarge and built the AMI) Option 3: GRID drivers (G5, G4dn, and G3 instances) G5 instances require GRID 13.1 or later (or GRID 12.4 or later).
Installed CUDA 11.6 aws s3 cp --recursive s3://ec2-linux-nvidia-drivers/latest/ .
Installed the docker utilit engine for NVIDIA GPUS.
https://docs.nvidia.com/ai-enterprise/deployment-guide/dg-docker.html
ENABLING THE DOCKER REPOSITORY AND INSTALLING THE NVIDIA CONTAINER TOOLKIT
sudo docker run --rm --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.85.02 Driver Version: 510.85.02 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 |
| 0% 16C P8 15W / 300W | 0MiB / 23028MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
- I provisioned EKS cluster, using Karpenter launched GPU nodes. I installed k8s-device-plugin
https://github.com/NVIDIA/k8s-device-plugin
I configure docker as follows in the AMI:
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
}
Enabling GPU Support in Kubernetes
$ kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.12.3/nvidia-device-plugin.yml
kubectl get daemonset -A
NAMESPACE NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
kube-system aws-node 3 3 3 3 3 <none> 25h
kube-system kube-proxy 3 3 3 3 3 <none> 25h
kube-system nvidia-device-plugin-daemonset 3 3 3 3 3 <none> 86m
kubectl get nodes "-o=custom-columns=NAME:.metadata.name ,GPU:.status.allocatable.nvidia.com/gpu"
NAME GPU
ip-10-2-2-71.us-west-2.compute.internal <none>
ip-10-2-3-146.us-west-2.compute.internal <none>
ip-10-2-4-217.us-west-2.compute.internal <none>