This DRA resource driver is currently under active development and not yet
designed for production use.
We may (at times) decide to push commits over main until we have something more stable.
Use at your own risk.
A document and demo of the DRA support for GPUs provided by this repo can be found below:
| Document | Demo |
|---|---|
This section describes using kind to demo the functionality of the NVIDIA GPU DRA Driver.
First since we'll launch kind with GPU support, ensure that the following prerequisites are met:
kind is installed. See the official documentation here.
Ensure that the NVIDIA Container Toolkit is installed on your system. This can be done by following the instructions here.
Configure the NVIDIA Container Runtime as the default Docker runtime:
sudo nvidia-ctk runtime configure --runtime=docker --set-as-defaultRestart Docker to apply the changes:
sudo systemctl restart dockerSet the accept-nvidia-visible-devices-as-volume-mounts option to true in
the /etc/nvidia-container-runtime/config.toml file to configure the NVIDIA
Container Runtime to use volume mounts to select devices to inject into a
container.
sudo nvidia-ctk config --in-place --set accept-nvidia-visible-devices-as-volume-mounts=trueShow the current set of GPUs on the machine:
nvidia-smi -LWe start by first cloning this repository and cding into it.
All of the scripts and example Pod specs used in this demo are in the demo
subdirectory, so take a moment to browse through the various files and see
what's available:
git clone https://github.com/NVIDIA/k8s-dra-driver.gitcd k8s-dra-driverHere's a demo showing how to install and configure DRA, and run a pod in a kind cluster on a Linux workstation.
Below are the detailed, step-by-step instructions.
First, create a kind cluster to run the demo:
./demo/clusters/kind/create-cluster.shFrom here we will build the image for the example resource driver:
./demo/clusters/kind/build-dra-driver.shThis also makes the built images available to the kind cluster.
We now install the NVIDIA GPU DRA driver:
./demo/clusters/kind/install-dra-driver.shThis should show two pods running in the nvidia-dra-driver namespace:
kubectl get pods -n nvidia-dra-driverNAME READY STATUS RESTARTS AGE
nvidia-k8s-dra-driver-kubelet-plugin-t5qgz 1/1 Running 0 44s
Finally, you can run the various examples contained in the demo/specs/quickstart folder.
With the most recent updates for Kubernetes v1.31, only the first 3 examples in
this folder are currently functional.
You can run them as follows:
kubectl apply --filename=demo/specs/quickstart/gpu-test{1,2,3}.yamlGet the pods' statuses. Depending on which GPUs are available, running the first three examples will produce output similar to the following...
Note: there is a known issue with kind. You may see an error while trying to tail the log of a running pod in the kind cluster: failed to create fsnotify watcher: too many open files. The issue may be resolved by increasing the value for fs.inotify.max_user_watches.
kubectl get pod -A -l app=podNAMESPACE NAME READY STATUS RESTARTS AGE
gpu-test1 pod1 1/1 Running 0 34s
gpu-test1 pod2 1/1 Running 0 34s
gpu-test2 pod 2/2 Running 0 34s
gpu-test3 pod1 1/1 Running 0 34s
gpu-test3 pod2 1/1 Running 0 34s
kubectl logs -n gpu-test1 -l app=podGPU 0: A100-SXM4-40GB (UUID: GPU-662077db-fa3f-0d8f-9502-21ab0ef058a2)
GPU 0: A100-SXM4-40GB (UUID: GPU-4cf8db2d-06c0-7d70-1a51-e59b25b2c16c)
kubectl logs -n gpu-test2 pod --all-containersGPU 0: A100-SXM4-40GB (UUID: GPU-79a2ba02-a537-ccbf-2965-8e9d90c0bd54)
GPU 0: A100-SXM4-40GB (UUID: GPU-79a2ba02-a537-ccbf-2965-8e9d90c0bd54)
kubectl logs -n gpu-test3 -l app=podGPU 0: A100-SXM4-40GB (UUID: GPU-4404041a-04cf-1ccf-9e70-f139a9b1e23c)
GPU 0: A100-SXM4-40GB (UUID: GPU-4404041a-04cf-1ccf-9e70-f139a9b1e23c)
Remove the cluster created in the preceding steps:
./demo/clusters/kind/delete-cluster.sh