Volcano Device Plugin
Official Documentation:
Note: This document is provided for reference purposes.
1. Prerequisites
The list of prerequisites for running the Volcano device plugin is as follows:
- NVIDIA Driver: > 440
- nvidia-docker: Version > 2.0
- Docker Configuration: NVIDIA must be set as the default runtime.
- Kubernetes Version: >= 1.16
- Volcano Version: >= 1.9
2. Quick Start
2.1 Prepare Your GPU Nodes
The following steps must be executed on all GPU nodes. This guide assumes you have already installed the NVIDIA drivers and nvidia-docker.
Please note that you need to install the nvidia-docker2 package rather than the nvidia-container-toolkit package, as Kubernetes does not yet support the newer --gpus options.
Example for Debian-based systems:
# Add the package repositories
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update && sudo apt-get install -y nvidia-docker2
sudo systemctl restart docker
Set NVIDIA as the default runtime:
Edit the Docker daemon configuration file (usually located at /etc/docker/daemon.json):
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
}
If runtimes is not yet installed, please refer to the nvidia-docker installation page.
2.2 Configuration
You need to enable vgpu in the volcano-scheduler ConfigMap:
kubectl edit cm -n volcano-system volcano-scheduler-configmap
For Volcano v1.9 and later, use the following configuration structure:
kind: ConfigMap
apiVersion: v1
metadata:
name: volcano-scheduler-configmap
namespace: volcano-system
data:
volcano-scheduler.conf: |
actions: "enqueue, allocate, backfill"
tiers:
- plugins:
- name: priority
- name: gang
- name: conformance
- plugins:
- name: drf
- name: deviceshare
arguments:
deviceshare.VGPUEnable: true # Enable vGPU
deviceshare.SchedulePolicy: binpack # binpack / spread
- name: predicates
- name: proportion
- name: nodeorder
- name: binpack
2.3 Sharing Modes
Volcano-vGPU supports two device sharing methods: HAMi-core and Dynamic-mig. A node can use either one, and heterogeneous deployments are supported (some nodes use HAMi-core, others use Dynamic-mig).
| Mode | Isolation Type | Requires MIG GPU | Core/Memory Control | Recommended For |
|---|---|---|---|---|
| HAMI-core | Software (vCUDA) | No | Yes | General workloads |
| Dynamic MIG | Hardware | Yes | Yes | Performance-critical workloads |
2.4 Enable GPU Support in Kubernetes
Once the options are enabled on all target GPU nodes, deploy the following DaemonSet to enable GPU support in the cluster:
kubectl create -f volcano-vgpu-device-plugin.yml
2.5 Verify Environment Readiness
Check the node status. Ensure volcano.sh/vgpu-number is included in the allocatable resources:
$ kubectl get node {node_name} -oyaml
...
capacity:
volcano.sh/vgpu-memory: "89424"
volcano.sh/vgpu-number: "10" # vGPU resources detected
2.6 Run a vGPU Job
Request vGPUs by setting volcano.sh/vgpu-number, volcano.sh/vgpu-cores, and volcano.sh/vgpu-memory in the resource limits.
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod1
annotations:
volcano.sh/vgpu-mode: "hami-core" # Optional: 'hami-core' or 'mig'
spec:
schedulerName: volcano
containers:
- name: cuda-container
image: nvidia/cuda:9.0-devel
command: ["sleep"]
args: ["100000"]
resources:
limits:
volcano.sh/vgpu-number: 2 # Requesting 2 GPU cards
volcano.sh/vgpu-memory: 3000 # Optional: 3GB memory per vGPU
volcano.sh/vgpu-cores: 50 # Optional: 50% core utilization per vGPU
EOF
Warning: If you do not request a GPU when using an NVIDIA-imaged device plugin, all GPUs on the machine will be exposed to the container. The number of virtual GPUs used by a container cannot exceed the actual number of GPUs on the node.
3. Monitoring
The volcano-scheduler-metrics record individual GPU usage and limits. Access them via:
curl {volcano_scheduler_cluster_ip}:8080/metrics
For node-specific metrics (GPU utilization, memory usage, pod limits), use:
curl {volcano_device_plugin_pod_ip}:9394/metrics