Volcano Device Plugin

Official Documentation:

Install Volcano Device Plugin

Note: This document is provided for reference purposes.

1. Prerequisites

The list of prerequisites for running the Volcano device plugin is as follows:

NVIDIA Driver: > 440
nvidia-docker: Version > 2.0
Docker Configuration: NVIDIA must be set as the default runtime.
Kubernetes Version: >= 1.16
Volcano Version: >= 1.9

2. Quick Start

2.1 Prepare Your GPU Nodes

The following steps must be executed on all GPU nodes. This guide assumes you have already installed the NVIDIA drivers and nvidia-docker.

Please note that you need to install the nvidia-docker2 package rather than the nvidia-container-toolkit package, as Kubernetes does not yet support the newer --gpus options.

Example for Debian-based systems:

# Add the package repositories
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

sudo apt-get update && sudo apt-get install -y nvidia-docker2
sudo systemctl restart docker

Set NVIDIA as the default runtime:

Edit the Docker daemon configuration file (usually located at /etc/docker/daemon.json):

{
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

If runtimes is not yet installed, please refer to the nvidia-docker installation page.

2.2 Configuration

You need to enable vgpu in the volcano-scheduler ConfigMap:

kubectl edit cm -n volcano-system volcano-scheduler-configmap

For Volcano v1.9 and later, use the following configuration structure:

kind: ConfigMap
apiVersion: v1
metadata:
  name: volcano-scheduler-configmap
  namespace: volcano-system
data:
  volcano-scheduler.conf: |
    actions: "enqueue, allocate, backfill"
    tiers:
    - plugins:
      - name: priority
      - name: gang
      - name: conformance
    - plugins:
      - name: drf
      - name: deviceshare
        arguments:
          deviceshare.VGPUEnable: true # Enable vGPU
          deviceshare.SchedulePolicy: binpack # binpack / spread
      - name: predicates
      - name: proportion
      - name: nodeorder
      - name: binpack

Volcano-vGPU supports two device sharing methods: HAMi-core and Dynamic-mig. A node can use either one, and heterogeneous deployments are supported (some nodes use HAMi-core, others use Dynamic-mig).

Mode	Isolation Type	Requires MIG GPU	Core/Memory Control	Recommended For
HAMI-core	Software (vCUDA)	No	Yes	General workloads
Dynamic MIG	Hardware	Yes	Yes	Performance-critical workloads

2.4 Enable GPU Support in Kubernetes

Once the options are enabled on all target GPU nodes, deploy the following DaemonSet to enable GPU support in the cluster:

kubectl create -f volcano-vgpu-device-plugin.yml

2.5 Verify Environment Readiness

Check the node status. Ensure volcano.sh/vgpu-number is included in the allocatable resources:

$ kubectl get node {node_name} -oyaml
...
  capacity:
    volcano.sh/vgpu-memory: "89424"
    volcano.sh/vgpu-number: "10"   # vGPU resources detected

2.6 Run a vGPU Job

Request vGPUs by setting volcano.sh/vgpu-number, volcano.sh/vgpu-cores, and volcano.sh/vgpu-memory in the resource limits.

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod1
  annotations:
    volcano.sh/vgpu-mode: "hami-core" # Optional: 'hami-core' or 'mig'
spec:
  schedulerName: volcano
  containers:
    - name: cuda-container
      image: nvidia/cuda:9.0-devel
      command: ["sleep"]
      args: ["100000"]
      resources:
        limits:
          volcano.sh/vgpu-number: 2     # Requesting 2 GPU cards
          volcano.sh/vgpu-memory: 3000   # Optional: 3GB memory per vGPU
          volcano.sh/vgpu-cores: 50      # Optional: 50% core utilization per vGPU
EOF

Warning: If you do not request a GPU when using an NVIDIA-imaged device plugin, all GPUs on the machine will be exposed to the container. The number of virtual GPUs used by a container cannot exceed the actual number of GPUs on the node.

3. Monitoring

The volcano-scheduler-metrics record individual GPU usage and limits. Access them via:

curl {volcano_scheduler_cluster_ip}:8080/metrics

For node-specific metrics (GPU utilization, memory usage, pod limits), use:

curl {volcano_device_plugin_pod_ip}:9394/metrics

1. Prerequisites​

2. Quick Start​

2.1 Prepare Your GPU Nodes​

2.2 Configuration​

2.3 Sharing Modes​

2.4 Enable GPU Support in Kubernetes​

2.5 Verify Environment Readiness​

2.6 Run a vGPU Job​

3. Monitoring​