Auto-installing NVIDIA device plug-in on GPU nodes only

December 16, 2020

Problem Statement

On a Kubernetes cluster before the GPUs in the nodes can be used, you need to deploy a DaemonSet for the NVIDIA device plugin. This DaemonSet runs a pod on each node to provide the required drivers for the GPUs. E.g. on Azure AKS cluster, here is the Azure official document regarding how to install NVIDIA device plug-in.

The YMAL manifest is in the nvidia-device-plugin-ds.yaml file. Once this file is applied, it does automatically run a pod to provide the required driver once a GPU node is added, no matter it's manually scaled or auto-scaled. But it may also run the pod on a node which has no GPUs at all, since that yaml manifest doesn't constraint the pods to run only on GPU nodes. This is a waste of resources on non-GPU nodes.

Solution

This is where "nodeSelector" and "affinity" can help. "nodeSelector" provides a very simple way to constrain pods to nodes with particular labels. The "affinity/anti-affinity" feature, greatly expands the types of constraints you can express.

Every node in an AKS nodepool is auto-labeled with agentpool=<nodepool name>. You may use the below command to check the values of label 'agentpool',

kubectl get nodes -L agentpool

We can leverage this label in the nodeAffinity configuration like below, assuming the AKS cluster has two gpu nodepools, gpupool1 and gpupool2. Then the DeamonSet will only run NVIDIA device plug-in on nodes of gpupool1 and gpupool2, but not of any other node pools.

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin-daemonset
  namespace: gpu-resources
spec:
  selector:
    matchLabels:
      name: nvidia-device-plugin-ds
  updateStrategy:
    type: RollingUpdate
  template:
    metadata:
      # Mark this pod as a critical add-on; when enabled, the critical add-on scheduler
      # reserves resources for critical add-on pods so that they can be rescheduled after
      # a failure.  This annotation works in tandem with the toleration below.
      annotations:
        scheduler.alpha.kubernetes.io/critical-pod: ""
      labels:
        name: nvidia-device-plugin-ds
    spec:
      tolerations:
      # Allow this pod to be rescheduled while the node is in "critical add-ons only" mode.
      # This, along with the annotation above marks this pod as a critical add-on.
      - key: CriticalAddonsOnly
        operator: Exists
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      containers:
      - image: mcr.microsoft.com/oss/nvidia/k8s-device-plugin:1.11
        name: nvidia-device-plugin-ctr
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop: ["ALL"]
        volumeMounts:
          - name: device-plugin
            mountPath: /var/lib/kubelet/device-plugins
      volumes:
        - name: device-plugin
          hostPath:
            path: /var/lib/kubelet/device-plugins
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                - key: agentpool
                  operator: In
                  values:
                    - gpupool1
                    - gpupool2

References

Use GPUs for compute-intensive workloads on Azure Kubernetes Service (AKS)

Assigning Pods to Nodes

Search This Blog

SZ's blog