Posts

Showing posts from December, 2020

Auto-installing NVIDIA device plug-in on GPU nodes only

Problem Statement On a Kubernetes cluster before the GPUs in the nodes can be used, you need to deploy a DaemonSet for the NVIDIA device plugin. This DaemonSet runs a pod on each node to provide the required drivers for the GPUs. E.g. on Azure AKS cluster, here is the Azure official document regarding how to install NVIDIA device plug-in .  The YMAL manifest is in the nvidia-device-plugin-ds.yaml file. Once this file is applied, it does automatically run a pod to provide the required driver once a GPU node is added, no matter it's manually scaled or auto-scaled. But it may also run the pod on a node which has no GPUs at all, since that yaml manifest doesn't constraint the pods to run only on GPU nodes. This is a waste of resources on non-GPU nodes. Solution This is where " nodeSelector " and " affinity " can help. "nodeSelector" provides a very simple way to constrain pods to nodes with particular labels. The "affinity/anti-affinity" feat