Occasionally, Kubernetes workloads require specialized nodes. Sometimes it’s for machine learning, or to save money through burstable node types, or maybe just to lock certain Pods to a dedicated subset of nodes. Thankfully, Kubernetes offers a few useful mechanisms to inform the scheduler how we’d like our workloads distributed: node-based or pod-based affinity rules along with taints and tolerations. I’ll go over how to use these briefly, but I use these frequently at work for numerous reasons. Recently, I realized something interesting about how Persistent Volume Claims (PVCs) work with dynamically provisioned storage like EBS volumes (meaning volumes that are created automatically by Kubernetes based on a StorageClass, rather than referencing existing volumes). The default behavior of a StorageClass is to immediately create a volume as soon as the PVC is created. This can have some consequences when trying to guide how Pods are scheduled.
First, let me explain a bit about the ways we can describe our desired layout to the Kubernetes scheduler. An approach I use a lot is to add what are called taints to a node, which repels Pods unless they can tolerate that taint. Taints have effects of varying strictness: PreferNoSchedule
, which, as the name suggests, asks the scheduler to try to not schedule on that node, NoSchedule
which forces the scheduler to not schedule on that node, and NoExecute
which not only won’t schedule new Pods, but it’ll evict existing Pods on the node. Here’s an example:
# Add a taint to a node
kubectl taint nodes somenode somekey=value:NoSchedule
With the above taint, no new Pods will schedule on the Node somenode
unless it tolerates it. Here’s how we’d add the toleration:
apiVersion: v1
kind: Pod
metadata:
name: mypod
spec:
containers:
- name: web
image: nginx
imagePullPolicy: IfNotPresent
tolerations:
- key: "somekey"
operator: "Equal"
value: "value"
effect: "NoSchedule"
This will allow this Pod to be scheduled on the node with the above taint, though it certainly doesn’t guarantee it. For that, we’ll want to set a node affinity which will be based on a label we apply to the node. Sadly, the taint isn’t usable for this purpose. Let’s add a label to a node:
# Add a label to the node
kubectl label nodes somenode apptype=web
Now that the node has a taint that prevents Pods from scheduling on the node, we’ve allowed our mypod
Pod to be scheduled on the node, and we have a label on the node. Let’s modify the Pod to require it to run on this labeled node:
apiVersion: v1
kind: Pod
metadata:
name: mypod
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: apptype
operator: In
values:
- web
containers:
- name: web
image: nginx
imagePullPolicy: IfNotPresent
tolerations:
- key: "somekey"
operator: "Equal"
value: "value"
effect: "NoSchedule"
This forces the Pod to only work on nodes with the apptype=web
label.
Complexity with PVCs
The above Pod schedules correctly, but it has no volumes. A common use-case would be to add a volume to our Pod, maybe like this:
apiVersion: v1
kind: Pod
metadata:
name: mypod
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: apptype
operator: In
values:
- web
containers:
- name: web
image: nginx
imagePullPolicy: IfNotPresent
volumeMounts:
- name: webdata
mountPath: /usr/local/data
tolerations:
- key: "somekey"
operator: "Equal"
value: "value"
effect: "NoSchedule"
We’ll also need to add a PVC for the webdata
volume:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: webdata-pvc
spec:
storageClassName: "gp2"
volumeName: webdata
This uses a Storage Class named gp2
:
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: gp2
provisioner: kubernetes.io/aws-ebs
parameters:
type: gp2
fsType: ext4
Here’s the problem: The above webdata-pvc
PVC will immediately provision a Persistent Volume. EBS volumes — including the gp2
volumes created by that storage class — are locked to a particular Avaliability Zone. If the node we labeled above is in a different Availability Zone than the PV created by the PVC, then the Pod will not be schedulable. It needs both a specific node and a volume that can’t be used on that node. To solve this, we’ll need to adjust the Storage Class:
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: gp2
provisioner: kubernetes.io/aws-ebs
# This is different from the default of Immediate
volumeBindingMode: WaitForFirstConsumer
parameters:
type: gp2
fsType: ext4
Adding volumeBindingMode: WaitForFirstConsumer
tells the Storage Class to make the PVC, but not to create the PV until something tries to use it. This lazy approach allows the PV to respect the affinity rules, etc., of the Pod that will use it.
This is actually a pretty safe feature to add for EBS and other types of volumes. Another good one to add for EBS is allowVolumeExpansion: true
, which, as the name suggests, allows volume expansion.