Occasionally, Kubernetes workloads require specialized nodes. Sometimes it’s for machine learning, or to save money through burstable node types, or maybe just to lock certain Pods to a dedicated subset of nodes. Thankfully, Kubernetes offers a few useful mechanisms to inform the scheduler how we’d like our workloads distributed: node-based or pod-based affinity rules along with taints and tolerations. I’ll go over how to use these briefly, but I use these frequently at work for numerous reasons. Recently, I realized something interesting about how Persistent Volume Claims (PVCs) work with dynamically provisioned storage like EBS volumes (meaning volumes that are created automatically by Kubernetes based on a StorageClass, rather than referencing existing volumes). The default behavior of a StorageClass is to immediately create a volume as soon as the PVC is created. This can have some consequences when trying to guide how Pods are scheduled.
First, let me explain a bit about the ways we can describe our desired layout to the Kubernetes scheduler. An approach I use a lot is to add what are called taints to a node, which repels Pods unless they can tolerate that taint. Taints have effects of varying strictness: PreferNoSchedule
, which, as the name suggests, asks the scheduler to try to not schedule on that node, NoSchedule
which forces the scheduler to not schedule on that node, and NoExecute
which not only won’t schedule new Pods, but it’ll evict existing Pods on the node. Here’s an example:
1 2 3 |
# Add a taint to a node kubectl taint nodes somenode somekey=value:NoSchedule |
With the above taint, no new Pods will schedule on the Node somenode
unless it tolerates it. Here’s how we’d add the toleration:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
apiVersion: v1 kind: Pod metadata: name: mypod spec: containers: - name: web image: nginx imagePullPolicy: IfNotPresent tolerations: - key: "somekey" operator: "Equal" value: "value" effect: "NoSchedule" |
This will allow this Pod to be scheduled on the node with the above taint, though it certainly doesn’t guarantee it. For that, we’ll want to set a node affinity which will be based on a label we apply to the node. Sadly, the taint isn’t usable for this purpose. Let’s add a label to a node:
1 2 3 |
# Add a label to the node kubectl label nodes somenode apptype=web |
Now that the node has a taint that prevents Pods from scheduling on the node, we’ve allowed our mypod
Pod to be scheduled on the node, and we have a label on the node. Let’s modify the Pod to require it to run on this labeled node:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
apiVersion: v1 kind: Pod metadata: name: mypod spec: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: apptype operator: In values: - web containers: - name: web image: nginx imagePullPolicy: IfNotPresent tolerations: - key: "somekey" operator: "Equal" value: "value" effect: "NoSchedule" |
This forces the Pod to only work on nodes with the apptype=web
label.
Complexity with PVCs
The above Pod schedules correctly, but it has no volumes. A common use-case would be to add a volume to our Pod, maybe like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
apiVersion: v1 kind: Pod metadata: name: mypod spec: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: apptype operator: In values: - web containers: - name: web image: nginx imagePullPolicy: IfNotPresent volumeMounts: - name: webdata mountPath: /usr/local/data tolerations: - key: "somekey" operator: "Equal" value: "value" effect: "NoSchedule" |
We’ll also need to add a PVC for the webdata
volume:
1 2 3 4 5 6 7 8 |
apiVersion: v1 kind: PersistentVolumeClaim metadata: name: webdata-pvc spec: storageClassName: "gp2" volumeName: webdata |
This uses a Storage Class named gp2
:
1 2 3 4 5 6 7 8 9 |
apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: gp2 provisioner: kubernetes.io/aws-ebs parameters: type: gp2 fsType: ext4 |
Here’s the problem: The above webdata-pvc
PVC will immediately provision a Persistent Volume. EBS volumes — including the gp2
volumes created by that storage class — are locked to a particular Avaliability Zone. If the node we labeled above is in a different Availability Zone than the PV created by the PVC, then the Pod will not be schedulable. It needs both a specific node and a volume that can’t be used on that node. To solve this, we’ll need to adjust the Storage Class:
1 2 3 4 5 6 7 8 9 10 11 |
apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: gp2 provisioner: kubernetes.io/aws-ebs # This is different from the default of Immediate volumeBindingMode: WaitForFirstConsumer parameters: type: gp2 fsType: ext4 |
Adding volumeBindingMode: WaitForFirstConsumer
tells the Storage Class to make the PVC, but not to create the PV until something tries to use it. This lazy approach allows the PV to respect the affinity rules, etc., of the Pod that will use it.
This is actually a pretty safe feature to add for EBS and other types of volumes. Another good one to add for EBS is allowVolumeExpansion: true
, which, as the name suggests, allows volume expansion.