Kubernetes Node Affinity and EBS Volumes

Occasionally, Kubernetes workloads require specialized nodes. Sometimes it’s for machine learning, or to save money through burstable node types, or maybe just to lock certain Pods to a dedicated subset of nodes. Thankfully, Kubernetes offers a few useful mechanisms to inform the scheduler how we’d like our workloads distributed: node-based or pod-based affinity rules along with taints and tolerations. I’ll go over how to use these briefly, but I use these frequently at work for numerous reasons. Recently, I realized something interesting about how Persistent Volume Claims (PVCs) work with dynamically provisioned storage like EBS volumes (meaning volumes that are created automatically by Kubernetes based on a StorageClass, rather than referencing existing volumes). The default behavior of a StorageClass is to immediately create a volume as soon as the PVC is created. This can have some consequences when trying to guide how Pods are scheduled.

First, let me explain a bit about the ways we can describe our desired layout to the Kubernetes scheduler. An approach I use a lot is to add what are called taints to a node, which repels Pods unless they can tolerate that taint. Taints have effects of varying strictness: PreferNoSchedule, which, as the name suggests, asks the scheduler to try to not schedule on that node, NoSchedule which forces the scheduler to not schedule on that node, and NoExecute which not only won’t schedule new Pods, but it’ll evict existing Pods on the node. Here’s an example:

With the above taint, no new Pods will schedule on the Node somenode unless it tolerates it. Here’s how we’d add the toleration:

This will allow this Pod to be scheduled on the node with the above taint, though it certainly doesn’t guarantee it. For that, we’ll want to set a node affinity which will be based on a label we apply to the node. Sadly, the taint isn’t usable for this purpose. Let’s add a label to a node:

Now that the node has a taint that prevents Pods from scheduling on the node, we’ve allowed our mypod Pod to be scheduled on the node, and we have a label on the node. Let’s modify the Pod to require it to run on this labeled node:

This forces the Pod to only work on nodes with the apptype=web label.

Complexity with PVCs

The above Pod schedules correctly, but it has no volumes. A common use-case would be to add a volume to our Pod, maybe like this:

We’ll also need to add a PVC for the webdata volume:

This uses a Storage Class named gp2:

Here’s the problem: The above webdata-pvc PVC will immediately provision a Persistent Volume. EBS volumes — including the gp2 volumes created by that storage class — are locked to a particular Avaliability Zone. If the node we labeled above is in a different Availability Zone than the PV created by the PVC, then the Pod will not be schedulable. It needs both a specific node and a volume that can’t be used on that node. To solve this, we’ll need to adjust the Storage Class:

Adding volumeBindingMode: WaitForFirstConsumer tells the Storage Class to make the PVC, but not to create the PV until something tries to use it. This lazy approach allows the PV to respect the affinity rules, etc., of the Pod that will use it.

This is actually a pretty safe feature to add for EBS and other types of volumes. Another good one to add for EBS is allowVolumeExpansion: true, which, as the name suggests, allows volume expansion.

Leave a Reply