Scheduling

At the heart of it, Kubernetes is a scheduler. But how does scheduling work? The scheduler will try to put things in the "right" place to the best of its abilities but sometimes we need a bit of extra control.

How Does the Scheduler Work?

Kubernetes scheduler is an amazing piece of software. And to explain how it works it will take way too long and I don't think I am even remotely qualified to do the explanation any justice.

Instead You can read this article

Node Selector:

Using labels specified on the node

To try deploy the deployment with the nodeSelector enabled run

kubectl apply -f k8s/scheduling/node-selector.yaml

You will see the pod remains pending.

This is expected. Lets look at our yaml file

    spec:
      containers:
      - image: moficodes/lifecycle:v0.0.1
        name: lifecycle
      nodeSelector:
        gpu: available
      restartPolicy: Always

We have a nodeSelector of gpu: available set. The scheduler will stop the pod from running unless there is a node with the label gpu with value available is present.

We can fix this easily by running

kubectl label node <nodeip> gpu=available

To find out nodeIP we can do

kubectl get nodes

Once the node is labeled scheduler will automatically schedule the pod on that node.

Node Affinity

Required

Pods will remain pending until a suitable node is found.

Preferred

Pods will start even if nothing matches. But if something matches, that node will be given priority. If multiple node matches partially, the node with the highest weight wins.

Lets look at the example we are running.

cat k8s/scheduling/node-affinity.yaml

The part we care about is this

      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                - key: nodetype
                  operator: In
                  values: 
                    - "dev"
                    - "test"
          preferredDuringSchedulingIgnoredDuringExecution:
            - preference:
                matchExpressions:
                  - key: numCores
                    operator: Gt
                    values: 
                      - "3"
              weight: 1
            - preference:
                matchExpressions:
                  - key: location
                    operator: In
                    values: 
                      - "us-east"
                      - "us-south"
              weight: 5

Lets break it down. We have two kinds of node affinity set. We require the nodetype label to be either dev or test for us to be able to schedule. Our pod will remain pending until this becomes true.

But the other affinity is preferred and pods will schedule even if they are not met. But they have a weight and if they exists the highest weight will be given priority.

Lets test it out.

kubectl apply -f k8s/scheduling/node-affinity.yaml 

If you do a kubectl get po you will see the pod is in pending.

first let label our nodes with the preferredLabels

kubectl label node <node1> numCores="4"

Then label the second node with location us-east

kubectl label node <node2> location="us-east"

A quick kubectl get po will show its still pending.

Lets add the required label to all nodes.

kubectl label node -l arch=amd64 nodetype=dev

And just with that our pod will get scheduled.

Although we put our weight on the preferred it is possible scheduler put it up with any node that would take it at this point. But if we restart the pod we have great chance we will get it to the node we want to.

Also after a pod get scheduled even if we change the label on the node nothing will change. Since label is ignored during execution.

Pod Affinity and Anti Affinity

In kubernetes for most cases we don't care about where the application runs. But there are some performance benefit to having certain pods run close to some other pods. For example if you have 3 nodes in 3 location and you have an application with 3 pods if all those pod run in the same node you basically get no benefit by having a node close to your user. In a case like this you want your pod to be spread out. There is also the case when you have an application that uses some other application heavily (like a datastore or cache) having them close to one another helps increase performance.

Lets see an example of this.

We have 3 nodes and 3 replicas of our pod. We also have 3 replicas of the redis cache.

Just like nodeAffinity pod affinity and anti affinity is also either required or the preferred. If its required it wont schedule until the condition is met. If its preferred, it will schedule but will try first to schedule to the node with highest weight.

Lets look at the affinity rule for the web-store found in k8s/scheduling/pod-affinity-web-server.yaml

      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - web-store
            topologyKey: "kubernetes.io/hostname"
        podAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - store
            topologyKey: "kubernetes.io/hostname"

The anti affinity requires to scheduler to schedule the pod away from any pod with label values: web-store which happens to be the label of the web-store pod itself.

It also has affinity towards pods with label values: store and won't get scheduled until pods with said label is available. That is the label we are choosing for our redis-cache. That means our pod won't schedule unless there is also a redis-cache available.

Lets see this in action

kubectl apply -f k8s/scheduling/pod-affinity-web-server.yaml

If we check with kubectl get po we will see the pods are pending.

Lets run the redis cache

kubectl apply -f k8s/scheduling/pod-affinity-redis-cache.yaml

And with that we will see the our web-server pods also get to running.

Taints and Tolerations

kubectl taint nodes -l arch=amd64 special=true:NoSchedule

Notice the -l arch=amd64 All our nodes has that label. We can technically use any label. But this one happen to be set on all our nodes.

Once this taint is set, all our nodes will have the taint set

kubectl describe nodes | grep Taint
Taints:             special=true:NoSchedule
Taints:             special=true:NoSchedule
Taints:             special=true:NoSchedule

If we want to still schedule something in these nodes we can use something called toleration. A toleration is a way to tell the scheduler that my pod tolerates the taint on this node.

Lets try deploying something

kubectl apply -f k8s/scheduling/taint-toleration.yaml

If we check the status we will see the pod is pending.

kubectl get po

Its because we have the toleration commented out.

Lets fix that

nano k8s/scheduling/taint-toleration.yaml

Uncomment these 5 line

      containers:
      - image:  moficodes/os-signal:v0.0.1
        name:  os-signal
        imagePullPolicy: Always
      # tolerations:
      #   - key: "special"
      #     operator: "Equal"
      #     value: "true"
      #     effect: "NoSchedule"

If we apply this new change we will see the deployment succeed.

Lets clean this taint

kubectl taint nodes -l arch=amd64 special-

Last updated

Was this helpful?