EvenPodsSpread discussion on tainted nodes

## Background

EvenPodsSpread is a feature to make pod placement decisions based on the distribution of existing pods. The concept `maxSkew` brings a question: how do we define the scope (which nodes) to calculate the number of matching pods so as to calculate the skew.

Here are some implications implemented in the alpha:

- Nodes with all `spec.topologySpreadConstraints[*].topologyKey` present are in the scope.
- Nodes matching incomingPod's `nodeSelector`/`nodeAffinity` (if defined) are in the scope.
- Other nodes are NOT in the scope.

The above implications have been discussed and agreed, so it's not likely to change in future releases. However, within the scope, there are one special case I want to bring up to get more feedback.

## Core Question

Here is the core question: what if the node is applied taints? Should we count the matching pods in the node? (The alpha answer is yes)

**NOTE:** to simplify the expression, in this post terms like "tainted nodes" and "applied taints" technically means **"the node is tainted AND incoming pod does't have corresponding tolerations"**. If the incoming pod's tolerations match the taints, we for sure consider that node.

## Case 1 - spreading on zones

We have a cluster with 2 zones and 4 nodes, and NodeB is tainted. (P = Pod)

```
+-------------------+---------------+
|       Zone1       |     Zone2     |
+-------+-----------+-------+-------+
| NodeA |   NodeB   | NodeC | NodeD |
|       | (tainted) |       |       |
+-------+-----------+-------+-------+
|   P   |   P P P   |   P   |  P P  |
+-------+-----------+-------+-------+
```

Suppose an incoming pod wants to be scheduled evenly across zones with maxSkew=1. In the current alpha implementation, NodeB is still considered as a valid node, so its 3 pods contributes to the final matching number in Zone1 - which sums up to 4. So the scheduling result is incoming pod goes to NodeC.

Sounds perfectly reasonable. But hold on to take a look at another case.

## Case 2: spreading on nodes

```
+-------------------+---------------+
|       Zone1       |     Zone2     |
+-------+-----------+-------+-------+
| NodeA |   NodeB   | NodeC | NodeD |
|       | (tainted) |       |       |
+-------+-----------+-------+-------+
|   P   |           |   P   |  P P  |
+-------+-----------+-------+-------+
```

In this case, an incoming pod wants to be scheduled evenly across **nodes**, with maxSkew=1. As NodeB has 0 pods, so to make it even, we should put incoming pod to NodeB, and it will be pending... (we've assumed that incoming pod doesn't have corresponding tolerations defined). This is also based on current alpha complementation.

## What if we exclude tainted nodes?

Things are a little different. Here is the comparison:

|       | include tainted nodes (Alpha) | exclude tainted nodes (to be discussed) |
|:-----:|:-----------------------------:|:---------------------------------------:|
| case1 |       pod goes to NodeC       |            pod goes to NodeA            |
| case2 |  pod goes to NodeB (pending)  |        pod goes to NodeA or NodeC       |

## Maybe more options

Maybe it's not a YES or NO answer. It's possible to provide more options to decide which kinds of taints will be included/excluded:

Here are the taints I can think of:

1. The taint is applied by users (admin or operator)
1. The taint is applied by Kubernetes
    1. Node becomes not-ready or unreachable. It can be a network glitch, CNI plugin error, or node reboot.
    1. Node becomes unschedulable. This is usually controlled by users via:
        1. kubectl cordon - node is marked as unschedulable, existing pods are kept
        2. kubectl drains - similar like `cordon`, but existing pods are evicted
    1. Node gets resource pressure.

## Anything else?

Things will be more complicated if `TaintNodesByCondition` and `TaintBasedEvictions` features are disabled. In these cases internally it's controlled by conditions and no taints applied. But these two features will eventually GA and adopting taints/tolerations is the direction, so I guess we don't need to consider these combinations.

/sig scheduling
cc/ @bsalamat @k82cn @ahg-g @ravisantoshgudimetla @liu-cong @alculquicondor @draveness @krmayankk @resouer 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

EvenPodsSpread discussion on tainted nodes #80921

Background

Core Question

Case 1 - spreading on zones

Case 2: spreading on nodes

What if we exclude tainted nodes?

Maybe more options

Anything else?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	include tainted nodes (Alpha)	exclude tainted nodes (to be discussed)
case1	pod goes to NodeC	pod goes to NodeA
case2	pod goes to NodeB (pending)	pod goes to NodeA or NodeC

EvenPodsSpread discussion on tainted nodes #80921

Description

Background

Core Question

Case 1 - spreading on zones

Case 2: spreading on nodes

What if we exclude tainted nodes?

Maybe more options

Anything else?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions