Description
Background
EvenPodsSpread is a feature to make pod placement decisions based on the distribution of existing pods. The concept maxSkew
brings a question: how do we define the scope (which nodes) to calculate the number of matching pods so as to calculate the skew.
Here are some implications implemented in the alpha:
- Nodes with all
spec.topologySpreadConstraints[*].topologyKey
present are in the scope. - Nodes matching incomingPod's
nodeSelector
/nodeAffinity
(if defined) are in the scope. - Other nodes are NOT in the scope.
The above implications have been discussed and agreed, so it's not likely to change in future releases. However, within the scope, there are one special case I want to bring up to get more feedback.
Core Question
Here is the core question: what if the node is applied taints? Should we count the matching pods in the node? (The alpha answer is yes)
NOTE: to simplify the expression, in this post terms like "tainted nodes" and "applied taints" technically means "the node is tainted AND incoming pod does't have corresponding tolerations". If the incoming pod's tolerations match the taints, we for sure consider that node.
Case 1 - spreading on zones
We have a cluster with 2 zones and 4 nodes, and NodeB is tainted. (P = Pod)
+-------------------+---------------+
| Zone1 | Zone2 |
+-------+-----------+-------+-------+
| NodeA | NodeB | NodeC | NodeD |
| | (tainted) | | |
+-------+-----------+-------+-------+
| P | P P P | P | P P |
+-------+-----------+-------+-------+
Suppose an incoming pod wants to be scheduled evenly across zones with maxSkew=1. In the current alpha implementation, NodeB is still considered as a valid node, so its 3 pods contributes to the final matching number in Zone1 - which sums up to 4. So the scheduling result is incoming pod goes to NodeC.
Sounds perfectly reasonable. But hold on to take a look at another case.
Case 2: spreading on nodes
+-------------------+---------------+
| Zone1 | Zone2 |
+-------+-----------+-------+-------+
| NodeA | NodeB | NodeC | NodeD |
| | (tainted) | | |
+-------+-----------+-------+-------+
| P | | P | P P |
+-------+-----------+-------+-------+
In this case, an incoming pod wants to be scheduled evenly across nodes, with maxSkew=1. As NodeB has 0 pods, so to make it even, we should put incoming pod to NodeB, and it will be pending... (we've assumed that incoming pod doesn't have corresponding tolerations defined). This is also based on current alpha complementation.
What if we exclude tainted nodes?
Things are a little different. Here is the comparison:
include tainted nodes (Alpha) | exclude tainted nodes (to be discussed) | |
---|---|---|
case1 | pod goes to NodeC | pod goes to NodeA |
case2 | pod goes to NodeB (pending) | pod goes to NodeA or NodeC |
Maybe more options
Maybe it's not a YES or NO answer. It's possible to provide more options to decide which kinds of taints will be included/excluded:
Here are the taints I can think of:
- The taint is applied by users (admin or operator)
- The taint is applied by Kubernetes
- Node becomes not-ready or unreachable. It can be a network glitch, CNI plugin error, or node reboot.
- Node becomes unschedulable. This is usually controlled by users via:
- kubectl cordon - node is marked as unschedulable, existing pods are kept
- kubectl drains - similar like
cordon
, but existing pods are evicted
- Node gets resource pressure.
Anything else?
Things will be more complicated if TaintNodesByCondition
and TaintBasedEvictions
features are disabled. In these cases internally it's controlled by conditions and no taints applied. But these two features will eventually GA and adopting taints/tolerations is the direction, so I guess we don't need to consider these combinations.
/sig scheduling
cc/ @bsalamat @k82cn @ahg-g @ravisantoshgudimetla @liu-cong @alculquicondor @draveness @krmayankk @resouer