Description
Is this a request for help? (If yes, you should use our troubleshooting guide and community support channels, see http://kubernetes.io/docs/troubleshooting/.):
No
What keywords did you search in Kubernetes issues before filing this one? (If you have found any duplicates, you should instead reply there.):
active_file
inactive_file
working_set
WorkingSet
cAdvisor
memory.available
Is this a BUG REPORT or FEATURE REQUEST? (choose one):
We'll say BUG REPORT (though this is arguable)
Kubernetes version (use kubectl version
):
1.5.3
Environment:
-
Cloud provider or hardware configuration:
-
OS (e.g. from /etc/os-release):
NAME="Ubuntu"
VERSION="14.04.5 LTS, Trusty Tahr"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 14.04.5 LTS"
VERSION_ID="14.04" -
Kernel (e.g.
uname -a
):
Linux HOSTNAME_REDACTED 3.13.0-44-generic kube-up: fix gcloud version check #73-Ubuntu SMP Tue Dec 16 00:22:43 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux -
Install tools:
-
Others:
What happened:
A pod was evicted due to memory pressure on the node, when it appeared to me that there shouldn't have been sufficient memory pressure to cause an eviction. Further digging seems to have revealed that active page cache is being counted against memory.available.
What you expected to happen:
memory.available would not have active page cache counted against it, since it is reclaimable by the kernel. This also seems to greatly complicate a general case for configuring memory eviction policies, since in a general sense it's effectively impossible to understand how much page cache will be active at any given time on any given node, or how long it will stay active (in relation to eviction grace periods).
How to reproduce it (as minimally and precisely as possible):
Cause a node to chew up enough active page cache that the existing calculation for memory.available trips a memory eviction threshold, even though the threshold would not be tripped if the page cache - active and inactive - were freed for anon memory.
Anything else we need to know:
I discussed this with @derekwaynecarr in #sig-node and am opening this issue at his request (conversation starts here).
Before poking around on Slack or opening this issue, I did my best to read through the 1.5.3 release code, Kubernetes documentation, and cgroup kernel documentation to make sure I understood what was going on here. The short of it is that I believe this calculation:
memory.available := node.status.capacity[memory] - node.stats.memory.workingSet
Is using cAdvisor's value for working set, which if I traced the code correctly, amounts to:
$cgroupfs/memory.usage_in_bytes - total_inactive_file
Where, according to my interpretation of the kernel documentation, usage_in_bytes includes all page cache:
$kernel/Documentation/cgroups/memory.txt
The core of the design is a counter called the res_counter. The res_counter
tracks the current memory usage and limit of the group of processes associated
with the controller.
...
2.2.1 Accounting details
All mapped anon pages (RSS) and cache pages (Page Cache) are accounted.
Ultimately my issue is concerning how I can set generally applicable memory eviction thresholds if active page cache is counting against those, and there's no way to to know (1) generally how much page cache will be active across a cluster's nodes, to use as part of general threshold calculations (2) how long active page cache will stay active, to use as part of eviction grace period calculations.
I understand that there are many layers here and that this is not a particularly simple problem to solve generally correctly, or even understand top to bottom. So I apologize up front if any of my conclusions are incorrect or I'm missing anything major, and I appreciate any feedback you all can provide.
As requested by @derekwaynecarr: cc @sjenning @derekwaynecarr