Problem
After restarting kubelet, one of the Kubernetes nodes started behaving unpredictably.
The symptoms were unusual and did not point to an obvious root cause.
Symptoms
The following issues were observed:
- some
kubectloperations stopped working; - system pods started failing unexpectedly;
kubeletbecame unstable;- the logs did not reveal an obvious root cause.
Investigation
The most likely causes were checked first:
- network issues;
- disk pressure;
- CPU exhaustion;
- memory exhaustion;
- overall node health.
None of these hypotheses were confirmed.
Root cause
The issue was caused by an exhausted inotify watchers limit on the node.
inotify is a Linux subsystem used to monitor file and directory changes.
Many infrastructure components depend on it:
- Kubernetes components;
- container runtimes;
- monitoring systems;
- log collectors;
- various agents.
When the limit is reached, unexpected symptoms can appear:
- applications stop receiving file change events;
- individual components become unstable;
- failures appear unrelated to the actual cause.
These issues are difficult to diagnose because they can remain invisible for a long time.
Most teams actively monitor:
- CPU;
- memory;
- disks;
- network.
inotify watchers consumption rarely makes it onto that list of metrics.
Takeaways
Not every Kubernetes issue originates inside Kubernetes itself.
Sometimes the root cause is hidden several layers lower in the stack.
A forgotten Linux system limit can cause more operational pain than a lack of compute resources.
As infrastructure grows, Linux system limits should be reviewed periodically alongside cluster resources.
After this incident, inotify watchers consumption was added to the list of metrics that are monitored proactively.