Problem

After restarting kubelet, one of the Kubernetes nodes started behaving unpredictably.

The symptoms were unusual and did not point to an obvious root cause.

Symptoms

The following issues were observed:

  • some kubectl operations stopped working;
  • system pods started failing unexpectedly;
  • kubelet became unstable;
  • the logs did not reveal an obvious root cause.

Investigation

The most likely causes were checked first:

  • network issues;
  • disk pressure;
  • CPU exhaustion;
  • memory exhaustion;
  • overall node health.

None of these hypotheses were confirmed.

Root cause

The issue was caused by an exhausted inotify watchers limit on the node.

inotify is a Linux subsystem used to monitor file and directory changes.

Many infrastructure components depend on it:

  • Kubernetes components;
  • container runtimes;
  • monitoring systems;
  • log collectors;
  • various agents.

When the limit is reached, unexpected symptoms can appear:

  • applications stop receiving file change events;
  • individual components become unstable;
  • failures appear unrelated to the actual cause.

These issues are difficult to diagnose because they can remain invisible for a long time.

Most teams actively monitor:

  • CPU;
  • memory;
  • disks;
  • network.

inotify watchers consumption rarely makes it onto that list of metrics.

Takeaways

Not every Kubernetes issue originates inside Kubernetes itself.

Sometimes the root cause is hidden several layers lower in the stack.

A forgotten Linux system limit can cause more operational pain than a lack of compute resources.

As infrastructure grows, Linux system limits should be reviewed periodically alongside cluster resources.

After this incident, inotify watchers consumption was added to the list of metrics that are monitored proactively.