๐ How I Accidentally DDoSed My Own Kubernetes Cluster
Introduction#
For the past few weeks, my production Kubernetes cluster has been fighting a ghost. Every few days (or sometimes hours), the entire Control Plane would commit suicide. Critical components like the scheduler and controller manager would crash simultaneously, claiming they had "lost leadership."
Usually, this happens when the network fails. But my network was fine. The nodes were up. The CPUs were idling.
After digging through weeks of logs and graphs, I discovered the culprit wasn't a bug or a hardware failureโit was a "Death Spiral" of my own making. My storage (ZFS on SATA SSDs) was being suffocated by a few noisy applications. When my database wanted to write data, it saturated the SATA link so badly that etcd (the cluster's brain) couldn't save its state in time. When etcd timed out, the cluster panicked, crashed, and wrote massive error logs to the disk... which saturated the disk even more, ensuring the cluster could never recover.
Here is the detective story of how I isolated the IOPS vampire and my plan to fix it with physics rather than tuning.
1. The Symptoms#
- The Crash: Random, cluster-wide restarts of
kube-controller-manager,cilium-operator, anddemocratic-csi. - The Logs: The "Smoking Gun" error in
etcd:Health check failed: context deadline exceeded(latency > 15s). - The Correlation: Crises occurred during high-write activities (large file uploads, database checkpoints).
2. The Investigation (RCA)#
I traced the issue to Storage I/O Saturation on the Proxmox host's rpool (SATA SSDs).
- The Bottleneck: My Intel S3500 SSDs top out at around 100 MB/s (mixed write workload).
- The "Noisy Neighbor": Using
iotopon the hypervisor, I caught VM 311 (a worker node) spiking to 17 MB/s writes. - The Victim: At the exact same second, the Control Plane VM tried to write 7 MB/s of logs.
- The Result: 17 + 7 = 24 MB/s of random I/O. The drives choked, latency spiked to 500ms+, and
etcd(which demands <10ms) failed.
Identified Heavy Hitters#
- VM 311 (
talos-bay-a2s): Hosting multiple databases (pg-vchord,home-assistant,trilium). - Ingress Nginx: Spiking to 8 MB/s (potentially due to buffering large uploads).
- Frigate: Constant 4 MB/s "hum" (Verified as Memory usage, but contributes to system load).
3. Resolution Plan#
โ Phase 1: The "Circuit Breaker" (Done)#
I cannot risk a single worker node taking down the Control Plane while I wait for parts. I have applied a hard I/O limit at the Hypervisor level to the noisy worker node.
- Action: Throttled VM 311 to 30 MB/s (with short bursts allowed). This guarantees ~70% of the SATA bus remains available for the Control Plane.
Command (Proxmox):
qm set 311 --scsi2 local-zfs:vm-311-disk-0,mbps_wr=30,mbps_wr_max=50,iothread=1
๐ Phase 2: The Hardware Fix (Immediate)#
I am bypassing the SATA bottleneck entirely for the cluster's brain. I will move the Control Plane VM disks to dedicated Intel Optane (PCIe) drives.
- Why: Optane write latency is consistent (~10ยตs) and virtually immune to the saturation issues affecting NAND SSDs. Even if the SATA pool is 100% busy,
etcdwill have its own dedicated lane. - Strategy: Move 2 out of 3 Control Plane nodes to Optane. Keep the 3rd on SATA for redundancy.
๐ง Analysis: Etcd Consensus in a Hybrid Optane/SATA Cluster
๐ Phase 3: Monitor & Verify#
After the Optane upgrade, I will monitor the cluster for stability. The theory is that even if the SATA drives get crushed by the database, etcd will survive because it lives on Optane.
โณ Phase 4: Conditional Diagnostics (Deferred)#
If instability persists, I will investigate the Ingress Nginx spikes.
- Hypothesis: Nginx buffers large uploads to disk, causing massive I/O.
- Action: If proven, I will tune
client-body-buffer-sizeto keep uploads in RAM or disable buffering for specific services.