Disaster Recovery: Talos etcd Corruption & Kyverno CNI Deadlock

Incident Summary: Talos Control Plane & CNI Deadlock#

Overview#

During a physical disk migration for the TrueNAS VM (running on the Proxmox bare-metal host), severe I/O latency spikes caused a cascading failure across the Kubernetes production cluster. The incident unfolded in two distinct, critical phases: a complete loss of the Talos etcd control plane, followed by a severe CNI routing deadlock caused by admission webhooks.

The recovery required completely breaking the high-availability replication loops and surgically removing admission controllers to allow the network to rebuild.

Phase 1: The etcd Brain Death#

The storage I/O drop caused the 3-node Talos control plane to lose consensus, resulting in a hard etcdserver: corrupt cluster state. Standard snapshots and graceful node reboots failed because the nodes continuously re-synced the corrupted state to one another.

Recovery required extracting the raw db files, identifying the highest revision, simultaneously wiping the EPHEMERAL partitions across the control plane to freeze replication, and forcing a bootstrap. Because the raw database file was extracted "dirty," an etcd corruption alarm had to be manually disarmed before the API would accept traffic.

Full technical breakdown and recovery steps: Post-Mortem: Recovering a Corrupted Talos etcd Cluster

Phase 2: The Admission Webhook CNI Deadlock#

Once the control plane was resurrected, the cluster network remained completely offline. Cilium (operating in strict kube-proxy replacement mode) had empty eBPF service maps for critical ClusterIPs like the Kubernetes API and CoreDNS. This caused infrastructure pods like talos-ccm and hubble-relay to crash with operation not permitted errors.

The root cause was a Catch-22 involving Kyverno:

  1. The API server needed to create EndpointSlices to tell Cilium where services were located.
  2. The API server was forced to validate these objects against the Kyverno validating webhook (failurePolicy: Fail).
  3. Because the CNI maps were empty, the API server couldn't reach the Kyverno pod, causing the webhook to fail and the EndpointSlice creation to be rejected.
  4. Without the EndpointSlices, Cilium's maps remained permanently empty.

The deadlock was broken by surgically deleting the Kyverno webhook configurations, which allowed the API server to bypass admission control, commit the network state, and trigger Cilium to rebuild the host eBPF routing maps.

Full technical breakdown and recovery steps: Post-Mortem: The Admission Webhook CNI Deadlock