Post-Mortem: The Admission Webhook CNI Deadlock

The Incident#

Following a successful bare-metal restore of a Talos etcd cluster, the control plane was healthy, but the entire pod network remained offline. Cilium (running in strict kube-proxy replacement mode) was completely failing to route traffic.

Critical infrastructure pods like talos-ccm, coredns, and hubble-relay were constantly crashing. Inspecting their logs revealed they were being firewalled by the host network when trying to reach the Kubernetes API or CoreDNS:

dial tcp 10.96.0.1:443: connect: operation not permitted

Diagnosis: The Catch-22#

In a kube-proxy-less Cilium environment, the eBPF maps dictate all routing. Checking the Cilium service maps revealed that the backends for 10.96.0.1 (API Server) and 10.96.0.10 (CoreDNS) were completely empty:

kubectl exec -it -n kube-system daemonset/cilium -- cilium service list | grep -E '10.96.0.1:|10.96.0.10:'

The Kubernetes API server relies on the Endpoints Controller to generate EndpointSlices that map these ClusterIPs to their actual backend nodes/pods. However, checking the cluster revealed the EndpointSlices for core services were missing or stuck as <unset>.

The root cause was found in the cilium-operator logs:

failed calling webhook "validate.kyverno.svc-fail": failed to call webhook: Post "[https://kyverno-svc.kyverno.svc:443/validate/fail?timeout=10s](https://kyverno-svc.kyverno.svc:443/validate/fail?timeout=10s)": dial tcp 10.109.202.249:443: connect: operation not permitted

The Deadlock Chain:

  1. The API server attempts to create/update EndpointSlices to tell the cluster where the API and CoreDNS live.
  2. Before committing to etcd, the API server must pass the object through all registered Admission Webhooks (in this case, Kyverno).
  3. The API server attempts to reach Kyverno via its internal ClusterIP (10.109.202.249).
  4. Because Cilium's eBPF maps are empty, the CNI aggressively drops the packet (operation not permitted).
  5. The connection to the webhook times out. Because the webhook's failurePolicy is set to Fail, the API server rejects the EndpointSlice creation.
  6. Because the EndpointSlice is never created, Cilium's maps remain empty forever.

The cluster was holding itself hostage.

The Resolution#

To break the deadlock, the offending webhooks must be surgically severed from the API server. By deleting the webhook configurations, the API server is free to bypass the admission controllers, commit the EndpointSlices to etcd, and allow the CNI to finally program the network fabric.

Step 1: Nuke the Webhooks#

Identify and forcefully delete the ValidatingWebhookConfiguration and MutatingWebhookConfiguration objects belonging to the blocker (Kyverno, OPA Gatekeeper, etc.):

#!/usr/bin/env bash
set -euo pipefail

kubectl get validatingwebhookconfigurations | grep kyverno | awk '{print $1}' | xargs kubectl delete validatingwebhookconfiguration
kubectl get mutatingwebhookconfigurations | grep kyverno | awk '{print $1}' | xargs kubectl delete mutatingwebhookconfiguration

(Note: These configurations will be automatically recreated by the webhook pods or your GitOps controller once the cluster stabilizes.)

Step 2: Verify Endpoint Population#

The instant the webhooks are removed, the Endpoints Controller retry loop will succeed. Verify the slices have populated with real IPs:

kubectl get endpointslices -n default
kubectl get endpointslices -n kube-system -l kubernetes.io/service-name=kube-dns

Step 3: Verify CNI Recovery#

Once the endpoints populate, the Cilium operator will ingest them, distribute them to the agents, and the eBPF maps will write to the host OS. Check the service map to confirm the backends are (active):

kubectl exec -it -n kube-system daemonset/cilium -- cilium service list | grep '10.96.0.1:'

With the API server reachable, all previously crash-looping infrastructure pods will instantly recover and the network will stabilize.