Post-Mortem: Recovering a Corrupted Talos etcd Cluster

The Incident#

During a recent physical disk migration for a TrueNAS VM running on Proxmox, the resulting I/O latency spikes and storage disruption caused a massive ungraceful state change. This led to a complete loss of consensus in the 3-node Talos Linux control plane.

The Kubernetes API became completely unreachable, and the Virtual IP (VIP) dropped. Checking the control plane nodes revealed a relentless stream of errors:

[talos] controller failed {"component": "controller-runtime", "controller": "kubeaccess.CRDController", "error": "error from crd controller: error creating etcd session: etcdserver: corrupt cluster"}
[talos] campaign failure {"component": "controller-runtime", "controller": "network.OperatorSpecController", "operator": "vip", "error": "failed to create concurrency session: etcdserver: corrupt cluster"}

The etcdserver: corrupt cluster error is a hard stop. It means the database has lost integrity and standard snapshot commands will fail. The cluster must be rebuilt from the raw database files.

Diagnosis & Expectations#

When etcd corrupts in a high-availability setup, you cannot rely on graceful recovery. If you wipe and reboot nodes sequentially, the active nodes will simply re-sync the corrupted state back to the freshly wiped node as soon as it joins (a frustrating game of Whack-a-Mole).

To recover, the cluster replication loop must be completely broken, the state wiped concurrently, and the cluster forcefully bootstrapped from the highest-revision raw database file. Furthermore, because extracting a raw database file from a running (but corrupted) cluster leaves a "dirty" flag on the file, etcd will immediately trigger a corruption alarm upon bootstrap, which must be manually disarmed.

The Recovery Procedure#

Step 1: Extract the Raw Database Files#

Since standard snapshots are broken, pull the raw db files directly from all three control plane nodes to your local machine:

#!/usr/bin/env bash
set -euo pipefail

talosctl -n 192.168.53.10 cp /var/lib/etcd/member/snap/db ./db.192.168.53.10/db
talosctl -n 192.168.53.11 cp /var/lib/etcd/member/snap/db ./db.192.168.53.11/db
talosctl -n 192.168.53.12 cp /var/lib/etcd/member/snap/db ./db.192.168.53.12/db

Step 2: Identify the Most Recent State#

You must find the database with the highest REVISION number to minimize data loss. Using etcdutl (easily invoked via Nix), inspect the status of each file:

nix shell nixpkgs#etcd -c etcdutl snapshot status ./db.192.168.53.10/db -w table

Output example:

+----------+-----------+------------+------------+---------+
|   HASH   | REVISION  | TOTAL KEYS | TOTAL SIZE | VERSION |
+----------+-----------+------------+------------+---------+
| fa4075aa | 396477919 |      12122 |     263 MB |   3.6.0 |
+----------+-----------+------------+------------+---------+

Compare the outputs. Select the file with the highest revision number to act as the seed for the new cluster.

Step 3: Stop etcd and Concurrently Wipe State#

To prevent the corrupted nodes from re-infecting each other, stop the etcd service on all nodes to freeze the cluster:

(Note: Talos may throw an error attempting to stop etcd via API, but running the simultaneous wipe directly bypasses the replication loop.)

Wipe the EPHEMERAL partition on all three nodes concurrently. This destroys the corrupted data directory while preserving the node configuration (STATE).

#!/usr/bin/env bash
set -euo pipefail

talosctl -n 192.168.53.10 reset --system-labels-to-wipe EPHEMERAL --graceful=false --reboot &
talosctl -n 192.168.53.11 reset --system-labels-to-wipe EPHEMERAL --graceful=false --reboot &
talosctl -n 192.168.53.12 reset --system-labels-to-wipe EPHEMERAL --graceful=false --reboot &
wait

Step 4: Verify "Preparing" State#

Wait for the nodes to reboot. Because their data directories are completely empty, they will boot into a "learner" state, throwing an rpc not supported for learner error. This is expected.

Ensure your primary target node (e.g., .10) is ready:

talosctl -n 192.168.53.10 service etcd

Look for STATE: Preparing or STATE: Running with learner errors in the events.

Step 5: Bootstrap the Cluster#

Inject the winning database back into the first node. Because it is a raw file and not a hashed snapshot, --recover-skip-hash-check is mandatory.

talosctl -n 192.168.53.10 bootstrap --recover-from=./db.192.168.53.10/db --recover-skip-hash-check

Step 6: Disarm the etcd Corruption Alarm#

Because the raw file was copied while dirty, etcd will successfully boot but immediately lock down the API to protect itself, raising an alarm. The Talos controllers will still complain about a corrupt cluster until this is cleared.

Check for alarms:

talosctl -n 192.168.53.10 etcd alarm list

Disarm the alarm to unlock the API:

talosctl -n 192.168.53.10 etcd alarm disarm

Step 7: Rolling Reboot and Verification#

With the alarm cleared, the restored node is fully operational. To ensure the remaining two nodes properly sync the clean state and drop any cached garbage, perform a rolling reboot.

talosctl -n 192.168.53.11 reboot
# Wait for it to return and show HEALTH: OK via `talosctl -n 192.168.53.11 service etcd`
talosctl -n 192.168.53.12 reboot

Finally, verify the VIP has bound correctly by checking the active addresses on your control plane nodes. The VIP will appear in the list of whichever node successfully claimed the election:

talosctl -n 192.168.53.10 get addresses | grep 192.168.53.2

Once you confirm the VIP is routing, test your standard cluster access:

kubectl get nodes -o wide

At this point, the control plane is fully restored. Any workload pods relying on storage that got stuck during the I/O freeze can now be safely restarted to re-attach their volumes.