Operations & Resilience

Tuning gets you speed; operations keep you alive. This section is the runbook — what breaks, what it looks like, and exactly what to do about it.

In this section

Cluster Standby (Premium) — HA without the latency penalty: fast snapshots, in-place identity switch, cross-AZ hot standby.
Disaster recovery — restoring from snapshots, log replay, and the 4th-node async backup.
Backup strategies — archiving to durable object storage and backing up the backup.
Failure-mode runbook — node, network, disk, snapshot, determinism, client, human-error, resource-exhaustion, and multi-AZ scenarios.

Scenario	Data loss	Availability	Action
Single follower crash	None	None (quorum held)	Restart; auto-catchup
Single leader crash	Uncommitted msgs	Brief (election)	Wait for auto-election; restart node
Minority crash	None	None	Restart ASAP to restore fault tolerance
Majority / quorum loss	Uncommitted msgs	Total	Manual: restore quorum, then snapshot + log
All nodes crash	Possible	Total	Cold-start; replay snapshot + log; else external backup