Skip to content

Operations & Resilience

Tuning gets you speed; operations keep you alive. This section is the runbook — what breaks, what it looks like, and exactly what to do about it.

  • Cluster Standby (Premium) — HA without the latency penalty: fast snapshots, in-place identity switch, cross-AZ hot standby.
  • Disaster recovery — restoring from snapshots, log replay, and the 4th-node async backup.
  • Backup strategies — archiving to durable object storage and backing up the backup.
  • Failure-mode runbook — node, network, disk, snapshot, determinism, client, human-error, resource-exhaustion, and multi-AZ scenarios.
ScenarioData lossAvailabilityAction
Single follower crashNoneNone (quorum held)Restart; auto-catchup
Single leader crashUncommitted msgsBrief (election)Wait for auto-election; restart node
Minority crashNoneNoneRestart ASAP to restore fault tolerance
Majority / quorum lossUncommitted msgsTotalManual: restore quorum, then snapshot + log
All nodes crashPossibleTotalCold-start; replay snapshot + log; else external backup