Operations & Resilience
Tuning gets you speed; operations keep you alive. This section is the runbook — what breaks, what it looks like, and exactly what to do about it.
In this section
Section titled “In this section”- Cluster Standby (Premium) — HA without the latency penalty: fast snapshots, in-place identity switch, cross-AZ hot standby.
- Disaster recovery — restoring from snapshots, log replay, and the 4th-node async backup.
- Backup strategies — archiving to durable object storage and backing up the backup.
- Failure-mode runbook — node, network, disk, snapshot, determinism, client, human-error, resource-exhaustion, and multi-AZ scenarios.
Failure-mode quick reference
Section titled “Failure-mode quick reference”| Scenario | Data loss | Availability | Action |
|---|---|---|---|
| Single follower crash | None | None (quorum held) | Restart; auto-catchup |
| Single leader crash | Uncommitted msgs | Brief (election) | Wait for auto-election; restart node |
| Minority crash | None | None | Restart ASAP to restore fault tolerance |
| Majority / quorum loss | Uncommitted msgs | Total | Manual: restore quorum, then snapshot + log |
| All nodes crash | Possible | Total | Cold-start; replay snapshot + log; else external backup |