Backups and Disaster Recovery
Disaster recovery is a spectrum, not a single plan. The blast radius dictates the remedy: lose one follower and the cluster shrugs it off automatically; lose every node and you cold-start from an external backup. This page maps each failure mode to its remediation, then compares the two backup strategies — OSS Cluster Backup versus Premium Cluster Standby — so you can pick the right one for your RTO and RPO.
For how recordings, snapshots, and the recording log actually work under the hood, see The Aeron Files. Here we focus on operational recovery.
DR scenarios, ranked by severity
Section titled “DR scenarios, ranked by severity”Four scenarios cover the realistic failure space, in order of severity:
| Scenario | Severity |
|---|---|
| Follower Node loss | Low — cluster continues, auto-recovery |
| Leader Node loss | Medium — brief election, auto-recovery |
| Most Recent Snapshot loss | Medium — falls back to older snapshot + longer replay |
| All Node Loss (only DB survives) | Critical — full rebuild required |
Recovery approach by scenario
Section titled “Recovery approach by scenario”Match the remedy to the severity. The further down this list you go, the more you rely on tooling and external backups instead of Aeron’s built-in mechanisms.
- Follower loss → Restart the node. It auto-catches up from the leader via snapshot + log replay.
- Leader loss → Auto-election picks a new leader. The old leader restarts as a follower and catches up.
- Snapshot loss → Use
ClusterToolto inspect the recording log, invalidate the corrupt snapshot, and fall back to an older snapshot with longer log replay. - All node loss → Cold-start from an external backup (snapshots + log recordings from Archive or Cluster Backup/Standby).
Cluster Backup (OSS) vs Cluster Standby (Premium)
Section titled “Cluster Backup (OSS) vs Cluster Standby (Premium)”For the critical “all node loss” case, you protect yourself with an external backup. There are two options. The choice comes down to one word: warm versus cold.
| Feature | Cluster Backup (OSS) | Cluster Standby (Premium) |
|---|---|---|
| Availability | Open Source (Apache 2.0) | Premium (commercial) |
| Purpose | Replicate log to remote location for DR | Extend deployment with live standby nodes |
| Node State | Passive — does not process messages | Active — processes every message in real-time |
| Recovery Speed | Slower — depends on snapshot age + log replay duration | Faster — near-instant switchover, no replay needed |
| Data Loss on Failover | Higher — depends on last backup sync point | Minimal — only in-transit uncommitted messages |
| Recovery Mechanism | Rebuild cluster state via log replay | Flip a flag to switch standby to active |
| Bandwidth Flexibility | Not configurable | Configurable — single node or all nodes routing |
| Back Pressure on Leader | May cause back pressure if remote node added to cluster | No back pressure on leader |
| Use Case | Protection against complete DC loss with acceptable recovery time | Rapid failover with minimal downtime and data loss |
Decision framework
Section titled “Decision framework”Pick based on your recovery objectives and budget:
Need fast failover (<60s) + minimal data loss? → Cluster Standby (Premium)Need basic DR with acceptable recovery time? → Cluster Backup (OSS)Budget constrained but need some DR? → Cluster Backup (OSS)Running cross-region with strict RPO/RTO? → Cluster Standby (Premium)The detailed comparison
Section titled “The detailed comparison”Drilling deeper, the warm-vs-cold distinction shows up in license, replication, recovery, and operations. ClusterStandby ships in the premium jar; ClusterBackup / ClusterBackupMediaDriver are OSS.
License & API
Section titled “License & API”| Cluster Standby (Premium) | ClusterBackup (OSS) | |
|---|---|---|
| License | Aeron Premium (commercial) | OSS (Apache 2.0) |
| API | ClusterStandby (premium jar) | ClusterBackup / ClusterBackupMediaDriver |
| Standby type | Warm — actively processes every message | Cold — replicates recordings, replays on restore |
Replication & state
Section titled “Replication & state”| Cluster Standby | ClusterBackup | |
|---|---|---|
| Replicates snapshot recordings | Yes (as part of message processing) | Yes (automatic) |
| Replicates log recordings | Yes (as part of message processing) | Yes (automatic) |
| Runs ClusteredService on standby | Yes — processes every message | No — archive-level replication only |
| Standby state matches primary | Yes — near real-time | No — state in archive only, not in memory |
| Continuous sync (live log tailing) | Yes | Yes — stays in BACKING_UP state |
| Leader-aware | Yes — follows leader changes | Yes — queries cluster consensus |
Recovery characteristics
Section titled “Recovery characteristics”This is where the latency story lives. Cluster Standby’s RTO is near-zero because state is already in memory — no snapshot load, no log replay. ClusterBackup pays minutes because it must copy data and start a fresh cluster.
| Cluster Standby | ClusterBackup | |
|---|---|---|
| RTO | Near-zero — “set a flag” to activate | Minutes — copy data + start cluster |
| Failover mechanism | Operator sets flag → standby becomes active | Copy recording.log + archive → start new cluster |
| State reconstruction | Already in memory — no replay needed | Cluster loads snapshot + replays log |
| Data loss window | Only in-transit uncommitted messages | Up to last backup sync point |
| Back pressure on leader | No | Minimal |
Operational
Section titled “Operational”| Cluster Standby | ClusterBackup | |
|---|---|---|
| Requires custom code | No (config only) | Minimal — ClusterBackupMediaDriver.launch() |
| Snapshot + log replay | Not needed — state already live | Automatic on recovery |