Backups and Disaster Recovery

Disaster recovery is a spectrum, not a single plan. The blast radius dictates the remedy: lose one follower and the cluster shrugs it off automatically; lose every node and you cold-start from an external backup. This page maps each failure mode to its remediation, then compares the two backup strategies — OSS Cluster Backup versus Premium Cluster Standby — so you can pick the right one for your RTO and RPO.

For how recordings, snapshots, and the recording log actually work under the hood, see The Aeron Files. Here we focus on operational recovery.

DR scenarios, ranked by severity

Four scenarios cover the realistic failure space, in order of severity:

Scenario	Severity
Follower Node loss	Low — cluster continues, auto-recovery
Leader Node loss	Medium — brief election, auto-recovery
Most Recent Snapshot loss	Medium — falls back to older snapshot + longer replay
All Node Loss (only DB survives)	Critical — full rebuild required

Recovery approach by scenario

Match the remedy to the severity. The further down this list you go, the more you rely on tooling and external backups instead of Aeron’s built-in mechanisms.

Follower loss → Restart the node. It auto-catches up from the leader via snapshot + log replay.
Leader loss → Auto-election picks a new leader. The old leader restarts as a follower and catches up.
Snapshot loss → Use ClusterTool to inspect the recording log, invalidate the corrupt snapshot, and fall back to an older snapshot with longer log replay.
All node loss → Cold-start from an external backup (snapshots + log recordings from Archive or Cluster Backup/Standby).

Cluster Backup (OSS) vs Cluster Standby (Premium)

For the critical “all node loss” case, you protect yourself with an external backup. There are two options. The choice comes down to one word: warm versus cold.

Feature	Cluster Backup (OSS)	Cluster Standby (Premium)
Availability	Open Source (Apache 2.0)	Premium (commercial)
Purpose	Replicate log to remote location for DR	Extend deployment with live standby nodes
Node State	Passive — does not process messages	Active — processes every message in real-time
Recovery Speed	Slower — depends on snapshot age + log replay duration	Faster — near-instant switchover, no replay needed
Data Loss on Failover	Higher — depends on last backup sync point	Minimal — only in-transit uncommitted messages
Recovery Mechanism	Rebuild cluster state via log replay	Flip a flag to switch standby to active
Bandwidth Flexibility	Not configurable	Configurable — single node or all nodes routing
Back Pressure on Leader	May cause back pressure if remote node added to cluster	No back pressure on leader
Use Case	Protection against complete DC loss with acceptable recovery time	Rapid failover with minimal downtime and data loss

Decision framework

Pick based on your recovery objectives and budget:

Need fast failover (<60s) + minimal data loss?  → Cluster Standby (Premium)
Need basic DR with acceptable recovery time?    → Cluster Backup (OSS)
Budget constrained but need some DR?            → Cluster Backup (OSS)
Running cross-region with strict RPO/RTO?       → Cluster Standby (Premium)

The detailed comparison

Drilling deeper, the warm-vs-cold distinction shows up in license, replication, recovery, and operations. ClusterStandby ships in the premium jar; ClusterBackup / ClusterBackupMediaDriver are OSS.

License & API

	Cluster Standby (Premium)	ClusterBackup (OSS)
License	Aeron Premium (commercial)	OSS (Apache 2.0)
API	`ClusterStandby` (premium jar)	`ClusterBackup` / `ClusterBackupMediaDriver`
Standby type	Warm — actively processes every message	Cold — replicates recordings, replays on restore

Replication & state

	Cluster Standby	ClusterBackup
Replicates snapshot recordings	Yes (as part of message processing)	Yes (automatic)
Replicates log recordings	Yes (as part of message processing)	Yes (automatic)
Runs ClusteredService on standby	Yes — processes every message	No — archive-level replication only
Standby state matches primary	Yes — near real-time	No — state in archive only, not in memory
Continuous sync (live log tailing)	Yes	Yes — stays in BACKING_UP state
Leader-aware	Yes — follows leader changes	Yes — queries cluster consensus

Recovery characteristics

This is where the latency story lives. Cluster Standby’s RTO is near-zero because state is already in memory — no snapshot load, no log replay. ClusterBackup pays minutes because it must copy data and start a fresh cluster.

	Cluster Standby	ClusterBackup
RTO	Near-zero — “set a flag” to activate	Minutes — copy data + start cluster
Failover mechanism	Operator sets flag → standby becomes active	Copy recording.log + archive → start new cluster
State reconstruction	Already in memory — no replay needed	Cluster loads snapshot + replays log
Data loss window	Only in-transit uncommitted messages	Up to last backup sync point
Back pressure on leader	No	Minimal

Operational

	Cluster Standby	ClusterBackup
Requires custom code	No (config only)	Minimal — `ClusterBackupMediaDriver.launch()`
Snapshot + log replay	Not needed — state already live	Automatic on recovery