Skip to content

Backups and Disaster Recovery

Disaster recovery is a spectrum, not a single plan. The blast radius dictates the remedy: lose one follower and the cluster shrugs it off automatically; lose every node and you cold-start from an external backup. This page maps each failure mode to its remediation, then compares the two backup strategies — OSS Cluster Backup versus Premium Cluster Standby — so you can pick the right one for your RTO and RPO.

For how recordings, snapshots, and the recording log actually work under the hood, see The Aeron Files. Here we focus on operational recovery.

Four scenarios cover the realistic failure space, in order of severity:

ScenarioSeverity
Follower Node lossLow — cluster continues, auto-recovery
Leader Node lossMedium — brief election, auto-recovery
Most Recent Snapshot lossMedium — falls back to older snapshot + longer replay
All Node Loss (only DB survives)Critical — full rebuild required

Match the remedy to the severity. The further down this list you go, the more you rely on tooling and external backups instead of Aeron’s built-in mechanisms.

  1. Follower loss → Restart the node. It auto-catches up from the leader via snapshot + log replay.
  2. Leader loss → Auto-election picks a new leader. The old leader restarts as a follower and catches up.
  3. Snapshot loss → Use ClusterTool to inspect the recording log, invalidate the corrupt snapshot, and fall back to an older snapshot with longer log replay.
  4. All node loss → Cold-start from an external backup (snapshots + log recordings from Archive or Cluster Backup/Standby).

Cluster Backup (OSS) vs Cluster Standby (Premium)

Section titled “Cluster Backup (OSS) vs Cluster Standby (Premium)”

For the critical “all node loss” case, you protect yourself with an external backup. There are two options. The choice comes down to one word: warm versus cold.

FeatureCluster Backup (OSS)Cluster Standby (Premium)
AvailabilityOpen Source (Apache 2.0)Premium (commercial)
PurposeReplicate log to remote location for DRExtend deployment with live standby nodes
Node StatePassive — does not process messagesActive — processes every message in real-time
Recovery SpeedSlower — depends on snapshot age + log replay durationFaster — near-instant switchover, no replay needed
Data Loss on FailoverHigher — depends on last backup sync pointMinimal — only in-transit uncommitted messages
Recovery MechanismRebuild cluster state via log replayFlip a flag to switch standby to active
Bandwidth FlexibilityNot configurableConfigurable — single node or all nodes routing
Back Pressure on LeaderMay cause back pressure if remote node added to clusterNo back pressure on leader
Use CaseProtection against complete DC loss with acceptable recovery timeRapid failover with minimal downtime and data loss

Pick based on your recovery objectives and budget:

Need fast failover (<60s) + minimal data loss? → Cluster Standby (Premium)
Need basic DR with acceptable recovery time? → Cluster Backup (OSS)
Budget constrained but need some DR? → Cluster Backup (OSS)
Running cross-region with strict RPO/RTO? → Cluster Standby (Premium)

Drilling deeper, the warm-vs-cold distinction shows up in license, replication, recovery, and operations. ClusterStandby ships in the premium jar; ClusterBackup / ClusterBackupMediaDriver are OSS.

Cluster Standby (Premium)ClusterBackup (OSS)
LicenseAeron Premium (commercial)OSS (Apache 2.0)
APIClusterStandby (premium jar)ClusterBackup / ClusterBackupMediaDriver
Standby typeWarm — actively processes every messageCold — replicates recordings, replays on restore
Cluster StandbyClusterBackup
Replicates snapshot recordingsYes (as part of message processing)Yes (automatic)
Replicates log recordingsYes (as part of message processing)Yes (automatic)
Runs ClusteredService on standbyYes — processes every messageNo — archive-level replication only
Standby state matches primaryYes — near real-timeNo — state in archive only, not in memory
Continuous sync (live log tailing)YesYes — stays in BACKING_UP state
Leader-awareYes — follows leader changesYes — queries cluster consensus

This is where the latency story lives. Cluster Standby’s RTO is near-zero because state is already in memory — no snapshot load, no log replay. ClusterBackup pays minutes because it must copy data and start a fresh cluster.

Cluster StandbyClusterBackup
RTONear-zero — “set a flag” to activateMinutes — copy data + start cluster
Failover mechanismOperator sets flag → standby becomes activeCopy recording.log + archive → start new cluster
State reconstructionAlready in memory — no replay neededCluster loads snapshot + replays log
Data loss windowOnly in-transit uncommitted messagesUp to last backup sync point
Back pressure on leaderNoMinimal
Cluster StandbyClusterBackup
Requires custom codeNo (config only)Minimal — ClusterBackupMediaDriver.launch()
Snapshot + log replayNot needed — state already liveAutomatic on recovery