Skip to content

Failure-Mode Runbook: Nodes, Network, Disk

When something breaks at 3 a.m., you want a table, not a textbook. This page is that table. It maps every infrastructure failure mode — node crashes, network partitions, disk problems — to its data-loss risk, availability impact, recovery complexity, and the exact remediation steps. Three rules run through all of it: minority failures self-heal, majority failures need you, and quorum is the line between the two.

For the byte-level internals — log layout, term files, recording segments — defer to The Aeron Files. This page is the operator’s runbook, not the wire format.

Start with the cluster itself. The pattern below holds across every row: lose the minority and Aeron heals on its own; lose the majority and you must restore quorum by hand.

CaseData Loss RiskAvailability ImpactRecovery ComplexityRemediation
Single follower crashNoneNone (quorum held)Low — auto catchupRestart node; auto-catches up via snapshot + log replay from leader
Single leader crashUncommitted msgsBrief (election timeout)Low — auto electionWait for auto-election; restart crashed node to rejoin as follower; clients reconnect to new leader
Multiple node crash (minority)NoneNone (zero tolerance left)Low — auto catchupRestart crashed nodes ASAP to restore fault tolerance; monitor quorum health
Multiple node crash (majority / quorum loss)Uncommitted msgsTotal until restoredHigh — manual interventionRestart enough nodes to restore quorum; if data lost, restore from latest snapshot + log backup
All nodes crash simultaneouslyPossibleTotalHigh — snapshot + log restoreCold-start all nodes; each replays from local snapshot + log recordings; if corrupt, restore from external backup
  • Minority failures are self-healing — Aeron handles them automatically.
  • Majority failures require manual intervention — you must restore quorum.
  • “Uncommitted msgs” means only messages the leader accepted but hadn’t replicated to a majority yet — typically a very small window.

Restore from external backup (4th node async replica)

Section titled “Restore from external backup (4th node async replica)”

When the cluster uses Aeron Cluster Backup to asynchronously replicate data to a 4th (non-voting) node, use the following procedure to restore from that external backup.

  • The 4th node runs the ClusterBackup agent, continuously receiving snapshots and log segments from the active cluster.
  • Backup data location is configured via ClusterBackup.Context (e.g. backupResponseChannel, backupDir).
  • The backup node stores: latest snapshot + subsequent log recordings.
  1. Stop all surviving cluster nodes (if any are still running) to prevent split-brain.

  2. Identify the backup data on the 4th node.

    • Locate the backup directory (configured in ClusterBackup.Context.backupDir()).
    • Verify the latest snapshot and recording log files are intact:
      Terminal window
      # List snapshot and log recordings in the backup dir
      ls -lah /path/to/backup-dir/
      # Use AeronStat or RecordingLog tool to inspect recording state
      java -cp aeron-all.jar io.aeron.cluster.RecordingLog /path/to/backup-dir/
  3. Copy backup data to each cluster node.

    • For each node in the cluster, replace its corrupt/lost data with the backup:
      Terminal window
      # On each cluster node, clear the old cluster dir
      rm -rf /path/to/node-X/cluster-dir/*
      # Copy snapshot + recording log from the 4th node backup
      scp -r backup-node:/path/to/backup-dir/* /path/to/node-X/cluster-dir/
    • Ensure file ownership and permissions are correct on each node.
  4. Update cluster mark file (if needed).

    • Each node’s cluster-mark.dat must reflect the correct memberId.
    • If restoring to the same nodes with the same member IDs, no change is needed.
    • If node identity changed, update the cluster configuration accordingly.
  5. Seed the leader node first.

    • Pick one node to start first — it will replay the snapshot + logs and become the initial leader.
    • Start it with:
      Terminal window
      # Start the first node; it will replay snapshot and logs
      java -cp your-app.jar <MainClass> --cluster-dir /path/to/node-0/cluster-dir/
    • Wait until it has fully replayed and is in LEADER state (check via AeronStat or application logs).
  6. Start remaining follower nodes.

    • Start the other cluster nodes one by one.
    • Each will catch up from the leader via snapshot install + log replay:
      Terminal window
      java -cp your-app.jar <MainClass> --cluster-dir /path/to/node-1/cluster-dir/
      java -cp your-app.jar <MainClass> --cluster-dir /path/to/node-2/cluster-dir/
  7. Verify cluster health.

    • Confirm all nodes have joined and quorum is established.
    • Check consensus module counters via AeronStat:
      Terminal window
      java -cp aeron-all.jar io.aeron.driver.status.AeronStat
    • Verify LEADER and FOLLOWER roles are correctly assigned.
    • Run application-level sanity checks (e.g. query state, check sequence numbers).
  8. Restart the 4th backup node.

    • Once the cluster is healthy, restart ClusterBackup on the 4th node so it resumes async replication from the new leader.
  • Data gap awareness: The async backup may lag behind the live cluster. Any messages accepted by the cluster but not yet replicated to the backup node will be lost. The gap depends on replication frequency and network latency.
  • Do NOT start cluster nodes before copying backup data to all of them — starting with mismatched state across nodes can cause further corruption.
  • Snapshot consistency: Always use a complete snapshot + its corresponding log recordings together. Never mix snapshots and logs from different points in time.
  • Test this procedure regularly in non-production environments to validate RTO/RPO targets.

The 4th backup node itself is a single point of failure for disaster recovery. If the backup node’s disk dies while the cluster is also down, you lose your last resort. The suggested approach is to layer external, durable storage on top of the Aeron ClusterBackup agent.

Suggested approach: periodic snapshot archival to object storage

Section titled “Suggested approach: periodic snapshot archival to object storage”

The recommended strategy is to periodically archive the backup node’s snapshot + recording log files to durable external storage (e.g. AWS S3, GCS, Azure Blob Storage, or a remote NFS/SAN).

  1. Aeron ClusterBackup continuously replicates the latest snapshot and log segments from the cluster to the 4th node’s local disk — this is real-time and automatic.
  2. A scheduled job on the backup node (cron, systemd timer, or application-level scheduler) periodically copies a consistent point-in-time snapshot bundle to object storage.
  3. Retention policy keeps N recent archives, pruning older ones to manage storage costs.
#!/bin/bash
# backup-to-s3.sh — runs periodically via cron on the 4th backup node
BACKUP_DIR="/path/to/backup-dir"
S3_BUCKET="s3://your-bucket/aeron-cluster-backups"
TIMESTAMP=$(date +%Y%m%dT%H%M%S)
ARCHIVE_NAME="cluster-backup-${TIMESTAMP}.tar.gz"
STAGING_DIR="/tmp/aeron-backup-staging"
# 1. Pause ClusterBackup agent (optional but recommended for consistency)
# Alternatively, take a filesystem-level snapshot (LVM, ZFS, EBS snapshot)
# if pausing is not acceptable
# 2. Create a consistent copy of the backup directory
mkdir -p "$STAGING_DIR"
cp -a "$BACKUP_DIR" "$STAGING_DIR/backup-snapshot"
# 3. Resume ClusterBackup agent (if paused in step 1)
# 4. Compress and upload to S3
tar -czf "/tmp/${ARCHIVE_NAME}" -C "$STAGING_DIR" backup-snapshot
aws s3 cp "/tmp/${ARCHIVE_NAME}" "${S3_BUCKET}/${ARCHIVE_NAME}"
# 5. Clean up staging
rm -rf "$STAGING_DIR" "/tmp/${ARCHIVE_NAME}"
# 6. Prune old backups — keep last 7 days
aws s3 ls "${S3_BUCKET}/" | awk '{print $4}' | sort | head -n -7 | \
xargs -I {} aws s3 rm "${S3_BUCKET}/{}"
echo "Backup archived: ${S3_BUCKET}/${ARCHIVE_NAME}"

Schedule via cron (e.g. every 6 hours):

Terminal window
0 */6 * * * /opt/scripts/backup-to-s3.sh >> /var/log/aeron-backup-archive.log 2>&1
ApproachProsCons
Object storage archival (recommended)Durable, versioned, cheap, cross-region replication built-inSlight RPO gap (time between archives)
EBS/disk-level snapshotsFilesystem-consistent, fast, no application-level scriptingCloud-provider specific; snapshot restore takes time
ZFS/LVM snapshots + replicationNear-instant consistent snapshots; can replicate to remoteRequires ZFS/LVM setup; more operational complexity
Rsync to a remote hostSimple, uses standard toolingNot atomic — can copy partial state if backup is being written to
Second backup node (5th node)Full redundancy at Aeron level, no scripting neededDoubles backup infra cost; still need off-site copy for DR
  • Best option: Use filesystem-level snapshots (EBS snapshot, ZFS snapshot, LVM snapshot) while ClusterBackup is running — this gives you a crash-consistent point-in-time copy without pausing replication.
  • Good option: Briefly pause the ClusterBackup agent, cp -a the directory, then resume. The pause window is short (seconds), and during this time the backup node simply falls behind the cluster slightly — it catches up automatically on resume.
  • Avoid: Copying the backup directory while ClusterBackup is actively writing — you risk archiving a half-written snapshot that is unusable.
  • Retention policy: Keep at least 3-7 days of archives. For compliance-heavy environments, keep 30+ days or per regulatory requirements.
  • Monitor the archival job: Alert if the cron job fails or if the most recent archive is older than 2x the expected interval.
  • Monitor the backup node lag: Track how far behind the backup node is from the cluster leader (via ClusterBackup counters). If lag grows, the backup is stale.
  • Validate restores periodically: Download an archive from object storage, restore it to a test cluster, and verify state integrity. An untested backup is not a backup.

Network failures look scary but mostly self-correct. Raft’s whole point is that committed data survives any partition. The one to fear is not a clean break — it is flapping.

CaseData Loss RiskAvailability ImpactRecovery ComplexityRemediation
Partition — leader isolatedUncommitted msgsBrief (election)Low — automaticWait for auto-election in majority partition; old leader steps down automatically; fix network
Partition — follower isolatedNoneNone (quorum held)Low — auto catchup on reconnectFix network; follower auto-catches up from leader on reconnect
Partition — no majority (symmetric split)None (committed safe)Total until reconnectMedium — wait or manualRestore network connectivity; cluster auto-resumes; no manual data recovery needed
Intermittent flappingNoneDegraded (election storms)Medium — tune timeoutsIncrease election timeout and heartbeat interval; fix underlying network instability; consider dedicated cluster network
High latency between nodesNoneDegraded (slow commits)Low — tune timeoutsIncrease heartbeat/election timeouts to tolerate latency; optimize network path; consider closer node placement
Scenario: Leader isolated (minority partition)
[AZ-1: Leader] ──✕── [AZ-2: Follower A, Follower B]
↓ ↓
Steps down New election → new leader
(can't reach quorum) (has quorum: 2/3)

Raft’s partition handling is elegant: the isolated leader can’t commit anything (no quorum), so it steps down. The majority partition elects a new leader and continues. No data is lost because uncommitted entries on the old leader were never acknowledged to clients.

Two of these rows are tuning problems, not failures. High latency between nodes slows commits because every commit waits on a cross-node round trip — your p99 climbs with the network path. Flapping triggers election storms that stall the cluster repeatedly. Both are addressed by widening the heartbeat and election timeouts so the cluster tolerates brief blips without reacting — at the cost of slower failure detection. The timeout itself does not touch p50/p99 on the hot path; it governs how long an outage stalls throughput before the cluster recovers.

Disk is where a healthy cluster quietly stalls. Two directories matter, and “disk full on leader” is the surprise that takes down the whole cluster.

CaseData Loss RiskAvailability ImpactRecovery ComplexityRemediation
Disk full on leaderNoneTotal (cannot commit)MediumFree disk space or expand volume; trigger snapshot to compact log; set up disk usage alerts
Disk full on followerNoneNone (quorum held)LowFree disk space on follower; node auto-catches up once writes resume
Disk corruption (single node)None (if detected)None (quorum held)Medium — rebuild nodeStop node; wipe corrupted clusterDir + archiveDir; restart — rebuilds from leader via snapshot replication + log catchup
Disk corruption (all nodes)HighTotalCritical — external backup neededRestore all nodes from external backup; if no backup exists, data is lost
Loss of cluster directory (single node)NoneNone (quorum held)MediumWipe node completely; restart — re-provisions from leader snapshot + log
Loss of cluster directory (all nodes)HighTotalCriticalRestore cluster metadata from external backup; rebuild from backed-up snapshots + logs
Loss of archive directory (single node)NoneNone (quorum held)MediumWipe node; restart — catches up from leader via snapshot replication
Loss of archive directory (all nodes)TotalTotalCriticalRestore archive recordings from external backup; no backup = total data loss
  • clusterDir — Contains cluster metadata, mark files, consensus state.
  • archiveDir — Contains recorded streams (log entries, snapshots).

Both are critical. Losing either on a single node is recoverable from peers. Losing either on all nodes simultaneously requires external backup.