Failure-Mode Runbook: Nodes, Network, Disk

When something breaks at 3 a.m., you want a table, not a textbook. This page is that table. It maps every infrastructure failure mode — node crashes, network partitions, disk problems — to its data-loss risk, availability impact, recovery complexity, and the exact remediation steps. Three rules run through all of it: minority failures self-heal, majority failures need you, and quorum is the line between the two.

For the byte-level internals — log layout, term files, recording segments — defer to The Aeron Files. This page is the operator’s runbook, not the wire format.

Start with the cluster itself. The pattern below holds across every row: lose the minority and Aeron heals on its own; lose the majority and you must restore quorum by hand.

Case	Data Loss Risk	Availability Impact	Recovery Complexity	Remediation
Single follower crash	None	None (quorum held)	Low — auto catchup	Restart node; auto-catches up via snapshot + log replay from leader
Single leader crash	Uncommitted msgs	Brief (election timeout)	Low — auto election	Wait for auto-election; restart crashed node to rejoin as follower; clients reconnect to new leader
Multiple node crash (minority)	None	None (zero tolerance left)	Low — auto catchup	Restart crashed nodes ASAP to restore fault tolerance; monitor quorum health
Multiple node crash (majority / quorum loss)	Uncommitted msgs	Total until restored	High — manual intervention	Restart enough nodes to restore quorum; if data lost, restore from latest snapshot + log backup
All nodes crash simultaneously	Possible	Total	High — snapshot + log restore	Cold-start all nodes; each replays from local snapshot + log recordings; if corrupt, restore from external backup

Key patterns

Minority failures are self-healing — Aeron handles them automatically.
Majority failures require manual intervention — you must restore quorum.
“Uncommitted msgs” means only messages the leader accepted but hadn’t replicated to a majority yet — typically a very small window.

Restore from external backup (4th node async replica)

When the cluster uses Aeron Cluster Backup to asynchronously replicate data to a 4th (non-voting) node, use the following procedure to restore from that external backup.

Prerequisites

The 4th node runs the ClusterBackup agent, continuously receiving snapshots and log segments from the active cluster.
Backup data location is configured via ClusterBackup.Context (e.g. backupResponseChannel, backupDir).
The backup node stores: latest snapshot + subsequent log recordings.

Restore procedure

Stop all surviving cluster nodes (if any are still running) to prevent split-brain.

Identify the backup data on the 4th node.

Locate the backup directory (configured in ClusterBackup.Context.backupDir()).

Verify the latest snapshot and recording log files are intact:

# List snapshot and log recordings in the backup dir
ls -lah /path/to/backup-dir/
# Use AeronStat or RecordingLog tool to inspect recording state
java -cp aeron-all.jar io.aeron.cluster.RecordingLog /path/to/backup-dir/

Copy backup data to each cluster node.

For each node in the cluster, replace its corrupt/lost data with the backup:

# On each cluster node, clear the old cluster dir
rm -rf /path/to/node-X/cluster-dir/*

# Copy snapshot + recording log from the 4th node backup
scp -r backup-node:/path/to/backup-dir/* /path/to/node-X/cluster-dir/

Ensure file ownership and permissions are correct on each node.

Update cluster mark file (if needed).
- Each node’s cluster-mark.dat must reflect the correct memberId.
- If restoring to the same nodes with the same member IDs, no change is needed.
- If node identity changed, update the cluster configuration accordingly.
Seed the leader node first.
- Pick one node to start first — it will replay the snapshot + logs and become the initial leader.
- Start it with:
  Terminal window
```
# Start the first node; it will replay snapshot and logs
java -cp your-app.jar <MainClass> --cluster-dir /path/to/node-0/cluster-dir/
```
- Wait until it has fully replayed and is in LEADER state (check via AeronStat or application logs).

Start remaining follower nodes.

Start the other cluster nodes one by one.

Each will catch up from the leader via snapshot install + log replay:

java -cp your-app.jar <MainClass> --cluster-dir /path/to/node-1/cluster-dir/
java -cp your-app.jar <MainClass> --cluster-dir /path/to/node-2/cluster-dir/

Verify cluster health.
- Confirm all nodes have joined and quorum is established.
- Check consensus module counters via AeronStat:
  Terminal window
```
java -cp aeron-all.jar io.aeron.driver.status.AeronStat
```
- Verify LEADER and FOLLOWER roles are correctly assigned.
- Run application-level sanity checks (e.g. query state, check sequence numbers).
Restart the 4th backup node.
- Once the cluster is healthy, restart ClusterBackup on the 4th node so it resumes async replication from the new leader.

Important notes

Data gap awareness: The async backup may lag behind the live cluster. Any messages accepted by the cluster but not yet replicated to the backup node will be lost. The gap depends on replication frequency and network latency.
Do NOT start cluster nodes before copying backup data to all of them — starting with mismatched state across nodes can cause further corruption.
Snapshot consistency: Always use a complete snapshot + its corresponding log recordings together. Never mix snapshots and logs from different points in time.
Test this procedure regularly in non-production environments to validate RTO/RPO targets.

Backing up the backup node

The 4th backup node itself is a single point of failure for disaster recovery. If the backup node’s disk dies while the cluster is also down, you lose your last resort. The suggested approach is to layer external, durable storage on top of the Aeron ClusterBackup agent.

Suggested approach: periodic snapshot archival to object storage

The recommended strategy is to periodically archive the backup node’s snapshot + recording log files to durable external storage (e.g. AWS S3, GCS, Azure Blob Storage, or a remote NFS/SAN).

How it works

Aeron ClusterBackup continuously replicates the latest snapshot and log segments from the cluster to the 4th node’s local disk — this is real-time and automatic.
A scheduled job on the backup node (cron, systemd timer, or application-level scheduler) periodically copies a consistent point-in-time snapshot bundle to object storage.
Retention policy keeps N recent archives, pruning older ones to manage storage costs.

Implementation

#!/bin/bash
# backup-to-s3.sh — runs periodically via cron on the 4th backup node

BACKUP_DIR="/path/to/backup-dir"
S3_BUCKET="s3://your-bucket/aeron-cluster-backups"
TIMESTAMP=$(date +%Y%m%dT%H%M%S)
ARCHIVE_NAME="cluster-backup-${TIMESTAMP}.tar.gz"
STAGING_DIR="/tmp/aeron-backup-staging"

# 1. Pause ClusterBackup agent (optional but recommended for consistency)
#    Alternatively, take a filesystem-level snapshot (LVM, ZFS, EBS snapshot)
#    if pausing is not acceptable

# 2. Create a consistent copy of the backup directory
mkdir -p "$STAGING_DIR"
cp -a "$BACKUP_DIR" "$STAGING_DIR/backup-snapshot"

# 3. Resume ClusterBackup agent (if paused in step 1)

# 4. Compress and upload to S3
tar -czf "/tmp/${ARCHIVE_NAME}" -C "$STAGING_DIR" backup-snapshot
aws s3 cp "/tmp/${ARCHIVE_NAME}" "${S3_BUCKET}/${ARCHIVE_NAME}"

# 5. Clean up staging
rm -rf "$STAGING_DIR" "/tmp/${ARCHIVE_NAME}"

# 6. Prune old backups — keep last 7 days
aws s3 ls "${S3_BUCKET}/" | awk '{print $4}' | sort | head -n -7 | \
  xargs -I {} aws s3 rm "${S3_BUCKET}/{}"

echo "Backup archived: ${S3_BUCKET}/${ARCHIVE_NAME}"

Schedule via cron (e.g. every 6 hours):

0 */6 * * * /opt/scripts/backup-to-s3.sh >> /var/log/aeron-backup-archive.log 2>&1

Alternative approaches

Approach	Pros	Cons
Object storage archival (recommended)	Durable, versioned, cheap, cross-region replication built-in	Slight RPO gap (time between archives)
EBS/disk-level snapshots	Filesystem-consistent, fast, no application-level scripting	Cloud-provider specific; snapshot restore takes time
ZFS/LVM snapshots + replication	Near-instant consistent snapshots; can replicate to remote	Requires ZFS/LVM setup; more operational complexity
Rsync to a remote host	Simple, uses standard tooling	Not atomic — can copy partial state if backup is being written to
Second backup node (5th node)	Full redundancy at Aeron level, no scripting needed	Doubles backup infra cost; still need off-site copy for DR

Consistency considerations

Best option: Use filesystem-level snapshots (EBS snapshot, ZFS snapshot, LVM snapshot) while ClusterBackup is running — this gives you a crash-consistent point-in-time copy without pausing replication.
Good option: Briefly pause the ClusterBackup agent, cp -a the directory, then resume. The pause window is short (seconds), and during this time the backup node simply falls behind the cluster slightly — it catches up automatically on resume.
Avoid: Copying the backup directory while ClusterBackup is actively writing — you risk archiving a half-written snapshot that is unusable.

Retention & monitoring

Retention policy: Keep at least 3-7 days of archives. For compliance-heavy environments, keep 30+ days or per regulatory requirements.
Monitor the archival job: Alert if the cron job fails or if the most recent archive is older than 2x the expected interval.
Monitor the backup node lag: Track how far behind the backup node is from the cluster leader (via ClusterBackup counters). If lag grows, the backup is stale.
Validate restores periodically: Download an archive from object storage, restore it to a test cluster, and verify state integrity. An untested backup is not a backup.

Network failures

Network failures look scary but mostly self-correct. Raft’s whole point is that committed data survives any partition. The one to fear is not a clean break — it is flapping.

Case	Data Loss Risk	Availability Impact	Recovery Complexity	Remediation
Partition — leader isolated	Uncommitted msgs	Brief (election)	Low — automatic	Wait for auto-election in majority partition; old leader steps down automatically; fix network
Partition — follower isolated	None	None (quorum held)	Low — auto catchup on reconnect	Fix network; follower auto-catches up from leader on reconnect
Partition — no majority (symmetric split)	None (committed safe)	Total until reconnect	Medium — wait or manual	Restore network connectivity; cluster auto-resumes; no manual data recovery needed
Intermittent flapping	None	Degraded (election storms)	Medium — tune timeouts	Increase election timeout and heartbeat interval; fix underlying network instability; consider dedicated cluster network
High latency between nodes	None	Degraded (slow commits)	Low — tune timeouts	Increase heartbeat/election timeouts to tolerate latency; optimize network path; consider closer node placement

Network partition behavior

Scenario: Leader isolated (minority partition)

  [AZ-1: Leader]  ──✕──  [AZ-2: Follower A, Follower B]
       ↓                         ↓
  Steps down                New election → new leader
  (can't reach quorum)      (has quorum: 2/3)

Raft’s partition handling is elegant: the isolated leader can’t commit anything (no quorum), so it steps down. The majority partition elects a new leader and continues. No data is lost because uncommitted entries on the old leader were never acknowledged to clients.

Two of these rows are tuning problems, not failures. High latency between nodes slows commits because every commit waits on a cross-node round trip — your p99 climbs with the network path. Flapping triggers election storms that stall the cluster repeatedly. Both are addressed by widening the heartbeat and election timeouts so the cluster tolerates brief blips without reacting — at the cost of slower failure detection. The timeout itself does not touch p50/p99 on the hot path; it governs how long an outage stalls throughput before the cluster recovers.

Disk failures

Disk is where a healthy cluster quietly stalls. Two directories matter, and “disk full on leader” is the surprise that takes down the whole cluster.

Case	Data Loss Risk	Availability Impact	Recovery Complexity	Remediation
Disk full on leader	None	Total (cannot commit)	Medium	Free disk space or expand volume; trigger snapshot to compact log; set up disk usage alerts
Disk full on follower	None	None (quorum held)	Low	Free disk space on follower; node auto-catches up once writes resume
Disk corruption (single node)	None (if detected)	None (quorum held)	Medium — rebuild node	Stop node; wipe corrupted clusterDir + archiveDir; restart — rebuilds from leader via snapshot replication + log catchup
Disk corruption (all nodes)	High	Total	Critical — external backup needed	Restore all nodes from external backup; if no backup exists, data is lost
Loss of cluster directory (single node)	None	None (quorum held)	Medium	Wipe node completely; restart — re-provisions from leader snapshot + log
Loss of cluster directory (all nodes)	High	Total	Critical	Restore cluster metadata from external backup; rebuild from backed-up snapshots + logs
Loss of archive directory (single node)	None	None (quorum held)	Medium	Wipe node; restart — catches up from leader via snapshot replication
Loss of archive directory (all nodes)	Total	Total	Critical	Restore archive recordings from external backup; no backup = total data loss

Key directories

clusterDir — Contains cluster metadata, mark files, consensus state.
archiveDir — Contains recorded streams (log entries, snapshots).

Both are critical. Losing either on a single node is recoverable from peers. Losing either on all nodes simultaneously requires external backup.