Failure-Mode Runbook: Nodes, Network, Disk
When something breaks at 3 a.m., you want a table, not a textbook. This page is that table. It maps every infrastructure failure mode — node crashes, network partitions, disk problems — to its data-loss risk, availability impact, recovery complexity, and the exact remediation steps. Three rules run through all of it: minority failures self-heal, majority failures need you, and quorum is the line between the two.
For the byte-level internals — log layout, term files, recording segments — defer to The Aeron Files. This page is the operator’s runbook, not the wire format.
Node-related failures
Section titled “Node-related failures”Start with the cluster itself. The pattern below holds across every row: lose the minority and Aeron heals on its own; lose the majority and you must restore quorum by hand.
| Case | Data Loss Risk | Availability Impact | Recovery Complexity | Remediation |
|---|---|---|---|---|
| Single follower crash | None | None (quorum held) | Low — auto catchup | Restart node; auto-catches up via snapshot + log replay from leader |
| Single leader crash | Uncommitted msgs | Brief (election timeout) | Low — auto election | Wait for auto-election; restart crashed node to rejoin as follower; clients reconnect to new leader |
| Multiple node crash (minority) | None | None (zero tolerance left) | Low — auto catchup | Restart crashed nodes ASAP to restore fault tolerance; monitor quorum health |
| Multiple node crash (majority / quorum loss) | Uncommitted msgs | Total until restored | High — manual intervention | Restart enough nodes to restore quorum; if data lost, restore from latest snapshot + log backup |
| All nodes crash simultaneously | Possible | Total | High — snapshot + log restore | Cold-start all nodes; each replays from local snapshot + log recordings; if corrupt, restore from external backup |
Key patterns
Section titled “Key patterns”- Minority failures are self-healing — Aeron handles them automatically.
- Majority failures require manual intervention — you must restore quorum.
- “Uncommitted msgs” means only messages the leader accepted but hadn’t replicated to a majority yet — typically a very small window.
Restore from external backup (4th node async replica)
Section titled “Restore from external backup (4th node async replica)”When the cluster uses Aeron Cluster Backup to asynchronously replicate data to a 4th (non-voting) node, use the following procedure to restore from that external backup.
Prerequisites
Section titled “Prerequisites”- The 4th node runs the
ClusterBackupagent, continuously receiving snapshots and log segments from the active cluster. - Backup data location is configured via
ClusterBackup.Context(e.g.backupResponseChannel,backupDir). - The backup node stores: latest snapshot + subsequent log recordings.
Restore procedure
Section titled “Restore procedure”-
Stop all surviving cluster nodes (if any are still running) to prevent split-brain.
-
Identify the backup data on the 4th node.
- Locate the backup directory (configured in
ClusterBackup.Context.backupDir()). - Verify the latest snapshot and recording log files are intact:
Terminal window # List snapshot and log recordings in the backup dirls -lah /path/to/backup-dir/# Use AeronStat or RecordingLog tool to inspect recording statejava -cp aeron-all.jar io.aeron.cluster.RecordingLog /path/to/backup-dir/
- Locate the backup directory (configured in
-
Copy backup data to each cluster node.
- For each node in the cluster, replace its corrupt/lost data with the backup:
Terminal window # On each cluster node, clear the old cluster dirrm -rf /path/to/node-X/cluster-dir/*# Copy snapshot + recording log from the 4th node backupscp -r backup-node:/path/to/backup-dir/* /path/to/node-X/cluster-dir/ - Ensure file ownership and permissions are correct on each node.
- For each node in the cluster, replace its corrupt/lost data with the backup:
-
Update cluster mark file (if needed).
- Each node’s
cluster-mark.datmust reflect the correctmemberId. - If restoring to the same nodes with the same member IDs, no change is needed.
- If node identity changed, update the cluster configuration accordingly.
- Each node’s
-
Seed the leader node first.
- Pick one node to start first — it will replay the snapshot + logs and become the initial leader.
- Start it with:
Terminal window # Start the first node; it will replay snapshot and logsjava -cp your-app.jar <MainClass> --cluster-dir /path/to/node-0/cluster-dir/ - Wait until it has fully replayed and is in
LEADERstate (check viaAeronStator application logs).
-
Start remaining follower nodes.
- Start the other cluster nodes one by one.
- Each will catch up from the leader via snapshot install + log replay:
Terminal window java -cp your-app.jar <MainClass> --cluster-dir /path/to/node-1/cluster-dir/java -cp your-app.jar <MainClass> --cluster-dir /path/to/node-2/cluster-dir/
-
Verify cluster health.
- Confirm all nodes have joined and quorum is established.
- Check consensus module counters via
AeronStat:Terminal window java -cp aeron-all.jar io.aeron.driver.status.AeronStat - Verify
LEADERandFOLLOWERroles are correctly assigned. - Run application-level sanity checks (e.g. query state, check sequence numbers).
-
Restart the 4th backup node.
- Once the cluster is healthy, restart
ClusterBackupon the 4th node so it resumes async replication from the new leader.
- Once the cluster is healthy, restart
Important notes
Section titled “Important notes”- Data gap awareness: The async backup may lag behind the live cluster. Any messages accepted by the cluster but not yet replicated to the backup node will be lost. The gap depends on replication frequency and network latency.
- Do NOT start cluster nodes before copying backup data to all of them — starting with mismatched state across nodes can cause further corruption.
- Snapshot consistency: Always use a complete snapshot + its corresponding log recordings together. Never mix snapshots and logs from different points in time.
- Test this procedure regularly in non-production environments to validate RTO/RPO targets.
Backing up the backup node
Section titled “Backing up the backup node”The 4th backup node itself is a single point of failure for disaster recovery. If the backup node’s
disk dies while the cluster is also down, you lose your last resort. The suggested approach is to layer
external, durable storage on top of the Aeron ClusterBackup agent.
Suggested approach: periodic snapshot archival to object storage
Section titled “Suggested approach: periodic snapshot archival to object storage”The recommended strategy is to periodically archive the backup node’s snapshot + recording log files to durable external storage (e.g. AWS S3, GCS, Azure Blob Storage, or a remote NFS/SAN).
How it works
Section titled “How it works”- Aeron
ClusterBackupcontinuously replicates the latest snapshot and log segments from the cluster to the 4th node’s local disk — this is real-time and automatic. - A scheduled job on the backup node (cron, systemd timer, or application-level scheduler) periodically copies a consistent point-in-time snapshot bundle to object storage.
- Retention policy keeps N recent archives, pruning older ones to manage storage costs.
Implementation
Section titled “Implementation”#!/bin/bash# backup-to-s3.sh — runs periodically via cron on the 4th backup node
BACKUP_DIR="/path/to/backup-dir"S3_BUCKET="s3://your-bucket/aeron-cluster-backups"TIMESTAMP=$(date +%Y%m%dT%H%M%S)ARCHIVE_NAME="cluster-backup-${TIMESTAMP}.tar.gz"STAGING_DIR="/tmp/aeron-backup-staging"
# 1. Pause ClusterBackup agent (optional but recommended for consistency)# Alternatively, take a filesystem-level snapshot (LVM, ZFS, EBS snapshot)# if pausing is not acceptable
# 2. Create a consistent copy of the backup directorymkdir -p "$STAGING_DIR"cp -a "$BACKUP_DIR" "$STAGING_DIR/backup-snapshot"
# 3. Resume ClusterBackup agent (if paused in step 1)
# 4. Compress and upload to S3tar -czf "/tmp/${ARCHIVE_NAME}" -C "$STAGING_DIR" backup-snapshotaws s3 cp "/tmp/${ARCHIVE_NAME}" "${S3_BUCKET}/${ARCHIVE_NAME}"
# 5. Clean up stagingrm -rf "$STAGING_DIR" "/tmp/${ARCHIVE_NAME}"
# 6. Prune old backups — keep last 7 daysaws s3 ls "${S3_BUCKET}/" | awk '{print $4}' | sort | head -n -7 | \ xargs -I {} aws s3 rm "${S3_BUCKET}/{}"
echo "Backup archived: ${S3_BUCKET}/${ARCHIVE_NAME}"Schedule via cron (e.g. every 6 hours):
0 */6 * * * /opt/scripts/backup-to-s3.sh >> /var/log/aeron-backup-archive.log 2>&1Alternative approaches
Section titled “Alternative approaches”| Approach | Pros | Cons |
|---|---|---|
| Object storage archival (recommended) | Durable, versioned, cheap, cross-region replication built-in | Slight RPO gap (time between archives) |
| EBS/disk-level snapshots | Filesystem-consistent, fast, no application-level scripting | Cloud-provider specific; snapshot restore takes time |
| ZFS/LVM snapshots + replication | Near-instant consistent snapshots; can replicate to remote | Requires ZFS/LVM setup; more operational complexity |
| Rsync to a remote host | Simple, uses standard tooling | Not atomic — can copy partial state if backup is being written to |
| Second backup node (5th node) | Full redundancy at Aeron level, no scripting needed | Doubles backup infra cost; still need off-site copy for DR |
Consistency considerations
Section titled “Consistency considerations”- Best option: Use filesystem-level snapshots (EBS snapshot, ZFS snapshot, LVM snapshot) while
ClusterBackupis running — this gives you a crash-consistent point-in-time copy without pausing replication. - Good option: Briefly pause the
ClusterBackupagent,cp -athe directory, then resume. The pause window is short (seconds), and during this time the backup node simply falls behind the cluster slightly — it catches up automatically on resume. - Avoid: Copying the backup directory while
ClusterBackupis actively writing — you risk archiving a half-written snapshot that is unusable.
Retention & monitoring
Section titled “Retention & monitoring”- Retention policy: Keep at least 3-7 days of archives. For compliance-heavy environments, keep 30+ days or per regulatory requirements.
- Monitor the archival job: Alert if the cron job fails or if the most recent archive is older than 2x the expected interval.
- Monitor the backup node lag: Track how far behind the backup node is from the cluster leader (via
ClusterBackupcounters). If lag grows, the backup is stale. - Validate restores periodically: Download an archive from object storage, restore it to a test cluster, and verify state integrity. An untested backup is not a backup.
Network failures
Section titled “Network failures”Network failures look scary but mostly self-correct. Raft’s whole point is that committed data survives any partition. The one to fear is not a clean break — it is flapping.
| Case | Data Loss Risk | Availability Impact | Recovery Complexity | Remediation |
|---|---|---|---|---|
| Partition — leader isolated | Uncommitted msgs | Brief (election) | Low — automatic | Wait for auto-election in majority partition; old leader steps down automatically; fix network |
| Partition — follower isolated | None | None (quorum held) | Low — auto catchup on reconnect | Fix network; follower auto-catches up from leader on reconnect |
| Partition — no majority (symmetric split) | None (committed safe) | Total until reconnect | Medium — wait or manual | Restore network connectivity; cluster auto-resumes; no manual data recovery needed |
| Intermittent flapping | None | Degraded (election storms) | Medium — tune timeouts | Increase election timeout and heartbeat interval; fix underlying network instability; consider dedicated cluster network |
| High latency between nodes | None | Degraded (slow commits) | Low — tune timeouts | Increase heartbeat/election timeouts to tolerate latency; optimize network path; consider closer node placement |
Network partition behavior
Section titled “Network partition behavior”Scenario: Leader isolated (minority partition)
[AZ-1: Leader] ──✕── [AZ-2: Follower A, Follower B] ↓ ↓ Steps down New election → new leader (can't reach quorum) (has quorum: 2/3)Raft’s partition handling is elegant: the isolated leader can’t commit anything (no quorum), so it steps down. The majority partition elects a new leader and continues. No data is lost because uncommitted entries on the old leader were never acknowledged to clients.
Two of these rows are tuning problems, not failures. High latency between nodes slows commits because every commit waits on a cross-node round trip — your p99 climbs with the network path. Flapping triggers election storms that stall the cluster repeatedly. Both are addressed by widening the heartbeat and election timeouts so the cluster tolerates brief blips without reacting — at the cost of slower failure detection. The timeout itself does not touch p50/p99 on the hot path; it governs how long an outage stalls throughput before the cluster recovers.
Disk failures
Section titled “Disk failures”Disk is where a healthy cluster quietly stalls. Two directories matter, and “disk full on leader” is the surprise that takes down the whole cluster.
| Case | Data Loss Risk | Availability Impact | Recovery Complexity | Remediation |
|---|---|---|---|---|
| Disk full on leader | None | Total (cannot commit) | Medium | Free disk space or expand volume; trigger snapshot to compact log; set up disk usage alerts |
| Disk full on follower | None | None (quorum held) | Low | Free disk space on follower; node auto-catches up once writes resume |
| Disk corruption (single node) | None (if detected) | None (quorum held) | Medium — rebuild node | Stop node; wipe corrupted clusterDir + archiveDir; restart — rebuilds from leader via snapshot replication + log catchup |
| Disk corruption (all nodes) | High | Total | Critical — external backup needed | Restore all nodes from external backup; if no backup exists, data is lost |
| Loss of cluster directory (single node) | None | None (quorum held) | Medium | Wipe node completely; restart — re-provisions from leader snapshot + log |
| Loss of cluster directory (all nodes) | High | Total | Critical | Restore cluster metadata from external backup; rebuild from backed-up snapshots + logs |
| Loss of archive directory (single node) | None | None (quorum held) | Medium | Wipe node; restart — catches up from leader via snapshot replication |
| Loss of archive directory (all nodes) | Total | Total | Critical | Restore archive recordings from external backup; no backup = total data loss |
Key directories
Section titled “Key directories”- clusterDir — Contains cluster metadata, mark files, consensus state.
- archiveDir — Contains recorded streams (log entries, snapshots).
Both are critical. Losing either on a single node is recoverable from peers. Losing either on all nodes simultaneously requires external backup.