Skip to content

Failure-Mode Runbook: Multi-AZ and Aeron Edge Cases

When things break at 3am, you want a table, not a theory. This runbook covers the two failure classes that page on-call most often: infrastructure-level outages (AZ loss, cross-AZ latency, region-wide events) and Aeron-specific edge cases (media driver crashes, archive I/O errors, component desync, mark file corruption, log buffer overflow). For each, you get the data-loss risk, the availability impact, the recovery complexity, and the remediation.

Infrastructure failures are mostly about quorum. Lose a minority of nodes and you’re fine. Lose a majority and you’re down. The table below is the fast lookup.

CaseData Loss RiskAvailability ImpactRecovery ComplexityRemediation
AZ failure (minority nodes)NoneNone (quorum held)Low — auto catchup on AZ recoveryWait for AZ recovery; nodes auto-rejoin and catch up. Ensure nodes spread across AZs (1 per AZ for 3-node cluster)
AZ failure (majority nodes)Uncommitted msgsTotalHigh — manual interventionRebalance node placement across AZs; restore failed nodes or provision new ones in healthy AZs
Cross-AZ latency spikeNoneDegraded (slow commits, false elections)Low — tune timeoutsIncrease election/heartbeat timeouts for cross-AZ deployment; use placement groups to minimize latency
Region-wide outagePossibleTotalCritical — cross-region restoreActivate standby cluster in DR region; restore from cross-region replicated backups (snapshots + logs)

Node placement is the single biggest lever on the infrastructure side. It is a latency-versus-resilience tradeoff, and you must choose it deliberately.

  • Any single AZ failure → quorum maintained (2/3).
  • Tradeoff: cross-AZ latency on every commit (leader must replicate to at least one other AZ).

3-node cluster in 2 AZs (latency optimized)

Section titled “3-node cluster in 2 AZs (latency optimized)”
  • AZ-b failure → quorum maintained (2/3 in AZ-a).
  • AZ-a failure → quorum lost (only 1/3 in AZ-b).
  • Tradeoff: lower latency (leader + one follower same AZ) but asymmetric failure tolerance.

These are the failures that don’t show up in a generic distributed-systems playbook. They come from Aeron’s component model: the Media Driver, Consensus Module, and Service Container are separate pieces that can fail independently.

CaseData Loss RiskAvailability ImpactRecovery ComplexityRemediation
Media Driver crashNoneSingle node downLow — restart nodeRestart entire node (Media Driver + Consensus Module + Service Container); investigate root cause (driver bug, resource issue)
Archive recording failure (I/O error)NoneDegraded to Total (if leader)Medium — take node out of serviceRemove node from service; fix I/O issue (replace disk, fix permissions); wipe and restart to rebuild from peers
Consensus Module & Service Container desyncNone (but stale)Degraded (not processing msgs)Medium — health monitoring neededImplement liveness checks for all components; restart full node if service container is down
Mark file corruptionNoneSingle node cannot startLow — clean up mark filesDelete stale mark files from clusterDir and aeronDir; restart node. Ensure no duplicate processes running
Log buffer overflow (termLength too small)NoneDegraded (back-pressure)Low — increase termLengthIncrease termLength in MediaDriver and ConsensusModule config; restart cluster with coordinated rolling restart

Mark files are sentinel files that Aeron uses to detect if another instance is already running in the same directory. Corruption typically happens when:

  • A process is killed without cleanup (kill -9).
  • Two processes accidentally share the same aeron directory.
  • Filesystem corruption.

The fix is simple: delete the mark files and restart. But always verify no other process is using the same directory first.

When messages are larger than expected or burst rates exceed the term buffer capacity, back-pressure cascades into an availability problem:

The fix is to increase termLength, but remember the L3 cache constraint from the term buffer sizing guidance: term buffer should be < 1/3 of L3 cache size.