Failure-Mode Runbook: Multi-AZ and Aeron® Edge Cases

When things break at 3am, you want a table, not a theory. This runbook covers the two failure classes that page on-call most often: infrastructure-level outages (AZ loss, cross-AZ latency, region-wide events) and Aeron-specific edge cases (media driver crashes, archive I/O errors, component desync, mark file corruption, log buffer overflow). For each, you get the data-loss risk, the availability impact, the recovery complexity, and the remediation.

Multi-AZ and cross-region failures

Infrastructure failures are mostly about quorum. Lose a minority of nodes and you’re fine. Lose a majority and you’re down. The table below is the fast lookup.

Case	Data Loss Risk	Availability Impact	Recovery Complexity	Remediation
AZ failure (minority nodes)	None	None (quorum held)	Low — auto catchup on AZ recovery	Wait for AZ recovery; nodes auto-rejoin and catch up. Ensure nodes spread across AZs (1 per AZ for 3-node cluster)
AZ failure (majority nodes)	Uncommitted msgs	Total	High — manual intervention	Rebalance node placement across AZs; restore failed nodes or provision new ones in healthy AZs
Cross-AZ latency spike	None	Degraded (slow commits, false elections)	Low — tune timeouts	Increase election/heartbeat timeouts for cross-AZ deployment; use placement groups to minimize latency
Region-wide outage	Possible	Total	Critical — cross-region restore	Activate standby cluster in DR region; restore from cross-region replicated backups (snapshots + logs)

Node placement strategy

Node placement is the single biggest lever on the infrastructure side. It is a latency-versus-resilience tradeoff, and you must choose it deliberately.

3-node cluster across 3 AZs (recommended)

Any single AZ failure → quorum maintained (2/3).
Tradeoff: cross-AZ latency on every commit (leader must replicate to at least one other AZ).

3-node cluster in 2 AZs (latency optimized)

AZ-b failure → quorum maintained (2/3 in AZ-a).
AZ-a failure → quorum lost (only 1/3 in AZ-b).
Tradeoff: lower latency (leader + one follower same AZ) but asymmetric failure tolerance.

Aeron-specific edge cases

These are the failures that don’t show up in a generic distributed-systems playbook. They come from Aeron’s component model: the Media Driver, Consensus Module, and Service Container are separate pieces that can fail independently.

Case	Data Loss Risk	Availability Impact	Recovery Complexity	Remediation
Media Driver crash	None	Single node down	Low — restart node	Restart entire node (Media Driver + Consensus Module + Service Container); investigate root cause (driver bug, resource issue)
Archive recording failure (I/O error)	None	Degraded to Total (if leader)	Medium — take node out of service	Remove node from service; fix I/O issue (replace disk, fix permissions); wipe and restart to rebuild from peers
Consensus Module & Service Container desync	None (but stale)	Degraded (not processing msgs)	Medium — health monitoring needed	Implement liveness checks for all components; restart full node if service container is down
Mark file corruption	None	Single node cannot start	Low — clean up mark files	Delete stale mark files from clusterDir and aeronDir; restart node. Ensure no duplicate processes running
Log buffer overflow (termLength too small)	None	Degraded (back-pressure)	Low — increase termLength	Increase termLength in MediaDriver and ConsensusModule config; restart cluster with coordinated rolling restart

Understanding mark files

Mark files are sentinel files that Aeron uses to detect if another instance is already running in the same directory. Corruption typically happens when:

A process is killed without cleanup (kill -9).
Two processes accidentally share the same aeron directory.
Filesystem corruption.

The fix is simple: delete the mark files and restart. But always verify no other process is using the same directory first.

Log buffer overflow

When messages are larger than expected or burst rates exceed the term buffer capacity, back-pressure cascades into an availability problem:

The fix is to increase termLength, but remember the L3 cache constraint from the term buffer sizing guidance: term buffer should be < 1/3 of L3 cache size.