Failure-Mode Runbook: Multi-AZ and Aeron Edge Cases
When things break at 3am, you want a table, not a theory. This runbook covers the two failure classes that page on-call most often: infrastructure-level outages (AZ loss, cross-AZ latency, region-wide events) and Aeron-specific edge cases (media driver crashes, archive I/O errors, component desync, mark file corruption, log buffer overflow). For each, you get the data-loss risk, the availability impact, the recovery complexity, and the remediation.
Multi-AZ and cross-region failures
Section titled “Multi-AZ and cross-region failures”Infrastructure failures are mostly about quorum. Lose a minority of nodes and you’re fine. Lose a majority and you’re down. The table below is the fast lookup.
| Case | Data Loss Risk | Availability Impact | Recovery Complexity | Remediation |
|---|---|---|---|---|
| AZ failure (minority nodes) | None | None (quorum held) | Low — auto catchup on AZ recovery | Wait for AZ recovery; nodes auto-rejoin and catch up. Ensure nodes spread across AZs (1 per AZ for 3-node cluster) |
| AZ failure (majority nodes) | Uncommitted msgs | Total | High — manual intervention | Rebalance node placement across AZs; restore failed nodes or provision new ones in healthy AZs |
| Cross-AZ latency spike | None | Degraded (slow commits, false elections) | Low — tune timeouts | Increase election/heartbeat timeouts for cross-AZ deployment; use placement groups to minimize latency |
| Region-wide outage | Possible | Total | Critical — cross-region restore | Activate standby cluster in DR region; restore from cross-region replicated backups (snapshots + logs) |
Node placement strategy
Section titled “Node placement strategy”Node placement is the single biggest lever on the infrastructure side. It is a latency-versus-resilience tradeoff, and you must choose it deliberately.
3-node cluster across 3 AZs (recommended)
Section titled “3-node cluster across 3 AZs (recommended)”- Any single AZ failure → quorum maintained (2/3).
- Tradeoff: cross-AZ latency on every commit (leader must replicate to at least one other AZ).
3-node cluster in 2 AZs (latency optimized)
Section titled “3-node cluster in 2 AZs (latency optimized)”- AZ-b failure → quorum maintained (2/3 in AZ-a).
- AZ-a failure → quorum lost (only 1/3 in AZ-b).
- Tradeoff: lower latency (leader + one follower same AZ) but asymmetric failure tolerance.
Aeron-specific edge cases
Section titled “Aeron-specific edge cases”These are the failures that don’t show up in a generic distributed-systems playbook. They come from Aeron’s component model: the Media Driver, Consensus Module, and Service Container are separate pieces that can fail independently.
| Case | Data Loss Risk | Availability Impact | Recovery Complexity | Remediation |
|---|---|---|---|---|
| Media Driver crash | None | Single node down | Low — restart node | Restart entire node (Media Driver + Consensus Module + Service Container); investigate root cause (driver bug, resource issue) |
| Archive recording failure (I/O error) | None | Degraded to Total (if leader) | Medium — take node out of service | Remove node from service; fix I/O issue (replace disk, fix permissions); wipe and restart to rebuild from peers |
| Consensus Module & Service Container desync | None (but stale) | Degraded (not processing msgs) | Medium — health monitoring needed | Implement liveness checks for all components; restart full node if service container is down |
| Mark file corruption | None | Single node cannot start | Low — clean up mark files | Delete stale mark files from clusterDir and aeronDir; restart node. Ensure no duplicate processes running |
| Log buffer overflow (termLength too small) | None | Degraded (back-pressure) | Low — increase termLength | Increase termLength in MediaDriver and ConsensusModule config; restart cluster with coordinated rolling restart |
Understanding mark files
Section titled “Understanding mark files”Mark files are sentinel files that Aeron uses to detect if another instance is already running in the same directory. Corruption typically happens when:
- A process is killed without cleanup (
kill -9). - Two processes accidentally share the same aeron directory.
- Filesystem corruption.
The fix is simple: delete the mark files and restart. But always verify no other process is using the same directory first.
Log buffer overflow
Section titled “Log buffer overflow”When messages are larger than expected or burst rates exceed the term buffer capacity, back-pressure cascades into an availability problem:
The fix is to increase termLength, but remember the L3 cache constraint from the term buffer sizing guidance: term buffer should be < 1/3 of L3 cache size.