Cluster Standby and Multi-AZ HA Design
Standby is where HA stops fighting latency. Aeron Premium’s Cluster Standby gives you cross-AZ resilience, sub-minute recovery, and snapshots that never stall the order book — all without dragging Raft consensus across availability zones. This page synthesizes the design and shows how each property maps to your latency and throughput numbers.
Three design pillars
Section titled “Three design pillars”Cluster Standby rests on three properties, each solving a specific production pain point. They work together: fast snapshots keep the primary unblocked, in-place identity switch gives sub-minute RTO, and cross-AZ placement gives real HA.
- Fast (快) — Snapshots happen on standby nodes, so the primary cluster never stops processing.
- Identity switch (身份切换) — A standby converts in-place to a full Raft member, dramatically reducing recovery time.
- Stable (稳) — Standby nodes can run in a different Availability Zone, providing genuine cross-AZ resilience without the latency penalty of cross-AZ Raft replication on the hot path.
The standby is warm, not cold. It processes every message to maintain state — it does not merely store data.
Fast: snapshots only on standby nodes
Section titled “Fast: snapshots only on standby nodes”Snapshots are the silent killer of hot-path latency in OSS Aeron. The leader must pause to serialize its entire in-memory state, and during that pause order processing speed drops to zero — a full stop-the-world event.
Without Cluster Standby (OSS)
Section titled “Without Cluster Standby (OSS)”订单处理速度 (Order Processing Speed) ████████████████░░░░░░████████████████ ↑ Snapshot window: Processing speed drops to ZERO while leader takes snapshot- The leader node must pause to serialize its entire in-memory state.
- During the snapshot, order processing speed = 0 — a full stop-the-world event.
- The cluster runs 3 nodes, each with its own archive, all participating in Raft.
With Cluster Standby (Premium)
Section titled “With Cluster Standby (Premium)”订单处理速度 (Order Processing Speed) ████████████████████████████████████████ (no interruption — snapshot happens on standby)- The snapshot is taken on the standby node, not the leader.
- The primary cluster never pauses for snapshots.
- The standby processes every message to maintain state, then snapshots independently.
- Snapshots taken on standby nodes are shipped back to the active cluster.
Identity switch: standby converts to a Raft member in-place
Section titled “Identity switch: standby converts to a Raft member in-place”When a node fails, the standby does not spin up a new instance. It converts in-place from standby to a full Raft consensus member. Because it has been processing every message, the state is already in memory — the switch is essentially flipping a flag, not rebuilding state.
Without Cluster Standby — recovery > 5 minutes
Section titled “Without Cluster Standby — recovery > 5 minutes”When a node fails in OSS Aeron:
- 开启新EC2 — Launch a new EC2 instance (depends on AZ inventory).
- 向新Leader拿最新Snapshot — Fetch the latest snapshot from the new leader.
- 追最新的logs — Replay logs to catch up.
- 加入成为Member — Join as a cluster member.
Recovery time: > 5 minutes — and it depends on whether the AZ has available instances.
With Cluster Standby — recovery < 1 minute
Section titled “With Cluster Standby — recovery < 1 minute”When a node fails with Premium Standby:
- 备节点通过指令成为新Member节点 — The standby node receives a command to become a member.
- 追最新的logs — Catch up on the latest logs (minimal, since the standby is already warm).
- 加入成为Member — Join as a member.
- StandBy节点打快照 — The standby takes a snapshot (background, non-blocking).
Recovery time: < 1 minute.
Stable: cross-AZ hot standby
Section titled “Stable: cross-AZ hot standby”Here is the part that protects latency. Standby nodes form a separate backup cluster in a different Availability Zone, connected to the primary cluster via log replication. You get cross-AZ HA without paying the cross-AZ round-trip on every Raft commit — because the Raft quorum stays inside one AZ.
Recovery targets
Section titled “Recovery targets”| Metric | Value |
|---|---|
| RTO | < 60 seconds |
| RPO | < cross-AZ/region latency (typically single-digit ms) |
Capabilities in detail
Section titled “Capabilities in detail”Cluster Standby is more than a failover trick. Its core properties open up several distinct production use cases.
Core properties
Section titled “Core properties”- Warm standby nodes — they process every message but do not participate in Raft consensus.
- Not required to synchronously replicate — so standby nodes apply no back-pressure on the leader. This is what keeps throughput and tail latency on the primary unaffected by the standby.
Cross-datacenter replication
Section titled “Cross-datacenter replication”- Disaster Recovery — run a standby cluster in another DC or region.
- Daisy chainable — a standby can replicate to further standby clusters: DC1 → DC2 → DC3.
Background snapshots
Section titled “Background snapshots”- Snapshots are taken on standby nodes and shipped back to the active cluster.
- The active cluster never pauses for snapshots.
Queries
Section titled “Queries”- Standby nodes can serve read queries without loading the leader.
- State is near-real-time, bounded by replication lag.
Node replacement
Section titled “Node replacement”- Join an existing cluster to replace a failed node — a command-line tool converts a Standby Node to a Member Node.
- Rolling upgrades — join a standby running the new software version, then decommission the old nodes.
How these knobs move your numbers
Section titled “How these knobs move your numbers”Cluster Standby is fundamentally a latency-preservation feature. The wins map directly:
- p99 / max latency — Moving snapshots off the leader removes the periodic stop-the-world pause that otherwise spikes the tail. Keeping the Raft quorum inside one AZ avoids adding cross-AZ round-trips to every commit.
- Throughput — Standby replication is asynchronous, so it applies no back-pressure on the leader. The primary commits as fast as its in-AZ quorum allows, regardless of how far the standby is or how many hops the daisy chain has.
- p50 — Stays tight for the same reasons: no quorum member sits across an AZ boundary, so the median commit path is the fast in-AZ path.
In short: you get cross-AZ HA, < 60s RTO, and single-digit-millisecond RPO while your hot path keeps running as if there were no standby at all.