Cluster Standby and Multi-AZ HA Design

Standby is where HA stops fighting latency. Aeron Premium’s Cluster Standby gives you cross-AZ resilience, sub-minute recovery, and snapshots that never stall the order book — all without dragging Raft consensus across availability zones. This page synthesizes the design and shows how each property maps to your latency and throughput numbers.

Three design pillars

Cluster Standby rests on three properties, each solving a specific production pain point. They work together: fast snapshots keep the primary unblocked, in-place identity switch gives sub-minute RTO, and cross-AZ placement gives real HA.

Fast — Snapshots happen on standby nodes, so the primary cluster never stops processing.
Identity switch — A standby converts in-place to a full Raft member, dramatically reducing recovery time.
Stable — Standby nodes can run in a different Availability Zone, providing genuine cross-AZ resilience without the latency penalty of cross-AZ Raft replication on the hot path.

The standby is warm, not cold. It processes every message to maintain state — it does not merely store data.

Fast: snapshots only on standby nodes

Snapshots are the silent killer of hot-path latency in OSS Aeron. The leader must pause to serialize its entire in-memory state, and during that pause order processing speed drops to zero — a full stop-the-world event.

Without Cluster Standby (OSS)

Order Processing Speed
  ████████████████░░░░░░████████████████
                  ↑
          Snapshot window:
     Processing speed drops to ZERO
     while leader takes snapshot

The leader node must pause to serialize its entire in-memory state.
During the snapshot, order processing speed = 0 — a full stop-the-world event.
The cluster runs 3 nodes, each with its own archive, all participating in Raft.

With Cluster Standby (Premium)

Order Processing Speed
  ████████████████████████████████████████
     (no interruption — snapshot happens on standby)

The snapshot is taken on the standby node, not the leader.
The primary cluster never pauses for snapshots.
The standby processes every message to maintain state, then snapshots independently.
Snapshots taken on standby nodes are shipped back to the active cluster.

Identity switch: standby converts to a Raft member in-place

When a node fails, the standby does not spin up a new instance. It converts in-place from standby to a full Raft consensus member. Because it has been processing every message, the state is already in memory — the switch is essentially flipping a flag, not rebuilding state.

Without Cluster Standby — recovery > 5 minutes

When a node fails in OSS Aeron:

Launch a new EC2 instance — depends on AZ inventory.
Fetch the latest snapshot from the new leader.
Replay logs to catch up.
Join as a cluster member.

Recovery time: > 5 minutes — and it depends on whether the AZ has available instances.

With Cluster Standby — recovery < 1 minute

When a node fails with Premium Standby:

The standby node receives a command to become a member.
Catch up on the latest logs — minimal, since the standby is already warm.
Join as a member.
The standby takes a snapshot — background, non-blocking.

Recovery time: < 1 minute.

Stable: cross-AZ hot standby

Here is the part that protects latency. Standby nodes form a separate backup cluster in a different Availability Zone, connected to the primary cluster via log replication. You get cross-AZ HA without paying the cross-AZ round-trip on every Raft commit — because the Raft quorum stays inside one AZ.

Recovery targets

Metric	Value
RTO	< 60 seconds
RPO	< cross-AZ/region latency (typically single-digit ms)

Capabilities in detail

Cluster Standby is more than a failover trick. Its core properties open up several distinct production use cases.

Core properties

Warm standby nodes — they process every message but do not participate in Raft consensus.
Not required to synchronously replicate — so standby nodes apply no back-pressure on the leader. This is what keeps throughput and tail latency on the primary unaffected by the standby.

Cross-datacenter replication

Disaster Recovery — run a standby cluster in another DC or region.
Daisy chainable — a standby can replicate to further standby clusters: DC1 → DC2 → DC3.

Background snapshots

Snapshots are taken on standby nodes and shipped back to the active cluster.
The active cluster never pauses for snapshots.

Queries

Standby nodes can serve read queries without loading the leader.
State is near-real-time, bounded by replication lag.

Node replacement

Join an existing cluster to replace a failed node — a command-line tool converts a Standby Node to a Member Node.
Rolling upgrades — join a standby running the new software version, then decommission the old nodes.

How these knobs move your numbers

Cluster Standby is fundamentally a latency-preservation feature. The wins map directly:

p99 / max latency — Moving snapshots off the leader removes the periodic stop-the-world pause that otherwise spikes the tail. Keeping the Raft quorum inside one AZ avoids adding cross-AZ round-trips to every commit.
Throughput — Standby replication is asynchronous, so it applies no back-pressure on the leader. The primary commits as fast as its in-AZ quorum allows, regardless of how far the standby is or how many hops the daisy chain has.
p50 — Stays tight for the same reasons: no quorum member sits across an AZ boundary, so the median commit path is the fast in-AZ path.

In short: you get cross-AZ HA, < 60s RTO, and single-digit-millisecond RPO while your hot path keeps running as if there were no standby at all.