Leader Placement and Preferred-Leader Control

In a Raft cluster the leader carries every write, so where the leader sits determines latency. When nodes span availability zones, a leader in a remote AZ adds a cross-AZ round trip to the client-facing hot path — client ingress → consensus → egress all run through the leader. This page is about keeping the leader in the zone you want, and moving it back there after a failover, without giving up Aeron’s real failover behaviour.

Aeron Cluster ships no transferLeadershipTo and no election priority — but a production-safe preferred-leader pattern is achievable by biasing the election and triggering a graceful step-down. That pattern, and its safety argument, is the subject of this page.

For the byte-level election internals — terms, canvass, vote records, the end-of-stream frame flag on the wire — see The Aeron Files. This page is the operational view: what control you have, how to build the missing piece safely, and what it costs.

Why leader placement matters

A real deployment usually runs two clustered services on the hot path — an OMS cluster and a matching-engine (ME) cluster — and an order flows OMS leader → ME leader. All the inter-cluster traffic goes leader-to-leader, so the latency of that hop is set by where the two leaders sit relative to each other. In the lowest-latency configuration both leaders sit in the same AZ, so the OMS→ME hop stays intra-AZ and p50/p99 stay tight.

Now the ME leader fails. Raft elects a new ME leader from the surviving ME members — an election decided by the protocol (timeouts and votes), not by zone. The new ME leader can land in a different AZ from the OMS leader, and from that point every order crosses an AZ boundary on the OMS→ME hop — even though the OMS leader never moved.

This is the trap: a single ME failover, with the OMS leader sitting still, silently turns the busiest inter-service call into a cross-AZ round trip. Nothing is broken — the cluster is healthy and losing no data — but every order now pays the cross-AZ tax until you move the ME leader back.

Leader assignment: Aeron Cluster vs SOFAJRaft

The two implementations differ on how much native control an operator has over leadership.

Capability	SOFAJRaft	Aeron Cluster
Explicit leadership transfer	Yes — `Node.transferLeadershipTo(PeerId)` hands leadership to a chosen peer via a `TimeoutNowRequest`	No native transfer API
Election priority / preference	Yes — per-node `ElectionPriority`; higher wins, `0` = “never leader”	No native priority — but achievable by injecting a biased `Context.random()` (see below)
Pin leadership to an AZ/node	Priority + transfer	Biased election + a step-down trigger (this page)
Trigger an election on demand	`transferLeadershipTo` triggers a targeted election	Gracefully closing (or resigning) the leader triggers an immediate election the old leader can’t win

The preferred-leader pattern, in two halves

“Preferred leader with real failover” — bias leadership toward a chosen node, but fail over normally if that node is truly dead, and be able to move leadership back once it recovers — decomposes into two independent problems:

Half A — bias who wins an election. Solvable with zero core change.
Half B — trigger the handoff while a healthy leader is sitting there. No pure-injection primitive exists; you choose between two operational options.

Half A — bias the winner with an injected `Context.random()`

When an election runs, every caught-up follower becomes a candidate and waits a random nomination delay drawn uniformly from [0, election-timeout ÷ 2) — with the default aeron.cluster.election.timeout of 1 s, a [0, 500 ms) window. The first candidate to fire claims the new term and wins (voters reject anyone whose log is behind). That per-node nomination delay is the only consumer of the cluster’s Random, and Context.random(Random) is a stock public setter — so you can inject a Random that returns 0 on the preferred node and the normal random elsewhere:

// Same code on every node; each computes its own bias from its member id.
final class PreferredLeaderRandom extends java.util.Random {
    private final boolean preferred;
    PreferredLeaderRandom(final boolean preferred) { this.preferred = preferred; }
    @Override public double nextDouble() {
        return preferred ? 0.0 : super.nextDouble();   // preferred → shortest nomination delay → fires first
    }
}
ctx.random(new PreferredLeaderRandom(ctx.clusterMemberId() == PREFERRED_ID));

Half B — trigger the handoff

Half A decides who wins once an election runs. To move leadership off a healthy current leader (e.g. to reclaim the preferred AZ after it recovers), something must start the election. There is no supported injection-only “resign but keep serving” call, so you pick one of two options.

Option 1 — External bounce (zero core change, recommended default)

Cleanly restart the current leader’s consensus-module process (SIGTERM / close the embedding app). A clean close disconnects the leader’s log publication, which raises end-of-stream on the followers → immediate graceful election (no waiting out the 10 s heartbeat timeout), and the departed leader is excluded from the fast-path unanimity count so survivors decide quickly. Half A’s biased random then steers the win to the preferred node if it is caught up. Restart the bounced node; it rejoins as a follower.

Pros: not one line of Aeron changed; uses the well-tested graceful path; safety guaranteed by the existing gates; nothing to maintain across upgrades.
Cons: it’s a full leader restart, not an in-place resign — heavier, with a brief unavailability window while the old leader restarts and rejoins. Needs an external orchestrator to decide when to bounce and which node.

Option 2 — In-place `RESIGN` (a minimal Aeron core fork)

Aeron has no built-in voluntary step-down that keeps the node alive, but one can be added by reusing the existing graceful machinery: a new RESIGN control-toggle that makes the leader enter a graceful election without dying (same end-of-stream path, same gracefulClosedLeaderId exclusion, same safety gates). It’s a small, localized change — roughly one new toggle value, one leader-only resign method, and one handler case — but it is a fork of Aeron core.

Pros: true in-place resign — no process restart, minimal unavailability; driven by the existing control-toggle counter (consistent with SUSPEND/SNAPSHOT); still graceful and still safety-checked.
Cons: you now maintain a fork — an upgrade burden and possible conflicts with upstream toggle changes, and you own the correctness of the new path. Not upstream-supported.

If you need…	Choose
No fork; ops can tolerate a leader restart	Option 1 (external bounce + `random` bias)
In-place resign, no restart; willing to maintain a tiny fork	Option 2 (`RESIGN` toggle)

In both options Half A supplies the winner bias, and the Raft safety gates guarantee a stale preferred node can never be elected. The only real Half-B decision is restart vs. small fork.

Measured: in-place RESIGN blocks clients for only single-digit milliseconds

On a single-machine embedded 3-node test cluster, an in-place RESIGN step-down produced a client-visible outage of ~1–2 ms — single-digit milliseconds — versus ~40 ms for stopping the leader (Option 1), roughly a 20× difference. RESIGN wins because it leaves every node and media driver alive and connected, so the client’s ingress simply republishes to the new leader instead of re-discovering a torn-down driver. Both are far below the 10 s heartbeat timeout because a graceful step-down is bounded by election.timeout (default 1 s), not by failure detection. Treat the numbers as order-of-magnitude (a single-machine harness, not a production SLA; real multi-AZ adds cross-AZ election round-trips) — but the headline holds: a planned reclaim is a single-digit-millisecond blip, not a failover event. See Tuning Cluster Failover Time for the detection-vs-election budget and why graceful paths skip the heartbeat timeout entirely.

Putting it together: multi-AZ hot-path pinning

The concrete driver for all of this is a multi-AZ deployment where the leader is the hot path. Two operational requirements fall out:

Pin — under normal conditions the leader should be the node in the selected AZ, so the client-facing path stays intra-AZ.
Reclaim after recovery — if the selected AZ fails, the cluster must fail over to a surviving AZ (availability first); when the selected-AZ node comes back healthy, leadership should migrate back to it. Reclaim is a steady-state move against a perfectly healthy leader — which is exactly why a trigger (Half B) is required and biased random alone can’t do it.

A suggested control loop (works with either Half-B option):

Only trigger when the preferred node is caught up. Triggering while it’s behind wastes a term (the safety gates reject it and a different survivor wins the interim). Pre-check the target’s logPosition against the leader’s commitPosition first.
Add hysteresis / backoff. A flapping AZ must not cause repeated leadership churn — each handoff is a brief hot-path interruption.
For the reclaim use case, Option 2 fits better. Reclaim is routine and possibly frequent; doing it as an in-place resign avoids restarting a healthy leader every time. Option 1 still works with no fork, but “restart a healthy leader to perform a routine op” is heavier and each bounce is a full rejoin/catch-up cycle.

How deterministic is the outcome?

With the Half-A bias in place, the “random race” framing changes:

Preferred node up and caught up → it wins deterministically. It draws nomination delay 0, fires first, and passes every safety gate. This is the normal, intended case.
Preferred node down or behind → normal Raft takes over. It can’t pass the candidacy gate, so a surviving, current member wins on the usual [0, 500 ms) race — i.e. real failover, unchanged.
Targeting an AZ is easier than a specific node. In the common 2+1 layout (both hot-path-AZ candidates present, misplaced leader in the remote zone), any winner among the two restores placement even without the bias — the bias just makes the specific node deterministic.

Without the bias, and relying only on repeated graceful step-downs, a specific target wins with probability ≈ 1/k per round among k equally-caught-up followers (a coin flip in a 3-node cluster) — so the biased random is what turns “retry until it lands” into “it lands the first time.”

Does step-down lose data?

Committed data: no. An entry is only committed once a quorum (⌊n/2⌋+1) has appended it, and voters refuse any candidate whose log is behind their own — so any new leader holds every committed entry (the standard Raft safety argument). A graceful close (or RESIGN) performs an orderly teardown — setting the commit position and draining the archive recording before the log publication closes — so committed entries are durably recorded before the handoff. See Cluster and Raft Overview.

In-flight, uncommitted data: it can be dropped, and the client must handle it. Entries the old leader appended but hadn’t committed may be truncated in the new term. During the election, client sessions are disconnected; an AeronCluster client receives a new-leader event and reconnects with the same session id, but it does not resubmit anything automatically. Treat un-acked requests as unknown-outcome and resubmit idempotently. Operationally: schedule reclaims in a quiet window and quiesce/drain ingress first if the workload allows; otherwise ensure clients reconcile un-acked orders once the new leader is ready.

See also: Tuning Cluster Failover Time (how fast the election completes), Cluster Standby and Multi-AZ HA Design, and Client-Cluster Communication (the new-leader event and client reconnect).