Leader Placement and Graceful Step-Down
In a Raft cluster the leader carries every write, so where the leader sits determines latency. When nodes span availability zones, a leader in a remote AZ adds a cross-AZ round trip to every committed operation. This page compares how Aeron Cluster and SOFAJRaft handle leader assignment, and the operational pattern for steering Aeron’s leader back to the low-latency zone after a failover.
For the byte-level election internals — terms, canvass, vote records — see The Aeron Files. This page is the operational view: what leader placement control each implementation offers, and what to do when it isn’t enough.
Why leader placement matters
Section titled “Why leader placement matters”In the lowest-latency configuration, the leader sits in the same AZ as the workload that drives it. Consensus traffic and the hot path stay within one zone, so p50 and p99 stay tight.
After a leader failure, Raft elects a new leader from the surviving members. That election is decided by the protocol — timeouts and votes — not by zone. The new leader can land in a different AZ, and from that point consensus crosses an AZ boundary on every order.
Leader assignment: Aeron Cluster vs SOFAJRaft
Section titled “Leader assignment: Aeron Cluster vs SOFAJRaft”The two implementations take different positions on how much control an operator has over leadership.
| Capability | SOFAJRaft | Aeron Cluster |
|---|---|---|
| Explicit leadership transfer | Yes — Node.transferLeadershipTo(PeerId) hands leadership to a chosen peer by sending it a TimeoutNowRequest to start an immediate election | No native API to transfer leadership to a specific node |
| Election priority / preference | Yes — per-node ElectionPriority (NodeOptions.setElectionPriority); higher-priority nodes are preferred, 0 means “never leader” | No priority mechanism; all voting members are equal candidates |
| Pin leadership to an AZ/node | Achievable via priority + transfer | Not natively supported |
| Trigger / influence an election | transferLeadershipTo triggers a targeted election | No command to trigger an election or move leadership to a chosen node |
What this means after a failover
Section titled “What this means after a failover”Because Aeron has no leader-designation API, the intuitive fix — kill the badly-placed leader and hope
the right node wins — is unreliable: the next election can hand leadership to another remote node. The
ClusterTool commands (is-leader, list-members, suspend/resume,
shutdown) let you inspect membership and gracefully shut a node down, but none of them request an
election toward a specific node.
Operational pattern: controlled step-down
Section titled “Operational pattern: controlled step-down”Without a native transfer API, the reliable approach is validate, then re-elect — pre-check the desired target node, then step the current leader down to force a fresh election. This is more dependable than force-killing a node, though the election outcome is still not guaranteed to land on the target.
Phase 1 — pre-check the target
Section titled “Phase 1 — pre-check the target”Before changing anything, confirm the intended node is healthy and fully caught up on the log. A target that is behind on replication may lose the election or, if it wins, stall while it catches up. Validating first is what makes the re-election predictable enough to be useful.
Phase 2 — step the leader down
Section titled “Phase 2 — step the leader down”Gracefully step down the current leader to trigger a new election. Combined with the pre-check, this
biases the outcome toward the validated node — far more reliable than force-killing and hoping, but still
not deterministic the way SOFAJRaft’s transferLeadershipTo is.
Phase 3 — confirm placement
Section titled “Phase 3 — confirm placement”Verify the new leader landed in the intended AZ (is-leader / list-members). If it didn’t, repeat the
step-down. Once the leader is back in the hot-path zone, consensus and the workload share one AZ again and
p50/p99 return to their single-AZ baseline.