Aeron Cluster and Raft Consensus
Aeron Cluster is a fault-tolerant cluster for stateful, event-driven applications. It gives you in-memory state at microsecond latency, with a Raft-replicated log underneath so you never lose a committed event. This page covers the design principles, the two RPCs that make Raft tick, the improvements Aeron layers on top, and the five safety guarantees that make the whole thing safe for a matching engine.
For the byte-level internals — log layout, term files, archive recordings — defer to The Aeron Files. This page is the operator’s mental model, not the wire format.
Design principles
Section titled “Design principles”Aeron Cluster is built on Aeron Transport, not TCP. That single choice drives the rest of the design.
- Very high throughput and very low latency — built on Aeron Transport, not TCP.
- State held entirely in memory — no database on the hot path.
- Input events written to persistent storage — via Aeron Archive, for node recovery.
- Replication to other nodes and sites — Raft-based consensus for consistency.
- Multiple services on a single cluster — share the consensus infrastructure across services.
The layers stack like this:
All state lives in memory. The persistent log (Archive) exists only for recovery. That is what enables microsecond-level processing: no disk I/O on the hot path, no database queries, no network calls to external systems. The result shows up directly in your latency profile — p50 and p99 stay low because the request never touches a disk or a remote service to produce a response.
The two Raft RPCs
Section titled “The two Raft RPCs”Raft consensus uses only two RPCs. That is the whole protocol surface.
RequestVote RPC
Section titled “RequestVote RPC”- Invoked by candidates to gather votes during leader election.
- A node transitions to candidate state, increments its term, votes for itself, and sends RequestVote to all other nodes.
- Other nodes grant their vote if the candidate’s log is at least as up-to-date as theirs.
AppendEntries RPC
Section titled “AppendEntries RPC”- Invoked by the leader for two purposes:
- Log replication — sending new log entries to followers.
- Heartbeat — empty AppendEntries to maintain leadership (prevents followers from starting elections).
Why only two? The simplicity is intentional. Fewer message types means fewer edge cases, fewer bugs, and easier formal verification. Everything in Raft is either “who should be leader?” (RequestVote) or “here’s what the leader says” (AppendEntries).
Aeron’s consensus implementation
Section titled “Aeron’s consensus implementation”Aeron Cluster implements Raft using three building blocks, with several improvements over the original paper.
Built on:
- Aeron Transport — for inter-node communication (UDP, not TCP).
- Aeron Archive — for persistent log storage.
- Consensus Module — Aeron’s Raft implementation.
Improvements upon the Raft paper
Section titled “Improvements upon the Raft paper”1. A canvass phase before elections. Before a node becomes a candidate and starts a formal election, it first canvasses other nodes to check if an election would succeed. This avoids unnecessary elections that would disrupt the cluster — a node won’t start an election it knows it will lose.
2. Parallel replication between nodes. The leader sends AppendEntries to all followers simultaneously, not sequentially. Commit latency is therefore bounded by the slowest node in the quorum, not the sum of all nodes.
3. Natural batching during replication. Multiple log entries are batched into a single AppendEntries message when they’re available. This reduces network round-trips and improves throughput without adding artificial batching delay — a throughput win that costs you nothing on latency.
The five safety guarantees
Section titled “The five safety guarantees”Raft’s correctness rests on five guarantees. Together they ensure the cluster can never diverge.
1. Election Safety
Section titled “1. Election Safety”At most one leader can be elected in a given term.
This prevents split-brain — you can never have two nodes both believing they are the leader for the same term number. Enforced by requiring a majority vote.
2. Leader Append-Only
Section titled “2. Leader Append-Only”A leader never overwrites or deletes existing entries in its log.
The leader only appends new entries; it never modifies history. This makes the leader’s log a monotonically growing, immutable sequence — critical for consistency.
3. Consistent Log Entries (Log Matching)
Section titled “3. Consistent Log Entries (Log Matching)”If two logs contain an entry with the same index and term, then the logs are identical in all entries up through the given index.
This is the induction property that makes Raft work: if two nodes agree on entry (index=42, term=5), they must also agree on entries 1–41. Enforced by the leader including the previous entry’s index and term in every AppendEntries — followers reject entries that don’t match.
4. Leader Completeness
Section titled “4. Leader Completeness”If a log entry is committed in a given term, then that entry will be present in the logs for leaders of all higher-numbered terms.
Once an entry is committed (replicated to a majority), it can never be lost — every future leader must have it. Enforced by the voting rule: a node won’t vote for a candidate whose log is less complete than its own.
5. State Machine Safety
Section titled “5. State Machine Safety”If a server has applied a log entry at a given index to its state machine, no other server will ever apply a different log entry for the same index.
This is the ultimate consistency guarantee: all nodes apply the same sequence of operations to their state machines. Combined with deterministic execution, every node converges to identical state.
How they all fit together
Section titled “How they all fit together”| Guarantee | What it ensures |
|---|---|
| Election Safety | One leader per term |
| Leader Append-Only | Leader never rewrites history |
| Log Matching | Agreement on one entry = agreement on all prior |
| Leader Completeness | Committed entries survive leader changes |
| State Machine Safety | All nodes apply same operations in same order |