Skip to content

The Four Receive-Path Buffers

A packet on the wire traverses four nested buffers before an Aeron subscriber reads it. Each must be large enough to absorb the layer above it stalling — otherwise you drop, gap, and NAK. This page maps the stack from outermost (wire) to innermost (app), the enforced constraints between the layers, and the order to size them in. Every constraint below is verified against the Aeron driver source and the ENA driver’s best-practices guide — see Sources.

Read it alongside the parameter reference and BDP — this page is where those knobs physically live.

BufferPropertyDefaultRole
ENA RX ringethtool -G rx1024 (256 up to 16K, instance-dependent — check ethtool -g)NIC descriptors; how many packets the NIC can hold before the host drains
OS recv bufaeron.socket.so_rcvbuf, capped by net.core.rmem_max128KBkernel socket queue between NAPI and Aeron’s socket read (recvmmsg in the C driver)
Term bufferaeron.term.buffer.length16MB (min 64KB, max 1GB, power of 2)the actual mmap’d log the subscriber reads from
Initial windowaeron.rcv.initial.window.length128KBflow-control credit the receiver grants the sender

The two 128KB defaults are no coincidence. The Aeron source derives the window default as a textbook BDP — 10 Gbps × 100 µs LAN RTT = 125,000 bytes, rounded to 128KB — and sets the SO_RCVBUF default to exactly the same value: the minimum that passes its own startup validation (Constraint C).

These are not advice — the driver enforces them at startup.

Constraint A — window is capped by term length

Section titled “Constraint A — window is capped by term length”

receiverWindow = min(initialWindowLength, termBufferLength / 2) (Configuration.java, receiverWindowLength())

The flow-control window can never exceed half the term buffer. Raising the initial window past termLength / 2 does nothing — you must grow the term buffer first. (e.g. a 16MB term caps the window at 8MB regardless of what you set.)

Constraint B — initial window must be ≥ MTU

Section titled “Constraint B — initial window must be ≥ MTU”

MTU ≤ initialWindowLength (Configuration.java, validateInitialWindowLength() — else the driver refuses to start)

With an MTU of 8192, the 128KB default window is fine — just never set the window below the MTU.

Constraint C — SO_RCVBUF must hold the window (BDP) — also enforced

Section titled “Constraint C — SO_RCVBUF must hold the window (BDP) — also enforced”

initialWindowLength ≤ SO_RCVBUF (Configuration.java, validateSocketBufferLengths() — else the driver refuses to start)

The source marks both so_rcvbuf and initial.window as “must be sufficient for Bandwidth Delay Product”, and the driver enforces the pairing: at startup it throws initialWindowLength > SO_RCVBUF — increase aeron.socket.so_rcvbuf if the window exceeds the configured (or OS-default) socket buffer. Otherwise the kernel would drop packets the window said were allowed in flight.

The one gap the driver cannot enforce is net.core.rmem_max: if the kernel ceiling is lower than the requested so_rcvbuf, the kernel silently caps the real buffer and the driver only prints a warning — it starts anyway, with an effective socket buffer smaller than the window. That is the back door through which the “kernel drops in-flight packets” failure mode survives.

ENA ring (max) ─ drains fast enough to feed ─►
SO_RCVBUF ≥ receiverWindow = min(initialWindow, termLength/2)
▲ requires net.core.rmem_max ≥ SO_RCVBUF (or the kernel silently caps it)

Are the four values correlated, or independent?

Section titled “Are the four values correlated, or independent?”

Both — and this is the crux. Three of the four are correlated (one byte-denominated chain anchored to the window); the ENA ring sits on its own axis.

Three byte-buffers, anchored to the window

Section titled “Three byte-buffers, anchored to the window”

The flow-control window is the anchor — it’s the intent (how many bytes you allow in flight = your BDP target). The other two byte-buffers are sized relative to it by the constraints above:

Solid arrows are constraints the driver enforces at startup; the dashed arrow is the one it cannot enforce — the kernel applies it silently. The right-hand axis never touches the left: the ring is sized off stall time, not in-flight bytes.

So they are not independent: pick the window first, and the term buffer has a hard floor (≥ 2 × window), the socket buffer has a hard floor (≥ window) checked at startup — and rmem_max is the silent ceiling that decides whether your SO_RCVBUF request is actually honored.

Worked example — target a 256KB window (2× the same-AZ default — covers RTT up to ~200 µs at 10 Gbps; genuine cross-AZ at 10 Gbps needs a ~2MB+ window):

BufferValueWhy
Initial window256KBthe anchor — your BDP target
Term buffer≥ 512KBConstraint A: term ≥ 2 × window (the 16MB default already clears this)
SO_RCVBUF≥ 256KB (set ~2MB)Constraint C: ≥ window or no startup, plus stall headroom — and raise rmem_max to match
ENA RX ringmax (ethtool -g)a time budget, not a byte budget — see below

The ENA ring is on a different axis: time, not bytes

Section titled “The ENA ring is on a different axis: time, not bytes”

The ring is not byte-denominated — it’s a packet descriptor count (256–16K entries depending on instance type), and it guards a different failure mode: surviving a drain stall, where the vCPU feeding the receive path is unavailable so NAPI can’t empty the ring. Its headroom is measured in time:

stall survival ≈ ring entries ÷ inbound packet rate

At 10 Gbps of ~1500-byte frames (~820k pps), the 1024-entry default survives only ≈ 1.2 ms of stall; 8192 entries buy ≈ 10 ms (jumbo frames stretch each entry further). So the ring doesn’t scale off the window — it scales off how long a stall you must ride out. The ENA guide’s guidance for spiky CPU load is exactly this: grow the RX ring to compensate for temporary vCPU unavailability. The practical rule: set it to max and forget it — at deploy time, since resizing the ring causes a brief traffic interruption.

The window tells the sender “you may have N bytes in flight.” Those N bytes must fit through every layer below it, or you lose them and NAK:

  1. SO_RCVBUF effectively smaller than the window (the driver blocks the configured case at startup, but rmem_max can silently cap it) → kernel socket overflows → RcvbufErrors → gap → NAK.
  2. SO_RCVBUF fine but the receive path’s vCPU stalls (GC, CPU steal, or scheduler jitter — e.g. CFS wakeup-preemption holding the woken receiver thread off-CPU for ~10 ms) → NAPI can’t move ring→socket fast enough → ENA RX ring overruns (rx_overruns in ethtool -S) → NAK. This is the classic production-overrun chart: the socket was big enough, but a stall let the NIC ring fill.
  3. Window too big for termLength/2 → silently capped (Constraint A), so you think you raised throughput headroom but didn’t.

Practical sizing rule — set them bottom-up

Section titled “Practical sizing rule — set them bottom-up”

To increase in-flight headroom (absorb bursts/stalls → fewer overruns → fewer NAKs), raise the layers bottom-up and consistently:

  1. ENA ring → max (ethtool -g to see the limit — up to 16K on larger instances, then ethtool -G <dev> rx <max>). Do it at deploy time: resizing causes a brief traffic interruption.
  2. rmem_max ≥ desired SO_RCVBUF, then set so_rcvbuf to the BDP (e.g. 2–8MB at high rate). rmem_max is just the ceiling — raising it alone does nothing until so_rcvbuf asks for it, and skipping it silently caps what so_rcvbuf asked for.
  3. Term buffer ≥ 2 × desired window — required by Constraint A before the window can grow.
  4. Initial window = your BDP target (≥ MTU, ≤ termLength/2, ≤ SO_RCVBUF). This is the actual flow-control credit.

For the BDP math behind steps 2 and 4, see Bandwidth Delay Product. For the protocol-level mechanics of flow control and NAK, see The Aeron Files.

Plug in your SBE message size, target TPS, the RTT between your nodes, and the instance family — the calculator chains every constraint on this page (window ← BDP, term ≥ 2 × window, SO_RCVBUF ≥ window, rmem_max ceiling, ring stall budget) and projects min/recommended values, plus whether the working set fits the chosen CPU’s L3 sharing domain.

Receive-path sizing calculator
Projected values
On-wire bytes/msg (32B frame header, 32B aligned)
Bandwidth λ = TPS × bytes/msg
BDP = λ × RTT
Initial window aeron.rcv.initial.window.length2–4 × BDP, floored at the 128KB default and at MTU
Term buffer aeron.term.buffer.length≥ 2 × window (Constraint A) AND ≥ 2 × λ × stall headroom (publisher can only race term/2 ahead), power of 2, 64KB–1GB
OS socket receive buffer SO_RCVBUF (aeron.socket.so_rcvbuf)≥ window (Constraint C — refuses to start otherwise); 2× for stall headroom
net.core.rmem_maxkernel ceiling for SO_RCVBUF (sysctl) — if lower, the kernel silently caps the request
PPS (packets/sec)
ENA RX ring stall survival (at worst-case PPS)
ENA RX queues (ethtool -L)RSS hashes each flow (5-tuple) to ONE queue — a single Aeron stream lands on one queue/ring/IRQ vCPU
NAK timers (Lever B — loss recovery)derived from RTT: backoff ≥ 2×RTT (≈ ½ per-loss budget); linger ≈ 3×(backoff+RTT); group.size = real receiver count. Full derivation + measured proof on the Lever B page below
MTU recommendation
L3 cache fit

Scope: this calculator covers loss prevention only (sizing the chain so a stall doesn’t drop). Recovery after a loss is governed by the NAK/retransmit timers — the ~10ms default backoff costs 10–20ms of tail per loss; consider tuning them for small-fan-out streams (see the “Lever B” section below). Also: outputs scale directly with inputs — use measured RTT and stall values, not guesses.

Two more numbers fall straight out of the same inputs:

MTU. Aeron’s default aeron.mtu.length is 1408 — one Aeron frame fits a standard 1500-byte Ethernet MTU with headroom for IP/UDP headers. Raise it to 8192 (with VPC jumbo frames, MTU 9001) when bandwidth is high or a single message doesn’t fit 1408: fewer, larger packets cut per-packet overhead and PPS. The trade-off is loss amplification — one lost 8KB datagram NAKs more data than one lost 1.4KB datagram — and remember Constraint B: the window must stay ≥ MTU, and SO_SNDBUF ≥ MTU is enforced at startup. Cross-region or internet paths usually can’t carry jumbo frames; stay at 1408 there.

PPS. EC2 enforces per-instance PPS allowances alongside bandwidth ones. Your packet rate is TPS ÷ messages-per-datagram (batched) up to TPS × datagrams-per-message (unbatched worst case) — the calculator shows both. Two reasons to keep PPS down: the ENA ring drains in packets, so halving PPS doubles the stall the same ring survives; and exceeding the instance allowance shows up as pps_allowance_exceeded drops in ethtool -S — invisible to the kernel, visible as gaps and NAKs. Smart batching and a bigger MTU are the two levers.

Everything above is Lever A: prevention — size the chain so a stall never drops a packet. But when a drop happens anyway, the tail cost is set by a different knob set entirely: the NAK/retransmit timers. You can pass every check on this page and still see a ~10–20 ms p99.9 from a single lost packet, because recovery time — not buffer size — is the floor on loss-event latency.

The defaults (Configuration.java):

ParameterDefaultRole
aeron.nak.unicast.delay1 µsunicast: receiver NAKs almost immediately
aeron.nak.unicast.retry.delay.ratio100unicast: re-NAK after delay × 100 if no retransmit arrives
aeron.nak.multicast.max.backoff10 msmulticast: receiver waits a random 0–10 ms before NAKing, so a large group doesn’t NAK-storm the sender
aeron.nak.multicast.group.size10the group-size estimate that backoff randomisation assumes
aeron.retransmit.unicast.delay0sender retransmits immediately on NAK
aeron.retransmit.unicast.linger10 mssender ignores duplicate NAKs for the same range while lingering

The trap for cluster/market-data deployments: the multicast backoff is tuned for fan-out of ~10+, but a 3-node cluster or a 2-subscriber MDC channel pays the same random 0–10 ms wait — pure tail latency with no storm to prevent. For small, known fan-out, shrink aeron.nak.multicast.max.backoff (e.g. to 100 µs – 1 ms) and the per-loss tail drops by roughly the same amount. Unicast streams are already near-immediate and rarely need touching.

Recovery also re-reads the term buffer (the retransmit comes from the log), which is the other reason the term buffer is your retransmission history — and why catch-up reads falling out of L3 (above) compound a loss event.

For the derivation formulas (backoff from RTT, linger from double-loss budget), the validity checks, and the measured 79× tail collapse, see NAK Timer Tuning. For the full NAK protocol mechanics, defer to The Aeron Files.