NAK Timer Tuning (Lever B)
Buffers (Lever A) prevent loss; they cannot make a loss cheap. Once a packet does drop, the tail cost is set by the NAK/retransmit timers — and at their defaults, every loss on a multicast-semantics stream costs 10–20 ms, regardless of how well-sized your buffers are. This page gives the three knobs, the formulas to derive them from your RTT and receiver count, and the measured proof.
The three knobs and the proven profile
Section titled “The three knobs and the proven profile”Media-driver properties — they must reach the driver process (see pitfall below):
aeron.nak.multicast.max.backoff=500us # default 10ms — THE dominant knobaeron.nak.multicast.group.size=3 # default 10 — set to your REAL receiver countaeron.retransmit.unicast.linger=2ms # default 10ms — re-retransmit suppression windowMeasured (bare-metal, 1 publisher → 3 MDC receivers, 224B @ 50k msg/s, identical seeded 0.01% loss):
| default timers | tuned profile | |
|---|---|---|
| p99.9 | 8,286 µs | 105 µs (~79× better) |
| Max | 17.4 ms | 2.99 ms (= the no-loss floor) |
| p50 | 37 µs | 37 µs (unchanged — only lossy packets pay) |
Two control arms confirmed the framing: with zero loss, calculator-minimum buffers and 16×-larger buffers were percentile-identical — bigger buffers buy nothing once they cover BDP (Lever A is about sufficiency, not size). And with loss + default timers, the tail landed at 17.4 ms — inside the 10–20 ms band the timer math predicts.
What each knob does
Section titled “What each knob does”aeron.nak.multicast.max.backoff— on detecting a gap, a receiver waits a randomised delay in[0, ~backoff]before NAKing, so a large group doesn’t NAK-storm the sender. At the 10 ms default, recovery hasn’t even started for up to 10 ms after a loss — that’s most of the 20 ms tail. The randomisation islambda = log(groupSize) + 1over the backoff window (OptimalMulticastDelayGenerator), whose own javadoc sizes the window asmaxBackoffT = K × GRTT— i.e. in units of the group RTT, not an absolute constant. The 10 ms default is K≈10 at internet-scale 1 ms RTT; on a 35 µs LAN it is K≈300, absurdly oversized.aeron.nak.multicast.group.size— the receiver count the randomisation assumes. Not a tunable: an input. Set it to the truth.aeron.retransmit.unicast.linger— after retransmitting a range, the sender ignores further NAKs for it for this long (duplicate suppression). It also blocks re-recovery if the retransmit itself is lost — so the default 10 ms makes a double-loss a 20 ms+ event.
The formulas
Section titled “The formulas”max.backoff — bounded by RTT below, your SLO above
Section titled “max.backoff — bounded by RTT below, your SLO above”floor: backoff ≥ ~2 × RTT (suppression + reordering safety)ceiling: backoff ≤ T_loss_budget − 2 × RTT (per-loss tail SLO)choose: backoff ≈ ½ × T_loss_budget, clamped to ≥ 2 × RTTThe floor exists because suppression is a round trip: the first receiver’s NAK → retransmit must
reach the others before their timers fire. The ceiling: worst-case single-loss recovery ≈
backoff + 2 × RTT.
Worked: CPG RTT 35 µs, budget 1 ms → floor 70 µs → 500 µs (the measured profile). Cross-AZ RTT 1 ms, budget 5 ms → floor 2 ms → 2–3 ms. The honest consequence: backoff scales with RTT — cross-AZ links cannot have sub-ms recovery with safe suppression.
group.size — just tell it the truth
Section titled “group.size — just tell it the truth”group.size = N_actual (real receiver count; max-expected if it varies)Too high → the distribution skews long. Below the real N → receivers draw similar delays → duplicate NAKs (the implosion the mechanism prevents). 2 followers for a 3-node cluster’s log channel; 3 for a 3-receiver MDC.
linger — one recovery round below, double-loss budget above
Section titled “linger — one recovery round below, double-loss budget above”floor: linger ≥ backoff + RTT (absorb late NAKs from the SAME loss)choose: linger ≈ 2–4 × (backoff + RTT)double-loss worst case ≈ linger + backoff + 2 × RTTWorked: backoff 500 µs + RTT 100 µs → floor 600 µs → 2 ms → double-loss ≈ 2.6 ms (vs 20 ms+ at default).
Three validity checks — before trusting any of this
Section titled “Three validity checks — before trusting any of this”- Isolated-recovery regime. Timers only matter if recoveries don’t overlap:
loss_rate × msg_rate × T_recovery < 0.1. Measured failure mode: at 2% loss × 100k msg/s the stream collapsed to a 444 ms mean no matter the timers — overlapping recoveries queue. Timers are for sparse residual loss; frequent loss is a Lever A problem. - Retransmit-store coverage. Retransmits come from the term buffer:
λ × (slowest_receiver_lag + T_recovery) ≤ termLength / 2, or the gap is overwritten and the image breaks. - Implosion guard. The 10 ms default protects 1000-receiver groups. The aggressive profile is
proven at N ≤ ~10; for N ≫ 10, scale
group.sizeto the real N and keep the backoff window wide enough to spread N simultaneous NAKs — re-derive, don’t copy.
How to apply
Section titled “How to apply”The operational pitfall: these are media-driver properties. The driver is a separate process — it
does not inherit your application JVM’s -D flags.
# Java driver — flags on the DRIVER's JVM (or a properties file argument):java -Daeron.nak.multicast.max.backoff=500us \ -Daeron.nak.multicast.group.size=3 \ -Daeron.retransmit.unicast.linger=2ms \ ... io.aeron.driver.MediaDriver
# C driver (aeronmd) — env vars:AERON_NAK_MULTICAST_MAX_BACKOFF=500usAERON_NAK_MULTICAST_GROUP_SIZE=3AERON_RETRANSMIT_UNICAST_LINGER=2msTime values accept us/ms suffixes; restart the driver to take effect.
Embedded driver (MediaDriver.launchEmbedded)
Section titled “Embedded driver (MediaDriver.launchEmbedded)”If your app embeds the driver, there is no separate process — the pitfall above doesn’t apply, and you have two options:
// Option 1 — -D flags on the APP JVM (they reach the embedded driver):// java -Daeron.nak.multicast.max.backoff=500us ... MyTradingApp
// Option 2 — programmatic, explicit and grep-able (setter names from MediaDriver.java):final MediaDriver.Context ctx = new MediaDriver.Context() .nakMulticastMaxBackoffNs(TimeUnit.MICROSECONDS.toNanos(500)) .nakMulticastGroupSize(3) .retransmitUnicastLingerNs(TimeUnit.MILLISECONDS.toNanos(2));final MediaDriver driver = MediaDriver.launchEmbedded(ctx);One timing trap with Option 1’s programmatic cousin: the Context captures its defaults from system
properties at construction time — System.setProperty(...) works only if it runs before
new MediaDriver.Context(). Set the flags on the command line (or use the explicit setters) and the
question never arises.
Set the profile on every host — the receiver side of each lossy hop is where NAK delay accrues.
Scope: which streams these govern
Section titled “Scope: which streams these govern”The multicast knobs govern multicast-semantics streams: IP multicast and MDC (dynamic and
manual control modes) — which includes Aeron Cluster’s log-replication channel. Plain unicast
streams already NAK near-immediately (aeron.nak.unicast.delay = 1 µs) — but their retry delay is
delay × aeron.nak.unicast.retry.delay.ratio (default ×100), so keep retransmit.unicast.linger low
on unicast too if double-loss matters.
Order of operations
Section titled “Order of operations”Tune timers after loss prevention, not instead of it: first stop the loss (RX ring → max, CPU
isolation, the receive-path stack); then this profile caps the
cost of what remains. Verify with the one-line check: measured per-loss cost should be ≈
backoff + 2 × RTT (single loss) — if it’s far above that, something else is in the path.
Sources
Section titled “Sources”- Aeron driver source —
Configuration.java(NAK/retransmit property names and defaults) andOptimalMulticastDelayGenerator.java(thelambda = log(groupSize) + 1randomisation andmaxBackoffT = K × GRTTsizing). - Measured results: bare-metal 4-arm A/B (calculator-minimum vs large buffers; seeded 0.01% loss with default vs derived timers), 1→3 MDC, fixed loss seed for valid A/B.