NAK Timer Tuning (Lever B)

Buffers (Lever A) prevent loss; they cannot make a loss cheap. Once a packet does drop, the tail cost is set by the NAK/retransmit timers — and at their defaults, every loss on a multicast-semantics stream costs 10–20 ms, regardless of how well-sized your buffers are. This page gives the three knobs, the formulas to derive them from your RTT and receiver count, and the measured proof.

The three knobs and the proven profile

Media-driver properties — they must reach the driver process (see pitfall below):

aeron.nak.multicast.max.backoff=500us    # default 10ms — THE dominant knob
aeron.nak.multicast.group.size=3         # default 10   — set to your REAL receiver count
aeron.retransmit.unicast.linger=2ms      # default 10ms — re-retransmit suppression window

Measured (bare-metal, 1 publisher → 3 MDC receivers, 224B @ 50k msg/s, identical seeded 0.01% loss):

	default timers	tuned profile
p99.9	8,286 µs	105 µs (~79× better)
Max	17.4 ms	2.99 ms (= the no-loss floor)
p50	37 µs	37 µs (unchanged — only lossy packets pay)

Two control arms confirmed the framing: with zero loss, calculator-minimum buffers and 16×-larger buffers were percentile-identical — bigger buffers buy nothing once they cover BDP (Lever A is about sufficiency, not size). And with loss + default timers, the tail landed at 17.4 ms — inside the 10–20 ms band the timer math predicts.

What each knob does

aeron.nak.multicast.max.backoff — on detecting a gap, a receiver waits a randomised delay in [0, ~backoff] before NAKing, so a large group doesn’t NAK-storm the sender. At the 10 ms default, recovery hasn’t even started for up to 10 ms after a loss — that’s most of the 20 ms tail. The randomisation is lambda = log(groupSize) + 1 over the backoff window (OptimalMulticastDelayGenerator), whose own javadoc sizes the window as maxBackoffT = K × GRTT — i.e. in units of the group RTT, not an absolute constant. The 10 ms default is K≈10 at internet-scale 1 ms RTT; on a 35 µs LAN it is K≈300, absurdly oversized.
aeron.nak.multicast.group.size — the receiver count the randomisation assumes. Not a tunable: an input. Set it to the truth.
aeron.retransmit.unicast.linger — after retransmitting a range, the sender ignores further NAKs for it for this long (duplicate suppression). It also blocks re-recovery if the retransmit itself is lost — so the default 10 ms makes a double-loss a 20 ms+ event.

The formulas

`max.backoff` — bounded by RTT below, your SLO above

floor:    backoff ≥ ~2 × RTT                       (suppression + reordering safety)
ceiling:  backoff ≤ T_loss_budget − 2 × RTT        (per-loss tail SLO)
choose:   backoff ≈ ½ × T_loss_budget, clamped to ≥ 2 × RTT

The floor exists because suppression is a round trip: the first receiver’s NAK → retransmit must reach the others before their timers fire. The ceiling: worst-case single-loss recovery ≈ backoff + 2 × RTT.

Worked: CPG RTT 35 µs, budget 1 ms → floor 70 µs → 500 µs (the measured profile). Cross-AZ RTT 1 ms, budget 5 ms → floor 2 ms → 2–3 ms. The honest consequence: backoff scales with RTT — cross-AZ links cannot have sub-ms recovery with safe suppression.

`group.size` — just tell it the truth

group.size = N_actual    (real receiver count; max-expected if it varies)

Too high → the distribution skews long. Below the real N → receivers draw similar delays → duplicate NAKs (the implosion the mechanism prevents). 2 followers for a 3-node cluster’s log channel; 3 for a 3-receiver MDC.

`linger` — one recovery round below, double-loss budget above

floor:    linger ≥ backoff + RTT                   (absorb late NAKs from the SAME loss)
choose:   linger ≈ 2–4 × (backoff + RTT)
double-loss worst case ≈ linger + backoff + 2 × RTT

Worked: backoff 500 µs + RTT 100 µs → floor 600 µs → 2 ms → double-loss ≈ 2.6 ms (vs 20 ms+ at default).

Three validity checks — before trusting any of this

Isolated-recovery regime. Timers only matter if recoveries don’t overlap: loss_rate × msg_rate × T_recovery < 0.1. Measured failure mode: at 2% loss × 100k msg/s the stream collapsed to a 444 ms mean no matter the timers — overlapping recoveries queue. Timers are for sparse residual loss; frequent loss is a Lever A problem.
Retransmit-store coverage. Retransmits come from the term buffer: λ × (slowest_receiver_lag + T_recovery) ≤ termLength / 2, or the gap is overwritten and the image breaks.
Implosion guard. The 10 ms default protects 1000-receiver groups. The aggressive profile is proven at N ≤ ~10; for N ≫ 10, scale group.size to the real N and keep the backoff window wide enough to spread N simultaneous NAKs — re-derive, don’t copy.

How to apply

The operational pitfall: these are media-driver properties. The driver is a separate process — it does not inherit your application JVM’s -D flags.

# Java driver — flags on the DRIVER's JVM (or a properties file argument):
java -Daeron.nak.multicast.max.backoff=500us \
     -Daeron.nak.multicast.group.size=3 \
     -Daeron.retransmit.unicast.linger=2ms \
     ... io.aeron.driver.MediaDriver

# C driver (aeronmd) — env vars:
AERON_NAK_MULTICAST_MAX_BACKOFF=500us
AERON_NAK_MULTICAST_GROUP_SIZE=3
AERON_RETRANSMIT_UNICAST_LINGER=2ms

Time values accept us/ms suffixes; restart the driver to take effect.

Embedded driver (`MediaDriver.launchEmbedded`)

If your app embeds the driver, there is no separate process — the pitfall above doesn’t apply, and you have two options:

// Option 1 — -D flags on the APP JVM (they reach the embedded driver):
//   java -Daeron.nak.multicast.max.backoff=500us ... MyTradingApp

// Option 2 — programmatic, explicit and grep-able (setter names from MediaDriver.java):
final MediaDriver.Context ctx = new MediaDriver.Context()
    .nakMulticastMaxBackoffNs(TimeUnit.MICROSECONDS.toNanos(500))
    .nakMulticastGroupSize(3)
    .retransmitUnicastLingerNs(TimeUnit.MILLISECONDS.toNanos(2));
final MediaDriver driver = MediaDriver.launchEmbedded(ctx);

One timing trap with Option 1’s programmatic cousin: the Context captures its defaults from system properties at construction time — System.setProperty(...) works only if it runs before new MediaDriver.Context(). Set the flags on the command line (or use the explicit setters) and the question never arises.

Set the profile on every host — the receiver side of each lossy hop is where NAK delay accrues.

Scope: which streams these govern

The multicast knobs govern multicast-semantics streams: IP multicast and MDC (dynamic and manual control modes) — which includes Aeron Cluster’s log-replication channel. Plain unicast streams already NAK near-immediately (aeron.nak.unicast.delay = 1 µs) — but their retry delay is delay × aeron.nak.unicast.retry.delay.ratio (default ×100), so keep retransmit.unicast.linger low on unicast too if double-loss matters.

Order of operations

Tune timers after loss prevention, not instead of it: first stop the loss (RX ring → max, CPU isolation, the receive-path stack); then this profile caps the cost of what remains. Verify with the one-line check: measured per-loss cost should be ≈ backoff + 2 × RTT (single loss) — if it’s far above that, something else is in the path.

Sources

Aeron driver source — Configuration.java (NAK/retransmit property names and defaults) and OptimalMulticastDelayGenerator.java (the lambda = log(groupSize) + 1 randomisation and maxBackoffT = K × GRTT sizing).
Measured results: bare-metal 4-arm A/B (calculator-minimum vs large buffers; seeded 0.01% loss with default vs derived timers), 1→3 MDC, fixed loss seed for valid A/B.