Transport Performance Factors and Their Impact

Ten factors move Aeron Transport performance. They form a stack — from the protocol on top to the hardware at the bottom — and you should tune them roughly in that order. This page maps each factor to the number it actually moves: throughput, p50 (median), or p99 (tail).

The ten factors, by layer

The factors stack from protocol-level down to hardware-level. Get the foundation right before reaching for the knobs above it.

Protocol & buffer tuning

Initial Window Size — controls how much data can be in-flight before requiring ACKs.
NAK / Retransmit Timers — how long receivers wait before NAKing a gap (and senders before re-retransmitting). On multicast-semantics streams (IP multicast, MDC, the Cluster log channel) the default 10 ms backoff makes every residual loss a 10–20 ms tail event — see NAK Timer Tuning.
Term Buffer Size — size of the log buffer per term rotation.

OS & driver level

OS Network Send/Receive Buffer Size — the SO_SNDBUF / SO_RCVBUF kernel socket buffers.
ENA Send/Receive Buffer Size — AWS Elastic Network Adapter ring buffer depth.

Hardware affinity

CPU Near the NIC Card — NUMA locality between processing cores and the network interface.
Term Buffer Fits into the CPU’s L3 Cache — keeping hot data in cache vs. spilling to DRAM.

System level

Kernel Overhead / Bypass — eliminating the kernel networking stack (premium: ef_vi / DPDK / VMA).
Thread Pinning / Core Isolation — dedicating CPU cores to Aeron threads, preventing scheduler interference — see Core Isolation & Thread Pinning.
Resource Utilization (CPU / Bandwidth) — keeping utilization below saturation to handle jitter.

The reason utilization matters so much is the shape of the response-time curve. For a single-server queue, response time grows as roughly 1 / (1 − ρ): gentle at low utilization, then steepening into a knee around ~70% (analytically ≈71.5%). There is no magic “safe” number — by 80% latency is already about 5× the unloaded baseline (50% → 2×, 70% → 3.3×, 90% → 10×, 95% → 20×). A latency-sensitive system should run below the knee — typically 50–70% — so it keeps headroom to absorb bursts; run it “hot” near saturation and the tail explodes.

This is a rule of thumb, not a constant. It applies to a single queue / per-core / per-flow hot path with bursty arrivals. Two things raise the safe ceiling: multi-core / multi-server pools (more parallel servers form queues much later, so a fleet can run at 80–90%), and deterministic or batched service (fixed-cost work has half the queuing penalty of random-duration work — which is exactly why Aeron’s smart batching lets it sustain high utilization without inflating the tail). The real fix for “high utilization and low tail” is removing arrival randomness via batching, not chasing a single percentage.

Response time vs utilisation for a single-server queue (∝ 1/(1−ρ)): a knee region around 70–85%, with a recommended low-latency operating zone below it; at 80% latency is already ~5× the baseline

How each factor moves the numbers

This star-rating matrix shows the relative impact of each tuning parameter on throughput, p50, and p99 latency. More stars means a bigger lever.

Parameter	Throughput	p50 Latency	p99 / Tail Latency
Initial Window Size	⭐⭐	⭐	⭐⭐
NAK / Retransmit Timers	⭐	⭐	⭐⭐⭐ †
Term Buffer Size	⭐⭐	⭐	⭐⭐
OS Network Send/Receive Buffer Size	⭐⭐	⭐	⭐⭐
ENA Send/Receive Buffer Size	⭐⭐	⭐	⭐⭐
CPU Near the NIC Card (NUMA-local)	⭐⭐	⭐⭐	⭐⭐⭐
Term Buffer Fits into L3 Cache	⭐⭐	⭐⭐	⭐⭐⭐
Kernel Bypass	⭐⭐⭐	⭐⭐⭐	⭐⭐⭐
Thread Pinning / Core Isolation	⭐⭐	⭐⭐	⭐⭐⭐
JVM Prewarm (JIT Warmup)	⭐	⭐⭐	⭐⭐⭐

† On multicast-semantics streams (MDC, IP multicast, the Cluster log channel) with residual loss. Measured: identical seeded 0.01% loss, default timers → p99.9 8,286 µs; derived timers → 105 µs (~79×). With zero loss the timers do nothing — they cap the cost of a loss, not its probability.

What the matrix tells you

Five patterns fall out of the ratings:

Kernel bypass is the only ⭐⭐⭐ across all three columns. It is the single biggest lever for overall performance.
Hardware affinity disproportionately affects p99. NUMA locality, L3 residency, and thread pinning all earn ⭐⭐⭐ on the tail — because tail latency is dominated by cache misses, context switches, and NUMA penalties.
Protocol parameters mainly affect throughput and p99. Window, NAK, and buffers have minimal p50 impact; they matter most under load. And buffers split into two levers with different jobs: Lever A (buffers) prevents loss; Lever B (NAK timers) caps the cost of a loss. Measured A/B: once buffers cover BDP, 16×-larger buffers changed nothing percentile-for-percentile — but on a lossy stream the NAK profile moved p99.9 by ~79×. Don’t buy tail latency with bigger buffers; buy it with timer tuning after sizing the chain.
p50 is relatively easy to optimize; p99 is where the real work is. Most ⭐⭐⭐ ratings sit in the p99 column.
JVM prewarm is a p99 problem, not a throughput one. A cold JIT means the first requests hit interpreted bytecode and uncompiled paths, producing severe tail spikes until hot methods are compiled.

Why the tail dominates

The pattern repeats across every parameter:

p99 is always about avoiding variance — cache misses, queue depth, context switches, NUMA penalties.
p50 is about reducing the baseline cost of each operation.
Throughput is about keeping the pipe full.

Different parameters pull different levers, but the hardware-level ones — NUMA, cache, kernel bypass, pinning — have the most dramatic effect on tail latency.

For the parameter-by-parameter breakdown of why each knob moves throughput, p50, and p99, and how to size it, continue to the Parameter reference.