Transport Performance Factors and Their Impact
Ten factors move Aeron Transport performance. They form a stack — from the protocol on top to the hardware at the bottom — and you should tune them roughly in that order. This page maps each factor to the number it actually moves: throughput, p50 (median), or p99 (tail).
The ten factors, by layer
Section titled “The ten factors, by layer”The factors stack from protocol-level down to hardware-level. Get the foundation right before reaching for the knobs above it.
Protocol & buffer tuning
Section titled “Protocol & buffer tuning”- Initial Window Size — controls how much data can be in-flight before requiring ACKs.
- NAK / Retransmit Timers — how long receivers wait before NAKing a gap (and senders before re-retransmitting). On multicast-semantics streams (IP multicast, MDC, the Cluster log channel) the default 10 ms backoff makes every residual loss a 10–20 ms tail event — see NAK Timer Tuning.
- Term Buffer Size — size of the log buffer per term rotation.
OS & driver level
Section titled “OS & driver level”- OS Network Send/Receive Buffer Size — the
SO_SNDBUF/SO_RCVBUFkernel socket buffers. - ENA Send/Receive Buffer Size — AWS Elastic Network Adapter ring buffer depth.
Hardware affinity
Section titled “Hardware affinity”- CPU Near the NIC Card — NUMA locality between processing cores and the network interface.
- Term Buffer Fits into the CPU’s L3 Cache — keeping hot data in cache vs. spilling to DRAM.
System level
Section titled “System level”- Kernel Overhead / Bypass — eliminating the kernel networking stack (premium: ef_vi / DPDK / VMA).
- Thread Pinning / Core Isolation — dedicating CPU cores to Aeron threads, preventing scheduler interference — see Core Isolation & Thread Pinning.
- Resource Utilization (CPU / Bandwidth) — keeping utilization below saturation to handle jitter.
The reason utilization matters so much is the shape of the response-time curve. For a single-server
queue, response time grows as roughly 1 / (1 − ρ): gentle at low utilization, then steepening into a
knee around ~70% (analytically ≈71.5%). There is no magic “safe” number — by 80% latency is already
about 5× the unloaded baseline (50% → 2×, 70% → 3.3×, 90% → 10×, 95% → 20×). A latency-sensitive
system should run below the knee — typically 50–70% — so it keeps headroom to absorb bursts; run it
“hot” near saturation and the tail explodes.
This is a rule of thumb, not a constant. It applies to a single queue / per-core / per-flow hot path with bursty arrivals. Two things raise the safe ceiling: multi-core / multi-server pools (more parallel servers form queues much later, so a fleet can run at 80–90%), and deterministic or batched service (fixed-cost work has half the queuing penalty of random-duration work — which is exactly why Aeron’s smart batching lets it sustain high utilization without inflating the tail). The real fix for “high utilization and low tail” is removing arrival randomness via batching, not chasing a single percentage.
How each factor moves the numbers
Section titled “How each factor moves the numbers”This star-rating matrix shows the relative impact of each tuning parameter on throughput, p50, and p99 latency. More stars means a bigger lever.
| Parameter | Throughput | p50 Latency | p99 / Tail Latency |
|---|---|---|---|
| Initial Window Size | ⭐⭐ | ⭐ | ⭐⭐ |
| NAK / Retransmit Timers | ⭐ | ⭐ | ⭐⭐⭐ † |
| Term Buffer Size | ⭐⭐ | ⭐ | ⭐⭐ |
| OS Network Send/Receive Buffer Size | ⭐⭐ | ⭐ | ⭐⭐ |
| ENA Send/Receive Buffer Size | ⭐⭐ | ⭐ | ⭐⭐ |
| CPU Near the NIC Card (NUMA-local) | ⭐⭐ | ⭐⭐ | ⭐⭐⭐ |
| Term Buffer Fits into L3 Cache | ⭐⭐ | ⭐⭐ | ⭐⭐⭐ |
| Kernel Bypass | ⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐ |
| Thread Pinning / Core Isolation | ⭐⭐ | ⭐⭐ | ⭐⭐⭐ |
| JVM Prewarm (JIT Warmup) | ⭐ | ⭐⭐ | ⭐⭐⭐ |
† On multicast-semantics streams (MDC, IP multicast, the Cluster log channel) with residual loss. Measured: identical seeded 0.01% loss, default timers → p99.9 8,286 µs; derived timers → 105 µs (~79×). With zero loss the timers do nothing — they cap the cost of a loss, not its probability.
What the matrix tells you
Section titled “What the matrix tells you”Five patterns fall out of the ratings:
- Kernel bypass is the only ⭐⭐⭐ across all three columns. It is the single biggest lever for overall performance.
- Hardware affinity disproportionately affects p99. NUMA locality, L3 residency, and thread pinning all earn ⭐⭐⭐ on the tail — because tail latency is dominated by cache misses, context switches, and NUMA penalties.
- Protocol parameters mainly affect throughput and p99. Window, NAK, and buffers have minimal p50 impact; they matter most under load. And buffers split into two levers with different jobs: Lever A (buffers) prevents loss; Lever B (NAK timers) caps the cost of a loss. Measured A/B: once buffers cover BDP, 16×-larger buffers changed nothing percentile-for-percentile — but on a lossy stream the NAK profile moved p99.9 by ~79×. Don’t buy tail latency with bigger buffers; buy it with timer tuning after sizing the chain.
- p50 is relatively easy to optimize; p99 is where the real work is. Most ⭐⭐⭐ ratings sit in the p99 column.
- JVM prewarm is a p99 problem, not a throughput one. A cold JIT means the first requests hit interpreted bytecode and uncompiled paths, producing severe tail spikes until hot methods are compiled.
Why the tail dominates
Section titled “Why the tail dominates”The pattern repeats across every parameter:
- p99 is always about avoiding variance — cache misses, queue depth, context switches, NUMA penalties.
- p50 is about reducing the baseline cost of each operation.
- Throughput is about keeping the pipe full.
Different parameters pull different levers, but the hardware-level ones — NUMA, cache, kernel bypass, pinning — have the most dramatic effect on tail latency.
For the parameter-by-parameter breakdown of why each knob moves throughput, p50, and p99, and how to size it, continue to the Parameter reference.