Step-by-Step Tuning Methodology
Don’t tune everything at once. The fastest path to a well-tuned Aeron Transport is incremental: start from sensible defaults, add load, watch which metric degrades first, then turn the one knob that addresses it.
This page is the workflow. For the why behind each knob — the internals of windows, terms, and NAKs — defer to The Aeron Files.
The five steps at a glance
Section titled “The five steps at a glance”- Start with sensible defaults.
- Validate your message size.
- Load test incrementally.
- Tune based on symptoms.
- Apply the L3 cache sizing rule.
Step 1: Start with sensible defaults
Section titled “Step 1: Start with sensible defaults”Begin with stock settings. Resist the urge to pre-optimize.
- 128K initial window size.
- Default term buffer size.
- Ensure OS / ENA driver send and receive buffers match.
A mismatch between Aeron’s buffers and the underlying OS or ENA driver buffers is a common source of silent throughput loss. Align them before you change anything else.
Step 2: Validate message size
Section titled “Step 2: Validate message size”Keep each message smaller than the MTU (typically 1500 bytes).
You do not need application-level batching. Aeron Transport handles smart batching for you — adding your own batching layer on top usually hurts more than it helps.
Step 3: Load test incrementally
Section titled “Step 3: Load test incrementally”Start with a small load and keep increasing it.
Monitor as you ramp until you see the first sign of stress:
- End-to-end latency climbing.
- p99 latency climbing.
- NAKs (negative acknowledgments) appearing.
The metric that degrades first tells you which knob to reach for next. That’s the whole point of ramping slowly — it isolates the bottleneck.
Step 4: Tune based on symptoms
Section titled “Step 4: Tune based on symptoms”Match the symptom to the action. Change one thing, then re-test.
| Symptom | Action |
|---|---|
| p50 latency increases | Tune initial window size and send/recv buffers |
| p99 latency increases | Tune NAK delay and term buffer size |
Read this table as a diagnostic. A rising p50 points at steady-state flow control — the window and OS buffers. A rising p99 points at tail events — recovery behavior (NAK delay) and how much in-flight data a term buffer holds. Throughput follows: once p50 and p99 are stable under load, push the load higher and repeat.
Step 5: The L3 cache sizing rule
Section titled “Step 5: The L3 cache sizing rule”This is the critical guardrail.
This ensures the active term plus other working-set data all fit in L3, avoiding DRAM spills on the hot path. DRAM spills are exactly what wrecks p99.
Worked example: if your L3 is 36 MB, keep term buffers ≤ 12 MB. The remaining cache leaves room for other hot data — connection state, application objects — to stay resident.
Putting it together
Section titled “Putting it together”The methodology is deliberately incremental. Start with defaults, add load, observe which metric degrades first, then tune the corresponding parameter — and never violate the 1/3 L3 rule while you do it.
For deeper mechanics of any individual knob, see The Aeron Files.