NUMA, Memory Hierarchy, and Cache Locality
Latency lives in the memory hierarchy. On the hot path, where your data sits matters more than how fast your code runs. This page maps the layout of a modern dual-socket server and shows why NUMA placement and cache locality dominate Aeron’s tail latency.
Read it alongside the parameter reference — the two NUMA and L3 knobs there are explained from first principles here.
The dual-socket NUMA layout
Section titled “The dual-socket NUMA layout”A modern server has two CPU sockets, each with its own cores, its own memory controller (MC), its own DRAM, and its own PCI-e lanes to a NIC. The two sockets talk over an interconnect (QPI). Crossing that link is expensive.
The key structure: L1 and L2 are per-core, L3 is shared across all cores on a socket, and DRAM hangs off each socket’s own memory controller. A NIC is physically wired to one socket’s PCI-e lanes — not both.
Latency by level
Section titled “Latency by level”Each step down the hierarchy costs roughly an order of magnitude more. The numbers below are the working figures for a typical Xeon-class server.
| Level | Latency | Notes |
|---|---|---|
| Registers/Buffers | < 1ns | |
| L1 Cache | ~4 cycles, ~1ns | Per-core |
| L2 Cache | ~12 cycles, ~3ns | Per-core |
| L3 Cache | ~40 cycles, ~12ns | Shared per socket |
| L3 (dirty hit) | ~75 cycles, ~25ns | Cross-core within socket |
| DRAM (local) | ~70ns | Local to socket’s memory controller |
| QPI (cross-socket) | > 40ns additional | Penalty for accessing remote socket’s memory |
The headline: L1 is ~1ns, DRAM is ~70ns. That is a 40x gap, and it sits squarely on your hot path.
Why this matters for Aeron
Section titled “Why this matters for Aeron”Three placement decisions move latency the most. Each maps cleanly to a tuning knob.
- L3 cache residency is the sweet spot. 12ns versus 70ns+ for DRAM. If your term buffer fits in L3, you avoid DRAM latency on every message — cutting both p50 and, more importantly, p99 variance from misses.
- NUMA locality. Accessing memory on the remote socket adds 40ns+ per access via QPI. If your NIC is on Socket 0 but Aeron runs on Socket 1, every packet crosses QPI twice. That penalty lands directly on the tail.
- PCI-e to NIC. The NIC is physically attached to one socket. Run Aeron’s media driver on cores local to that socket and you eliminate cross-socket PCI-e traversal entirely.
How this maps to tuning knobs
Section titled “How this maps to tuning knobs”Two parameters in the reference come straight out of the hierarchy above:
- NUMA locality (CPU near the NIC) — pin IRQs, driver threads, and app threads to cores on the NIC’s NUMA node. This removes the QPI penalty and stabilizes p99.
- Term buffer fits in L3 — size the active term footprint to stay within effective L3. A resident working set turns DRAM-rate misses into cache-rate hits, lowering p50 and tightening the tail.
CPU architecture supplement: AMD CCD vs Intel die vs Graviton
Section titled “CPU architecture supplement: AMD CCD vs Intel die vs Graviton”Everything above assumes the classic Xeon picture: one socket = one big shared L3. That assumption quietly breaks on AMD EPYC — and since EPYC, Intel Xeon, and AWS Graviton power different EC2 families, “pin to the NUMA node” means something different on each. This section maps the three topologies so the pinning advice above lands on the right cores.
AMD EPYC: the socket is not the cache domain
Section titled “AMD EPYC: the socket is not the cache domain”AMD builds EPYC from chiplets. Each CCD (Core Complex Die) carries one CCX of up to 8 cores sharing a 32 MB L3 slice — on both Zen 3 (Milan, EPYC 7R13 in c6a) and Zen 4 (Genoa, EPYC 9R14 in c7a/m7a/r7a, up to 12 CCDs per socket). That L3 is private to the CCD: a core never allocates into another CCD’s L3, and any cross-CCD traffic detours through the I/O die over Infinity Fabric.
The measured numbers make the cliff visible. On the EPYC 7R13 (c6a), core-to-core latency is ~23ns within a CCD but jumps to ~90–110ns across CCDs — a worse penalty than the cross-socket QPI hop on the Xeon diagram above, inside a single NUMA node. A thread that the scheduler migrates from CCD 0 to CCD 1 wakes up with a cold L3: its entire working set replays as misses, and that burst lands straight on p99.
Two consequences for the knobs above:
- “Term buffer fits in L3” means 32 MB, not the socket total. A Genoa socket advertises 384 MB of L3, but a pinned thread can only ever hit its own CCD’s 32 MB. Budget against 32 MB.
- Pinning to the NUMA node is not enough. Keep the media driver’s conductor/sender/receiver threads
and the app threads that touch the same term buffers inside one CCD (8 cores).
lscpu -e(the L3 column) andlstopoboth show CCD boundaries as L3 groups; on c7a every vCPU is a physical core (SMT is off), so consecutive vCPUs group naturally into CCDs.
On bare metal, BIOS adds NPS (NUMA-per-socket: NPS1/NPS2/NPS4) to present 1, 2, or 4 NUMA nodes per
socket, and an ACPI SRAT L3 Cache as NUMA Domain option that exposes each CCX as its own node — handy
for making schedulers CCD-aware. On EC2 you don’t control the BIOS; verify what you actually got with
lscpu and numactl --hardware.
Intel Xeon: one big (but slower) L3
Section titled “Intel Xeon: one big (but slower) L3”Ice Lake (Xeon 8375C in c6i: 32 cores, 54 MB L3) hangs all cores and L3 slices off one mesh on a monolithic die — every core sees the whole L3 at roughly uniform latency. Sapphire Rapids (custom Xeon 8488C in c7i: 48 cores, 105 MB) is physically four tiles stitched with EMIB, but the mesh spans the tile boundaries and presents one logical L3, so it behaves quasi-monolithically.
The trade: capacity and uniformity instead of speed. Measured Sapphire Rapids L3 latency is ~33ns — roughly 3× AMD’s per-CCD L3 (~9ns on Zen 4) — because every access hashes across all the L3 slices on a very large mesh. Intel offers sub-NUMA clustering (SNC) in BIOS to split a socket’s cores, L3, and memory controllers into 2 or 4 NUMA domains for tighter locality, but again: not a knob you can reach on EC2.
Pinning is correspondingly forgiving: any core on the socket sees the same L3, so staying on the NIC’s socket/NUMA node — the rule from earlier in this page — is the whole job.
AWS Graviton: the simplest topology of the three
Section titled “AWS Graviton: the simplest topology of the three”Graviton3 (c7g: 64 cores, 32 MB L3) and Graviton4 (c8g/r8g: 96 cores per socket, 36 MB L3,
2 MB L2 per core) put all cores on a single coherent mesh exposed as one NUMA node — and there is
no SMT: every vCPU is a physical core (threads per core = 1 in AWS’s spec tables). No CCD
boundaries, no hyperthread siblings polluting your L1/L2, no cross-die surprises.
Two caveats: the shared L3 is small (32–36 MB for the whole chip — about one AMD CCD’s worth, shared
by 64–96 cores), so the term-buffer-in-L3 budget is tight and contended; and the 192-vCPU 48xlarge
sizes (e.g. r8g.48xlarge) are dual-socket with 2 NUMA nodes, where all the cross-socket rules from
the top of this page apply again.
Cheat sheet
Section titled “Cheat sheet”| EC2 family | CPU | L3 per sharing domain | Cores per L3 domain | Pinning rule of thumb |
|---|---|---|---|---|
| c6a / m6a | AMD EPYC 7R13 (Milan, Zen 3) | 32 MB per CCD | 8 | Driver + hot app threads inside one CCD |
| c7a / m7a / r7a | AMD EPYC 9R14 (Genoa, Zen 4) | 32 MB per CCD | 8 (1 vCPU = 1 core) | Same — never let hot threads straddle CCDs |
| c6i / m6i | Intel Xeon 8375C (Ice Lake) | 54 MB per socket | 32 | Stay on the NIC’s socket / NUMA node |
| c7i | Intel Xeon 8488C (Sapphire Rapids) | 105 MB per socket (4 tiles, EMIB, one logical L3) | 48 | Stay on the socket; L3 is uniform but slow (~33ns) |
| c7g | Graviton3 | 32 MB per chip | 64 | Any core; single NUMA node, no SMT |
| c8g / r8g | Graviton4 | 36 MB per socket | 96 | Any core on the socket; 48xlarge = 2 sockets |
Sources
Section titled “Sources”- AWS Graviton Getting Started — processor spec table (Graviton2/3/4 cores, L2/L3 sizes, NUMA nodes, interconnect)
- AWS EC2 instance types — general purpose specs (vCPUs = cores, threads per core = 1 on Graviton and m7a)
- Wikipedia — Zen 3 (CCD = one 8-core CCX with 32 MB shared L3)
- Wikipedia — Zen 4 (32 MB L3 per CCD; Genoa up to 12 CCDs)
- Wikipedia — Sapphire Rapids (four tiles, EMIB, up to 112.5 MB L3)
- Chips and Cheese — A Peek at Sapphire Rapids (quasi-monolithic L3 across tiles, ~33ns L3 latency, mesh slice hashing, clustered mode)
- Chips and Cheese — Zen 4 memory subsystem (Zen 4 L3 latency ~9ns)
- nviennot/core-to-core-latency (EPYC 7R13: ~23ns intra-CCD vs ~90–110ns cross-CCD; Xeon 8375C: ~51ns uniform)
- Broadcom KB — AMD EPYC BIOS and NUMA guidance (NPS presents 1/2/4 NUMA nodes per socket; CCX-as-NUMA option)
- Phoronix — Intel sub-NUMA clustering (SNC splits cores, cache, and memory into NUMA domains)
- AWS — c6i, c7i, c7g, c8g instance pages (CPU generations per family)
Going deeper
Section titled “Going deeper”This page covers the hardware context. For how Aeron’s term buffers, log structures, and counters are laid out to exploit cache locality, defer to The Aeron Files — it goes far deeper into the internals than we duplicate here.