NUMA, Memory Hierarchy, and Cache Locality

Latency lives in the memory hierarchy. On the hot path, where your data sits matters more than how fast your code runs. This page maps the layout of a modern dual-socket server and shows why NUMA placement and cache locality dominate Aeron’s tail latency.

Read it alongside the parameter reference — the two NUMA and L3 knobs there are explained from first principles here.

The dual-socket NUMA layout

A modern server has two CPU sockets, each with its own cores, its own memory controller (MC), its own DRAM, and its own PCI-e lanes to a NIC. The two sockets talk over an interconnect (QPI). Crossing that link is expensive.

The key structure: L1 and L2 are per-core, L3 is shared across all cores on a socket, and DRAM hangs off each socket’s own memory controller. A NIC is physically wired to one socket’s PCI-e lanes — not both.

Latency by level

Each step down the hierarchy costs roughly an order of magnitude more. The numbers below are the working figures for a typical Xeon-class server.

Level	Latency	Notes
Registers/Buffers	< 1ns
L1 Cache	~4 cycles, ~1ns	Per-core
L2 Cache	~12 cycles, ~3ns	Per-core
L3 Cache	~40 cycles, ~12ns	Shared per socket
L3 (dirty hit)	~75 cycles, ~25ns	Cross-core within socket
DRAM (local)	~70ns	Local to socket’s memory controller
QPI (cross-socket)	> 40ns additional	Penalty for accessing remote socket’s memory

The headline: L1 is ~1ns, DRAM is ~70ns. That is a 40x gap, and it sits squarely on your hot path.

Why this matters for Aeron

Three placement decisions move latency the most. Each maps cleanly to a tuning knob.

L3 cache residency is the sweet spot. 12ns versus 70ns+ for DRAM. If your term buffer fits in L3, you avoid DRAM latency on every message — cutting both p50 and, more importantly, p99 variance from misses.
NUMA locality. Accessing memory on the remote socket adds 40ns+ per access via QPI. If your NIC is on Socket 0 but Aeron runs on Socket 1, every packet crosses QPI twice. That penalty lands directly on the tail.
PCI-e to NIC. The NIC is physically attached to one socket. Run Aeron’s media driver on cores local to that socket and you eliminate cross-socket PCI-e traversal entirely.

How this maps to tuning knobs

Two parameters in the reference come straight out of the hierarchy above:

NUMA locality (CPU near the NIC) — pin IRQs, driver threads, and app threads to cores on the NIC’s NUMA node. This removes the QPI penalty and stabilizes p99.
Term buffer fits in L3 — size the active term footprint to stay within effective L3. A resident working set turns DRAM-rate misses into cache-rate hits, lowering p50 and tightening the tail.

CPU architecture supplement: AMD CCD vs Intel die vs Graviton

Everything above assumes the classic Xeon picture: one socket = one big shared L3. That assumption quietly breaks on AMD EPYC — and since EPYC, Intel Xeon, and AWS Graviton power different EC2 families, “pin to the NUMA node” means something different on each. This section maps the three topologies so the pinning advice above lands on the right cores.

AMD EPYC: the socket is not the cache domain

AMD builds EPYC from chiplets. Each CCD (Core Complex Die) carries one CCX of up to 8 cores sharing a 32 MB L3 slice — on both Zen 3 (Milan, EPYC 7R13 in c6a) and Zen 4 (Genoa, EPYC 9R14 in c7a/m7a/r7a, up to 12 CCDs per socket). That L3 is private to the CCD: a core never allocates into another CCD’s L3, and any cross-CCD traffic detours through the I/O die over Infinity Fabric.

The measured numbers make the cliff visible. On the EPYC 7R13 (c6a), core-to-core latency is ~23ns within a CCD but jumps to ~90–110ns across CCDs — a worse penalty than the cross-socket QPI hop on the Xeon diagram above, inside a single NUMA node. A thread that the scheduler migrates from CCD 0 to CCD 1 wakes up with a cold L3: its entire working set replays as misses, and that burst lands straight on p99.

Two consequences for the knobs above:

“Term buffer fits in L3” means 32 MB, not the socket total. A Genoa socket advertises 384 MB of L3, but a pinned thread can only ever hit its own CCD’s 32 MB. Budget against 32 MB.
Pinning to the NUMA node is not enough. Keep the media driver’s conductor/sender/receiver threads and the app threads that touch the same term buffers inside one CCD (8 cores). lscpu -e (the L3 column) and lstopo both show CCD boundaries as L3 groups; on c7a every vCPU is a physical core (SMT is off), so consecutive vCPUs group naturally into CCDs.

On bare metal, BIOS adds NPS (NUMA-per-socket: NPS1/NPS2/NPS4) to present 1, 2, or 4 NUMA nodes per socket, and an ACPI SRAT L3 Cache as NUMA Domain option that exposes each CCX as its own node — handy for making schedulers CCD-aware. On EC2 you don’t control the BIOS; verify what you actually got with lscpu and numactl --hardware.

Intel Xeon: one big (but slower) L3

Ice Lake (Xeon 8375C in c6i: 32 cores, 54 MB L3) hangs all cores and L3 slices off one mesh on a monolithic die — every core sees the whole L3 at roughly uniform latency. Sapphire Rapids (custom Xeon 8488C in c7i: 48 cores, 105 MB) is physically four tiles stitched with EMIB, but the mesh spans the tile boundaries and presents one logical L3, so it behaves quasi-monolithically.

The trade: capacity and uniformity instead of speed. Measured Sapphire Rapids L3 latency is ~33ns — roughly 3× AMD’s per-CCD L3 (~9ns on Zen 4) — because every access hashes across all the L3 slices on a very large mesh. Intel offers sub-NUMA clustering (SNC) in BIOS to split a socket’s cores, L3, and memory controllers into 2 or 4 NUMA domains for tighter locality, but again: not a knob you can reach on EC2.

Pinning is correspondingly forgiving: any core on the socket sees the same L3, so staying on the NIC’s socket/NUMA node — the rule from earlier in this page — is the whole job.

AWS Graviton: the simplest topology of the three

Graviton3 (c7g: 64 cores, 32 MB L3) and Graviton4 (c8g/r8g: 96 cores per socket, 36 MB L3, 2 MB L2 per core) put all cores on a single coherent mesh exposed as one NUMA node — and there is no SMT: every vCPU is a physical core (threads per core = 1 in AWS’s spec tables). No CCD boundaries, no hyperthread siblings polluting your L1/L2, no cross-die surprises.

Two caveats: the shared L3 is small (32–36 MB for the whole chip — about one AMD CCD’s worth, shared by 64–96 cores), so the term-buffer-in-L3 budget is tight and contended; and the 192-vCPU 48xlarge sizes (e.g. r8g.48xlarge) are dual-socket with 2 NUMA nodes, where all the cross-socket rules from the top of this page apply again.

Cheat sheet

EC2 family	CPU	L3 per sharing domain	Cores per L3 domain	Pinning rule of thumb
c6a / m6a	AMD EPYC 7R13 (Milan, Zen 3)	32 MB per CCD	8	Driver + hot app threads inside one CCD
c7a / m7a / r7a	AMD EPYC 9R14 (Genoa, Zen 4)	32 MB per CCD	8 (1 vCPU = 1 core)	Same — never let hot threads straddle CCDs
c6i / m6i	Intel Xeon 8375C (Ice Lake)	54 MB per socket	32	Stay on the NIC’s socket / NUMA node
c7i	Intel Xeon 8488C (Sapphire Rapids)	105 MB per socket (4 tiles, EMIB, one logical L3)	48	Stay on the socket; L3 is uniform but slow (~33ns)
c7g	Graviton3	32 MB per chip	64	Any core; single NUMA node, no SMT
c8g / r8g	Graviton4	36 MB per socket	96	Any core on the socket; 48xlarge = 2 sockets

Sources

AWS Graviton Getting Started — processor spec table (Graviton2/3/4 cores, L2/L3 sizes, NUMA nodes, interconnect)
AWS EC2 instance types — general purpose specs (vCPUs = cores, threads per core = 1 on Graviton and m7a)
Wikipedia — Zen 3 (CCD = one 8-core CCX with 32 MB shared L3)
Wikipedia — Zen 4 (32 MB L3 per CCD; Genoa up to 12 CCDs)
Wikipedia — Sapphire Rapids (four tiles, EMIB, up to 112.5 MB L3)
Chips and Cheese — A Peek at Sapphire Rapids (quasi-monolithic L3 across tiles, ~33ns L3 latency, mesh slice hashing, clustered mode)
Chips and Cheese — Zen 4 memory subsystem (Zen 4 L3 latency ~9ns)
nviennot/core-to-core-latency (EPYC 7R13: ~23ns intra-CCD vs ~90–110ns cross-CCD; Xeon 8375C: ~51ns uniform)
Broadcom KB — AMD EPYC BIOS and NUMA guidance (NPS presents 1/2/4 NUMA nodes per socket; CCX-as-NUMA option)
Phoronix — Intel sub-NUMA clustering (SNC splits cores, cache, and memory into NUMA domains)
AWS — c6i, c7i, c7g, c8g instance pages (CPU generations per family)

Going deeper

This page covers the hardware context. For how Aeron’s term buffers, log structures, and counters are laid out to exploit cache locality, defer to The Aeron Files — it goes far deeper into the internals than we duplicate here.

The Aeron Files