Skip to content

NUMA, Memory Hierarchy, and Cache Locality

Latency lives in the memory hierarchy. On the hot path, where your data sits matters more than how fast your code runs. This page maps the layout of a modern dual-socket server and shows why NUMA placement and cache locality dominate Aeron’s tail latency.

Read it alongside the parameter reference — the two NUMA and L3 knobs there are explained from first principles here.

A modern server has two CPU sockets, each with its own cores, its own memory controller (MC), its own DRAM, and its own PCI-e lanes to a NIC. The two sockets talk over an interconnect (QPI). Crossing that link is expensive.

The key structure: L1 and L2 are per-core, L3 is shared across all cores on a socket, and DRAM hangs off each socket’s own memory controller. A NIC is physically wired to one socket’s PCI-e lanes — not both.

Each step down the hierarchy costs roughly an order of magnitude more. The numbers below are the working figures for a typical Xeon-class server.

LevelLatencyNotes
Registers/Buffers< 1ns
L1 Cache~4 cycles, ~1nsPer-core
L2 Cache~12 cycles, ~3nsPer-core
L3 Cache~40 cycles, ~12nsShared per socket
L3 (dirty hit)~75 cycles, ~25nsCross-core within socket
DRAM (local)~70nsLocal to socket’s memory controller
QPI (cross-socket)> 40ns additionalPenalty for accessing remote socket’s memory

The headline: L1 is ~1ns, DRAM is ~70ns. That is a 40x gap, and it sits squarely on your hot path.

Three placement decisions move latency the most. Each maps cleanly to a tuning knob.

  • L3 cache residency is the sweet spot. 12ns versus 70ns+ for DRAM. If your term buffer fits in L3, you avoid DRAM latency on every message — cutting both p50 and, more importantly, p99 variance from misses.
  • NUMA locality. Accessing memory on the remote socket adds 40ns+ per access via QPI. If your NIC is on Socket 0 but Aeron runs on Socket 1, every packet crosses QPI twice. That penalty lands directly on the tail.
  • PCI-e to NIC. The NIC is physically attached to one socket. Run Aeron’s media driver on cores local to that socket and you eliminate cross-socket PCI-e traversal entirely.

Two parameters in the reference come straight out of the hierarchy above:

  • NUMA locality (CPU near the NIC) — pin IRQs, driver threads, and app threads to cores on the NIC’s NUMA node. This removes the QPI penalty and stabilizes p99.
  • Term buffer fits in L3 — size the active term footprint to stay within effective L3. A resident working set turns DRAM-rate misses into cache-rate hits, lowering p50 and tightening the tail.

CPU architecture supplement: AMD CCD vs Intel die vs Graviton

Section titled “CPU architecture supplement: AMD CCD vs Intel die vs Graviton”

Everything above assumes the classic Xeon picture: one socket = one big shared L3. That assumption quietly breaks on AMD EPYC — and since EPYC, Intel Xeon, and AWS Graviton power different EC2 families, “pin to the NUMA node” means something different on each. This section maps the three topologies so the pinning advice above lands on the right cores.

AMD EPYC: the socket is not the cache domain

Section titled “AMD EPYC: the socket is not the cache domain”

AMD builds EPYC from chiplets. Each CCD (Core Complex Die) carries one CCX of up to 8 cores sharing a 32 MB L3 slice — on both Zen 3 (Milan, EPYC 7R13 in c6a) and Zen 4 (Genoa, EPYC 9R14 in c7a/m7a/r7a, up to 12 CCDs per socket). That L3 is private to the CCD: a core never allocates into another CCD’s L3, and any cross-CCD traffic detours through the I/O die over Infinity Fabric.

The measured numbers make the cliff visible. On the EPYC 7R13 (c6a), core-to-core latency is ~23ns within a CCD but jumps to ~90–110ns across CCDs — a worse penalty than the cross-socket QPI hop on the Xeon diagram above, inside a single NUMA node. A thread that the scheduler migrates from CCD 0 to CCD 1 wakes up with a cold L3: its entire working set replays as misses, and that burst lands straight on p99.

Two consequences for the knobs above:

  • “Term buffer fits in L3” means 32 MB, not the socket total. A Genoa socket advertises 384 MB of L3, but a pinned thread can only ever hit its own CCD’s 32 MB. Budget against 32 MB.
  • Pinning to the NUMA node is not enough. Keep the media driver’s conductor/sender/receiver threads and the app threads that touch the same term buffers inside one CCD (8 cores). lscpu -e (the L3 column) and lstopo both show CCD boundaries as L3 groups; on c7a every vCPU is a physical core (SMT is off), so consecutive vCPUs group naturally into CCDs.

On bare metal, BIOS adds NPS (NUMA-per-socket: NPS1/NPS2/NPS4) to present 1, 2, or 4 NUMA nodes per socket, and an ACPI SRAT L3 Cache as NUMA Domain option that exposes each CCX as its own node — handy for making schedulers CCD-aware. On EC2 you don’t control the BIOS; verify what you actually got with lscpu and numactl --hardware.

Ice Lake (Xeon 8375C in c6i: 32 cores, 54 MB L3) hangs all cores and L3 slices off one mesh on a monolithic die — every core sees the whole L3 at roughly uniform latency. Sapphire Rapids (custom Xeon 8488C in c7i: 48 cores, 105 MB) is physically four tiles stitched with EMIB, but the mesh spans the tile boundaries and presents one logical L3, so it behaves quasi-monolithically.

The trade: capacity and uniformity instead of speed. Measured Sapphire Rapids L3 latency is ~33ns — roughly 3× AMD’s per-CCD L3 (~9ns on Zen 4) — because every access hashes across all the L3 slices on a very large mesh. Intel offers sub-NUMA clustering (SNC) in BIOS to split a socket’s cores, L3, and memory controllers into 2 or 4 NUMA domains for tighter locality, but again: not a knob you can reach on EC2.

Pinning is correspondingly forgiving: any core on the socket sees the same L3, so staying on the NIC’s socket/NUMA node — the rule from earlier in this page — is the whole job.

AWS Graviton: the simplest topology of the three

Section titled “AWS Graviton: the simplest topology of the three”

Graviton3 (c7g: 64 cores, 32 MB L3) and Graviton4 (c8g/r8g: 96 cores per socket, 36 MB L3, 2 MB L2 per core) put all cores on a single coherent mesh exposed as one NUMA node — and there is no SMT: every vCPU is a physical core (threads per core = 1 in AWS’s spec tables). No CCD boundaries, no hyperthread siblings polluting your L1/L2, no cross-die surprises.

Two caveats: the shared L3 is small (32–36 MB for the whole chip — about one AMD CCD’s worth, shared by 64–96 cores), so the term-buffer-in-L3 budget is tight and contended; and the 192-vCPU 48xlarge sizes (e.g. r8g.48xlarge) are dual-socket with 2 NUMA nodes, where all the cross-socket rules from the top of this page apply again.

EC2 familyCPUL3 per sharing domainCores per L3 domainPinning rule of thumb
c6a / m6aAMD EPYC 7R13 (Milan, Zen 3)32 MB per CCD8Driver + hot app threads inside one CCD
c7a / m7a / r7aAMD EPYC 9R14 (Genoa, Zen 4)32 MB per CCD8 (1 vCPU = 1 core)Same — never let hot threads straddle CCDs
c6i / m6iIntel Xeon 8375C (Ice Lake)54 MB per socket32Stay on the NIC’s socket / NUMA node
c7iIntel Xeon 8488C (Sapphire Rapids)105 MB per socket (4 tiles, EMIB, one logical L3)48Stay on the socket; L3 is uniform but slow (~33ns)
c7gGraviton332 MB per chip64Any core; single NUMA node, no SMT
c8g / r8gGraviton436 MB per socket96Any core on the socket; 48xlarge = 2 sockets

This page covers the hardware context. For how Aeron’s term buffers, log structures, and counters are laid out to exploit cache locality, defer to The Aeron Files — it goes far deeper into the internals than we duplicate here.