Core Isolation & Thread Pinning
“Pin your threads” is one sentence that hides three different mechanisms operating at three different
layers — and using any one of them alone does nothing. This page explains what GRUB_CMDLINE_LINUX
boot parameters, taskset, and numactl each control, how they compose into a working recipe, and
exactly how to pin an Aeron media driver. Where to pin (which CCD/NUMA node) is covered by
NUMA and cache locality; this page is the how.
The one-line model
Section titled “The one-line model”boot params carve cores OUT of the scheduler → the cores sit emptytaskset places your threads ONTO them → nothing else can follownumactl adds the memory dimension → pages live next to the NICThey are three layers of one recipe, not alternatives.
Layer 1 — boot parameters: carve out and de-noise
Section titled “Layer 1 — boot parameters: carve out and de-noise”Set in GRUB_CMDLINE_LINUX; whole-machine scope; changed only by reboot.
| Parameter | What it does |
|---|---|
isolcpus=domain,<list> | Removes the CPUs from “the general SMP balancing and scheduling algorithms” — the scheduler will never place anything there. Irreversible at runtime. |
isolcpus=managed_irq,… | Additionally keeps managed device interrupts off the cores (best effort). |
nohz_full=<list> | Stops the scheduling-clock tick on those CPUs — but only while a CPU has a single runnable task, and a residual ~1 Hz tick remains. Implies RCU offload. The boot CPU is forcibly excluded. |
rcu_nocbs=<list> | Moves RCU callback execution off the cores into rcuox/N kthreads (redundant with nohz_full but conventionally listed — tuned’s cpu-partitioning profile sets both). |
irqaffinity=<list> | Default affinity mask for newly allocated IRQs — keeps unmanaged IRQs on housekeeping cores. |
Three facts that get missed:
isolcpusis officially deprecated in kernel-parameters.txt (“use cpusets instead”) yet remains the simplest bulletproof option — Red Hat’s tunedcpu-partitioningprofile still emits it when you opt intono_balance_cores=(by default it relies on softer affinity sweeps — see the tuned section). The modern equivalent is the cgroup v2 isolated partition (cpuset.cpus.partition = isolated): same effect — no load balancing, excluded from unbound workqueues — but reversible at runtime.- Isolation ≠ speed.
isolcpusonly empties cores. If you never place anything on them, you’ve just shrunk your machine — and squeezed every unpinned thread onto fewer cores. - systemd
CPUAffinity=(in/etc/systemd/system.conf) is the soft variant: every service inherits a default mask that excludes your quiet cores, but the scheduler/tick are untouched. Belt-and-braces: use it with the boot params, as tuned does.
Layer 2 — taskset: place threads onto the carved-out cores
Section titled “Layer 2 — taskset: place threads onto the carved-out cores”Runtime, per-process or per-thread (-p <tid>), wraps sched_setaffinity(2).
The load-bearing fact, straight from the man page: with isolcpus, “the only way to schedule
processes onto the isolated CPUs is via sched_setaffinity() or the cpuset mechanism.” An isolated
core is empty by construction — taskset is how your thread gets there, and it’s the only thing
that gets there.
Two properties that shape the recipe:
- Affinity is inherited across
fork/exec— sotaskset -c 8-15 java …constrains every JVM thread, including GC and JIT. That’s a feature: confine the whole process to housekeeping cores, then move only the named hot threads out, one TID at a time. - Affinity is restriction, not reservation. Without
isolcpus/cpusets, other processes can still be scheduled on “your” core.tasksetalone does not isolate anything.
Layer 3 — numactl: the memory dimension
Section titled “Layer 3 — numactl: the memory dimension”taskset moves a thread; it says nothing about where the thread’s pages live. numactl adds the
NUMA policy:
| Option | Meaning |
|---|---|
--membind=<node> (-m) | Hard-bind allocations to the node (fails rather than falls back) |
--preferred=<node> | Soft preference with fallback |
--cpunodebind=<node> (-N) | Run only on that node’s CPUs — node-granular |
--physcpubind=<cpus> (-C) | Run only on those CPUs — core-granular (taskset-equivalent) |
--localalloc (-l) | Allocate on whichever node the thread runs on (the default) |
The trap: --localalloc only helps if the thread is also pinned — first-touch from the wrong node
permanently strands pages there. And numactl can’t re-policy a running process (taskset -p can
retarget a live TID; memory needs migratepages or a restart).
Side-by-side
Section titled “Side-by-side”| Boot params | taskset | numactl | |
|---|---|---|---|
| When applied | Boot (reboot to change) | Launch or live (-p, per-TID) | Launch only |
| Controls | What the scheduler balances over; tick/RCU; default IRQ targets | Which CPUs one process/thread may use | CPUs (node- or core-granular) and memory placement |
| Does NOT | Place anything; speed anything up alone | Reserve the core; control memory | Exclude others; re-policy live processes |
| Aeron use | Carve cores for driver sender/receiver + hot app threads | Pin Java driver threads by TID | Launch driver/app with --membind to the NIC’s node |
Misconceptions that cost real debugging time
Section titled “Misconceptions that cost real debugging time”- “taskset isolates a core.” No — it restricts that process. Exclusion needs
isolcpus, an isolated cgroup partition, or fleet-wide systemdCPUAffinity=. - “isolcpus alone makes things faster.” No — it only empties cores; the win comes from the
combination. Measured (Mark Price): affinity + isolcpus took inter-thread max latency from
11.5 ms to 14.8 µs; IRQ steering and
nosoftlockupthen cut residual jitter 15 µs → 2.5 µs. - “numactl replaces taskset.” Partially:
-Cis core-granular at launch, but-Nis a whole node, and onlytaskset -pcan retarget a running thread. - “nohz_full removes all ticks.” Only with exactly one runnable task, and a ~1 Hz residual remains.
- “isolcpus protects against interrupts.” Only
managed_irq-class, best-effort. ENA’s queue IRQs follow/proc/irq/N/smp_affinity_list— steer them yourself. (irqbalance auto-bans isolated and nohz_full CPUs by default — keep the daemon, don’t disable it, per the ENA guide.)
What Aeron itself provides
Section titled “What Aeron itself provides”- C driver (
aeronmd): built-in per-agent pinning —AERON_CONDUCTOR_CPU_AFFINITY,AERON_SENDER_CPU_AFFINITY,AERON_RECEIVER_CPU_AFFINITY(default −1 = unpinned), applied viasched_setaffinitywith a single-CPU mask. - Java driver: no affinity properties. Aeron’s own benchmark harness pins it externally — it
resolves the TIDs of the threads named
driver-conducto,sender,receiverand runstaskset -p -c <core> <tid>. (Thread names are truncated to 15 chars by the kernel’scomm— hencedriver-conducto.) - The official harness launch line is exactly the layered recipe:
numactl --membind=$NODE --cpunodebind=$NODE --physcpubind=<housekeeping cores> <driver>, then hot threads pinned out to dedicated isolated cores. - Threading guidance from the Aeron wiki: use
DEDICATEDmode when busy threads ≤ spare cores; the conductor “can be run on a dirty CPU” — spend your isolated cores on sender/receiver and your app’s duty-cycle threads.
Worked example — c7i.metal box
Section titled “Worked example — c7i.metal box”96 vCPUs = 48 cores × 2 hyperthreads (sibling of CPU N is N+48 — verify with lscpu -e; confirm NUMA
layout with numactl --hardware). Plan: CPUs 2–7 isolated for Aeron + hot app threads, siblings
50–55 isolated-and-left-empty so busy-spin threads own their physical cores; everything else is
housekeeping.
# 1 · /etc/default/grub (then grub2-mkconfig -o /boot/grub2/grub.cfg && reboot)GRUB_CMDLINE_LINUX="... isolcpus=managed_irq,domain,2-7,50-55 \ nohz_full=2-7,50-55 rcu_nocbs=2-7,50-55 irqaffinity=0-1,8-49,56-95 \ intel_idle.max_cstate=1 processor.max_cstate=1 nosoftlockup"cat /sys/devices/system/cpu/isolated # verify after boot: 2-7,50-55# 2 · steer ENA queue IRQs to housekeeping cores near (not on) the driver coresgrep Tx-Rx /proc/interrupts # find ENA IRQsecho 8-15 | sudo tee /proc/irq/<N>/smp_affinity_list # per queue IRQ# 3a · C media driver — built-in pinning onto the isolated coresAERON_THREADING_MODE=DEDICATED \AERON_CONDUCTOR_CPU_AFFINITY=2 AERON_SENDER_CPU_AFFINITY=3 AERON_RECEIVER_CPU_AFFINITY=4 \numactl --membind=0 --cpunodebind=0 aeronmd &# 3b · Java media driver — confine to housekeeping, then move hot threads out by TIDnumactl --membind=0 --physcpubind=8-15 java io.aeron.driver.MediaDriver &DRIVER_PID=$!for t in "driver-conducto:2" "sender:3" "receiver:4"; do tid=$(ps Ho tid,comm -p $DRIVER_PID | awk -v n="${t%:*}" '$2~n{print $1; exit}') taskset -p -c "${t#*:}" "$tid"done# 4 · app JVM: heap on the NIC's node, aux threads on housekeeping,# hot duty-cycle threads pinned to the remaining isolated coresnumactl --membind=0 --physcpubind=8-23 java -jar app.jar &taskset -p -c 5 <producer_tid> # or OpenHFT Java-Thread-Affinity in-processtaskset -p -c 6 <consumer_tid>Placement rules while assigning cores (the where — see
NUMA and cache locality): keep the producer thread in the
same L3 domain as the driver sender (they share term-buffer cache lines — same socket on Intel,
same CCD on AMD); never put two busy-spin threads on HT siblings of the same physical core; on
dual-socket boxes, find the NIC’s node first (cat /sys/class/net/eth0/device/numa_node) and put
everything — isolated cores, IRQs, --membind — on that node.
Runtime isolation without a reboot
Section titled “Runtime isolation without a reboot”Can’t (or don’t want to) touch GRUB? You can get most of Layer 1 at runtime. Three options, strongest first:
Option 1 — cgroup v2 isolated partition (the isolcpus equivalent, reversible)
Section titled “Option 1 — cgroup v2 isolated partition (the isolcpus equivalent, reversible)”# carve cores 2-4 into an isolated partition at runtime — no rebootsudo mkdir -p /sys/fs/cgroup/aeronecho "+cpuset" | sudo tee /sys/fs/cgroup/cgroup.subtree_controlecho 2-4 | sudo tee /sys/fs/cgroup/aeron/cpuset.cpusecho isolated | sudo tee /sys/fs/cgroup/aeron/cpuset.cpus.partitioncat /sys/fs/cgroup/aeron/cpuset.cpus.partition # MUST read "isolated" — "isolated invalid" means the # partition rules weren't met (cores busy in siblings)
# put the media driver in it, then pin per-thread as usualecho <driver_pid> | sudo tee /sys/fs/cgroup/aeron/cgroup.procsPer the kernel docs, an isolated partition’s CPUs get no scheduler load balancing and are excluded
from unbound workqueues — functionally isolcpus=domain, but reversible (echo member to undo) and
officially the blessed path: isolcpus is deprecated in favor of cpuset partitions. On systemd
machines, prefer expressing it through systemd (next option) so the two don’t fight over the cgroup
tree.
Option 2 — systemd CPUAffinity (soft exclusion, two config lines)
Section titled “Option 2 — systemd CPUAffinity (soft exclusion, two config lines)”# /etc/systemd/system.conf — every service/login inherits a mask EXCLUDING the quiet coresCPUAffinity=0-1,5-95
# the aeron service's unit file gets its cores back:[Service]AllowedCPUs=2-4This is what tuned cpu-partitioning does by default: nothing systemd spawns lands on the quiet cores,
but the cores stay in the scheduler’s balancing domains and an explicit sched_setaffinity call can
still intrude. Good enough on a single-purpose box; survives reboots.
Option 3 — affinity sweep (brute force, erodes)
Section titled “Option 3 — affinity sweep (brute force, erodes)”for pid in $(ps -eo pid --no-headers); do sudo taskset -a -p -c 0-1,5-95 "$pid" 2>/dev/nulldoneWhat tuned’s scheduler plugin does internally — but hand-rolled it’s racy: later-spawned processes inherit their parent’s mask and escape. Pair with Option 2 or it erodes over time.
What no runtime option gives you
Section titled “What no runtime option gives you”The tick and RCU silencing (nohz_full, rcu_nocbs) are boot-only — runtime isolation leaves the
~250–1000 Hz scheduler tick firing on your cores. And in every variant, per-CPU kernel threads remain
and device IRQs follow /proc/irq/N/smp_affinity_list — steer those separately (step 2 of the worked
example) regardless of which isolation mechanism you chose.
The packaged alternative: tuned cpu-partitioning
Section titled “The packaged alternative: tuned cpu-partitioning”On RHEL-family systems (incl. Amazon Linux), the whole manual recipe above ships as one reversible profile. You set a single variable and reboot:
dnf install tuned tuned-profiles-cpu-partitioningecho "isolated_cores=2-7,50-55" >> /etc/tuned/cpu-partitioning-variables.conftuned-adm profile cpu-partitioning && rebootVerified against the profile source, it executes the same steps: appends nohz_full=/rcu_nocbs=/
nosoftlockup to the kernel cmdline, steers unbound workqueues (with an early-boot dracut hook),
writes systemd CPUAffinity=, sets IRQBALANCE_BANNED_CPUS, sweeps existing processes and movable
IRQs off the isolated cores, and (via its included network-latency/latency-performance layers)
caps C-states at C1, sets busy_poll=50, disables THP and kernel.numa_balancing.
Two things to know before preferring it:
- By default it does not use
isolcpus— exclusion is “soft” (affinity sweep + systemd default mask), so the cores stay in the scheduler’s balancing domains and the profile is fully reversible at runtime (tuned-adm profile <other>). For the hard kernel-level guarantee, setno_balance_cores=too — that adds realisolcpus=and the until-reboot irreversibility. - It still places nothing. After the reboot you must do Layer-2/3 yourself — pin the driver
threads (
AERON_*_CPU_AFFINITY/taskset -p) andnumactl --membindexactly as above. tuned empties and de-noises; it never positions your workload.
EC2 caveat: the profile’s cmdline includes intel_pstate=disable; teams that want intel_pstate keep
it by overriding cmdline_cpu_part in a child profile (include=cpu-partitioning).
Sources
Section titled “Sources”- kernel-parameters.txt —
isolcpus(flags + deprecation note),nohz_full,rcu_nocbs,irqaffinity. - NO_HZ / adaptive ticks · per-CPU kthreads · cgroup v2 isolated partitions.
- Man pages: taskset(1) · sched_setaffinity(2) (inheritance, the isolcpus note) · numactl(8) · cpuset(7) · systemd-system.conf(5).
- tuned: cpu-partitioning profile · its man page · network-latency / latency-performance layers.
- ENA Best Practices — IRQ banning, taskset/numactl steering, C-state GRUB lines.
- Mark Price, Reducing System Jitter part 1 / part 2 — the measured numbers.
- Aeron: Best Practices wiki ·
benchmarks harness (the
pin_thread()/numactl pattern) · C driver affinity env vars inaeronmd.h/aeron_thread.c.