Core Isolation & Thread Pinning

“Pin your threads” is one sentence that hides three different mechanisms operating at three different layers — and using any one of them alone does nothing. This page explains what GRUB_CMDLINE_LINUX boot parameters, taskset, and numactl each control, how they compose into a working recipe, and exactly how to pin an Aeron media driver. Where to pin (which CCD/NUMA node) is covered by NUMA and cache locality; this page is the how.

The one-line model

boot params carve cores OUT of the scheduler  →  the cores sit empty
taskset places your threads ONTO them          →  nothing else can follow
numactl adds the memory dimension              →  pages live next to the NIC

They are three layers of one recipe, not alternatives.

Layer 1 — boot parameters: carve out and de-noise

Set in GRUB_CMDLINE_LINUX; whole-machine scope; changed only by reboot.

Parameter	What it does
`isolcpus=domain,<list>`	Removes the CPUs from “the general SMP balancing and scheduling algorithms” — the scheduler will never place anything there. Irreversible at runtime.
`isolcpus=managed_irq,…`	Additionally keeps managed device interrupts off the cores (best effort).
`nohz_full=<list>`	Stops the scheduling-clock tick on those CPUs — but only while a CPU has a single runnable task, and a residual ~1 Hz tick remains. Implies RCU offload. The boot CPU is forcibly excluded.
`rcu_nocbs=<list>`	Moves RCU callback execution off the cores into `rcuox/N` kthreads (redundant with `nohz_full` but conventionally listed — tuned’s cpu-partitioning profile sets both).
`irqaffinity=<list>`	Default affinity mask for newly allocated IRQs — keeps unmanaged IRQs on housekeeping cores.

Three facts that get missed:

isolcpus is officially deprecated in kernel-parameters.txt (“use cpusets instead”) yet remains the simplest bulletproof option — Red Hat’s tuned cpu-partitioning profile still emits it when you opt into no_balance_cores= (by default it relies on softer affinity sweeps — see the tuned section). The modern equivalent is the cgroup v2 isolated partition (cpuset.cpus.partition = isolated): same effect — no load balancing, excluded from unbound workqueues — but reversible at runtime.
Isolation ≠ speed. isolcpus only empties cores. If you never place anything on them, you’ve just shrunk your machine — and squeezed every unpinned thread onto fewer cores.
systemd CPUAffinity= (in /etc/systemd/system.conf) is the soft variant: every service inherits a default mask that excludes your quiet cores, but the scheduler/tick are untouched. Belt-and-braces: use it with the boot params, as tuned does.

Layer 2 — `taskset`: place threads onto the carved-out cores

Runtime, per-process or per-thread (-p <tid>), wraps sched_setaffinity(2).

The load-bearing fact, straight from the man page: with isolcpus, “the only way to schedule processes onto the isolated CPUs is via sched_setaffinity() or the cpuset mechanism.” An isolated core is empty by construction — taskset is how your thread gets there, and it’s the only thing that gets there.

Two properties that shape the recipe:

Affinity is inherited across fork/exec — so taskset -c 8-15 java … constrains every JVM thread, including GC and JIT. That’s a feature: confine the whole process to housekeeping cores, then move only the named hot threads out, one TID at a time.
Affinity is restriction, not reservation. Without isolcpus/cpusets, other processes can still be scheduled on “your” core. taskset alone does not isolate anything.

Layer 3 — `numactl`: the memory dimension

taskset moves a thread; it says nothing about where the thread’s pages live. numactl adds the NUMA policy:

Option	Meaning
`--membind=<node>` (`-m`)	Hard-bind allocations to the node (fails rather than falls back)
`--preferred=<node>`	Soft preference with fallback
`--cpunodebind=<node>` (`-N`)	Run only on that node’s CPUs — node-granular
`--physcpubind=<cpus>` (`-C`)	Run only on those CPUs — core-granular (taskset-equivalent)
`--localalloc` (`-l`)	Allocate on whichever node the thread runs on (the default)

The trap: --localalloc only helps if the thread is also pinned — first-touch from the wrong node permanently strands pages there. And numactl can’t re-policy a running process (taskset -p can retarget a live TID; memory needs migratepages or a restart).

Side-by-side

	Boot params	`taskset`	`numactl`
When applied	Boot (reboot to change)	Launch or live (`-p`, per-TID)	Launch only
Controls	What the scheduler balances over; tick/RCU; default IRQ targets	Which CPUs one process/thread may use	CPUs (node- or core-granular) and memory placement
Does NOT	Place anything; speed anything up alone	Reserve the core; control memory	Exclude others; re-policy live processes
Aeron use	Carve cores for driver sender/receiver + hot app threads	Pin Java driver threads by TID	Launch driver/app with `--membind` to the NIC’s node

Misconceptions that cost real debugging time

“taskset isolates a core.” No — it restricts that process. Exclusion needs isolcpus, an isolated cgroup partition, or fleet-wide systemd CPUAffinity=.
“isolcpus alone makes things faster.” No — it only empties cores; the win comes from the combination. Measured (Mark Price): affinity + isolcpus took inter-thread max latency from 11.5 ms to 14.8 µs; IRQ steering and nosoftlockup then cut residual jitter 15 µs → 2.5 µs.
“numactl replaces taskset.” Partially: -C is core-granular at launch, but -N is a whole node, and only taskset -p can retarget a running thread.
“nohz_full removes all ticks.” Only with exactly one runnable task, and a ~1 Hz residual remains.
“isolcpus protects against interrupts.” Only managed_irq-class, best-effort. ENA’s queue IRQs follow /proc/irq/N/smp_affinity_list — steer them yourself. (irqbalance auto-bans isolated and nohz_full CPUs by default — keep the daemon, don’t disable it, per the ENA guide.)

What Aeron itself provides

C driver (aeronmd): built-in per-agent pinning — AERON_CONDUCTOR_CPU_AFFINITY, AERON_SENDER_CPU_AFFINITY, AERON_RECEIVER_CPU_AFFINITY (default −1 = unpinned), applied via sched_setaffinity with a single-CPU mask.
Java driver: no affinity properties. Aeron’s own benchmark harness pins it externally — it resolves the TIDs of the threads named driver-conducto, sender, receiver and runs taskset -p -c <core> <tid>. (Thread names are truncated to 15 chars by the kernel’s comm — hence driver-conducto.)
The official harness launch line is exactly the layered recipe: numactl --membind=$NODE --cpunodebind=$NODE --physcpubind=<housekeeping cores> <driver>, then hot threads pinned out to dedicated isolated cores.
Threading guidance from the Aeron wiki: use DEDICATED mode when busy threads ≤ spare cores; the conductor “can be run on a dirty CPU” — spend your isolated cores on sender/receiver and your app’s duty-cycle threads.

Worked example — c7i.metal box

96 vCPUs = 48 cores × 2 hyperthreads (sibling of CPU N is N+48 — verify with lscpu -e; confirm NUMA layout with numactl --hardware). Plan: CPUs 2–7 isolated for Aeron + hot app threads, siblings 50–55 isolated-and-left-empty so busy-spin threads own their physical cores; everything else is housekeeping.

# 1 · /etc/default/grub  (then grub2-mkconfig -o /boot/grub2/grub.cfg && reboot)
GRUB_CMDLINE_LINUX="... isolcpus=managed_irq,domain,2-7,50-55 \
  nohz_full=2-7,50-55 rcu_nocbs=2-7,50-55 irqaffinity=0-1,8-49,56-95 \
  intel_idle.max_cstate=1 processor.max_cstate=1 nosoftlockup"
cat /sys/devices/system/cpu/isolated      # verify after boot: 2-7,50-55

# 2 · steer ENA queue IRQs to housekeeping cores near (not on) the driver cores
grep Tx-Rx /proc/interrupts                               # find ENA IRQs
echo 8-15 | sudo tee /proc/irq/<N>/smp_affinity_list      # per queue IRQ

# 3a · C media driver — built-in pinning onto the isolated cores
AERON_THREADING_MODE=DEDICATED \
AERON_CONDUCTOR_CPU_AFFINITY=2 AERON_SENDER_CPU_AFFINITY=3 AERON_RECEIVER_CPU_AFFINITY=4 \
numactl --membind=0 --cpunodebind=0 aeronmd &

# 3b · Java media driver — confine to housekeeping, then move hot threads out by TID
numactl --membind=0 --physcpubind=8-15 java io.aeron.driver.MediaDriver &
DRIVER_PID=$!
for t in "driver-conducto:2" "sender:3" "receiver:4"; do
  tid=$(ps Ho tid,comm -p $DRIVER_PID | awk -v n="${t%:*}" '$2~n{print $1; exit}')
  taskset -p -c "${t#*:}" "$tid"
done

# 4 · app JVM: heap on the NIC's node, aux threads on housekeeping,
#     hot duty-cycle threads pinned to the remaining isolated cores
numactl --membind=0 --physcpubind=8-23 java -jar app.jar &
taskset -p -c 5 <producer_tid>      # or OpenHFT Java-Thread-Affinity in-process
taskset -p -c 6 <consumer_tid>

Placement rules while assigning cores (the where — see NUMA and cache locality): keep the producer thread in the same L3 domain as the driver sender (they share term-buffer cache lines — same socket on Intel, same CCD on AMD); never put two busy-spin threads on HT siblings of the same physical core; on dual-socket boxes, find the NIC’s node first (cat /sys/class/net/eth0/device/numa_node) and put everything — isolated cores, IRQs, --membind — on that node.

Runtime isolation without a reboot

Can’t (or don’t want to) touch GRUB? You can get most of Layer 1 at runtime. Three options, strongest first:

Option 1 — cgroup v2 isolated partition (the isolcpus equivalent, reversible)

# carve cores 2-4 into an isolated partition at runtime — no reboot
sudo mkdir -p /sys/fs/cgroup/aeron
echo "+cpuset" | sudo tee /sys/fs/cgroup/cgroup.subtree_control
echo 2-4      | sudo tee /sys/fs/cgroup/aeron/cpuset.cpus
echo isolated | sudo tee /sys/fs/cgroup/aeron/cpuset.cpus.partition
cat /sys/fs/cgroup/aeron/cpuset.cpus.partition   # MUST read "isolated" — "isolated invalid" means the
                                                 # partition rules weren't met (cores busy in siblings)

# put the media driver in it, then pin per-thread as usual
echo <driver_pid> | sudo tee /sys/fs/cgroup/aeron/cgroup.procs

Per the kernel docs, an isolated partition’s CPUs get no scheduler load balancing and are excluded from unbound workqueues — functionally isolcpus=domain, but reversible (echo member to undo) and officially the blessed path: isolcpus is deprecated in favor of cpuset partitions. On systemd machines, prefer expressing it through systemd (next option) so the two don’t fight over the cgroup tree.

Option 2 — systemd CPUAffinity (soft exclusion, two config lines)

# /etc/systemd/system.conf — every service/login inherits a mask EXCLUDING the quiet cores
CPUAffinity=0-1,5-95

# the aeron service's unit file gets its cores back:
[Service]
AllowedCPUs=2-4

This is what tuned cpu-partitioning does by default: nothing systemd spawns lands on the quiet cores, but the cores stay in the scheduler’s balancing domains and an explicit sched_setaffinity call can still intrude. Good enough on a single-purpose box; survives reboots.

Option 3 — affinity sweep (brute force, erodes)

for pid in $(ps -eo pid --no-headers); do
  sudo taskset -a -p -c 0-1,5-95 "$pid" 2>/dev/null
done

What tuned’s scheduler plugin does internally — but hand-rolled it’s racy: later-spawned processes inherit their parent’s mask and escape. Pair with Option 2 or it erodes over time.

What no runtime option gives you

The tick and RCU silencing (nohz_full, rcu_nocbs) are boot-only — runtime isolation leaves the ~250–1000 Hz scheduler tick firing on your cores. And in every variant, per-CPU kernel threads remain and device IRQs follow /proc/irq/N/smp_affinity_list — steer those separately (step 2 of the worked example) regardless of which isolation mechanism you chose.

The packaged alternative: `tuned cpu-partitioning`

On RHEL-family systems (incl. Amazon Linux), the whole manual recipe above ships as one reversible profile. You set a single variable and reboot:

dnf install tuned tuned-profiles-cpu-partitioning
echo "isolated_cores=2-7,50-55" >> /etc/tuned/cpu-partitioning-variables.conf
tuned-adm profile cpu-partitioning && reboot

Verified against the profile source, it executes the same steps: appends nohz_full=/rcu_nocbs=/ nosoftlockup to the kernel cmdline, steers unbound workqueues (with an early-boot dracut hook), writes systemd CPUAffinity=, sets IRQBALANCE_BANNED_CPUS, sweeps existing processes and movable IRQs off the isolated cores, and (via its included network-latency/latency-performance layers) caps C-states at C1, sets busy_poll=50, disables THP and kernel.numa_balancing.

Two things to know before preferring it:

By default it does not use isolcpus — exclusion is “soft” (affinity sweep + systemd default mask), so the cores stay in the scheduler’s balancing domains and the profile is fully reversible at runtime (tuned-adm profile <other>). For the hard kernel-level guarantee, set no_balance_cores= too — that adds real isolcpus= and the until-reboot irreversibility.
It still places nothing. After the reboot you must do Layer-2/3 yourself — pin the driver threads (AERON_*_CPU_AFFINITY / taskset -p) and numactl --membind exactly as above. tuned empties and de-noises; it never positions your workload.

EC2 caveat: the profile’s cmdline includes intel_pstate=disable; teams that want intel_pstate keep it by overriding cmdline_cpu_part in a child profile (include=cpu-partitioning).

Sources

kernel-parameters.txt — isolcpus (flags + deprecation note), nohz_full, rcu_nocbs, irqaffinity.
NO_HZ / adaptive ticks · per-CPU kthreads · cgroup v2 isolated partitions.
Man pages: taskset(1) · sched_setaffinity(2) (inheritance, the isolcpus note) · numactl(8) · cpuset(7) · systemd-system.conf(5).
tuned: cpu-partitioning profile · its man page · network-latency / latency-performance layers.
ENA Best Practices — IRQ banning, taskset/numactl steering, C-state GRUB lines.
Mark Price, Reducing System Jitter part 1 / part 2 — the measured numbers.
Aeron: Best Practices wiki · benchmarks harness (the pin_thread()/numactl pattern) · C driver affinity env vars in aeronmd.h / aeron_thread.c.