Skip to content

Core Isolation & Thread Pinning

“Pin your threads” is one sentence that hides three different mechanisms operating at three different layers — and using any one of them alone does nothing. This page explains what GRUB_CMDLINE_LINUX boot parameters, taskset, and numactl each control, how they compose into a working recipe, and exactly how to pin an Aeron media driver. Where to pin (which CCD/NUMA node) is covered by NUMA and cache locality; this page is the how.

boot params carve cores OUT of the scheduler → the cores sit empty
taskset places your threads ONTO them → nothing else can follow
numactl adds the memory dimension → pages live next to the NIC

They are three layers of one recipe, not alternatives.

Layer 1 — boot parameters: carve out and de-noise

Section titled “Layer 1 — boot parameters: carve out and de-noise”

Set in GRUB_CMDLINE_LINUX; whole-machine scope; changed only by reboot.

ParameterWhat it does
isolcpus=domain,<list>Removes the CPUs from “the general SMP balancing and scheduling algorithms” — the scheduler will never place anything there. Irreversible at runtime.
isolcpus=managed_irq,…Additionally keeps managed device interrupts off the cores (best effort).
nohz_full=<list>Stops the scheduling-clock tick on those CPUs — but only while a CPU has a single runnable task, and a residual ~1 Hz tick remains. Implies RCU offload. The boot CPU is forcibly excluded.
rcu_nocbs=<list>Moves RCU callback execution off the cores into rcuox/N kthreads (redundant with nohz_full but conventionally listed — tuned’s cpu-partitioning profile sets both).
irqaffinity=<list>Default affinity mask for newly allocated IRQs — keeps unmanaged IRQs on housekeeping cores.

Three facts that get missed:

  • isolcpus is officially deprecated in kernel-parameters.txt (“use cpusets instead”) yet remains the simplest bulletproof option — Red Hat’s tuned cpu-partitioning profile still emits it when you opt into no_balance_cores= (by default it relies on softer affinity sweeps — see the tuned section). The modern equivalent is the cgroup v2 isolated partition (cpuset.cpus.partition = isolated): same effect — no load balancing, excluded from unbound workqueues — but reversible at runtime.
  • Isolation ≠ speed. isolcpus only empties cores. If you never place anything on them, you’ve just shrunk your machine — and squeezed every unpinned thread onto fewer cores.
  • systemd CPUAffinity= (in /etc/systemd/system.conf) is the soft variant: every service inherits a default mask that excludes your quiet cores, but the scheduler/tick are untouched. Belt-and-braces: use it with the boot params, as tuned does.

Layer 2 — taskset: place threads onto the carved-out cores

Section titled “Layer 2 — taskset: place threads onto the carved-out cores”

Runtime, per-process or per-thread (-p <tid>), wraps sched_setaffinity(2).

The load-bearing fact, straight from the man page: with isolcpus, “the only way to schedule processes onto the isolated CPUs is via sched_setaffinity() or the cpuset mechanism.” An isolated core is empty by constructiontaskset is how your thread gets there, and it’s the only thing that gets there.

Two properties that shape the recipe:

  • Affinity is inherited across fork/exec — so taskset -c 8-15 java … constrains every JVM thread, including GC and JIT. That’s a feature: confine the whole process to housekeeping cores, then move only the named hot threads out, one TID at a time.
  • Affinity is restriction, not reservation. Without isolcpus/cpusets, other processes can still be scheduled on “your” core. taskset alone does not isolate anything.

taskset moves a thread; it says nothing about where the thread’s pages live. numactl adds the NUMA policy:

OptionMeaning
--membind=<node> (-m)Hard-bind allocations to the node (fails rather than falls back)
--preferred=<node>Soft preference with fallback
--cpunodebind=<node> (-N)Run only on that node’s CPUs — node-granular
--physcpubind=<cpus> (-C)Run only on those CPUs — core-granular (taskset-equivalent)
--localalloc (-l)Allocate on whichever node the thread runs on (the default)

The trap: --localalloc only helps if the thread is also pinned — first-touch from the wrong node permanently strands pages there. And numactl can’t re-policy a running process (taskset -p can retarget a live TID; memory needs migratepages or a restart).

Boot paramstasksetnumactl
When appliedBoot (reboot to change)Launch or live (-p, per-TID)Launch only
ControlsWhat the scheduler balances over; tick/RCU; default IRQ targetsWhich CPUs one process/thread may useCPUs (node- or core-granular) and memory placement
Does NOTPlace anything; speed anything up aloneReserve the core; control memoryExclude others; re-policy live processes
Aeron useCarve cores for driver sender/receiver + hot app threadsPin Java driver threads by TIDLaunch driver/app with --membind to the NIC’s node

Misconceptions that cost real debugging time

Section titled “Misconceptions that cost real debugging time”
  1. “taskset isolates a core.” No — it restricts that process. Exclusion needs isolcpus, an isolated cgroup partition, or fleet-wide systemd CPUAffinity=.
  2. “isolcpus alone makes things faster.” No — it only empties cores; the win comes from the combination. Measured (Mark Price): affinity + isolcpus took inter-thread max latency from 11.5 ms to 14.8 µs; IRQ steering and nosoftlockup then cut residual jitter 15 µs → 2.5 µs.
  3. “numactl replaces taskset.” Partially: -C is core-granular at launch, but -N is a whole node, and only taskset -p can retarget a running thread.
  4. “nohz_full removes all ticks.” Only with exactly one runnable task, and a ~1 Hz residual remains.
  5. “isolcpus protects against interrupts.” Only managed_irq-class, best-effort. ENA’s queue IRQs follow /proc/irq/N/smp_affinity_list — steer them yourself. (irqbalance auto-bans isolated and nohz_full CPUs by default — keep the daemon, don’t disable it, per the ENA guide.)
  • C driver (aeronmd): built-in per-agent pinningAERON_CONDUCTOR_CPU_AFFINITY, AERON_SENDER_CPU_AFFINITY, AERON_RECEIVER_CPU_AFFINITY (default −1 = unpinned), applied via sched_setaffinity with a single-CPU mask.
  • Java driver: no affinity properties. Aeron’s own benchmark harness pins it externally — it resolves the TIDs of the threads named driver-conducto, sender, receiver and runs taskset -p -c <core> <tid>. (Thread names are truncated to 15 chars by the kernel’s comm — hence driver-conducto.)
  • The official harness launch line is exactly the layered recipe: numactl --membind=$NODE --cpunodebind=$NODE --physcpubind=<housekeeping cores> <driver>, then hot threads pinned out to dedicated isolated cores.
  • Threading guidance from the Aeron wiki: use DEDICATED mode when busy threads ≤ spare cores; the conductor “can be run on a dirty CPU” — spend your isolated cores on sender/receiver and your app’s duty-cycle threads.

96 vCPUs = 48 cores × 2 hyperthreads (sibling of CPU N is N+48 — verify with lscpu -e; confirm NUMA layout with numactl --hardware). Plan: CPUs 2–7 isolated for Aeron + hot app threads, siblings 50–55 isolated-and-left-empty so busy-spin threads own their physical cores; everything else is housekeeping.

Terminal window
# 1 · /etc/default/grub (then grub2-mkconfig -o /boot/grub2/grub.cfg && reboot)
GRUB_CMDLINE_LINUX="... isolcpus=managed_irq,domain,2-7,50-55 \
nohz_full=2-7,50-55 rcu_nocbs=2-7,50-55 irqaffinity=0-1,8-49,56-95 \
intel_idle.max_cstate=1 processor.max_cstate=1 nosoftlockup"
cat /sys/devices/system/cpu/isolated # verify after boot: 2-7,50-55
Terminal window
# 2 · steer ENA queue IRQs to housekeeping cores near (not on) the driver cores
grep Tx-Rx /proc/interrupts # find ENA IRQs
echo 8-15 | sudo tee /proc/irq/<N>/smp_affinity_list # per queue IRQ
Terminal window
# 3a · C media driver — built-in pinning onto the isolated cores
AERON_THREADING_MODE=DEDICATED \
AERON_CONDUCTOR_CPU_AFFINITY=2 AERON_SENDER_CPU_AFFINITY=3 AERON_RECEIVER_CPU_AFFINITY=4 \
numactl --membind=0 --cpunodebind=0 aeronmd &
Terminal window
# 3b · Java media driver — confine to housekeeping, then move hot threads out by TID
numactl --membind=0 --physcpubind=8-15 java io.aeron.driver.MediaDriver &
DRIVER_PID=$!
for t in "driver-conducto:2" "sender:3" "receiver:4"; do
tid=$(ps Ho tid,comm -p $DRIVER_PID | awk -v n="${t%:*}" '$2~n{print $1; exit}')
taskset -p -c "${t#*:}" "$tid"
done
Terminal window
# 4 · app JVM: heap on the NIC's node, aux threads on housekeeping,
# hot duty-cycle threads pinned to the remaining isolated cores
numactl --membind=0 --physcpubind=8-23 java -jar app.jar &
taskset -p -c 5 <producer_tid> # or OpenHFT Java-Thread-Affinity in-process
taskset -p -c 6 <consumer_tid>

Placement rules while assigning cores (the where — see NUMA and cache locality): keep the producer thread in the same L3 domain as the driver sender (they share term-buffer cache lines — same socket on Intel, same CCD on AMD); never put two busy-spin threads on HT siblings of the same physical core; on dual-socket boxes, find the NIC’s node first (cat /sys/class/net/eth0/device/numa_node) and put everything — isolated cores, IRQs, --membind — on that node.

Can’t (or don’t want to) touch GRUB? You can get most of Layer 1 at runtime. Three options, strongest first:

Option 1 — cgroup v2 isolated partition (the isolcpus equivalent, reversible)

Section titled “Option 1 — cgroup v2 isolated partition (the isolcpus equivalent, reversible)”
Terminal window
# carve cores 2-4 into an isolated partition at runtime — no reboot
sudo mkdir -p /sys/fs/cgroup/aeron
echo "+cpuset" | sudo tee /sys/fs/cgroup/cgroup.subtree_control
echo 2-4 | sudo tee /sys/fs/cgroup/aeron/cpuset.cpus
echo isolated | sudo tee /sys/fs/cgroup/aeron/cpuset.cpus.partition
cat /sys/fs/cgroup/aeron/cpuset.cpus.partition # MUST read "isolated" — "isolated invalid" means the
# partition rules weren't met (cores busy in siblings)
# put the media driver in it, then pin per-thread as usual
echo <driver_pid> | sudo tee /sys/fs/cgroup/aeron/cgroup.procs

Per the kernel docs, an isolated partition’s CPUs get no scheduler load balancing and are excluded from unbound workqueues — functionally isolcpus=domain, but reversible (echo member to undo) and officially the blessed path: isolcpus is deprecated in favor of cpuset partitions. On systemd machines, prefer expressing it through systemd (next option) so the two don’t fight over the cgroup tree.

Option 2 — systemd CPUAffinity (soft exclusion, two config lines)

Section titled “Option 2 — systemd CPUAffinity (soft exclusion, two config lines)”
# /etc/systemd/system.conf — every service/login inherits a mask EXCLUDING the quiet cores
CPUAffinity=0-1,5-95
# the aeron service's unit file gets its cores back:
[Service]
AllowedCPUs=2-4

This is what tuned cpu-partitioning does by default: nothing systemd spawns lands on the quiet cores, but the cores stay in the scheduler’s balancing domains and an explicit sched_setaffinity call can still intrude. Good enough on a single-purpose box; survives reboots.

Option 3 — affinity sweep (brute force, erodes)

Section titled “Option 3 — affinity sweep (brute force, erodes)”
Terminal window
for pid in $(ps -eo pid --no-headers); do
sudo taskset -a -p -c 0-1,5-95 "$pid" 2>/dev/null
done

What tuned’s scheduler plugin does internally — but hand-rolled it’s racy: later-spawned processes inherit their parent’s mask and escape. Pair with Option 2 or it erodes over time.

The tick and RCU silencing (nohz_full, rcu_nocbs) are boot-only — runtime isolation leaves the ~250–1000 Hz scheduler tick firing on your cores. And in every variant, per-CPU kernel threads remain and device IRQs follow /proc/irq/N/smp_affinity_list — steer those separately (step 2 of the worked example) regardless of which isolation mechanism you chose.

The packaged alternative: tuned cpu-partitioning

Section titled “The packaged alternative: tuned cpu-partitioning”

On RHEL-family systems (incl. Amazon Linux), the whole manual recipe above ships as one reversible profile. You set a single variable and reboot:

Terminal window
dnf install tuned tuned-profiles-cpu-partitioning
echo "isolated_cores=2-7,50-55" >> /etc/tuned/cpu-partitioning-variables.conf
tuned-adm profile cpu-partitioning && reboot

Verified against the profile source, it executes the same steps: appends nohz_full=/rcu_nocbs=/ nosoftlockup to the kernel cmdline, steers unbound workqueues (with an early-boot dracut hook), writes systemd CPUAffinity=, sets IRQBALANCE_BANNED_CPUS, sweeps existing processes and movable IRQs off the isolated cores, and (via its included network-latency/latency-performance layers) caps C-states at C1, sets busy_poll=50, disables THP and kernel.numa_balancing.

Two things to know before preferring it:

  • By default it does not use isolcpus — exclusion is “soft” (affinity sweep + systemd default mask), so the cores stay in the scheduler’s balancing domains and the profile is fully reversible at runtime (tuned-adm profile <other>). For the hard kernel-level guarantee, set no_balance_cores= too — that adds real isolcpus= and the until-reboot irreversibility.
  • It still places nothing. After the reboot you must do Layer-2/3 yourself — pin the driver threads (AERON_*_CPU_AFFINITY / taskset -p) and numactl --membind exactly as above. tuned empties and de-noises; it never positions your workload.

EC2 caveat: the profile’s cmdline includes intel_pstate=disable; teams that want intel_pstate keep it by overriding cmdline_cpu_part in a child profile (include=cpu-partitioning).