OMS / Matching Engine Best Practices

Building an OMS or matching engine on Aeron Cluster is less about the framework and more about the operational patterns around it: how you deploy, upgrade, snapshot, and recover without ever telling the market “we’re closed.” This section collects those patterns.

Planned topics

Rolling upgrades of cluster member nodes — upgrading a 3/5-node Aeron Cluster one member at a time, without stopping the world.
How the OMS and ME fit together: two shard axes — the whole-system map: a user-sharded OMS account layer in front of a symbol-sharded matching engine, how an order flows through both, and why a fill crosses the two axes. Start here.
Sharding an OMS by the actor pattern — the symbol axis: scaling the matching engine by partitioning the instrument universe into single-threaded, single-writer actors.
Sharding the OMS by user: hot-user actors with eviction — the user axis: the per-user account/risk/session layer in front of the matching engine, as evictable write-back-cache actors: hydrate on first order, dehydrate when cold, kept deterministic and single-writer.
Deterministic state machines — keeping the replicated service replayable (no wall clocks, no randomness, no external I/O in the business logic).
Snapshot discipline — when to snapshot, sizing the recovery window, testing restores.
Session and duty-cycle design — backpressure handling on ingress/egress.
Failover drills — leader loss, follower loss, AZ loss, and what the runbook says for each.

Related foundations elsewhere on this site: performance tuning, operations & resilience.