OMS / Matching Engine Best Practices
Building an OMS or matching engine on Aeron Cluster is less about the framework and more about the operational patterns around it: how you deploy, upgrade, snapshot, and recover without ever telling the market “we’re closed.” This section collects those patterns.
Planned topics
Section titled “Planned topics”- Rolling upgrades of cluster member nodes — upgrading a 3/5-node Aeron Cluster one member at a time, without stopping the world.
- Deterministic state machines — keeping the replicated service replayable (no wall clocks, no randomness, no external I/O in the business logic).
- Snapshot discipline — when to snapshot, sizing the recovery window, testing restores.
- Session and duty-cycle design — backpressure handling on ingress/egress.
- Failover drills — leader loss, follower loss, AZ loss, and what the runbook says for each.
Related foundations elsewhere on this site: performance tuning, operations & resilience.