Failure-Mode Runbook: Clients, Ops, and Resources
Tuning makes the cluster fast. This runbook keeps it correct when something breaks. It covers three failure families that share one trait: they rarely look like infrastructure faults. They look like client retries, fat-fingered config, and silent resource pressure — and they are the ones teams forget to plan for.
Read the three sections as a triage guide. Each leads with what goes wrong, what it costs you, and exactly what to do about it.
Client-side failures
Section titled “Client-side failures”The client-cluster boundary is where reliability is most often misunderstood. Three cases dominate.
| Case | Data Loss Risk | Availability Impact | Recovery Complexity | Remediation |
|---|---|---|---|---|
| Client disconnection during leader election | Uncommitted msgs | Brief (reconnect needed) | Low — client retry logic | Implement client-side leader discovery and auto-reconnect; track correlation IDs; retry unacknowledged messages after reconnect |
| Duplicate message delivery (client retry) | None (but double processing) | None | Low — implement idempotency | Implement idempotency in clustered service using correlation ID / sequence number deduplication |
| Message loss on ingress (leader crash before commit) | Uncommitted msgs | Brief | Low — client timeout + retry | Client detects missing egress response via timeout; retries the message to new leader |
Client-side best practices
Section titled “Client-side best practices”Three patterns turn these cases from incidents into non-events.
1. Correlation ID tracking. Every message sent to the cluster should include a unique correlation ID. The client tracks which IDs have been acknowledged. After a reconnect, retry all unacknowledged IDs.
2. Idempotency in service code. The clustered service must handle duplicate messages gracefully:
// In onSessionMessage():if (processedCorrelationIds.contains(correlationId)) { // Already processed — send cached response return;}// Process and recordprocessedCorrelationIds.add(correlationId);3. Leader discovery. Clients should implement automatic leader discovery — when redirected by a follower, reconnect to the leader without manual intervention.
Full E2E order flow — reliability at every layer
Section titled “Full E2E order flow — reliability at every layer”An order travels through multiple layers before it is matched. Each layer boundary is a potential message loss point. The key question at each hop: who is responsible for retry — the sender, or can we rely on the upstream source to resend?
Flow overview
Section titled “Flow overview”Layer-by-layer breakdown
Section titled “Layer-by-layer breakdown”① Customer App → API Gateway
Section titled “① Customer App → API Gateway”| Aspect | Detail |
|---|---|
| Protocol | HTTPS / WebSocket (external-facing) |
| Retry responsibility | Customer (sender) |
| Ack mechanism | HTTP response (sync) or WebSocket ack frame |
| Timeout & retry | Customer sets a request timeout; if no HTTP 200 / ack received, customer must resend |
| Idempotency key | Customer must include a client-generated clientOrderId (idempotency key) in every request |
| Why sender retries | The API Gateway is stateless — it has no knowledge of what the customer intended to send. Only the customer knows the original request and can resend it |
Design notes:
- The API Gateway should reject duplicates early using the
clientOrderId(lookup in a short-lived dedup cache or forward to OMS for dedup). - Rate limiting and authentication happen here — retries from the customer are subject to the same throttling rules.
- WebSocket connections should implement heartbeat/ping-pong; if the connection drops, the customer reconnects and resends any unacknowledged orders.
② API Gateway → OMS (Aeron Cluster)
Section titled “② API Gateway → OMS (Aeron Cluster)”| Aspect | Detail |
|---|---|
| Protocol | Aeron Cluster ingress (the API Gateway acts as an Aeron client to the OMS cluster) |
| Retry responsibility | API Gateway (sender) — standard Aeron client-cluster contract |
| Ack mechanism | Egress response from OMS cluster back to API Gateway |
| Timeout & retry | API Gateway tracks each request by clientOrderId. If no egress ack within timeout, the Gateway resends to the OMS cluster (handling leader discovery/reconnection) |
| Idempotency key | clientOrderId propagated from customer |
| Why sender retries | Same Aeron client-cluster contract as any other hop. The OMS cluster does not track what the Gateway intended to send. During OMS leader elections or network issues, messages can be lost before consensus commit. Only the Gateway knows which requests are outstanding |
Design notes:
- The API Gateway is an Aeron client connecting to the OMS cluster’s ingress — it follows the same leader discovery, correlation ID tracking, and retry patterns described in the best practices above.
- OMS cluster’s
onSessionMessage()deduplicates onclientOrderIdbefore processing. - Since OMS is a cluster, it gives you durable, replicated state for the order lifecycle — if an OMS node crashes, the order state is not lost (it’s in the Raft log).
③ OMS (Aeron Cluster) → Matching Engine (Aeron Cluster) — Cluster-to-Cluster
Section titled “③ OMS (Aeron Cluster) → Matching Engine (Aeron Cluster) — Cluster-to-Cluster”| Aspect | Detail |
|---|---|
| Protocol | Aeron Cluster ingress (the OMS cluster’s state machine acts as an Aeron client to the ME cluster) |
| Retry responsibility | OMS cluster leader (sender) — but with an important nuance (see below) |
| Ack mechanism | Egress response from the ME cluster back to the OMS leader |
| Timeout & retry | OMS state machine tracks each forwarded order by correlationId. If no egress ack within timeout, OMS resends to the ME cluster |
| Idempotency key | correlationId (mapped from clientOrderId) — the Matching Engine deduplicates on this |
| Why sender retries | Same Aeron client-cluster contract. The ME cluster does not know what OMS intended to send. Only the OMS cluster (as the Aeron client) knows which orders are in-flight toward ME |
Design notes — cluster-to-cluster communication pattern:
-
Only the OMS leader sends to ME. The OMS leader’s state machine creates an Aeron client session to the ME cluster. Followers do NOT independently send to ME — this would cause duplicates.
-
OMS leader election → reconnect to ME. When the OMS cluster elects a new leader, the new leader must establish a new Aeron client session to the ME cluster and resend any orders that were in-flight (committed in OMS Raft log but not yet acked by ME). This is safe because:
- The OMS Raft log is the source of truth for what was sent.
- On replay, the new OMS leader knows exactly which orders need to be forwarded.
- The ME cluster deduplicates on
correlationId.
-
Deterministic replay is critical. The OMS state machine’s decision to forward an order to ME must be deterministic. The act of sending is a side effect — during replay, the OMS replays the decision but must be careful about re-triggering the actual send only when it is the live leader (not during log replay on followers). A common pattern:
// In OMS ClusteredService.onSessionMessage():void onNewOrder(long correlationId, Order order) {// Deterministic state update (replayed on all nodes)orderBook.trackOrder(correlationId, order);pendingForward.add(correlationId, order);// Side effect: only the live leader actually sends to MEif (role == Cluster.Role.LEADER) {meClient.offer(encodeOrder(correlationId, order));}} -
Pending order tracking survives failover. Because
pendingForwardis part of the OMS cluster’s replicated state (snapshotted + replayed), a new leader knows exactly which orders were accepted but not yet acked by ME — and can resend them. -
The Matching Engine’s
onSessionMessage()must dedup oncorrelationIdbefore applying the order to the order book.
④ Matching Engine (Internal)
Section titled “④ Matching Engine (Internal)”| Aspect | Detail |
|---|---|
| Protocol | Raft consensus (internal to Aeron Cluster) |
| Retry responsibility | Aeron Cluster framework handles this automatically |
| Ack mechanism | Raft log replication — leader replicates to followers, committed once majority ack |
| Why no manual retry | Once a message is committed to the Raft log, the framework guarantees it is replicated and will be replayed on all nodes. This is the exactly-once boundary |
Retry responsibility summary
Section titled “Retry responsibility summary”| Hop | Retry Owner | Can Rely on Source to Resend? | Why / Why Not |
|---|---|---|---|
| Customer → API GW | Customer | No — customer is the origin | No upstream source exists. The customer is the source of truth for their intent. |
| API GW → OMS Cluster | API Gateway | Partially yes — if the GW crashes, the customer will timeout and resend to a different GW instance, which effectively retries the whole flow | But the GW should still retry itself first. Relying solely on customer resend adds latency (customer timeout is typically longer). The GW has the context in memory and can retry faster. |
| OMS Cluster → ME Cluster | OMS leader | Yes — OMS Raft log is the safety net. If the OMS leader crashes, the new OMS leader replays from the Raft log, discovers pending (unacked) orders, and resends them to ME. No upstream help needed for this hop. | This is the key advantage of OMS being a cluster: the in-flight order state is replicated and durable in the OMS Raft log. Unlike a stateless OMS that loses in-memory state on crash, a clustered OMS recovers its pending forward queue via log replay. The upstream (GW) only needs to resend if the order was never committed to the OMS Raft log in the first place. |
| OMS internal (Raft) | Aeron framework | N/A — automatic | Raft consensus protocol handles replication. No application-level retry needed. |
| ME internal (Raft) | Aeron framework | N/A — automatic | Raft consensus protocol handles replication. No application-level retry needed. |
Key takeaways
Section titled “Key takeaways”- Every sender is responsible for retry at its own layer. This is the safest default — each layer owns reliability for its outbound hop.
- OMS as an Aeron cluster is the critical design choice. It gives you durable, replicated state for the order lifecycle. A stateless OMS loses in-flight orders on crash and must rely on upstream resend. A clustered OMS recovers its own pending state via Raft log replay — it is self-healing for the OMS→ME hop.
- Cluster-to-cluster communication requires care. Only the OMS leader sends to ME. On leader election, the new leader must reconstruct in-flight state from the Raft log and re-establish the Aeron client session to ME. Followers must never independently send to ME.
- Upstream resend is a fallback, not a primary mechanism. If the API Gateway crashes, the customer eventually resends. But within the OMS→ME boundary, the OMS cluster handles its own retry — no upstream help needed (unless the order never reached OMS in the first place).
- Idempotency must propagate end-to-end. The
clientOrderIdoriginates at the customer and flows through every layer. Each cluster deduplicates using this key (or a derived key likecorrelationId). Without this, retries at any layer cause duplicate order execution. - There are two exactly-once boundaries. The OMS Raft commit guarantees the order is durably accepted. The ME Raft commit guarantees the order is durably matched. Between these two boundaries (OMS→ME), the contract is at-least-once with idempotency.
Operational / human errors
Section titled “Operational / human errors”The cluster survives node and network faults on its own. It does not survive a wrong clusterMembers string or a kill -9. These are the failures you cause, and the ones a runbook exists to prevent.
| Case | Data Loss Risk | Availability Impact | Recovery Complexity | Remediation |
|---|---|---|---|---|
| Incorrect cluster configuration | None | Total (cannot form cluster) | Low — fix config and restart | Validate clusterMembers string, endpoints, and member IDs before deployment; fix config; restart all nodes |
| Accidental deletion of persistent data | Possible to Total | Depends on scope | Medium to Critical | Single node: wipe and rebuild from peers. All nodes: restore from external backup. Implement filesystem-level protection (snapshots, immutable backups) |
| Improper shutdown (kill -9 all nodes) | Possible | Total | Medium — check recording integrity | Restart all nodes; Aeron Archive validates recordings on startup; if corrupt, restore from backup. Use graceful shutdown (SIGTERM) in production |
| Failed rolling upgrade | Silent divergence | Possible | High — rollback may need old snapshots | Stop upgraded nodes; restore pre-upgrade snapshot; restart with old version; fix compatibility before re-attempting |
| Clock skew between nodes | None | Degraded (false elections) | Low — sync clocks (NTP) | Configure NTP/chrony on all nodes; increase election timeout to tolerate minor skew |
Operational safeguards
Section titled “Operational safeguards”- Config validation — Validate cluster configuration before deployment (member IDs, endpoints, port conflicts).
- Graceful shutdown — Always use SIGTERM, never SIGKILL. Graceful shutdown allows the node to complete in-flight operations and close recordings cleanly.
- Pre-upgrade snapshots — Always take a snapshot before any upgrade. This gives you a clean rollback point.
- Filesystem protection — Use immutable backups or filesystem snapshots for the archive and cluster directories.
- Clock sync — NTP/chrony is mandatory. Clock skew doesn’t cause data loss but can trigger unnecessary elections.
Resource exhaustion
Section titled “Resource exhaustion”These failures share a signature: nothing crashes, but the leader misses a heartbeat and the cluster panics. Resource pressure manifests as false elections, which is why it is so easy to misdiagnose.
| Case | Data Loss Risk | Availability Impact | Recovery Complexity | Remediation |
|---|---|---|---|---|
| JVM OOM | None | Single node down | Low — restart with tuning | Restart node with increased heap (-Xmx); profile memory usage; fix memory leaks; implement state eviction in service |
| File descriptor exhaustion | None | Single node down | Low — increase ulimit, restart | Increase ulimit -n; restart node; review for FD leaks in application code |
| CPU starvation / GC pauses | None | Degraded (false elections) | Medium — tune GC / threading | Switch to low-latency GC (ZGC/Shenandoah); pin Aeron threads to dedicated CPU cores; reduce allocation rate in service |
| Thread starvation (SHARED mode + slow handler) | None | Degraded (perceived dead) | Medium — optimize handler or use DEDICATED | Optimize onSessionMessage handler; offload heavy work; switch to DEDICATED threading mode; increase heartbeat timeout |
Resource exhaustion → election cascade
Section titled “Resource exhaustion → election cascade”The most common pattern is a feedback loop. Resource pressure causes an election, the election causes reconnect traffic, and that traffic adds more pressure.
Note how this connects back to the client section above: well-behaved client reconnect and idempotency logic keeps the reconnection burst from turning a single GC pause into a sustained cascade.
Prevention strategies
Section titled “Prevention strategies”| Resource | Monitoring | Prevention |
|---|---|---|
| Heap | -XX:+HeapDumpOnOutOfMemoryError, JMX metrics | Right-size heap, profile allocation rate, fix leaks |
| File descriptors | lsof count, /proc/pid/fd | Set ulimit -n 65536+, close resources properly |
| CPU | perf, async-profiler, cycle time counters | Pin threads, isolate cores, use appropriate idle strategy |
| Threads | Thread dumps, cycle time counters | Use DEDICATED mode for production, avoid blocking in handlers |
What this means for tuning
Section titled “What this means for tuning”Map these failure modes back to your latency targets and the priorities fall out:
- p50 lives in
onSessionMessage(). A slow or allocating handler doesn’t just raise the median — in SHARED mode it starves the conductor and triggers false elections. Keep the handler tight, and run DEDICATED mode in production. - p99 is wrecked by GC pauses and false elections. A single long stop-the-world pause both spikes p99 and can drop the leader. Low-latency GC (ZGC/Shenandoah), pinned Aeron threads, and isolated cores protect the tail.
- Throughput collapses during cascades. An election storm fed by a reconnect burst takes the whole cluster offline for the duration. Idempotent clients with correlation-ID retry and sane backoff are what keep a transient pause from becoming sustained downtime.
The cheapest reliability win is almost always discipline at the boundaries — idempotent retries, graceful shutdown, right-sized limits — not heroics during the incident.