Skip to content

Failure-Mode Runbook: Clients, Ops, and Resources

Tuning makes the cluster fast. This runbook keeps it correct when something breaks. It covers three failure families that share one trait: they rarely look like infrastructure faults. They look like client retries, fat-fingered config, and silent resource pressure — and they are the ones teams forget to plan for.

Read the three sections as a triage guide. Each leads with what goes wrong, what it costs you, and exactly what to do about it.

The client-cluster boundary is where reliability is most often misunderstood. Three cases dominate.

CaseData Loss RiskAvailability ImpactRecovery ComplexityRemediation
Client disconnection during leader electionUncommitted msgsBrief (reconnect needed)Low — client retry logicImplement client-side leader discovery and auto-reconnect; track correlation IDs; retry unacknowledged messages after reconnect
Duplicate message delivery (client retry)None (but double processing)NoneLow — implement idempotencyImplement idempotency in clustered service using correlation ID / sequence number deduplication
Message loss on ingress (leader crash before commit)Uncommitted msgsBriefLow — client timeout + retryClient detects missing egress response via timeout; retries the message to new leader

Three patterns turn these cases from incidents into non-events.

1. Correlation ID tracking. Every message sent to the cluster should include a unique correlation ID. The client tracks which IDs have been acknowledged. After a reconnect, retry all unacknowledged IDs.

2. Idempotency in service code. The clustered service must handle duplicate messages gracefully:

// In onSessionMessage():
if (processedCorrelationIds.contains(correlationId)) {
// Already processed — send cached response
return;
}
// Process and record
processedCorrelationIds.add(correlationId);

3. Leader discovery. Clients should implement automatic leader discovery — when redirected by a follower, reconnect to the leader without manual intervention.

Full E2E order flow — reliability at every layer

Section titled “Full E2E order flow — reliability at every layer”

An order travels through multiple layers before it is matched. Each layer boundary is a potential message loss point. The key question at each hop: who is responsible for retry — the sender, or can we rely on the upstream source to resend?

AspectDetail
ProtocolHTTPS / WebSocket (external-facing)
Retry responsibilityCustomer (sender)
Ack mechanismHTTP response (sync) or WebSocket ack frame
Timeout & retryCustomer sets a request timeout; if no HTTP 200 / ack received, customer must resend
Idempotency keyCustomer must include a client-generated clientOrderId (idempotency key) in every request
Why sender retriesThe API Gateway is stateless — it has no knowledge of what the customer intended to send. Only the customer knows the original request and can resend it

Design notes:

  • The API Gateway should reject duplicates early using the clientOrderId (lookup in a short-lived dedup cache or forward to OMS for dedup).
  • Rate limiting and authentication happen here — retries from the customer are subject to the same throttling rules.
  • WebSocket connections should implement heartbeat/ping-pong; if the connection drops, the customer reconnects and resends any unacknowledged orders.
AspectDetail
ProtocolAeron Cluster ingress (the API Gateway acts as an Aeron client to the OMS cluster)
Retry responsibilityAPI Gateway (sender) — standard Aeron client-cluster contract
Ack mechanismEgress response from OMS cluster back to API Gateway
Timeout & retryAPI Gateway tracks each request by clientOrderId. If no egress ack within timeout, the Gateway resends to the OMS cluster (handling leader discovery/reconnection)
Idempotency keyclientOrderId propagated from customer
Why sender retriesSame Aeron client-cluster contract as any other hop. The OMS cluster does not track what the Gateway intended to send. During OMS leader elections or network issues, messages can be lost before consensus commit. Only the Gateway knows which requests are outstanding

Design notes:

  • The API Gateway is an Aeron client connecting to the OMS cluster’s ingress — it follows the same leader discovery, correlation ID tracking, and retry patterns described in the best practices above.
  • OMS cluster’s onSessionMessage() deduplicates on clientOrderId before processing.
  • Since OMS is a cluster, it gives you durable, replicated state for the order lifecycle — if an OMS node crashes, the order state is not lost (it’s in the Raft log).

③ OMS (Aeron Cluster) → Matching Engine (Aeron Cluster) — Cluster-to-Cluster

Section titled “③ OMS (Aeron Cluster) → Matching Engine (Aeron Cluster) — Cluster-to-Cluster”
AspectDetail
ProtocolAeron Cluster ingress (the OMS cluster’s state machine acts as an Aeron client to the ME cluster)
Retry responsibilityOMS cluster leader (sender) — but with an important nuance (see below)
Ack mechanismEgress response from the ME cluster back to the OMS leader
Timeout & retryOMS state machine tracks each forwarded order by correlationId. If no egress ack within timeout, OMS resends to the ME cluster
Idempotency keycorrelationId (mapped from clientOrderId) — the Matching Engine deduplicates on this
Why sender retriesSame Aeron client-cluster contract. The ME cluster does not know what OMS intended to send. Only the OMS cluster (as the Aeron client) knows which orders are in-flight toward ME

Design notes — cluster-to-cluster communication pattern:

  • Only the OMS leader sends to ME. The OMS leader’s state machine creates an Aeron client session to the ME cluster. Followers do NOT independently send to ME — this would cause duplicates.

  • OMS leader election → reconnect to ME. When the OMS cluster elects a new leader, the new leader must establish a new Aeron client session to the ME cluster and resend any orders that were in-flight (committed in OMS Raft log but not yet acked by ME). This is safe because:

    • The OMS Raft log is the source of truth for what was sent.
    • On replay, the new OMS leader knows exactly which orders need to be forwarded.
    • The ME cluster deduplicates on correlationId.
  • Deterministic replay is critical. The OMS state machine’s decision to forward an order to ME must be deterministic. The act of sending is a side effect — during replay, the OMS replays the decision but must be careful about re-triggering the actual send only when it is the live leader (not during log replay on followers). A common pattern:

    // In OMS ClusteredService.onSessionMessage():
    void onNewOrder(long correlationId, Order order) {
    // Deterministic state update (replayed on all nodes)
    orderBook.trackOrder(correlationId, order);
    pendingForward.add(correlationId, order);
    // Side effect: only the live leader actually sends to ME
    if (role == Cluster.Role.LEADER) {
    meClient.offer(encodeOrder(correlationId, order));
    }
    }
  • Pending order tracking survives failover. Because pendingForward is part of the OMS cluster’s replicated state (snapshotted + replayed), a new leader knows exactly which orders were accepted but not yet acked by ME — and can resend them.

  • The Matching Engine’s onSessionMessage() must dedup on correlationId before applying the order to the order book.

AspectDetail
ProtocolRaft consensus (internal to Aeron Cluster)
Retry responsibilityAeron Cluster framework handles this automatically
Ack mechanismRaft log replication — leader replicates to followers, committed once majority ack
Why no manual retryOnce a message is committed to the Raft log, the framework guarantees it is replicated and will be replayed on all nodes. This is the exactly-once boundary
HopRetry OwnerCan Rely on Source to Resend?Why / Why Not
Customer → API GWCustomerNo — customer is the originNo upstream source exists. The customer is the source of truth for their intent.
API GW → OMS ClusterAPI GatewayPartially yes — if the GW crashes, the customer will timeout and resend to a different GW instance, which effectively retries the whole flowBut the GW should still retry itself first. Relying solely on customer resend adds latency (customer timeout is typically longer). The GW has the context in memory and can retry faster.
OMS Cluster → ME ClusterOMS leaderYes — OMS Raft log is the safety net. If the OMS leader crashes, the new OMS leader replays from the Raft log, discovers pending (unacked) orders, and resends them to ME. No upstream help needed for this hop.This is the key advantage of OMS being a cluster: the in-flight order state is replicated and durable in the OMS Raft log. Unlike a stateless OMS that loses in-memory state on crash, a clustered OMS recovers its pending forward queue via log replay. The upstream (GW) only needs to resend if the order was never committed to the OMS Raft log in the first place.
OMS internal (Raft)Aeron frameworkN/A — automaticRaft consensus protocol handles replication. No application-level retry needed.
ME internal (Raft)Aeron frameworkN/A — automaticRaft consensus protocol handles replication. No application-level retry needed.
  • Every sender is responsible for retry at its own layer. This is the safest default — each layer owns reliability for its outbound hop.
  • OMS as an Aeron cluster is the critical design choice. It gives you durable, replicated state for the order lifecycle. A stateless OMS loses in-flight orders on crash and must rely on upstream resend. A clustered OMS recovers its own pending state via Raft log replay — it is self-healing for the OMS→ME hop.
  • Cluster-to-cluster communication requires care. Only the OMS leader sends to ME. On leader election, the new leader must reconstruct in-flight state from the Raft log and re-establish the Aeron client session to ME. Followers must never independently send to ME.
  • Upstream resend is a fallback, not a primary mechanism. If the API Gateway crashes, the customer eventually resends. But within the OMS→ME boundary, the OMS cluster handles its own retry — no upstream help needed (unless the order never reached OMS in the first place).
  • Idempotency must propagate end-to-end. The clientOrderId originates at the customer and flows through every layer. Each cluster deduplicates using this key (or a derived key like correlationId). Without this, retries at any layer cause duplicate order execution.
  • There are two exactly-once boundaries. The OMS Raft commit guarantees the order is durably accepted. The ME Raft commit guarantees the order is durably matched. Between these two boundaries (OMS→ME), the contract is at-least-once with idempotency.

The cluster survives node and network faults on its own. It does not survive a wrong clusterMembers string or a kill -9. These are the failures you cause, and the ones a runbook exists to prevent.

CaseData Loss RiskAvailability ImpactRecovery ComplexityRemediation
Incorrect cluster configurationNoneTotal (cannot form cluster)Low — fix config and restartValidate clusterMembers string, endpoints, and member IDs before deployment; fix config; restart all nodes
Accidental deletion of persistent dataPossible to TotalDepends on scopeMedium to CriticalSingle node: wipe and rebuild from peers. All nodes: restore from external backup. Implement filesystem-level protection (snapshots, immutable backups)
Improper shutdown (kill -9 all nodes)PossibleTotalMedium — check recording integrityRestart all nodes; Aeron Archive validates recordings on startup; if corrupt, restore from backup. Use graceful shutdown (SIGTERM) in production
Failed rolling upgradeSilent divergencePossibleHigh — rollback may need old snapshotsStop upgraded nodes; restore pre-upgrade snapshot; restart with old version; fix compatibility before re-attempting
Clock skew between nodesNoneDegraded (false elections)Low — sync clocks (NTP)Configure NTP/chrony on all nodes; increase election timeout to tolerate minor skew
  1. Config validation — Validate cluster configuration before deployment (member IDs, endpoints, port conflicts).
  2. Graceful shutdown — Always use SIGTERM, never SIGKILL. Graceful shutdown allows the node to complete in-flight operations and close recordings cleanly.
  3. Pre-upgrade snapshots — Always take a snapshot before any upgrade. This gives you a clean rollback point.
  4. Filesystem protection — Use immutable backups or filesystem snapshots for the archive and cluster directories.
  5. Clock sync — NTP/chrony is mandatory. Clock skew doesn’t cause data loss but can trigger unnecessary elections.

These failures share a signature: nothing crashes, but the leader misses a heartbeat and the cluster panics. Resource pressure manifests as false elections, which is why it is so easy to misdiagnose.

CaseData Loss RiskAvailability ImpactRecovery ComplexityRemediation
JVM OOMNoneSingle node downLow — restart with tuningRestart node with increased heap (-Xmx); profile memory usage; fix memory leaks; implement state eviction in service
File descriptor exhaustionNoneSingle node downLow — increase ulimit, restartIncrease ulimit -n; restart node; review for FD leaks in application code
CPU starvation / GC pausesNoneDegraded (false elections)Medium — tune GC / threadingSwitch to low-latency GC (ZGC/Shenandoah); pin Aeron threads to dedicated CPU cores; reduce allocation rate in service
Thread starvation (SHARED mode + slow handler)NoneDegraded (perceived dead)Medium — optimize handler or use DEDICATEDOptimize onSessionMessage handler; offload heavy work; switch to DEDICATED threading mode; increase heartbeat timeout

The most common pattern is a feedback loop. Resource pressure causes an election, the election causes reconnect traffic, and that traffic adds more pressure.

Note how this connects back to the client section above: well-behaved client reconnect and idempotency logic keeps the reconnection burst from turning a single GC pause into a sustained cascade.

ResourceMonitoringPrevention
Heap-XX:+HeapDumpOnOutOfMemoryError, JMX metricsRight-size heap, profile allocation rate, fix leaks
File descriptorslsof count, /proc/pid/fdSet ulimit -n 65536+, close resources properly
CPUperf, async-profiler, cycle time countersPin threads, isolate cores, use appropriate idle strategy
ThreadsThread dumps, cycle time countersUse DEDICATED mode for production, avoid blocking in handlers

Map these failure modes back to your latency targets and the priorities fall out:

  • p50 lives in onSessionMessage(). A slow or allocating handler doesn’t just raise the median — in SHARED mode it starves the conductor and triggers false elections. Keep the handler tight, and run DEDICATED mode in production.
  • p99 is wrecked by GC pauses and false elections. A single long stop-the-world pause both spikes p99 and can drop the leader. Low-latency GC (ZGC/Shenandoah), pinned Aeron threads, and isolated cores protect the tail.
  • Throughput collapses during cascades. An election storm fed by a reconnect burst takes the whole cluster offline for the duration. Idempotent clients with correlation-ID retry and sane backoff are what keep a transient pause from becoming sustained downtime.

The cheapest reliability win is almost always discipline at the boundaries — idempotent retries, graceful shutdown, right-sized limits — not heroics during the incident.