Failure-Mode Runbook: Clients, Ops, and Resources

Tuning makes the cluster fast. This runbook keeps it correct when something breaks. It covers three failure families that share one trait: they rarely look like infrastructure faults. They look like client retries, fat-fingered config, and silent resource pressure — and they are the ones teams forget to plan for.

Read the three sections as a triage guide. Each leads with what goes wrong, what it costs you, and exactly what to do about it.

Client-side failures

The client-cluster boundary is where reliability is most often misunderstood. Three cases dominate.

Case	Data Loss Risk	Availability Impact	Recovery Complexity	Remediation
Client disconnection during leader election	Uncommitted msgs	Brief (reconnect needed)	Low — client retry logic	Implement client-side leader discovery and auto-reconnect; track correlation IDs; retry unacknowledged messages after reconnect
Duplicate message delivery (client retry)	None (but double processing)	None	Low — implement idempotency	Implement idempotency in clustered service using correlation ID / sequence number deduplication
Message loss on ingress (leader crash before commit)	Uncommitted msgs	Brief	Low — client timeout + retry	Client detects missing egress response via timeout; retries the message to new leader

Client-side best practices

Three patterns turn these cases from incidents into non-events.

1. Correlation ID tracking. Every message sent to the cluster should include a unique correlation ID. The client tracks which IDs have been acknowledged. After a reconnect, retry all unacknowledged IDs.

2. Idempotency in service code. The clustered service must handle duplicate messages gracefully:

// In onSessionMessage():
if (processedCorrelationIds.contains(correlationId)) {
    // Already processed — send cached response
    return;
}
// Process and record
processedCorrelationIds.add(correlationId);

3. Leader discovery. Clients should implement automatic leader discovery — when redirected by a follower, reconnect to the leader without manual intervention.

Full E2E order flow — reliability at every layer

An order travels through multiple layers before it is matched. Each layer boundary is a potential message loss point. The key question at each hop: who is responsible for retry — the sender, or can we rely on the upstream source to resend?

Flow overview

Layer-by-layer breakdown

① Customer App → API Gateway

Aspect	Detail
Protocol	HTTPS / WebSocket (external-facing)
Retry responsibility	Customer (sender)
Ack mechanism	HTTP response (sync) or WebSocket ack frame
Timeout & retry	Customer sets a request timeout; if no HTTP 200 / ack received, customer must resend
Idempotency key	Customer must include a client-generated `clientOrderId` (idempotency key) in every request
Why sender retries	The API Gateway is stateless — it has no knowledge of what the customer intended to send. Only the customer knows the original request and can resend it

Design notes:

The API Gateway should reject duplicates early using the clientOrderId (lookup in a short-lived dedup cache or forward to OMS for dedup).
Rate limiting and authentication happen here — retries from the customer are subject to the same throttling rules.
WebSocket connections should implement heartbeat/ping-pong; if the connection drops, the customer reconnects and resends any unacknowledged orders.

② API Gateway → OMS (Aeron Cluster)

Aspect	Detail
Protocol	Aeron Cluster ingress (the API Gateway acts as an Aeron client to the OMS cluster)
Retry responsibility	API Gateway (sender) — standard Aeron client-cluster contract
Ack mechanism	Egress response from OMS cluster back to API Gateway
Timeout & retry	API Gateway tracks each request by `clientOrderId`. If no egress ack within timeout, the Gateway resends to the OMS cluster (handling leader discovery/reconnection)
Idempotency key	`clientOrderId` propagated from customer
Why sender retries	Same Aeron client-cluster contract as any other hop. The OMS cluster does not track what the Gateway intended to send. During OMS leader elections or network issues, messages can be lost before consensus commit. Only the Gateway knows which requests are outstanding

Design notes:

The API Gateway is an Aeron client connecting to the OMS cluster’s ingress — it follows the same leader discovery, correlation ID tracking, and retry patterns described in the best practices above.
OMS cluster’s onSessionMessage() deduplicates on clientOrderId before processing.
Since OMS is a cluster, it gives you durable, replicated state for the order lifecycle — if an OMS node crashes, the order state is not lost (it’s in the Raft log).

③ OMS (Aeron Cluster) → Matching Engine (Aeron Cluster) — Cluster-to-Cluster

Aspect	Detail
Protocol	Aeron Cluster ingress (the OMS cluster’s state machine acts as an Aeron client to the ME cluster)
Retry responsibility	OMS cluster leader (sender) — but with an important nuance (see below)
Ack mechanism	Egress response from the ME cluster back to the OMS leader
Timeout & retry	OMS state machine tracks each forwarded order by `correlationId`. If no egress ack within timeout, OMS resends to the ME cluster
Idempotency key	`correlationId` (mapped from `clientOrderId`) — the Matching Engine deduplicates on this
Why sender retries	Same Aeron client-cluster contract. The ME cluster does not know what OMS intended to send. Only the OMS cluster (as the Aeron client) knows which orders are in-flight toward ME

Design notes — cluster-to-cluster communication pattern:

Only the OMS leader sends to ME. The OMS leader’s state machine creates an Aeron client session to the ME cluster. Followers do NOT independently send to ME — this would cause duplicates.
OMS leader election → reconnect to ME. When the OMS cluster elects a new leader, the new leader must establish a new Aeron client session to the ME cluster and resend any orders that were in-flight (committed in OMS Raft log but not yet acked by ME). This is safe because:
- The OMS Raft log is the source of truth for what was sent.
- On replay, the new OMS leader knows exactly which orders need to be forwarded.
- The ME cluster deduplicates on correlationId.

Deterministic replay is critical. The OMS state machine’s decision to forward an order to ME must be deterministic. The act of sending is a side effect — during replay, the OMS replays the decision but must be careful about re-triggering the actual send only when it is the live leader (not during log replay on followers). A common pattern:

// In OMS ClusteredService.onSessionMessage():
void onNewOrder(long correlationId, Order order) {
    // Deterministic state update (replayed on all nodes)
    orderBook.trackOrder(correlationId, order);
    pendingForward.add(correlationId, order);

    // Side effect: only the live leader actually sends to ME
    if (role == Cluster.Role.LEADER) {
        meClient.offer(encodeOrder(correlationId, order));
    }
}

Pending order tracking survives failover. Because pendingForward is part of the OMS cluster’s replicated state (snapshotted + replayed), a new leader knows exactly which orders were accepted but not yet acked by ME — and can resend them.
The Matching Engine’s onSessionMessage() must dedup on correlationId before applying the order to the order book.

④ Matching Engine (Internal)

Aspect	Detail
Protocol	Raft consensus (internal to Aeron Cluster)
Retry responsibility	Aeron Cluster framework handles this automatically
Ack mechanism	Raft log replication — leader replicates to followers, committed once majority ack
Why no manual retry	Once a message is committed to the Raft log, the framework guarantees it is replicated and will be replayed on all nodes. This is the exactly-once boundary

Retry responsibility summary

Hop	Retry Owner	Can Rely on Source to Resend?	Why / Why Not
Customer → API GW	Customer	No — customer is the origin	No upstream source exists. The customer is the source of truth for their intent.
API GW → OMS Cluster	API Gateway	Partially yes — if the GW crashes, the customer will timeout and resend to a different GW instance, which effectively retries the whole flow	But the GW should still retry itself first. Relying solely on customer resend adds latency (customer timeout is typically longer). The GW has the context in memory and can retry faster.
OMS Cluster → ME Cluster	OMS leader	Yes — OMS Raft log is the safety net. If the OMS leader crashes, the new OMS leader replays from the Raft log, discovers pending (unacked) orders, and resends them to ME. No upstream help needed for this hop.	This is the key advantage of OMS being a cluster: the in-flight order state is replicated and durable in the OMS Raft log. Unlike a stateless OMS that loses in-memory state on crash, a clustered OMS recovers its pending forward queue via log replay. The upstream (GW) only needs to resend if the order was never committed to the OMS Raft log in the first place.
OMS internal (Raft)	Aeron framework	N/A — automatic	Raft consensus protocol handles replication. No application-level retry needed.
ME internal (Raft)	Aeron framework	N/A — automatic	Raft consensus protocol handles replication. No application-level retry needed.

Key takeaways

Every sender is responsible for retry at its own layer. This is the safest default — each layer owns reliability for its outbound hop.
OMS as an Aeron cluster is the critical design choice. It gives you durable, replicated state for the order lifecycle. A stateless OMS loses in-flight orders on crash and must rely on upstream resend. A clustered OMS recovers its own pending state via Raft log replay — it is self-healing for the OMS→ME hop.
Cluster-to-cluster communication requires care. Only the OMS leader sends to ME. On leader election, the new leader must reconstruct in-flight state from the Raft log and re-establish the Aeron client session to ME. Followers must never independently send to ME.
Upstream resend is a fallback, not a primary mechanism. If the API Gateway crashes, the customer eventually resends. But within the OMS→ME boundary, the OMS cluster handles its own retry — no upstream help needed (unless the order never reached OMS in the first place).
Idempotency must propagate end-to-end. The clientOrderId originates at the customer and flows through every layer. Each cluster deduplicates using this key (or a derived key like correlationId). Without this, retries at any layer cause duplicate order execution.
There are two exactly-once boundaries. The OMS Raft commit guarantees the order is durably accepted. The ME Raft commit guarantees the order is durably matched. Between these two boundaries (OMS→ME), the contract is at-least-once with idempotency.

Operational / human errors

The cluster survives node and network faults on its own. It does not survive a wrong clusterMembers string or a kill -9. These are the failures you cause, and the ones a runbook exists to prevent.

Case	Data Loss Risk	Availability Impact	Recovery Complexity	Remediation
Incorrect cluster configuration	None	Total (cannot form cluster)	Low — fix config and restart	Validate `clusterMembers` string, endpoints, and member IDs before deployment; fix config; restart all nodes
Accidental deletion of persistent data	Possible to Total	Depends on scope	Medium to Critical	Single node: wipe and rebuild from peers. All nodes: restore from external backup. Implement filesystem-level protection (snapshots, immutable backups)
Improper shutdown (kill -9 all nodes)	Possible	Total	Medium — check recording integrity	Restart all nodes; Aeron Archive validates recordings on startup; if corrupt, restore from backup. Use graceful shutdown (SIGTERM) in production
Failed rolling upgrade	Silent divergence	Possible	High — rollback may need old snapshots	Stop upgraded nodes; restore pre-upgrade snapshot; restart with old version; fix compatibility before re-attempting
Clock skew between nodes	None	Degraded (false elections)	Low — sync clocks (NTP)	Configure NTP/chrony on all nodes; increase election timeout to tolerate minor skew

Operational safeguards

Config validation — Validate cluster configuration before deployment (member IDs, endpoints, port conflicts).
Graceful shutdown — Always use SIGTERM, never SIGKILL. Graceful shutdown allows the node to complete in-flight operations and close recordings cleanly.
Pre-upgrade snapshots — Always take a snapshot before any upgrade. This gives you a clean rollback point.
Filesystem protection — Use immutable backups or filesystem snapshots for the archive and cluster directories.
Clock sync — NTP/chrony is mandatory. Clock skew doesn’t cause data loss but can trigger unnecessary elections.

Resource exhaustion

These failures share a signature: nothing crashes, but the leader misses a heartbeat and the cluster panics. Resource pressure manifests as false elections, which is why it is so easy to misdiagnose.

Case	Data Loss Risk	Availability Impact	Recovery Complexity	Remediation
JVM OOM	None	Single node down	Low — restart with tuning	Restart node with increased heap (`-Xmx`); profile memory usage; fix memory leaks; implement state eviction in service
File descriptor exhaustion	None	Single node down	Low — increase ulimit, restart	Increase `ulimit -n`; restart node; review for FD leaks in application code
CPU starvation / GC pauses	None	Degraded (false elections)	Medium — tune GC / threading	Switch to low-latency GC (ZGC/Shenandoah); pin Aeron threads to dedicated CPU cores; reduce allocation rate in service
Thread starvation (SHARED mode + slow handler)	None	Degraded (perceived dead)	Medium — optimize handler or use DEDICATED	Optimize `onSessionMessage` handler; offload heavy work; switch to DEDICATED threading mode; increase heartbeat timeout

Resource exhaustion → election cascade

The most common pattern is a feedback loop. Resource pressure causes an election, the election causes reconnect traffic, and that traffic adds more pressure.

Note how this connects back to the client section above: well-behaved client reconnect and idempotency logic keeps the reconnection burst from turning a single GC pause into a sustained cascade.

Prevention strategies

Resource	Monitoring	Prevention
Heap	`-XX:+HeapDumpOnOutOfMemoryError`, JMX metrics	Right-size heap, profile allocation rate, fix leaks
File descriptors	`lsof` count, `/proc/pid/fd`	Set `ulimit -n 65536`+, close resources properly
CPU	`perf`, `async-profiler`, cycle time counters	Pin threads, isolate cores, use appropriate idle strategy
Threads	Thread dumps, cycle time counters	Use DEDICATED mode for production, avoid blocking in handlers

What this means for tuning

Map these failure modes back to your latency targets and the priorities fall out:

p50 lives in onSessionMessage(). A slow or allocating handler doesn’t just raise the median — in SHARED mode it starves the conductor and triggers false elections. Keep the handler tight, and run DEDICATED mode in production.
p99 is wrecked by GC pauses and false elections. A single long stop-the-world pause both spikes p99 and can drop the leader. Low-latency GC (ZGC/Shenandoah), pinned Aeron threads, and isolated cores protect the tail.
Throughput collapses during cascades. An election storm fed by a reconnect burst takes the whole cluster offline for the duration. Idempotent clients with correlation-ID retry and sane backoff are what keep a transient pause from becoming sustained downtime.

The cheapest reliability win is almost always discipline at the boundaries — idempotent retries, graceful shutdown, right-sized limits — not heroics during the incident.