Failure-Mode Runbook: Snapshots and Determinism
The scariest failures do not crash anything. The cluster keeps running, the dashboards stay green, and nodes silently disagree about reality. This page is the runbook for that class of bug — snapshot corruption, snapshot inconsistency, and the determinism violations that quietly poison your state machine.
Two kinds of pain: loud and silent
Section titled “Two kinds of pain: loud and silent”Failures split cleanly into two buckets. Loud failures hurt availability — a node fails to start, recovery is delayed, you can see something is wrong. Silent failures are worse: the cluster appears healthy while nodes diverge. The silent ones are marked Critical below because by the time you notice, the divergence may already be baked into every snapshot you have.
Snapshot failures
Section titled “Snapshot failures”Snapshots are how nodes rebuild state without replaying the entire log. When a snapshot goes wrong, you either lose availability or — far worse — keep running on corrupted state.
| Case | Data Loss Risk | Availability Impact | Recovery Complexity | Remediation |
|---|---|---|---|---|
| Snapshot corruption | Possible | Delayed recovery | Medium — fallback to older snapshot | Use ClusterTool to inspect recording-log; invalidate corrupt snapshot; node falls back to older snapshot + longer log replay |
| Snapshot inconsistency (app bug in onTakeSnapshot/loadSnapshot) | Silent divergence | None (appears healthy) | Critical — hard to detect | Add snapshot validation checksums; compare state across nodes periodically; fix serialization bug; take new clean snapshot |
| Snapshot taken during inconsistent state | Silent divergence | None (appears healthy) | Critical — fix app logic | Ensure atomic state capture in onTakeSnapshot; fix app logic; invalidate bad snapshot; take new snapshot |
| Missing snapshot with truncated log | Data gap | Delayed recovery | High — node cannot rebuild | Restore snapshot from external backup or from another node; if unavailable, rebuild node from a healthy peer |
| Snapshot version mismatch (rolling upgrade) | Possible | Node fails to start | Medium — migration strategy | Implement snapshot versioning and migration logic; take snapshot before upgrade; ensure backward compatibility |
The silent divergence problem
Section titled “The silent divergence problem”The most dangerous failures here are the ones marked “appears healthy” — the cluster keeps running, but nodes have different state. This only manifests when:
- A failover happens and the new leader has different state.
- A client gets different results from different nodes (if reading from followers).
- A snapshot comparison reveals mismatches.
Notice the trigger conditions. Divergence is invisible until a failover or a follower read drags it into the light — and a failover is exactly the moment you cannot afford a surprise. That delay is the whole problem: the bug ships, runs clean for weeks, then detonates during an incident.
Determinism violations
Section titled “Determinism violations”This is the most insidious category of all — everything appears healthy but nodes silently diverge. A clustered state machine must produce bit-identical output on every node from the same input. The instant your service logic depends on something that differs between nodes, the snapshots diverge and you are back in silent-corruption territory.
| Case | Data Loss Risk | Availability Impact | Recovery Complexity | Remediation |
|---|---|---|---|---|
| Non-deterministic app logic (time, random, HashMap, etc.) | Silent divergence | None (appears healthy) | Critical — hard to detect | Audit code: use cluster.time(), seeded Random, ordered collections (TreeMap/LinkedHashMap); add cross-node state hash comparison |
| Floating point non-determinism across JVM/CPU | Silent divergence | None (appears healthy) | Critical | Replace Math with StrictMath; use fixed-point (long cents) instead of double; standardize JVM version across nodes |
| External side effects in service logic (HTTP, DB, file I/O) | Silent divergence | None (appears healthy) | Critical — redesign service | Remove ALL external calls from onSessionMessage; push external data into cluster via ingress messages; side effects only on leader egress |
Common determinism traps
Section titled “Common determinism traps”// ❌ WRONG — non-deterministiclong now = System.currentTimeMillis(); // Different on each nodedouble result = Math.sin(x); // May differ across JVMs/CPUsMap<String, Order> orders = new HashMap<>(); // Iteration order undefinedUUID id = UUID.randomUUID(); // Different on each nodehttpClient.get("https://api.price.com"); // External call in service
// ✅ CORRECT — deterministiclong now = cluster.time(); // Cluster-provided timedouble result = StrictMath.sin(x); // Guaranteed identicalMap<String, Order> orders = new TreeMap<>(); // Deterministic iteration// Use cluster-provided sequence numbers instead of UUID// Push external data into cluster via ingress, not pull from serviceThe rule of thumb: time, randomness, iteration order, math, and the outside world are all forbidden inside service logic. Pull none of them; have the cluster hand them to you instead.
Detection and recovery playbook
Section titled “Detection and recovery playbook”You cannot rely on noticing divergence by accident. Build detection in, because the failure mode is silence.
- Validate on load. Append a checksum/hash to every snapshot and verify it in
loadSnapshot. A failed check turns a silent corruption into a loud, recoverable one. - Compare across nodes. Periodically hash the in-memory state on every node and compare. Catch divergence in hours, not during a failover.
- Inspect with ClusterTool. For snapshot corruption, use ClusterTool to inspect the recording-log, invalidate the corrupt snapshot, and let the node fall back to an older snapshot plus longer log replay.
- Rebuild from a healthy peer. When a snapshot is missing and the log is truncated, restore from external backup or from another node; if neither is available, rebuild the node from a healthy peer.
- Version before you upgrade. Take a snapshot before a rolling upgrade, implement snapshot versioning and migration logic, and ensure backward compatibility so a node never fails to start on a version mismatch.
How these failures move your numbers
Section titled “How these failures move your numbers”These bugs do not show up as a latency regression — that is what makes them dangerous. They cost you in two places instead.
- p99 / recovery time — Snapshot corruption forces a fallback to an older snapshot plus a longer log replay. That replay is what stretches your recovery window when a node restarts, even though steady-state p50/p99 looked perfect right up until the incident.
- Correctness, not latency — Determinism violations and silent snapshot inconsistencies do not touch p50, p99, or throughput at all while the cluster runs. They are pure correctness failures. The cost lands as wrong state surfacing at failover — the single worst moment to discover it.
In short: the loud snapshot failures tax your recovery time; the silent ones tax your trust. Build the checksums and cross-node hash comparisons now, so the silent failures become loud ones you can actually fix.