Skip to content

Failure-Mode Runbook: Snapshots and Determinism

The scariest failures do not crash anything. The cluster keeps running, the dashboards stay green, and nodes silently disagree about reality. This page is the runbook for that class of bug — snapshot corruption, snapshot inconsistency, and the determinism violations that quietly poison your state machine.

Failures split cleanly into two buckets. Loud failures hurt availability — a node fails to start, recovery is delayed, you can see something is wrong. Silent failures are worse: the cluster appears healthy while nodes diverge. The silent ones are marked Critical below because by the time you notice, the divergence may already be baked into every snapshot you have.

Snapshots are how nodes rebuild state without replaying the entire log. When a snapshot goes wrong, you either lose availability or — far worse — keep running on corrupted state.

CaseData Loss RiskAvailability ImpactRecovery ComplexityRemediation
Snapshot corruptionPossibleDelayed recoveryMedium — fallback to older snapshotUse ClusterTool to inspect recording-log; invalidate corrupt snapshot; node falls back to older snapshot + longer log replay
Snapshot inconsistency (app bug in onTakeSnapshot/loadSnapshot)Silent divergenceNone (appears healthy)Critical — hard to detectAdd snapshot validation checksums; compare state across nodes periodically; fix serialization bug; take new clean snapshot
Snapshot taken during inconsistent stateSilent divergenceNone (appears healthy)Critical — fix app logicEnsure atomic state capture in onTakeSnapshot; fix app logic; invalidate bad snapshot; take new snapshot
Missing snapshot with truncated logData gapDelayed recoveryHigh — node cannot rebuildRestore snapshot from external backup or from another node; if unavailable, rebuild node from a healthy peer
Snapshot version mismatch (rolling upgrade)PossibleNode fails to startMedium — migration strategyImplement snapshot versioning and migration logic; take snapshot before upgrade; ensure backward compatibility

The most dangerous failures here are the ones marked “appears healthy” — the cluster keeps running, but nodes have different state. This only manifests when:

  • A failover happens and the new leader has different state.
  • A client gets different results from different nodes (if reading from followers).
  • A snapshot comparison reveals mismatches.

Notice the trigger conditions. Divergence is invisible until a failover or a follower read drags it into the light — and a failover is exactly the moment you cannot afford a surprise. That delay is the whole problem: the bug ships, runs clean for weeks, then detonates during an incident.

This is the most insidious category of all — everything appears healthy but nodes silently diverge. A clustered state machine must produce bit-identical output on every node from the same input. The instant your service logic depends on something that differs between nodes, the snapshots diverge and you are back in silent-corruption territory.

CaseData Loss RiskAvailability ImpactRecovery ComplexityRemediation
Non-deterministic app logic (time, random, HashMap, etc.)Silent divergenceNone (appears healthy)Critical — hard to detectAudit code: use cluster.time(), seeded Random, ordered collections (TreeMap/LinkedHashMap); add cross-node state hash comparison
Floating point non-determinism across JVM/CPUSilent divergenceNone (appears healthy)CriticalReplace Math with StrictMath; use fixed-point (long cents) instead of double; standardize JVM version across nodes
External side effects in service logic (HTTP, DB, file I/O)Silent divergenceNone (appears healthy)Critical — redesign serviceRemove ALL external calls from onSessionMessage; push external data into cluster via ingress messages; side effects only on leader egress
// ❌ WRONG — non-deterministic
long now = System.currentTimeMillis(); // Different on each node
double result = Math.sin(x); // May differ across JVMs/CPUs
Map<String, Order> orders = new HashMap<>(); // Iteration order undefined
UUID id = UUID.randomUUID(); // Different on each node
httpClient.get("https://api.price.com"); // External call in service
// ✅ CORRECT — deterministic
long now = cluster.time(); // Cluster-provided time
double result = StrictMath.sin(x); // Guaranteed identical
Map<String, Order> orders = new TreeMap<>(); // Deterministic iteration
// Use cluster-provided sequence numbers instead of UUID
// Push external data into cluster via ingress, not pull from service

The rule of thumb: time, randomness, iteration order, math, and the outside world are all forbidden inside service logic. Pull none of them; have the cluster hand them to you instead.

You cannot rely on noticing divergence by accident. Build detection in, because the failure mode is silence.

  • Validate on load. Append a checksum/hash to every snapshot and verify it in loadSnapshot. A failed check turns a silent corruption into a loud, recoverable one.
  • Compare across nodes. Periodically hash the in-memory state on every node and compare. Catch divergence in hours, not during a failover.
  • Inspect with ClusterTool. For snapshot corruption, use ClusterTool to inspect the recording-log, invalidate the corrupt snapshot, and let the node fall back to an older snapshot plus longer log replay.
  • Rebuild from a healthy peer. When a snapshot is missing and the log is truncated, restore from external backup or from another node; if neither is available, rebuild the node from a healthy peer.
  • Version before you upgrade. Take a snapshot before a rolling upgrade, implement snapshot versioning and migration logic, and ensure backward compatibility so a node never fails to start on a version mismatch.

These bugs do not show up as a latency regression — that is what makes them dangerous. They cost you in two places instead.

  • p99 / recovery time — Snapshot corruption forces a fallback to an older snapshot plus a longer log replay. That replay is what stretches your recovery window when a node restarts, even though steady-state p50/p99 looked perfect right up until the incident.
  • Correctness, not latency — Determinism violations and silent snapshot inconsistencies do not touch p50, p99, or throughput at all while the cluster runs. They are pure correctness failures. The cost lands as wrong state surfacing at failover — the single worst moment to discover it.

In short: the loud snapshot failures tax your recovery time; the silent ones tax your trust. Build the checksums and cross-node hash comparisons now, so the silent failures become loud ones you can actually fix.