Failure-Mode Runbook: Snapshots and Determinism

The scariest failures do not crash anything. The cluster keeps running, the dashboards stay green, and nodes silently disagree about reality. This page is the runbook for that class of bug — snapshot corruption, snapshot inconsistency, and the determinism violations that quietly poison your state machine.

Two kinds of pain: loud and silent

Failures split cleanly into two buckets. Loud failures hurt availability — a node fails to start, recovery is delayed, you can see something is wrong. Silent failures are worse: the cluster appears healthy while nodes diverge. The silent ones are marked Critical below because by the time you notice, the divergence may already be baked into every snapshot you have.

Snapshot failures

Snapshots are how nodes rebuild state without replaying the entire log. When a snapshot goes wrong, you either lose availability or — far worse — keep running on corrupted state.

Case	Data Loss Risk	Availability Impact	Recovery Complexity	Remediation
Snapshot corruption	Possible	Delayed recovery	Medium — fallback to older snapshot	Use ClusterTool to inspect recording-log; invalidate corrupt snapshot; node falls back to older snapshot + longer log replay
Snapshot inconsistency (app bug in onTakeSnapshot/loadSnapshot)	Silent divergence	None (appears healthy)	Critical — hard to detect	Add snapshot validation checksums; compare state across nodes periodically; fix serialization bug; take new clean snapshot
Snapshot taken during inconsistent state	Silent divergence	None (appears healthy)	Critical — fix app logic	Ensure atomic state capture in onTakeSnapshot; fix app logic; invalidate bad snapshot; take new snapshot
Missing snapshot with truncated log	Data gap	Delayed recovery	High — node cannot rebuild	Restore snapshot from external backup or from another node; if unavailable, rebuild node from a healthy peer
Snapshot version mismatch (rolling upgrade)	Possible	Node fails to start	Medium — migration strategy	Implement snapshot versioning and migration logic; take snapshot before upgrade; ensure backward compatibility

The silent divergence problem

The most dangerous failures here are the ones marked “appears healthy” — the cluster keeps running, but nodes have different state. This only manifests when:

A failover happens and the new leader has different state.
A client gets different results from different nodes (if reading from followers).
A snapshot comparison reveals mismatches.

Notice the trigger conditions. Divergence is invisible until a failover or a follower read drags it into the light — and a failover is exactly the moment you cannot afford a surprise. That delay is the whole problem: the bug ships, runs clean for weeks, then detonates during an incident.

Determinism violations

This is the most insidious category of all — everything appears healthy but nodes silently diverge. A clustered state machine must produce bit-identical output on every node from the same input. The instant your service logic depends on something that differs between nodes, the snapshots diverge and you are back in silent-corruption territory.

Case	Data Loss Risk	Availability Impact	Recovery Complexity	Remediation
Non-deterministic app logic (time, random, HashMap, etc.)	Silent divergence	None (appears healthy)	Critical — hard to detect	Audit code: use `cluster.time()`, seeded Random, ordered collections (TreeMap/LinkedHashMap); add cross-node state hash comparison
Floating point non-determinism across JVM/CPU	Silent divergence	None (appears healthy)	Critical	Replace `Math` with `StrictMath`; use fixed-point (long cents) instead of double; standardize JVM version across nodes
External side effects in service logic (HTTP, DB, file I/O)	Silent divergence	None (appears healthy)	Critical — redesign service	Remove ALL external calls from `onSessionMessage`; push external data into cluster via ingress messages; side effects only on leader egress

Common determinism traps

// ❌ WRONG — non-deterministic
long now = System.currentTimeMillis();        // Different on each node
double result = Math.sin(x);                   // May differ across JVMs/CPUs
Map<String, Order> orders = new HashMap<>();   // Iteration order undefined
UUID id = UUID.randomUUID();                   // Different on each node
httpClient.get("https://api.price.com");       // External call in service

// ✅ CORRECT — deterministic
long now = cluster.time();                     // Cluster-provided time
double result = StrictMath.sin(x);             // Guaranteed identical
Map<String, Order> orders = new TreeMap<>();   // Deterministic iteration
// Use cluster-provided sequence numbers instead of UUID
// Push external data into cluster via ingress, not pull from service

The rule of thumb: time, randomness, iteration order, math, and the outside world are all forbidden inside service logic. Pull none of them; have the cluster hand them to you instead.

Detection and recovery playbook

You cannot rely on noticing divergence by accident. Build detection in, because the failure mode is silence.

Validate on load. Append a checksum/hash to every snapshot and verify it in loadSnapshot. A failed check turns a silent corruption into a loud, recoverable one.
Compare across nodes. Periodically hash the in-memory state on every node and compare. Catch divergence in hours, not during a failover.
Inspect with ClusterTool. For snapshot corruption, use ClusterTool to inspect the recording-log, invalidate the corrupt snapshot, and let the node fall back to an older snapshot plus longer log replay.
Rebuild from a healthy peer. When a snapshot is missing and the log is truncated, restore from external backup or from another node; if neither is available, rebuild the node from a healthy peer.
Version before you upgrade. Take a snapshot before a rolling upgrade, implement snapshot versioning and migration logic, and ensure backward compatibility so a node never fails to start on a version mismatch.

How these failures move your numbers

These bugs do not show up as a latency regression — that is what makes them dangerous. They cost you in two places instead.

p99 / recovery time — Snapshot corruption forces a fallback to an older snapshot plus a longer log replay. That replay is what stretches your recovery window when a node restarts, even though steady-state p50/p99 looked perfect right up until the incident.
Correctness, not latency — Determinism violations and silent snapshot inconsistencies do not touch p50, p99, or throughput at all while the cluster runs. They are pure correctness failures. The cost lands as wrong state surfacing at failover — the single worst moment to discover it.

In short: the loud snapshot failures tax your recovery time; the silent ones tax your trust. Build the checksums and cross-node hash comparisons now, so the silent failures become loud ones you can actually fix.