Rolling Upgrade of Cluster Member Nodes
Outline
Section titled “Outline”- Why rolling, and what “safe” means — quorum math during the upgrade window: a 3-node cluster upgrading one node at a time is running with zero redundancy while each member is down; a 5-node cluster keeps tolerance for one additional failure throughout.
- Pre-flight checks — cluster healthy (all members ACTIVE, no snapshot in progress), recent snapshot taken, archive disk headroom, version compatibility verified (wire protocol + serialized state across old/new code).
- Follower-first ordering — upgrade followers one at a time; the leader goes last, after a controlled leadership handoff rather than a timeout-driven election.
- Per-node procedure — drain, stop, upgrade, restart, wait for catch-up (the member replays the log/snapshot and rejoins; verify with cluster counters before touching the next node).
- Leader handoff — stepping the leader down gracefully vs. letting election timeout fire, and what clients observe during the transition (ingress redirect, session reconnect).
- Application/state-machine versioning — upgrading business logic on a replicated deterministic state machine: the new code must replay the old code’s log entries identically; feature-flagging logic changes on a log position.
- Abort criteria & rollback — what to check before each step, and how to back out if a node fails to rejoin.
- Verification — counters and checks at each step (commit position advancing, member states, NAK/error counters quiet), and the post-upgrade smoke test.