How we broke four validators by updating one

This is an incident report. The chain was down for about a day, and it was entirely self-inflicted.

What happened

We updated the explorer and RPC server on the primary validator to add governance views. Standard deployment — SSH in, upload new source, build, restart. The explorer looked great. Governance proposals rendered. Copy-to-clipboard icons worked. Ship it.

What we didn't do: update the other three validators.

The divergence

The primary validator started producing blocks with the new code. The other three validators — still running old code — tried to sync those blocks. At block 83,623, the primary processed a transaction that produced different state under the new code. The state root in block 83,623's header didn't match what the peers computed locally.

The peers rejected the block. Then the next one. Then the next one. They got stuck in a loop: fetch block → apply → state root mismatch → retry. For over 200,000 blocks.

The primary kept producing solo (the liveness-aware committee shrinks to just the live proposers). By the time we noticed, the primary was at block 290,871 and the peers were stuck at 83,622.

The cascade

Fixing this turned out to be harder than expected:

Wipe and re-sync? The peers would need to replay 290K blocks from genesis. At ~30 blocks per minute, that's 160 hours. Not viable.
Copy the data directory? The primary's state database was only 33MB. We copied it — but the peers had old code, so they couldn't validate new blocks on top of the copied state. Same mismatch.
Sync code + copy data? Getting closer. But pnpm's symlink-based node_modules didn't survive the copy. Missing dependencies, wrong versions.
Fresh genesis? The nuclear option. Wipe everything, start at block 0. We did this three times before getting it right — each attempt hit a different issue: permission errors, rate-limited faucet, mempool spam, bond contract API mismatch.

The mempool bug

During the recovery, we discovered a critical bug: the mempool had no deduplication and no eviction. When peer nodes received faucet transactions and forwarded them to each other, duplicates accumulated. The block builder tried to include them, they failed (nonce mismatch), but they stayed in the mempool forever. The chain produced empty blocks while thousands of invalid transactions sat in queue.

We shipped a fix: the mempool now deduplicates by transaction hash, has a 10K cap, and evicts stale transactions (nonce below the sender's current on-chain nonce) after every block.

The bond API mismatch

The final twist. The testnet's genesis was created months ago with bond(pubKey, amount) — two arguments, explicit amount. Our local code had evolved to bond(pubKey) + msg.value. We were calling the wrong API. The bond transactions executed, consumed gas, but the contract threw Cannot convert undefined to a BigInt because msg.value wasn't being read.

Once we switched to bond(pubKey, amount) — the OLD API that matches the deployed genesis — all three validators bonded successfully.

Lessons

Never update one validator without updating all of them. A deploy script that pushes to all nodes atomically is now on the roadmap.
The deployed contract code is the truth, not the repo. The genesis bakes contract source into the chain state. Local modifications don't change what's running. You have to match the deployed API.
Mempools need hygiene. A naive FIFO queue with no eviction turns a spam event into a chain halt. Invalid transactions must be cleaned up, not retried forever.
Snapshot sync beats replay. Copying a 33MB state directory is instantaneous. Replaying 290K blocks takes days.