Three incidents in two weeks. A 23-hour chain halt from a closed laptop. A code drift divergence that stuck three validators 200,000 blocks behind. A mempool bug that turned transaction spam into empty blocks.
Every single one of these would have been worse on mainnet. Some of them would have been catastrophic. That's the point.
What we found
The laptop incident exposed that our original consensus design assumed all validators stay online forever. When the Mac validator went offline (lid closed), the committee couldn't reach quorum. Fix: the liveness-aware committee now dynamically shrinks to the set of validators who proposed at least one of the last 30 blocks. Home validators can come and go.
The code drift incident exposed that deploying code to one validator without updating the others causes state divergence. The primary processed a transaction that produced different state under new code. The peers computed a different state root and rejected the block. Fix: an atomic deploy script is on the roadmap — one command pushes identical built artifacts to all validators and restarts them simultaneously.
The mempool bug exposed that a naive FIFO queue with no deduplication turns any source of duplicate transactions into a chain halt. Peers forwarding faucet transactions to each other created tens of thousands of invalid duplicates. The block builder tried to include them every block, they all failed, and the chain produced empty blocks. Fix: the mempool now deduplicates by hash, has a 10K size cap, and evicts stale transactions after every block.
The build process
We also learned practical things about running a multi-validator testnet:
- pnpm symlinks don't travel. Copying
node_modulesbetween machines breaks because pnpm uses a content-addressable store with symlinks. You either build on each machine independently or usenpm installfor flat dependencies. - Genesis is permanent. The system contracts deployed at genesis are baked into the chain state. If your local code evolves, the deployed contract API doesn't change with it. You have to match the deployed version, not the repo version.
- Snapshot sync is essential. Replaying 290K blocks takes days. Copying a 33MB state directory takes seconds. New validators should snapshot-sync, not replay from genesis.
- Rate limiting the faucet is great for users, annoying for ops. We built a workaround: temporarily disable the rate limit during genesis resets, then re-enable.
What comes next
- Atomic deploy script — build once, push to all validators, restart simultaneously.
- Code version hash on /metadata — validators compare code hashes with peers and warn if they differ.
- Auto-updater for the Operator app — desktop validators get updates automatically.
- Mempool is now production-grade — dedup, cap, eviction. The empty-block class of bug is eliminated.
The chain is back up. Four validators across three continents, producing 2-second blocks. The testnet did its job — every bug we found and fixed here is a bug that won't exist on mainnet.
