Oops, I closed my laptop and broke the chain

I promise this isn't a post I wanted to write. But it's a useful one, and the whole point of a public build log is that you tell the truth about the parts that didn't work.

Yesterday at 13:22 UTC, the Asentum testnet produced block #4341. At 13:22:40 it tried to produce #4342 and couldn't reach consensus. It kept trying, every eight seconds, for twenty-three hours straight. have 1/4 prevotes. have 1/4. have 1/4.

Here's what happened.

the setup

I'd been running the testnet across four boxes. Three Hetzner VPSes in Germany, Virginia, and Oregon. And one more validator — my personal MacBook — running through Asentum Operator (the desktop app). Five active validators total, the MacBook being a bit of a beta test of what "home validating" actually feels like.

The committee that finalizes each block was four of those five, rotating. My Mac was usually in the cut because its stake matched the others.

Around 13:22 UTC yesterday I closed the MacBook. Not rebooted, not killed — just closed the lid, which puts it to sleep and drops the network connection immediately.

the cascade

Under normal Tendermint-style BFT, losing one of four validators is survivable. Quorum is 2f+1, so with four validators you need three votes. Three of four without the Mac is still three.

Except the moment the Mac went silent, another validator briefly hiccuped at the exact same time. Maybe a dropped socket, maybe a momentary garbage collection pause. Who knows. Two validators unavailable simultaneously meant the live set was two of four, and two of four is not enough to reach 2f+1.

Under normal circumstances the chain would just skip that round and advance. But a second bug was waiting. The primary validator didn't have an ASENTUM_PEER_RPC environment variable set, which meant its HTTP sync loop wasn't running. So when it fell out of step with the other validators' in-memory consensus state, it couldn't pull blocks from any of them to catch back up. It just sat there, proposing block #4342 over and over, getting one prevote (its own), and timing out every eight seconds.

Every eight seconds. For twenty-three hours.

the investigation

When I opened the explorer this morning, it said "last block produced 23 hours ago." My heart sank exactly as hard as you think it did.

I SSH'd into the primary and pulled the logs. The first thing I saw was a wall of block production failed: timed out waiting for prevote quorum on 0x... after 8000ms (have 1/4) stretching back to the previous afternoon. Identical error every eight seconds, thousands of times.

I checked the peers. All four machines agreed on their last-finalized block height, but the primary was actually behind the others by a few blocks — it had got kicked to an earlier consensus state and never came back. The peers had moved on without it, then got stuck too when their own rounds needed the primary's vote.

I added the missing ASENTUM_PEER_RPC so the primary could sync. It pulled the missing blocks and caught up. Still stuck. Every validator was now on the same block, but none were voting on each other's proposals.

the real problem

The real issue wasn't the momentary glitch that started the cascade. It was the structural property that made the cascade unrecoverable:

BFT with a strict four-validator committee and one silent participant has zero fault tolerance. Any flicker during that window halts the chain.

And with a chain that wants home validators on consumer hardware, there will always be someone flickering. People close their laptops. ISPs have maintenance windows. Dogs step on Ethernet cables.

A chain whose decentralization story is "anybody can validate" cannot afford to be one closed laptop away from a 23-hour outage.

the fix

The fix is a liveness-aware committee. Instead of the consensus threshold being "2/3 of everyone who ever bonded," it's "2/3 of everyone who actually proposed a block in the last 30 blocks." If your laptop has been offline for three minutes, you're not in the committee. If you come back and propose your next scheduled slot, you're in it again, immediately. No slash. No ceremony.

The computation is deterministic — every node reads the same block headers and computes the same live set. Nothing can go out of sync.

I wrote it this afternoon and redeployed to all four boxes. The chain produced block #4345 within five seconds of the last restart and hasn't stopped since.

Full technical writeup in the next post. The one after that is the broader staking-contract cleanup that this incident also forced into scope — minimum self-bonds, stake caps, and a proper unbonding window.

what I actually learned

Three things.

First, if your testnet survives because you personally keep four VPS boxes up and running, your testnet hasn't actually survived anything. The moment you test a realistic operator — a home computer with a lid that closes — it falls apart. I'd been comforting myself with the stability of the VPS cluster without realizing it was a lie. The real test is whether the chain tolerates the conditions the thesis claims it tolerates.

Second, operational defaults matter more than I was giving them credit for. ASENTUM_PEER_RPC should not be something you have to remember to set on the primary. If sync is what lets a stuck node recover, sync should be on by default. I fixed the env-var omission and I'm also going to make the sync loop run unconditionally when peers are configured, so nobody else gets bitten by this.

Third, 23 hours is a long time. Long enough for the story to have moved through several "the chain is broken, huh, is Asentum just vaporware" stages in the minds of the handful of people watching. I was lucky the testnet isn't carrying any real value yet. When it does, 23 hours of downtime is a chain-killing event. Better to learn this now, on zero-stake test ASE, than to learn it the day someone's stablecoin infrastructure is on top of it.

The chain's back. It's better than it was yesterday. And I have a new, very specific nightmare about MacBook lids.

— milkie