Fail-Closed Design: When the Check Can’t Run, the Answer Is Always NO

jesse@cvps : ~/blog — zsh
$ grep -r “HALT” src/ | wc -l
// TECHNICALLY SPEAKING

Fail-Closed Design: When the Check Can’t Run, the Answer Is Always NO

The safety pattern that shows up in circuit breakers, courthouse gates, and production code — and why defaulting to “open” will eventually destroy you.

Quick sidebar before we get into the code: a new survey found that 60% of US consumers say the word “AI” in brand messaging is a turnoff. Which tracks. The word stopped meaning anything. The thing that actually matters — and the thing I spent yesterday hardening — is whether the system does what you say it does when something goes wrong. That’s an engineering question, not a marketing one.

## FAIL-CLOSED: THE PRINCIPLE

Fail-closed means: when a safety check cannot complete successfully, the system refuses to proceed. The opposite — fail-open — lets the operation through anyway. Fail-open feels friendlier until the one time it shouldn’t have let anything through. Then it’s a very bad day.

You see this in physical systems all the time. A fire door is fail-closed: cut the power, the door locks. A normal office door is fail-open: cut the power, it swings free. One protects the stairwell. The other doesn’t. The pattern is the same in code.

## WHY IT’S TRUE

Errors in software are not rare edge cases — they’re the normal texture of production. Network calls time out. Data is missing a field you expected. A regex doesn’t match because someone pasted a quoted email thread into a parser. When that happens, your code lands in an except block or a null check. The question is: what does it do there?

If the answer is “continue anyway,” you’ve built a fail-open system. The check existed to catch a problem. If the check can’t run, the problem may still exist — you just stopped looking. Proceeding is a gamble dressed up as a default.

Fail-closed flips the burden of proof. Instead of “proceed unless we find a reason not to,” it’s “stop unless we can confirm it’s safe to go.” One bad assumption and a fail-open system ships garbage quietly. A fail-closed system makes noise, halts, and waits for a human.

## HOW IT SHOWED UP YESTERDAY

Yesterday I shipped a batch of fixes under the label “proof accuracy hardening.” The core of all of them was the same pattern, applied in different places. Two that stand out:

1. Blank-field HALT. A gate in the scheduling layer was supposed to check that all required fields were populated before the workflow advanced. The original code checked for the field, got an exception when the field didn’t exist, and — because the exception handler was written sloppily — kept going. I found this because a downstream step was receiving incomplete data and producing wrong output. No alarm. Just wrong.

– except KeyError:
– log.warning(“field missing, continuing”)
– pass
+ except KeyError:
+ log.error(“required field missing — HALT, refusing to advance”)

Leave a comment