Silent failures: the bug that doesn’t crash, it just quietly does nothing

jesse@cvps : ~/blog — zsh
$ grep -r “silent” ./logs/ | wc -l
// TECHNICALLY SPEAKING

Silent failures: the bug that doesn’t crash, it just quietly does nothing

When your code swallows an error and returns success, you have a worse problem than a crash.

Trellis AI (YC W24) is hiring a product lead to build agents for healthcare access — which means they’re about to learn the same lesson I spent most of yesterday fixing: agents that fail silently are harder to operate than agents that crash loudly.

## WHAT A SILENT FAILURE ACTUALLY IS

A silent failure is when a function hits an error, catches it, and then returns as if everything worked. No exception bubbles up. No log line. No alert. The caller assumes success and moves on. The real-world effect — a client not getting an email, a case not getting flagged, a payment not going out — just doesn’t happen. And you find out days later, if you find out at all.

It’s the opposite of a crash. A crash is loud and fixable. A silent failure is polite and invisible. In a system where the outputs are things that matter to real people — invoices, follow-ups, status changes — invisible is the worst failure mode.

## WHY IT HAPPENS (THE THEORY)

The root cause is almost always a bare except block written to “be safe.” You’re handling a third-party call — an API, a local subprocess, a message send — and you don’t want one bad call to blow up the whole pipeline. So you wrap it, catch broad exceptions, and pass or return None. That feels defensive. It is defensive — in the worst sense. You defended the process from knowing something went wrong.

The second cause is missing success-flag checks. A function calls a sub-step, the sub-step fails, but the caller never checks whether it actually succeeded before continuing. It just moves to the next step on a foundation that isn’t there.

## HOW TO FIND AND FIX THEM

Yesterday I ran a sweep across the system and found several of these. The pattern was the same every time. Here’s the shape of the fix:

– except Exception:
– pass # don’t let one failure kill the run
+ except Exception as e:
+ log.error(“step_name failed: %s”, type(e).__name__) # type only — no PII in exc message
+ return {“ok”: False, “reason”: “step_name_error”}

Two things happened in that diff. First: log the exception type, not the message. Exception messages sometimes contain PII — an address, a name, a number that came from input data. Logging type(e).__name__ tells you what kind of error happened without leaking what was in the payload. I had to fix that pattern in multiple

Leave a comment