Fail-Silent Bugs: When Logs Say OK While the System Does Nothing

// TECHNICALLY SPEAKING

Fail-Silent Bugs: When Logs Say OK While the System Does Nothing

$ git log –oneline –since=yesterday

Updated 2026-06-17

Today I spent the better part of the day fixing bugs that weren’t crashing anything. The logs looked clean. No errors, no alerts, no stack traces. And a whole tier of automated steps were silently doing nothing — or worse, doing the wrong thing and logging success on the way out.

I run a real California field-services company. The whole operation runs on software I built from scratch — intake, dispatch, invoicing, client comms, all automated. Over 1,000 cases through it. When it works, I’m one person doing the job of four. When it breaks silently, I don’t find out until something downstream doesn’t happen and someone asks why.

Today was that kind of day. No crashes. Just a system confidently marching forward on stale data, mismatched enum values, and wrong routing paths — logging “processed ok” the whole time. The technical name for this is a fail-silent failure. It’s the hardest class of bug to catch, because nothing tells you it’s happening.

$ cat contents.md

What a fail-silent bug actually is
Why they’re harder to find than crashes
Three real ones I hit today
The fix: verify the effect, not the call
Three questions to ask about any automated step

1. What a fail-silent bug actually is

// THE PRINCIPLE

A fail-silent failure is when a system encounters a condition it can’t handle — and instead of crashing or alerting, it quietly does nothing and moves on. No error is raised. Nothing gets logged as a failure. The system just skips the work and continues. Everything looks nominal from the outside. And the work doesn’t happen.

This is different from a crash (which tells you exactly where things went wrong) and different from an unhandled exception (which at least leaves a trace). A fail-silent failure is a zero — a complete non-event. The system logs “processed” and the step never ran.

// WHY IT WORKS THIS WAY

Most fail-silent bugs are accidental design. Someone wrote if records: do_the_thing(records) and assumed records would always exist. When the query filter breaks and records comes back empty — no exception. The condition is False, the loop runs zero times, and the log line at the end says everything processed fine.

The code handles the normal case. Nobody designed behavior for when the input is wrong.

2. Why they’re harder to find than crashes

// THE PRINCIPLE

A crash tells you where to look. An error log tells you what broke. A fail-silent bug gives you nothing except an absence. And absences are hard to notice, especially when you’re the one who built the system and you’re also the one running it.

The way I find fail-silent bugs is when someone asks why something didn’t happen. Not from my own monitoring. A step that was supposed to fire didn’t. A case that should have advanced didn’t. By the time I know about it, the bug might have been silently eating work for hours — sometimes longer.

// WHY THEY COMPOUND

Fail-silent bugs also tend to travel in packs. One module emits a value another module doesn’t recognize. The mismatch propagates. Module A doesn’t produce the right output, module B doesn’t recognize it and skips, module C never gets triggered. Each one looks fine in isolation. The problem lives in the handoff.

And by the time you find the surface symptom, the root is three modules back and covered by logs that all say “ok.”

3. Three real ones I hit today

// A REAL EXAMPLE

I’m keeping the business logic vague — client data doesn’t go in blog posts. But the patterns are exact.

// LIVE LOG — TODAY’S SESSION
[09:14] BUG 1 — stale query: filter referenced a dropped property option → returned [] → 14 records silently skipped, logged “processed 0”
[11:32] BUG 2 — enum desync: module A emits “COURT_RUN” → module B expected “COURT RUN” → handler never fired → no action, no alert, logged “ok”
[14:05] BUG 3 — routing mismatch: one intake category routed to the generic handler instead of its own path → wrong draft built → logged success
all three: INFO — processing complete

Bug 1 is a query that trusted its filter was correct. The filter broke when a property option got renamed upstream — the query returned an empty list, the loop ran zero iterations, nothing got processed. No exception. Zero records handled, zero alerts, one “processed ok” log line.

Bug 2 is the canonical form of this class of failure: an enum mismatch. Two modules agreed on what values exist. They were using slightly different spellings. One module’s output was never recognized by the other. The handler for that value never fired. Again, no error anywhere.

Bug 3 is a routing path that correctly categorized the input but sent it to the wrong handler. The handler ran and succeeded — it just ran the wrong logic and produced a wrong artifact. Found out when I looked at the output, not from any alert.

4. The fix: verify the effect, not the call

// THE PRINCIPLE

Here’s the rule I came out of today with: don’t log that you called a function. Log that the function had an effect.

The difference sounds small. It’s not. Logging a call is easy — you always called the function. Verifying an effect means checking whether the work actually happened: the record was written, the message was delivered, the count changed, the state transitioned. If you can’t verify the effect, you don’t know if the step succeeded.

$ git diff HEAD~1 HEAD — core/pipeline_worker.py
– records = query_active_records(filter=action_type)
– for r in records:
–     process(r)
– log(“processed %d records”, len(records))
+ records = query_active_records(filter=action_type)
+ if not records:
+     log.warning(“EMPTY RESULT for %s — verify filter is still valid”, action_type)
+     alert_jesse(“Zero records for %s — possible filter desync”, action_type)
+     return
+ processed = [r for r in records if process(r)]
+ if len(processed) != len(records):
+     log.error(“processed %d of %d — %d failed”, len(processed), len(records), len(records)-len(processed))

The old version trusted that records would be there. If they weren’t, it ran zero times and logged success. The new version treats empty as suspicious — because in a live production system, a query that returns zero records when it should return active ones means the query is broken, not that there’s nothing to do. Those are different things and they need different log entries.

The enum fix was the same pattern: instead of handling known values and silently ignoring unknown ones, I added an explicit else: log.error("unrecognized ACTION value: %s", val) branch. Unknown values are now loudly wrong instead of quietly ignored.

// THE DESIGN PRINCIPLE

This is called fail-closed design. When something unexpected happens — a query returns empty, an enum value isn’t recognized, a routing condition doesn’t match — the system moves toward the safe state: alert, halt, surface the gap. It does not move toward the permissive state: assume fine, skip, continue. Fail-open looks resilient. It’s actually just silent.

5. Three questions to ask about any automated step

// HOW TO APPLY THIS

I ask these about every step in my automation pipeline now:

What does success actually look like? Not “did the function run” — what changed in the world? A record was created. A draft was sent. A status transitioned. If you can’t name the effect, you don’t have a success condition. You have a log line.

What does an empty result mean? If a query returns zero records, is that normal or suspicious? “Nothing to process right now” is fine. “Nothing that matches a filter that should always match active records” is a bug. Know the difference and log it differently.

What happens to a value you don’t recognize? If an enum, a status string, or a routing key appears that you didn’t write a handler for — does your code skip it silently, crash loudly, or alert and halt? Crashing loud and alerting-and-halting are both acceptable options. Skipping silently is not an option in a system where real work depends on it.

The discipline is just the habit of treating silence as suspicious instead of treating it as fine. Systems that fail loudly are fixable. Systems that fail silently are time bombs — and the fuse is already lit, you just can’t see it.

$ whoami
Jesse Moraga — surgical tech turned self-taught systems builder. I run a real California field-services company on software I built from scratch. No CS degree. A real company runs on it. The system page has what I built and what I offer.
→ github.com/JesseMoraga

Fail-Silent Bugs: When Logs Say OK While the System Does Nothing