Test Pollution in Production Logs (and the Real Bug It Hid)

Short answer: A routine health check on my one-operator automation platform flagged 816 errors and one pattern that had “fired 624 times today.” When I dug in, 680 of those were fake — synthetic test data my own test suite was writing into the real production error log on every run. The false alarms had quietly trained me to distrust the monitor. And under the noise sat a real bug that was silently aborting work. This is the discovery, the two fixes, and why telling signal from noise is most of the job.

Updated 2026-06-19.

The sweep that wasn’t routine
Discovery 1: my tests were poisoning production logs
Alarm fatigue is a real outage
Discovery 2: the bug the noise was hiding
The two fixes
Why this is the good work
FAQ

The sweep that wasn’t routine

I run the back office of a small service business almost entirely on software I built myself — hundreds of jobs a month, one operator, on well under $300/month of infrastructure. On a busy day I did what every operator should do more often: I stopped shipping features and asked one question — is everything actually flowing?

The health dashboard came back mostly green: services up, queues draining, zero failed background jobs. But the error monitor was loud. It claimed 816 errors in the logs and was paging me about a pattern that had “fired 624 times today — INVESTIGATE.” Two other patterns had supposedly been failing for 7 and 10 days straight.

That is exactly the moment most solo builders glance at the green dashboard, decide the monitor is being dramatic, and move on. I’ve learned the hard way not to.

Discovery 1: my tests were poisoning production logs

I opened the noisiest log and read the actual lines instead of the count. The “errors” looked like this:

no usage found; type=dict keys=['foo', 'no_usage_here']
no usage found; type=dict keys=['mystery', 'no_tokens']
no usage found; type=NoneType keys=['NoneType']

Keys named foo, mystery, no_usage_here. No real API response on earth has those keys — those are test fixtures. A unit test was feeding deliberately-broken payloads into a parser to prove the parser handles junk gracefully. Good test. The problem: the parser legitimately logs every unrecognized shape to a real error-log file, and the test was running against the production log path. Every time my test suite ran — which on a shipping day is many times, because it runs automatically before every push — it dumped a fresh batch of fake errors into the file my monitor watches.

680 of the 816 “errors” were my own tests talking to themselves.

Alarm fatigue is a real outage

Here’s the part that matters more than the bug. A monitor that cries wolf isn’t a cosmetic annoyance — it’s a slow outage of your attention. Every false page lowers the odds you’ll read the next one. Train yourself for a few weeks to swipe away “816 errors,” and the day a real number shows up, it looks identical to the noise. The alert system is only as valuable as your willingness to trust it, and trust is destroyed quietly.

So the noise wasn’t the small problem and the bug the big one. The noise was the big problem, because of what it was hiding.

Discovery 2: the bug the noise was hiding

Once I filtered the fake lines out, a genuine error was sitting underneath, attached to my document-generation step:

generate_document: ABORTED — ['No date — not found in the record']
generate_document: fallback error (non-blocking): cannot import name
    'get_thread_messages' from 'gmail'

The system generates a legal-grade PDF at the end of a job. If a required date isn’t already saved in the structured record, it’s supposed to fall back to reading the email thread and recovering the date from what was actually written there. Smart fallback. Except it imported the helper function from the wrong module, with the wrong keyword argument:

# broken: wrong module, wrong kwarg — raises ImportError every single time
from gmail import get_thread_messages
msgs = get_thread_messages(thread_id, max_results=10)

The real function lived in a different module and took max_messages, not max_results. So the fallback threw an ImportError on every call. And because the failure was caught and logged as “non-blocking,” nobody noticed — the document just quietly aborted instead of recovering. Three other places in the same codebase called the same function correctly. This one line had drifted, and the noise had been burying its complaint for who knows how long.

The two fixes

Fix 1 — stop the tests from touching production. One small automatic test fixture that redirects the log path to a throwaway folder for the duration of every test run. The parser still logs real problems in production; the tests just can’t reach the real file anymore:

@pytest.fixture(autouse=True)
def _isolate_error_log(monkeypatch, tmp_path_factory):
    import ai_usage
    d = tmp_path_factory.mktemp('ai_usage')
    monkeypatch.setattr(ai_usage, '_ERROR_LOG', d / 'errors.log')

Fix 2 — correct the import, then lock it. Point the fallback at the right module and the right keyword. Then write a regression test that fails if the wrong-module/wrong-kwarg form ever returns — because a mock would happily fake this away, so the test binds the real function signature instead:

from gmail_api import get_thread_messages          # right module
msgs = get_thread_messages(thread_id, max_messages=10)  # right kwarg

# regression: the drift can never come back silently
assert 'max_messages' in inspect.signature(get_thread_messages).parameters
assert 'from gmail import get_thread_messages' not in source

Why this is the good work

None of this shows up in a feature demo. There’s no new button. But this is the work that keeps a one-person operation alive, and it’s built on three habits worth more than any framework:

Read the logs, not the count. “816 errors” is a number. The actual lines are the truth. The fix started the second I read one.
Treat your monitoring as a product. A noisy alert isn’t free — it costs you the next real alert. Keeping the signal clean is not housekeeping; it’s how you stay able to react at all. (I wrote about the cousin of this problem in fail-silent bugs.)
Fix the root cause, then make the fix permanent with a test. A patched line drifts back. A patched line plus a regression test stays fixed. I lean on pytest fixtures to make “don’t pollute production” an automatic guarantee instead of a thing I have to remember.

I didn’t go looking for a bug. I went looking for the truth about whether my system was healthy — and the discipline of not trusting a green dashboard, plus the discipline of not trusting a loud one, is what surfaced a real failure and a real fix in the same afternoon. That’s the job. More on the auditing mindset in building an AI agent that audits your own code, and the full archive lives in technology.

FAQ

What is test pollution in production logs? It’s when running your automated tests writes data into a real production file or system — here, a test fed synthetic error payloads to a parser that logged them to the live error log, so the test suite kept manufacturing fake “production errors.”

How do you stop tests from writing to production paths? Redirect the path during tests. An automatic (autouse) pytest fixture that points the log/ledger/file path at a temporary directory for every test is the cleanest fix — production behavior is untouched, tests can’t reach real files.

Why are false alarms in monitoring dangerous? Because they cause alarm fatigue. Once you learn to ignore a noisy alert, you also ignore the real one when it finally fires — the noise hides genuine failures in plain sight.

What’s a silent-abort bug? Code that fails but swallows the error and stops doing its job without crashing or alerting. It looks fine from the outside while quietly producing nothing — which is why you have to read logs and verify output, not assume green means good.