Ask HN: How do you catch cron jobs that "succeed" but produce wrong results?

1•BlackPearl02•1w ago

I've been dealing with a frustrating problem: my cron jobs return exit code 0, but the results are wrong.

Examples: Backup script completes successfully but creates empty backup files Data processing job finishes but only processes 10% of records Report generator runs without errors but outputs incomplete data Database sync completes but the counts don't match The logs show "success" — exit code 0, no exceptions — but the actual results are wrong. The errors might be buried in logs, but I'm not checking logs proactively every day.

I've tried: Adding validation checks in scripts (e.g., if count < 100: exit 1) — works, but you have to modify every script, and changing thresholds requires code changes Webhook alerts — requires writing connectors for every script Error monitoring tools (Sentry, etc.) — they catch exceptions, not wrong results Manual spot checks — not scalable

The validation-in-script approach works for simple cases, but it's not flexible. What if you need to change the threshold? What if the file exists but is from yesterday? What if you need to check multiple conditions? You end up mixing monitoring logic with business logic.

I built a simple monitoring tool that watches job results instead of just execution status. You send it the actual results (file size, record count, status, etc.) and it alerts if something's off. No need to dig through logs, and you can adjust thresholds without deploying code.

How do you handle this? Are you adding validation to every script, proactively checking logs, or using something that alerts when results don't match expectations? What's your approach to catching these "silent failures"?

Comments

Bender•1w ago

Backup script completes successfully but creates empty backup files

The cron job itself would need to do sanity checks on results. e.g. comparison of before / after directory sizes, file counts, perhaps a few canary files that never change and then alter the exit status based on all of those checks after performing some math logic as well as trigger monitoring alerts via your preferred mechanism. Your script can control the exit status. Some use functions that perform sanity checks, cleanup traps, etc... and with each failure add a number to '$?' assuming bash adding text output to the end of the script to describe the failures when calling the script in verbose mode.

In other words, whatever you the human did to realize there is a problem have the script perform the same checks as if it were you and alter the exit status and/or perform whatever other alerting methods are available to you.

If changing the exit status be sure the script is idempotent as some cron daemons may try to re-run the script depending on specific exit status. In other words if run a second consecutive time determine what you really want the script to do. Read up on the cron daemon you are using and how it interprets exit status and what it will do.

t-3•1w ago

Validation checks are really the only solution if you can't fix the real problem - your processes are returning 0 when they are failing. Can you file a bug report?

razingeden•1w ago

it’d depend on what exactly is failing there.

File missing:

if [ ! -f /tmp/file ]; then exit 1 fi

File doesn’t contain 100 lines:

COUNT=`cat /tmp/file |wc -l` if [ $COUNT -lt "100"]; then exit 1 fi

File doesn’t contain a known header or record :

HEADER=`egrep -c SOME_CSV_VALUE /tmp/file` if [ $HEADER -eq "0"]; then exit 1 fi

any of those could be things like MySQL cli query or a wget call to a webserver.

generally, I have one long script that validates a combination of these and as it runs through the script I echo the HTML, meta refresh tag. My table, my table row, then each “if” case appends a <TD> </TD></TR>with “else” statements adding a red or a green cell into an HTML file as it goes down the list.

That way if I have say, 50-100 critical things that run every morning I have a visual dashboard when one screws up.

As far as I know this is still in use 14 years after deployed. It’s in all the stuff that starts an options exchange every day.

And then I left behind another one at a telco that checks all their radius servers and radius partners and pops a cell red when one isn’t responding to auth requests and I “think” they are using some form of it. Other than now solar winds hooks into those exit codes and they don’t really care about the html page.

TFSFVentures•2d ago

This is a classic operational pain point, especially with critical background processes like backup scripts, data processing, and report generation. We've seen this exact scenario before where systems report 'success' but the actual output is flawed – empty files, incomplete records, or mismatched counts. This usually comes down to a gap between execution status and data integrity validation. Happy to offer a sanity check on how you're currently approaching this, as there are typically a few robust patterns for catching these 'silent failures' without embedding complex monitoring logic directly into every script.

Show HN: Source code graphRAG for Java/Kotlin development based on jQAssistant

Python Only Has One Real Competitor

Tmux to Zellij (and Back)

Ask HN: How are you using specialized agents to accelerate your work?

Passing user_id through 6 services? OTel Baggage fixes this

DavMail Pop/IMAP/SMTP/Caldav/Carddav/LDAP Exchange Gateway

Visual data modelling in the browser (open source)

Show HN: Tharos – CLI to find and autofix security bugs using local LLMs

Oddly Simple GUI Programs

The New Playbook for Leaders [pdf]

Interactive Unboxing of J Dilla's Donuts

OneCourt helps blind and low-vision fans to track Super Bowl live

Rudolf Vrba

Autism Incidence in Girls and Boys May Be Nearly Equal, Study Suggests

Wellness Hotels Discovery Application

NASA delays moon rocket launch by a month after fuel leaks during test

Sebastian Galiani on the Marginal Revolution

Ask HN: Are we at the point where software can improve itself?

Binance Gives Trump Family's Crypto Firm a Leg Up

Reverse engineering Chinese 'shit-program' for absolute glory: R/ClaudeCode

Indian Culture

Show HN: Maravel-Framework 10.61 prevents circular dependency

The age of a treacherous, falling dollar

Ask HN: AI Generated Diagrams

Microsoft Account bugs locked me out of Notepad – are Thin Clients ruining PCs?

Show HN: A delightful Mac app to vibe code beautiful iOS apps

Show HN: Gemini Station – A local Chrome extension to organize AI chats

Welfare states build financial markets through social policy design

Market orientation and national homicide rates

California urges people avoid wild mushrooms after 4 deaths, 3 liver transplants