Why "top" missed a cron job that was killing our API latency

https://parth21shah.substack.com/p/why-your-dashboard-is-green-but-the

4•parth21shah•2mo ago

Comments

parth21shah•2mo ago

OP here. I’ve been doing backend work for ~15 years, but this was the first time I really felt why eBPF matters. We had a latency spike that all the usual polling tools missed — top, CloudWatch, Datadog, everything looked normal. In the end it was a misconfigured cron job spawning ~50 short-lived workers every minute. Each one ran for ~500ms, burned the CPU, and exited before the next poll. So all our “snapshot” tools were basically blind. I wrote the post to show this exact gap: Polling = snapshots, Tracing = event stream. For stuff that appears and disappears between polls, only tracing really sees it.tools like execsnoop or auditd can catch this, but in our case the overhead felt too high to leave on 24/7 in production. I amm currently playing with a small Rust+Aya agent that listens on ring buffers so we can run this continuously with less overhead. If you just want to try the idea, the post has a few bpftrace one-liners so you can reproduce the detection logic without writing any C or Rust.

danishSuri1994•2mo ago

This is a great example of the blind spot between sampling-based observability and event-driven tracing.

Anything that appears + disappears between polls is effectively invisible unless you’re streaming syscalls/process events. It’s surprising how often “short-lived, high-impact” processes cause the worst production spikes.

Curious whether you’re planning to surface this at the scheduler level (run queue latency/involuntary context switches) or stick to process-lifecycle tracing?

parth21shah•2mo ago

Right now I’m sticking to process lifecycle (sched_process_fork and sched_process_exit), mostly for correlation. It’s much easier to grab container ID / cgroup metadata at fork time and say “this pod/image is the bad actor” than it is to reconstruct that context off a firehose of sched_switch events. I agree that run queue latency / scheduler stats are the “better” signals for pure performance debugging. But scheduler switches generate a huge volume of events compared to forks. So I’m starting with fork/exec/exit + container/cgroup mapping If you’ve shipped scheduler-level tracing in production I’d love to hear how you handled filtering + aggregation.

zahlman•2mo ago

I could already guess the answer and there is just so little actual content here with way too many words to explain a simple idea. Which is what you typically get when you let the LLM write for you.

Show HN: Pyrig – One command to set up a production-ready Python project

Fast Response or Silence: Conversation Persistence in an AI-Agent Social Network [pdf]

C and C++ dependencies: don't dream it, be it

Show HN: Vbuckets – Infinite virtual S3 buckets

Open Molten Claw: Post-Eval as a Service

New York Budget Bill Mandates File Scans for 3D Printers

The End of Software as a Business?

Exploring 1,400 reusable skills for AI coding tools

Show HN: A unique twist on Tetris and block puzzle

The logs I never read

How to use AI with expressive writing without generating AI slop

Show HN: LinkScope – Real-Time UART Analyzer Using ESP32-S3 and PC GUI

Cppsp v1.4.5–custom pattern-driven, nested, namespace-scoped templates

The next frontier in weight-loss drugs: one-time gene therapy

At Age 25, Wikipedia Refuses to Evolve

Show HN: ReviewReact – AI review responses inside Google Maps ($19/mo)

Why AlphaTensor Failed at 3x3 Matrix Multiplication: The Anchor Barrier

Ask HN: How much of your token use is fixing the bugs Claude Code causes?

Show HN: Agents – Sync MCP Configs Across Claude, Cursor, Codex Automatically

Hello

FSD helped save my father's life during a heart attack

Show HN: Writtte – Draft and publish articles without reformatting, anywhere

Portuguese icon (FROM A CAN) makes a simple meal (Canned Fish Files) [video]

Brookhaven Lab's RHIC Concludes 25-Year Run with Final Collisions

Transcribe your aunts post cards with Gemini 3 Pro

.72% Variance Lance

ReKindle – web-based operating system designed specifically for E-ink devices

Encrypt It

NextMatch – 5-minute video speed dating to reduce ghosting

Personalizing esketamine treatment in TRD and TRBD