Show HN: StatusDude – Uptime monitoring internal services with K8s autodiscovery

1•canto•1h ago

Hey HN, I'm Oskar. For the past few months I've been building StatusDude - an uptime monitoring tool with private agents that auto-detects your Kubernetes resources. I run a bunch of stuff across multiple orgs, different clusters, internal networks, self-hosted, GKE, EKS, etc. Monitoring all of it without Datadog money was getting tough, and most tools don't even support internal networks. So, here we are. A tiny async agent sits inside your network and phones home over HTTPS. No inbound ports, no VPN, no firewall rules. One container, one helm install, done. A single instance handles 10k+ monitors comfortably. The agent pulls check definitions from the cloud, runs them locally, uploads raw results. All evaluation is server-side - the agent stays dead simple, and the cloud decides what's actually down vs. a blip. For Kubernetes, it auto-discovers Ingresses, Services, and HTTPRoutes. Deploy something new, it just gets picked up. Monitors and status pages spin up automatically. During the development process I found out I don't know how to use Celery properly. Went with ARQ instead - 50k+ jobs/min, no drama. After I modified it a bit, that is ;-) Not a full observability platform - no incident management, no on-call. Just monitoring, status pages, and notifications. If you want straightforward uptime monitoring that works behind firewalls, give it a go and please leave feedback in the comments! New signups currently get the Team plan unlocked for free, I want people to test the full thing. Happy to answer any questions about the architecture.

https://statusdude.com https://artifacthub.io/packages/helm/statusdude-agent/status...

Comments

jamiemallers•41m ago

The agent-based pull model is the right architecture here. We ran something similar internally and the key insight is exactly what you landed on: keep the agent dumb, evaluate server-side. The moment you put alerting logic on the agent, you need to redeploy agents every time you tweak a threshold, and coordinating that across 20 clusters is a nightmare.

K8s auto-discovery via Ingresses/Services/HTTPRoutes is clever. One edge case to watch: teams using custom CRDs for routing (Istio VirtualServices, Traefik IngressRoutes). You'll get requests for those pretty fast once people adopt this in real clusters. A plugin/annotation system where users can teach the agent about custom resource types would scale better than hard-coding each one.

The "what's actually down vs a blip" problem is where most monitoring tools quietly fail. Two things that help: (1) requiring N consecutive failures before marking down, with N configurable per-monitor (a database might need N=1, a CDN edge might need N=3), and (2) correlating failures across monitors. If 5 services behind the same ingress controller all fail simultaneously, that's one incident, not five.

Curious about your status page auto-generation. Do you group services by namespace, by cluster, or something else? In our experience the auto-generated grouping is never quite what customers want to show publicly, so having an easy way to override the hierarchy matters a lot.

canto•25m ago

"A plugin/annotation system where users can teach the agent about custom resource types would scale better than hard-coding each one." - this is a fantastic observation and feedback! Many thanks!

"requiring N consecutive failures before marking down" - I do have the code for it, it's just hidden currently. StatusDude supports 2 types of worker/agents - cloud agents - that will re-verify from multiregion the service status and private agents - the ones we're talking about here - that I might just bring this option back as it makes more sense.

Correlating failures is a bit tricky as usually it requires some sort of manual dependency creation but, I guess for k8s ingress and similar I should be able to figure this out and at least send alerts with appropriate priorities and order.

As for the status page auto generation - currently it's based on namespace - I didn't wanted to bloat the user dashboard too much. Each monitor is tagged with cluster id, namespace and labels. Status Pages pickup monitors based on labels. Users are free to modify these and show exactly what they want :)

Koyeb Is Joining Mistral AI to Build the Future of AI Infrastructure

Show HN: I wrote a technical history book on Lisp

Sub-Millisecond RAG on Apple Silicon. No Server. No API. One File

Nvidia, Groq and the limestone race to real-time AI

Opus 4.6 is great at formal proofs (Rocq/Lean4)

The Cult Deprogrammer

Understanding the Fido Alliance's Standards and Working Groups

Show HN: Daymon – Open-source app that gives Claude scheduled tasks

Show HN: Diesel-guard adds custom checks via Rhai for Postgres migrations

Privacy, fairness concerns sparked by AI pricing

Is an AI judge more fair than a human judge?

Show HN: Skill to annotate any Markdown file for AI feedback

Amp and the partial ordering of measures of disorder, part 1

Profiling on Windows: A Short Rant

Don't Prompt Your Agent for Reliability – Engineer It

Khronos Announces glTF Gaussian Splatting Extension

Token Anxiety

Zluda update Q4 2025 – ROCm7, Windows, full llama.cpp and more

Women Mourning the "Deaths" of Their AI Boyfriends

An Interview with Brian Daugherty from Google

KPMG partner fined over using AI to pass AI test

Show HN: Open-source digital back office

Show HN: You probably won't last 60 seconds

Show HN: Price Per Ball – Site that sorts golf balls on Amazon by price per ball

Why I Attack (2024)

I Built an S3 Interface for Git (and Why It Makes More Sense Than You Think)

DPaint JS

European Commission investigates Shein for addictive design

InBrowser.App: a collection of web apps that run in the browser

The anxiety driving AI's brutal work culture is a warning for all of us