Hey HN, I'm Oskar. For the past few months I've been building StatusDude - an uptime monitoring tool with private agents that auto-detects your Kubernetes resources.
I run a bunch of stuff across multiple orgs, different clusters, internal networks, self-hosted, GKE, EKS, etc. Monitoring all of it without Datadog money was getting tough, and most tools don't even support internal networks. So, here we are.
A tiny async agent sits inside your network and phones home over HTTPS. No inbound ports, no VPN, no firewall rules. One container, one helm install, done. A single instance handles 10k+ monitors comfortably.
The agent pulls check definitions from the cloud, runs them locally, uploads raw results. All evaluation is server-side - the agent stays dead simple, and the cloud decides what's actually down vs. a blip.
For Kubernetes, it auto-discovers Ingresses, Services, and HTTPRoutes. Deploy something new, it just gets picked up. Monitors and status pages spin up automatically.
During the development process I found out I don't know how to use Celery properly. Went with ARQ instead - 50k+ jobs/min, no drama. After I modified it a bit, that is ;-)
Not a full observability platform - no incident management, no on-call. Just monitoring, status pages, and notifications. If you want straightforward uptime monitoring that works behind firewalls, give it a go and please leave feedback in the comments!
New signups currently get the Team plan unlocked for free, I want people to test the full thing. Happy to answer any questions about the architecture.
https://statusdude.com
https://artifacthub.io/packages/helm/statusdude-agent/status...
jamiemallers•41m ago
K8s auto-discovery via Ingresses/Services/HTTPRoutes is clever. One edge case to watch: teams using custom CRDs for routing (Istio VirtualServices, Traefik IngressRoutes). You'll get requests for those pretty fast once people adopt this in real clusters. A plugin/annotation system where users can teach the agent about custom resource types would scale better than hard-coding each one.
The "what's actually down vs a blip" problem is where most monitoring tools quietly fail. Two things that help: (1) requiring N consecutive failures before marking down, with N configurable per-monitor (a database might need N=1, a CDN edge might need N=3), and (2) correlating failures across monitors. If 5 services behind the same ingress controller all fail simultaneously, that's one incident, not five.
Curious about your status page auto-generation. Do you group services by namespace, by cluster, or something else? In our experience the auto-generated grouping is never quite what customers want to show publicly, so having an easy way to override the hierarchy matters a lot.
canto•25m ago
"requiring N consecutive failures before marking down" - I do have the code for it, it's just hidden currently. StatusDude supports 2 types of worker/agents - cloud agents - that will re-verify from multiregion the service status and private agents - the ones we're talking about here - that I might just bring this option back as it makes more sense.
Correlating failures is a bit tricky as usually it requires some sort of manual dependency creation but, I guess for k8s ingress and similar I should be able to figure this out and at least send alerts with appropriate priorities and order.
As for the status page auto generation - currently it's based on namespace - I didn't wanted to bloat the user dashboard too much. Each monitor is tagged with cluster id, namespace and labels. Status Pages pickup monitors based on labels. Users are free to modify these and show exactly what they want :)