Monitoring My Homelab, Simply

https://b.tuxes.uk/simple-homelab-monitoring.html

62•Bogdanp•3d ago

Comments

Tractor8626•6h ago

Even in homelab you should totally monitor thing like

- raid health

- free disk space

- whether backup jobs running

- ssl certs expiring

ahofmann•5h ago

One could also look every Sunday at 5 pm manually through this stuff. In a homelab, this can be enough.

tough•3h ago

One cold also just wait for things to not work before to try and fix them

dewey•3h ago

For backups that's usually not the best strategy.

sthuck•3h ago

Look I agree but one can also manage with an always on pc and an external hard drive instead of a homelab. It's part hobby part learning experience.

Also if you have kids 0-6 you can't schedule anything relaibly

KaiserPro•5h ago

I understand your pain.

I used to have sensu, but it was a pain to keep updated (and didn't work that well on old rpis)

But what I did find was a good alternative was telegraph->some sort of time series (I still really like graphite, influxQL is utter horse shit, and prometheus's fucking pull models is bollocks)

Then I could create alert conditions on grafana. At least that was simple.

However the alerting on grafana moved from being "move the handle adjust a threshold, get a a configurable alert" to craft a query, get loads of unfilterable metadata as an alert.

its still good enough.

cyberpunk•2h ago

Why is the pull model bollocks? I’ve been building monitoring for stuff since nagios and zabbix were the new hot tools; and I can’t really imagine preferring the oldschool ways vs the pretty much industry standard of promstack these days…

jamesholden•4h ago

ok.. so your solution is using at minimum a $5/month service. Yikes, I'd prefer something like pushover before that. :/

tough•3h ago

or a shell script

faster•3h ago

You can self-host ntfy.sh but then you need to find a place outside of your infra to host it.

Scaevolus•3h ago

I use Prometheus + Prometheus Alertmanager + Any Free Tier paging system (currently OpsGenie, might move to AlertOps).

Having a feature-rich TSDB backing alerting minimizes time adding alerts, and the UX of being able to write a potential alert expression and seeing when in the past it would fire is amazing.

Just two processes to run, either bare or containerized, and you can throw in a Grafana instance if you want better graphs.

JZL003•3h ago

I do kinda similar. I have a node express swrver which has lots of little async jobs, throw it all into a promise.all, and if they're all good, send 200, if not sent 500 and the failing jobs. Then free uptime monitors check every few hours and will email me if "the site goes down"=some error. Kinda like a multiplexer to stay within their free monitoring limit and easy to add more tests

loloquwowndueo•2h ago

Did he reinvent monit?

Even a quick Prometheus + alert manager setup with two docker containers is not difficult to manage - mine just works, I seldom have to touch it (mainly when I need to tweak the alert queries).

I use pushover for easy api-driven notifications to my phone, it’s a one-time $7 fee or so and it was money well spent.

frenchtoast8•2h ago

At work I use Datadog, but it's very expensive for a homelab: $15/mo per host (and for cost I prefer using multiple cheap servers than a single large one).

NewRelic and Grafana Cloud have pretty good free plan limits, but I'm paying for that in effort because I don't use either at work so it's not what I'm used to.

SteveNuts•2h ago

The Datadog IoT agents are cheaper, but still probably more than you’d want to spend on a lab.

You also only get system metrics, no integrations - but most metrics and checks can be done remotely with a single dedicated agent

Evidlo•2h ago

My solution is to just be OK with http status checking (run a webserver on important machines), and use a service like updown.io which is so cheap it's almost free.

e.g. For 1 machine, hourly checking is ~$0.25/year

Havoc•2h ago

I personally found uptime kuma to be easiest because it has a python api package to bulk load stuff into it.

Much easier to edit a list in vscode than click around a bunch in an app

bonobocop•1h ago

Quite like Cloudprober for this tbh: https://cloudprober.org/docs/how-to/alerting/

Easy to configure, easy to extend with Go, and slots in to alerting.

jauntywundrkind•1h ago

There's an article-bias towards rejectionism, towards single shot adventures. "I didn't grok so and so and here's the shell scripts I wrote instead".

Especially for home cloud, home ops, home labs: that's great! That's awesome that you did for yourself, that you wrote up your experience.

But in general I feel like there's a huge missing middle of operations & sys-admin-ery that creates a distorted weird narrative. There's few people out there starting their journey with Prometheus blogging helpfully through it. There's few people mid way through their k8s work talking about their challenges and victories. The tales of just muddling through, of the perseverance, of looking for information, trying to find signal through the noise are few.

What we get a lot of is "this was too much for me so I wrote my own thing instead". Or, "we have been doing such and such for years and found such and such to shave 20% compute" or "we needed this capability so added Z to our k8s cluster like so". The journey is so often missing, we don't have stories of trying & learning. We have stories like this of making.

There's such a background of 'too complex' that I really worry leads us spiritually astray. I'm happy for articles like this, it's awesome to see ingenuity on display, but there's so many good amazing robust tools out there that seem to have lots of people happily or at least adequately using them, but it feels like the stories of turning back from the attempt, stories of eschewing the battle tested widely adopted software drive so much narrative, have so much more ink spilled over them.

Very thankful for Flix language putting Rich Hickey's principle of Simple isn't Easy first, for helping re-orient me by the axis of Hickey's old grand guidance. I feel like there's such a loud clambor generally for easy, for scripts you throw together, for the intimacy of tiny systems. And I admire a lot of these principles! But I also think there's a horrible backwardsness that doesn't help, that drives us away from more comprehensive capable integrative systems that can do amazing things, that are scalable both performance wise (as Prometheus certainly is) and organizationally (that other other people and other experts will also lastingly use and build from). The preselection for easy is attainable individually quickly, but real simple requires vastly more, requires so much more thought and planning and structure. https://www.infoq.com/presentations/Simple-Made-Easy/

It's so weird to find myself such a Cathedral-but-open-source fan today. Growing up the Bazaar model made such sense, had such virtue to it. And I still believe in the Bazaar, in the wide world teaming with different softwares. But I worry what lessons are most visible, worry what we pass along, worry about the proliferation of software discontent against the really good open source software that we do collaborate together on em masse. It feels like there's a massive self sabotage going on, that so many people are radicalized and sold a story of discontent against bigger more robust more popular open source software. I'd love to hear that view so much, but I want smaller individuals and voices also making a chorus of happy noise about how far they get how magical how powerful it is that we have so many amazing fantastic bigger open source projects that so scalably enable so much. https://en.m.wikipedia.org/wiki/The_Cathedral_and_the_Bazaar

reboot81•1h ago

Just love https://healthchecks.io I set it up on all my boxes with these two scripts: win https://github.com/reboot81/healthchecks_service_ps/ macos https://github.com/reboot81/hc_check_maker_macos linux https://healthchecks.io/docs/bash/

Most people who buy games on Steam never play them

How does a screen even work?

GLP-1s Are Breaking Life Insurance

Show HN: A Raycast-compatible launcher for Linux

OpenICE: Open-Source US Immigration Detention Dashboard

A Technical Look at Iran's Internet Shutdowns

Amazon CEO says AI agents will soon reduce company's corporate workforce

Reading Neuromancer for the first time in 2025

The Gottorf Globe and its reconstruction

Show HN: Learn LLMs LeetCode Style

Hungary's oldest library fighting to save 100k books from a beetle infestation

Infisical (YC W23) Is Hiring DevRel Engineers

Axon's Draft One AI Police Report Generator Is Designed to Defy Transparency

Does Showing Seconds in the System Tray Use More Power?

The upcoming GPT-3 moment for RL

How to scale RL to 10^26 FLOPs

Local Chatbot RAG with FreeBSD Knowledge

Notes on Graham's ANSI Common Lisp (2024)

Zig's New Async I/O

The Decipherment of the Dhofari Script

The Robot Sculptors of Italy

Monitoring My Homelab, Simply

Understanding Tool Calling in LLMs – Step-by-Step with REST and Spring AI

Chrome's hidden X-Browser-Validation header reverse engineered

Bypassing Google's big anti-adblock update

Gaming cancer: How citizen science games could help cure disease

Hacking Coroutines into C

Let me pay for Firefox

Parse, Don’t Validate – Some C Safety Tips

Switching to Claude Code and VSCode Inside Docker

Most people who buy games on Steam never play them

How does a screen even work?

GLP-1s Are Breaking Life Insurance

Show HN: A Raycast-compatible launcher for Linux

OpenICE: Open-Source US Immigration Detention Dashboard

A Technical Look at Iran's Internet Shutdowns

Amazon CEO says AI agents will soon reduce company's corporate workforce

Reading Neuromancer for the first time in 2025

The Gottorf Globe and its reconstruction

Show HN: Learn LLMs LeetCode Style

Hungary's oldest library fighting to save 100k books from a beetle infestation

Infisical (YC W23) Is Hiring DevRel Engineers

Axon's Draft One AI Police Report Generator Is Designed to Defy Transparency

Does Showing Seconds in the System Tray Use More Power?

The upcoming GPT-3 moment for RL

How to scale RL to 10^26 FLOPs

Local Chatbot RAG with FreeBSD Knowledge

Notes on Graham's ANSI Common Lisp (2024)

Zig's New Async I/O

The Decipherment of the Dhofari Script

The Robot Sculptors of Italy

Monitoring My Homelab, Simply

Understanding Tool Calling in LLMs – Step-by-Step with REST and Spring AI

Chrome's hidden X-Browser-Validation header reverse engineered

Bypassing Google's big anti-adblock update

Gaming cancer: How citizen science games could help cure disease

Hacking Coroutines into C

Let me pay for Firefox

Parse, Don’t Validate – Some C Safety Tips

Switching to Claude Code and VSCode Inside Docker

Monitoring My Homelab, Simply

Comments