I used Munin a lot as well in the 2005-2010 timeframe. Still do as a backup (for when Prometheus, Grafana, and Influxdb conspire against me) on my home lab.
Usually the 15 minute collection interval is just fine. One time though I had an issue with servers that were just fine and, then, crashed and rebooted with no useful metrics collected between the last "I'm fine" and the first "I'm fine again".
At that point we started collecting metrics (for only those servers) every 5 seconds, and we figured out someone introduced a nasty bug that took a couple weeks of uptime to run out of its own memory and crash everything. It was a fun couple days.
Chances are your volumes are low enough it will be actually cheaper to run with something like New Relic or Datadog. When the monthly bill starts reaching 10% of what a dedicated person would cost, it's time to plan your move to self-hosted.
No, it's always time to plan the move to self hosted, and just occasionally choose someone else to be the "self." Because once a proprietary vendor gets in the stack, evicting them is going to be a project
I'm aware that this doesn't split cleanly down the "saas only feature" or the evil "rug pull" axes, but I'd much rather say "I legitimately tried to allow us to eject from the walled garden and the world changed" versus "whaddya mean non-Datadog?"
They'll gladly pay someone to do it and have a big team of engineers and planners to support the outsourcing.
Efficiency isn't what bigco inc is about.
They know that even if you have the capacity to run something internally today, that is a delicate state of affairs that could easily change tomorrow.
I encourage the author to read the honeycomb blog and try to grok what makes otel different. If I had to sum it up in two points:
- wide rows with high cardinality
- sampling
Which, sure, if you’re willing to pay for it, I’m happy to let you make your life miserable. But I’m still going to be the Marie Kondo of IT and ask if that specific data point brings you joy. Does having per-second interval data points actually improve response times and diagnostics for your internal tooling, or does it just make you feel big and important while checking off a box somewhere?
Observability is a lot like imaging or patching: a necessary process to be sure, but do you really need a Cadillac Escalade (New Relic/Datadog/etc) to go to the grocery store when a Honda Accord (self-hosted Grafana + OTel) will do the same job more efficiently for less money?
Honestly regret not picking the Observability’s head at BigCo when I had the chance. What little he showed me (self-hosted Grafana for $90/mo in AWS ECS for the corporate infrastructure of a Fortune 50? With OTel agents consuming 1/3 to 1/2 the resources of New Relic agents? Man, I wish I had jumped down that specific rabbit hole) was amazingly efficient and informative. Observation done right.
There seems to be a strong "instrument everything" culture that, I think, misses the point. You want simple metrics (machine and service) for everything, but if your service gets an error every million requests or so, it might be overkill to trace every request. And, for the errors, you usually get a nice stack dump telling you where everything went wrong (and giving you a good idea of what was wrong).
At that point - and only at that point, I'd say it's worth to TEMPORARILY add increased logging and tracing. And yes, it's OK to add those and redeploy TO PRODUCTION.
Depends if your objective is to go to the grocery store or merely showing off going to the grocery store.
During the ZIRP era there was a financial incentive for everyone to over-engineer things to justify VC funding rounds and appear "cool". Business profitability/cost-efficiency was never a concern (a lot of those business were never viable and their only purpose was to grift VC money and enjoy the "startup founder" lifestyle).
Now ZIRP is over, but the people who started their career back then are still here and a lot of them still didn't get the memo.
Yep, and what’s worse is that…
> Now ZIRP is over, but the people who started their career back then are still here and a lot of them still didn't get the memo.
…folks let go from BigTech are filtering into smaller orgs, and the copy-pasters and “startup lyfers” are bringing this attitude with them. I guess I got lucky enough to start my interest in tech before the dotcom crash, my career just before the 2008 crash, and finished my BigTech tenure just after COVID (and before the likely AI crash), and thus am always weighing the costs versus the benefits and trying to be objective.
Problem is, not all of them are even doing this intentionally. A lot actually started their career during that clown show, so for them this is normal and they don't know any other way.
The ideal setup is that you trace as much for some given time frame, if your stack supports compression and tiered storage it becomes cheap er
Metrics are the easiest way to simply expose your application internal state and then, as a maintainer of that service, you’re in nirvana. And even if you don’t go that far you’re likely to be an engineer writing code and when it comes time to add some metrics why wouldn’t you add more rather than less, and once you have all of them why not adding all possible labels? And in the meantime your Prometheus server is in a crash loop because it run if of RAM, but that’s not a problem visible to you. Unfortunately there’s a big gap in understanding between a code editor writing instrumentation code and the effect in resource usage on the other end of your observability pipeline.
As annoying as that may sound, it's a hell of a lot harder to go back in time to observe that bizarre intermittent issue...
The way that I've seen it play out is something like this:
1. We should self host something like Grafana and otel.
2. Oh no, the teams don't want to host individual instances of that, we should centralize it!
(2b - optional, but common, Random team gets saddled with this job)
3. Oh no, the centralized team is struggling with scaling issues and the service isn't very reliable. We should outsource it for 10x the cost!
This will happen even if they have a really nice set of deployment infrastructure and patterns that could have allowed them to host observability at the team level. It turns out, most teams really don't need the Escalade, they just need some basic graphs and alerts.Self hosting needs to be more common within organizations.
So yea, cost of storage and network traffic is only going to balloon.
There is room for improvements and I can already see new projects that will most likely gain traction in upcoming years.
You really don't have to.
Throw away traces. Throw away logs. Sample those metrics. The standard gives you capabilities, it doesn't force you to use them. Tune based on your risk appetite, constraints, and needs.
My other favourite retort to "look how expensive the observability is" is "have you quantified how expensive not having it is". But I reserve that one for obtuse bean counters :)
Self-hosting metrics at any scale is pretty cost effective.
The main difference I see with otel is the ability to repeatedly aggregate/decimate/discard your data at whatever tier(s) you deem necessary using opentelemetry-collector. The amount of data you end up with is up to you.
denysvitali•5h ago
It's then obviously your receiver end's job to take the incoming data and store it efficiently - grouping it by resource attributes for example (since you probably don't want to store 10 times the same metadata). But especially thanks to the flexibility of adding all the surrounding metadata (rather than just shipping the single log line), you can do magic thinks like routing metrics to different tenants / storage classes or drop them.
Having said that, OTEL is both a joy and an immense pain to work with - but I still love the project (and still hate the fact that every release has breaking changes and 4 different version identifiers).
Btw, one of the biggest win in the otel-collector would be to use the new Protobuf Opaque API as it will most likely save lots of CPU cycles (see https://github.com/open-telemetry/opentelemetry-collector/is...) - PRs are always welcome I guess.