So far it's pretty good. We're at least one major version behind, but hey everything still works.
I cannot imagine other products support as many data sources (though I'm starting to think they all suck, I just dump what I can in InfluxDB).
I operate a fairly large custom VictoriaMetrics-based Observability platform and have learned early on to only use Grafana as opposed to other Grafana products. Part of the stack used to use Mimir's frontend as caching layer but even that died with Mimir v3.0, now that it can't talk to generic Prometheus APIs anymore (vanilla Prom, VictoriaMetrics, promxy etc.). I went back to Cortex for caching.
Such a custom stack is obviously not for everyone and takes much more time, knowledge and effort to deploy than some helm chart but overall I'd say it did save me some headache. At least when compared to the Google-like deprecation culture Grafana seems to have.
We're using a combination of Zabbix (alerting) and local Grafana/Prometheus/Loki (observability) at this point, but I've been worried about when Grafana will rug-pull for a while now. Hopefully enough people using their cloud offering sates their appetite and they leave the people running locally alone.
I'm out of that game now though so don't have the challenge.
The kicker for me recently was hearing someone say "ally"
Or without numbers,
authC/authN, authZ...
The problem is that authorization also has an "n" in the word.
Enter authC.
It seems unlikely to me that stenography would use this style because they have better ways of abbreviating long agglutinative words.
ironically that's not very accessible...
Besides that, if you're feeling masochistic you could use Prometheus' console templates or VictoriaMetrics' built-in dashboards.
Though these are all obviously nowhere near as feature rich and capable as Grafana and would only be able to display metrics for the single Prom/VM node they're running on. Might enough for some users.
Disclaimer: I am affiliated with them.
Grafana dashboards itself (paired with VictoriaMetrics and occasionally Clickhouse) is one of the most pleasant web apps IMO. Especially when you don’t try to push the constraints of its display model, which are sometimes annoying but understandable.
we're opentelemetry-native and apart from many out of box charts for APM, infra monitoring, and logs, you can also build customized dashboards with lots of visualization option.
p.s - i am one of the maintainers
Now it's your turn to disclose your employer name.
Prometheus and Grafana have been progressing in their own ways and each of them is trying to have a fullstack solution and then the OTEL thingy came and ruined the party for everyone
Does OTEL mean we just need to replace all our collectors (like logstash for logs and all the native metrics collectors and pushgateway crap) and then reconfigure Prometheus and OpenSearch?
And we don't really have a simpler alternative in sight...at least in the java days there was the disgust and reaction via struts, spring, EJB3+, and of course other languages and communities.
Not sure how we exactly we got into such an over-engineered mono-culture in terms of operations and monitoring and deployment for 80%+ of the industry (k8s + graf/loki/tempo + endless supporting tools or flavors), but it is really a sad state.
Then you have endless implementations handling bits and pieces of various parts of the spec, and of course you have the tools to actually ingest and analyze and report on them.
- all the delta-vs-cumulative counter confusion
- push support for Prometheus, and the resulting out-of-order errors
- the {"metric_name"} syntax changes in PromQL
- resource attributes and the new info() function needed to join them
I just don’t see how any of these OTEL requirements make my day-to-day monitoring tasks easier. Everything has only become more complicated.
And I haven’t even mentioned the cognitive and resource cost everyone pays just to ship metrics in the OTEL format - see https://promlabs.com/blog/2025/07/17/why-i-recommend-native-...
The pushgateway's documentation itself calls out that there are only very limited cirumstances where it makes sense.
I personally only used it in $old_job and only for batch jobs that could not use the node_exporter's textfile collector. I would not use it again and would even advise against it.
The author is 100% correct: Monitoring should be the most boring tool in the stack. Its one and only job is to be more reliable than the thing it's monitoring.
The moment your monitoring stack requires a complex dependency like Kafka, or changes its entire agent flow every 18 months, it has failed its primary purpose. It has become the problem.
This sounds less like a technical evolution and more like the classic VC-funded push to get everyone onto a high-margin cloud product, even at the cost of the open-source soul.
We at VictoriaMetrics try making boring monitoring solution which just works out of the box - https://docs.victoriametrics.com/victoriametrics/goals/
This isn't a Grafana problem, this is an industry wide problem. Resume driven product design, resume driven engineering, resume driven marketing. DO your 2-3 years, pump out something big to inflate your resume. Apply elsewhere to get the pay bump that almost no company is handing out. After the departures there is no one left who knows the system and the next people in want to replace the things they don't understand to pad their resume for the next job.
Wash, rinse, repete.
Loyalty, simply goes unrewarded in a lot of places in our industry (and at a many corporations). And the people who do stay... in many cases they turn into furniture that ends up holding potential good evolution back. They loose out to the technological magpies the bring shiny things to management because it will "move the needle".
Sadly this is just one facet of the problems we are facing, from how we interview to how we run (or rent) our infrastructure things have gotten rather silly...
The days where you could devote your career to a firm and retire with a pension are long gone
The author of this article wants a boring tech stack that just works, and honestly after everything we’ve been through in the last five years, I kinda want a boring job I can keep until I retire, too
elastic stack is so heavy it's out of question for smaller clusters, loki integration with grafana is nice to have but separate capable dashboard would be also fine
I think this would not need to be an issue as frequently if prometheus had a more efficient publish/scraping mechanism. iirc there was once a protobuf metric format that was dropped, and now there is just the text format. While it wouldn't handle billions of unique labels like mimir, a compact binary metric format could certainly allow for millions at reasonable resolution instead of wasting all that scale potential on repeated name strings. I should be able to push or expose a bulk blob all at once with ordered labels or at least raw int keys.
This didn‘t feel right, so we looked around and found greptimedb https://github.com/GreptimeTeam/greptimedb, which simplifies the whole stack. It‘s designed to handle metrics, logs, and traces. We collect metrics and logs via OpenTelemetry, and visualize them with Grafana. It provides endpoints for Postgres, MySQL, PromQL; we‘re happy to be able to build dashboards using SQL as that’s where we have the most knowledge.
The benchmarks look promising, but our k8s clusters aren’t huge anyway. As a platform engineer, we appreciate the simplicity of our observability stack.
Any other happy greptimedb users around here? Together with OTel, we think we can handle all future obs needs.
Thank you for giving GreptimeDB a shout-out—it means a lot to us. We created GreptimeDB to simplify the observability data stack with an all-in-one database, and we’re glad to hear it’s been helpful.
OpenTelemetry-native is a requirement, not an option, for the new observability data stack. I believe otel-arrow (https://github.com/open-telemetry/otel-arrow) has strong future potential, and we are committed to supporting and improving it.
FYI: I think SQL is great for building everything—dashboards, alerting rules, and complex analytics—but PromQL still has unique value in the Prometheus ecosystem. To be transparent, GreptimeDB still has some performance issues with PromQL, which we’ll address before the 1.0 GA.
Are you saying that you prefer SQL over PromQL for metrics queries? I haven't tried querying metrics via SQL yet, but generally speaking have found PromQL to be one of the easier query languages to learn - more straightforward and concise IME. What advantages does SQL offer here?
PromQL, on the other hand, is purpose-built for observability — it’s optimized for time‑series data, streaming calculations, and real‑time aggregation. It’s definitely easier to learn and more straightforward when your goal is to reason about metrics and alerting.
SQL’s strengths are in relational joins, richer operator sets, and higher‑level abstraction, which make it more powerful for analytical use cases beyond monitoring. PromQL trades that flexibility for simplicity and immediacy — which is exactly what makes it great for monitoring.
Disclosure: I am a maintainer of OpenObserve
I've been doing monitoring since before it was called observability with good old Nagios, and the modern observability stack is insane. I'm glad that tools like OpenObserve and SigNoz exist.
open source and opentelemetry-native. Lots of our users have migrated from grafana to overcome challenges like having to handle multiple backends.
p.s - i am one of the maintainers.
That's besides the point that most customers will never need that level of scale. If you're not running Mimir on a dedicated Kubernetes cluster (or at least a dedicated-to-Grafana / observability cluster) then it's probably over-engineered for your use-case. Just use Prometheus.
(I'm not affiliated, but a very happy user across multiple orgs and personal projects)
All of these systems that store metrics in object storage - you have to remember that object storage is not file storage. Generally speaking (stuff like S3 One Zone being a relatively recent exception) you cannot append to object files. Metrics queries are resolved by querying historical metrics in object storage plus a stateful service hosting the latest 2 hours of data before it can be compressed and uploaded to object storage as a single block. At a certain scale, you simply need to choose which is more important - being able to answer queries or being able to insert more timeseries. And if you don't prioritize insertion, it just results in the backlog getting bigger and bigger, which especially in the eventual case (Murphy's Law guarantees it) of a sudden flood of metrics to ingest will cause several hour ingestion delays during which you are blind. And if you do prioritize insertion, well the component simply won't respond to queries, which makes you blind anyway. Lose-lose.
Mimir built in Kafka because it's quite literally necessary at scale. You need the stateful query component (with the latest 2 hours) to prioritize queries, then pull from the Kafka topic on a lower priority thread, when there's spare time to do so. Kafka soaks up the sudden ingestion floods so that they don't result in the stateful query component getting DoS'd.
I took a quick look at VictoriaMetrics - no Kafka or Kafka-like component to soak up ingestion floods? DOA.
Again, most companies are not BigCos. If you're a startup/scaleup with one VP supervising several development teams, you likely don't need that scale, probably VictoriaMetrics is just fine, you're not the first person I've heard recommend it. But I would say 80% of companies are small enough to be served with a simple Prometheus or Thanos Query over HA Prometheus setup, 17% of companies will get a lot of value out of Victoria Metrics, the last 3% really need Mimir's scalability.
There are multiple ways to deal with ingestion floods. Kafka/distributed log is one of them, but it's not the only one. In cluster mode VM is a distributed set of services that scale out independently and buffer at different levels.
Resource usage for ingestion/storage is much lower than other solutions, and you get more for your money. At $PREVIOUS_JOB, we migrated from a very expensive Thanos to a VM cluster backed by HDDs, and saved a lot. Performance was much better as well. It was a while ago, and I don't remember the exact number of time series, but it was meant to handle 10k+ VMs (and a lot of other resources, multiple k8s clusters) and did it with ease (also for everybody involved).
I don't think you have really looked into VM - you might get pleasantly surprised by what you find :) Check out this benchmark with Mimir[1] (it is a few years old though), and some case studies [2]. Some of the companies in the case studies run at significantly higher volume than your requirements.
[1] https://victoriametrics.com/blog/mimir-benchmark/
[2] https://docs.victoriametrics.com/victoriametrics/casestudies...
> HDD
You're right, I'm misremembering here, that particular complaint about a lack of Kafka was a Thanos issue, not VM.
That said, HDD is a hard sell to management. Seen as "not cloud native". People with old trauma from 100% full disks not expanded in time. Organizational perception that object storage does not need to be backed up (because redundancy is built into the object storage system) but HDD does (and automated backups are a VM Enterprise feature, and even more important if storing long-term metrics in VM).
> In cluster mode VM is a distributed set of services that scale out independently and buffer at different levels
So are Thanos and Mimir, which suffer from ingest floods causing DoS, at least until Kafka was added. vminsert is billed as stateless, same as Thanos Receiver, same as Mimir Distributor. Not convinced.
This is a classical FUD. VictoriaMetrics is used as a drop-in replacement for Prometheus, Thanos and Mimir. It works perfectly across all the existing dashboards in Grafana, and across all the existing recording and alerting rules. I'm unaware of VictoriaMetrics users who hit PromQL compatibility issues during the migration from Prometheus, Thanos and Mimir to VictoriaMetrics. There are a few deliberate incompatibilities aimed towards improving user experience. See https://medium.com/@romanhavronenko/victoriametrics-promql-c...
> seeing features locked behind the Enterprise version (Mimir Enterprise had features added on top, not features locked away)
All the VictoriaMetrics features, which are useful across the majority of practical use cases, are included in open-source version. The main Enterprise feature - high-quality technical support by VictoriaMetrics engineers. Other Enterprise features are needed only for large enterprise companies. See https://docs.victoriametrics.com/victoriametrics/enterprise/
I recommend reading real-world case studies from happy users, who migrated from other systems (including Prometheus, Thanos and Mimir) to VictoriaMetrics - https://docs.victoriametrics.com/victoriametrics/casestudies...
I'm looking for something with really compact storage, really simple deployment (preferably a single statically linked binary that does everything), and compatible with OpenTelemetry (including metrics and distributed tracing). If/when I outgrow it, I can switch to another OpenTelemetry provider (but realistically this will not happen)
Isn't OpenTelemetry very slow?
I'm looking at OpenTelemetry because of broad tooling compatibility (both Rust tracing crates, tracing and emit, support it - for logs, tracing and metrics) and it seems like something that will stick around.
Also I'm not sure I will ever need actual performance out of an observability solution; it's a tiny app after all.
Depends entirely on the scale and frequency.
I've had an overall good experience operating it at numerous places w/o hitting bottlenecks.
I was previously leaning torwards VictoriaMetrics and VictoriaTraces (I will need both) but I think that OpenObserve is even simpler. Later I found Gigapipe/qryn https://github.com/metrico/gigapipe
Does OpenObserve ships something to view traces and metrics? (it appears that Gigapipe does). Or am I supposed to just use Grafana? I want to cut down on moving pieces.
ClickStack also looks promising.
OpenTelemetry is, last I looked at it, way too immature, unstable, and resource-hungry to be such a foundational part of infrastructure.
This is a lot of infrastructure, we are talking about a tiny app here. Are you sure this is warranted?
Honestly I would prefer to have observability as a library, that's not feasible because of two factors, a) I really want distributed tracing (no microservices - I just want to combine traces from frontend and backend) so I need a place to join them, and b) it could/would lead to loss of traces when the program crashes.
In any case, it makes sense for me to choose tracing and metrics libraries that can output either OpenTelemetry or Prometheus and Jaeger, in the event that OpenTelemetry is not enough.
> Loki is also just fine and much better than Elastic/OpenSearch.
Wait, there is more?
I'm scratching my head a little bit on what your expectation is here. Traces and real-user-monitoring are not the same thing here. Distributed tracing is specifically a microservices thing. Maybe all you're looking for is to just attach a UUID to each request and log it? Jaeger and Tempo aren't going to help you with frontend code.
> A lot of infrastructure
> Prometheus
You need something to tell you when your tiny app isn't running, so it can't be a library embedded into the app itself.
> Grafana
You need something with dashboards to help you understand what's going on. If your thing telling you your app has crashed is outside your app, the thing that helps visualize what happened before your app crashed also needs to be outside your app.
> Jaeger
Do you really need traces? Or just to log how long request times took, and have metrics for p50/p95/p99?
> Loki
If you're running only one instance of your app, you don't need it, just read through your logfiles with hl. If you have multiple machines, sending your logs to one place may not be necessary, but it's incredibly helpful; the alternative is basically ssh multiplexing...
I love it when people take a hard stand like this, using the words "period"
BTW, Cortex is used as Amazon Managed Prometheus (Probably at a much larger scale) than Mimir by AWS.
OpenObserve too, is already being used at a multi-petabyte scale.
> Multi-petabyte
Sheer storage size is a meaningless point here, as longer retention requires more storage. There may or may not be compaction components that help speed up queries over larger windows, but that's irrelevant to the point that the queries will still succeed. I have no doubt that any of the solutions on the table will handle storing that much data.
The real scaling question is how many active timeseries the system can handle, at which resolution (per 15 seconds? per 60 seconds? worse?), and no, "we scale horizontally" doesn't mean much without more serious benchmarks.
handle? What does it mean? Be able to ingest data? Be able to query?
ingest data - Using kafka helps only during ingestion for handling the spike. query data - Kafka has no role to play in it. Querying performantly at scale is a hard problem. I do not doubt Mimir's capability in being able to query high volumes of data, but other systems can do it too and OpenObserve's internal benchmarks show that it's querying is much faster at scale than Mimir and we will publish it at the right time (We don't just publish benchmarks to satisfy plain curiosity of people on internet), but this is not about OpenObserve so let's push it aside for a while.
About - how many active timeseries We've built OpenObserve with a fundamentally different architecture. We don't have the "active timeseries" constraint that Prometheus-based systems do. High cardinality isn't an issue by design It's a topic for another day though.
The primary function of a message broker is to decouple producer and consumer so writes can happen efficiently (consumers do not get bogged down by high incoming volume). Something like Kafka allows that very, very well, and it is one of the best systems designed to do it. It allows massive volumes of ingestion reliably without dropping packets. It's a beast on it's own though.
Kafka was also built in an era when autoscaling was not available (Still very relevant though and will be for a very long time). Autoscaling to a great degree can allow you to handle write spikes (It's not the same thing but can attack the same problem from a different angle) and extreme spikes will still require a message broker. Horizontally scalable does cut it to a great degree though.
Having architected massive systems for multiple large companies, I can argue about technology for a long time, but the only point I want to drive is to avoid the use of words like "period". Mimir's architecture makes sense but it's not the only solution that works at scale, and the operational complexity has real costs. There are no absolutes in tech as in life.
If Mimir is the only one, why Roblox, GrafanaLabs's customer, isn't using Mimir for monitoring? They're using VictoriaMetrics on approx scale of 5 Billion active time series. See https://docs.victoriametrics.com/victoriametrics/casestudies....
None solution is perfect. Each one has its own trade-offs. That is why it triggers me when I see statements like this one.
- Thanos
- Mimir
- VictoriaMetrics
All of them provide a way to scale monitoring to insane numbers. The difference is in architecture, maintainability and performance. But make your own choices here.
Before, I remember there was m3db from Uber. But the project seems pretty dead now.
And there was a Cortex project, mostly maintaned by GrafanaLabs. But at some point they forked Cortex and named it Mimir. And Cortex is now maintained by Amazon and, as I undersand, is powering Amazon Managed Prometheus. However, I would avoid using Cortex ecaxctly because it is now maintained by Amazon.
That is to say I agree with the author.
I used to be a fan of InfluxDB (back in the days of v1.x) then I went off it for exactly this reason.
I’d like to adjust this understanding. Kafka is the big new thing, but it’s optional. The previous way using gRPC still works.
I work on Mimir and other things at Grafana Labs.
"However, this architecture is set to be deprecated in a future release."
So it doesn't stay optional unfortunately. It quite a heavy dependency to include...
Having an S3-compatible store was already a fairly heavy dependency in terms of something to run correctly in production, it's just that most people don't even consider running their own object store at any real scale, they just go to cloud. Whereas running your own Kafka is something more platform teams are already attempting.
What’s a bigger lock in for me is metrics and promql - you just can’t ever rename a poorly named metric or you face a world of pain. Or when Prometheus releases Native Histograms to replace the old ones, and suddenly everything from rules, alerts, ad-hoc queries and dashboards needs updating.
And PromQL is so opaque, it just never give you an error unless there is a syntax issue. We need tools like https://github.com/cloudflare/pint just to know if my alert description isn’t trying to render a label that’s just not gonna be there, etc
PromQL should be blamed on prometheus though, not on grafana.
I just want the thing to alert me when something's down, and ideally if the check doesn't change and the datasource and metric don't change, the dashboard definition and the alert definition should be the same for the last and the next 10 years.
The UI used to have the most 4-5 important links in the sidebar, now it's 10 menus with submenus of submenus, and I never know where to find the basics: Dashboards and Alerts. When something goes off I don't have time to re-learn the UI I look at maybe once a month.
I understand updating some front facing service due to a vulnerability... But for a thing that it's internally accessible?
I know there is always the temptation to make it really shiny and nice. But the more moving parts your system has the liklier it becomes that something will fail eventually. And as it happens these failures happen usually at the time when it is most inconvenient for all people involved.
That doesn't mean that complex software cannot work reliably, but it takes more effort for the developer side to honor that unwritten contract with their users (if they are even aware of it).
This is why sometimes doing it yourself, on your own servers can be benefitial because it gives you more control.
I've had a grafana + prometheus setup on my servers since like 2017. It worked then and works today. I log in maybe once every year or two to update to a newer LTS version. Every dashboard is still pristine, and nothing has ever broken.
I don't understand most of the words in the linked post and don't need to. The core package is the boring solution that 99% of people here need, and that works great.
How did you handle the angular deprecation in grafana? Or are you just staying in an older version that still supports it?
Though I won't say I loved doing it.
But it doesn't have a complete dashboard UI like Grafana.
- VictoriaMetrics for metrics. With Prometheus API support, so it integrates with Grafana using Prometheus datasource. It has its own Grafana datasource with extra functionality too.
- VictoriaLogs for logs. Integrates natively with Grafana using VictoriaLogs datasource.
- VictoriaTraces for traces. With Jaeger API support, so it intergrates with Grafana using Jaeger datasource.
All 3 solutions support alerting, managed by same team, are Apache2 licensed, are focused on resource efficiency and simiplicity.
I wonder why they think that is the case, as Zabbix has native support for monitoring VMs and Docker containers, including support for discovering newly spawned ones and all that jazz
> career-driven development
we don't have this and promote and reward as frequently for "I've done solid operations" as we do for "I've added this feature" (I'm on promotion committees and can state this confidently).
what we do have is high autonomy for engineers. This autonomy means it's a freedom that engineers have to identify problems they feel are important and to work on them, they do not need permission and leadership do not veto this. Some of the best features in the last few years have been a direct result of this autonomy, it's one of the things that makes working here so attractive to many of the engineers. But, with autonomy comes a little chaos, and not everything that is done is going to satisfy every end user of OSS or paid customer (of which these are a small percent of the whole).
a lot of the innovation speed is just in the DNA of the company, even the creation of Grafana can be traced to a desire to get things done; Torkel wanted Kibana to also work for Prometheus, Kibana declined to add this, Torkel didn't stand still and added things to a fork of Kibana now called Grafana and hasn't stopped adding things since.
> They also deprecated Angular within Grafana and switched to React for dashboards. This broke most existing dashboards.
we did, I think the entire journey was 7 years long, communicated many times, over at least 6 major releases. maintaining dashboards in two languages increased complexity, whilst reducing compatibility, and gave a very large security surface to be worried about. we communicated clearly, provided migration tools, put it in release notes, updated docs, repeated it at conferences and on community calls.
arguably we went too slow, and should've ripped the band-aid off, but we were sensitive to the fact that it was a breaking change and so we proceeded with extreme caution. it's done now, it was finally completed in the last version, only a very small number of users reported impact as a result of the time and care taken on this.
> I just hope OTEL settles, gets stable and boring fast
this is distinct from Grafana, but it's a good point... OTel is the product of virtually every vendor at this point, and a hell of a lot of engineers, it now has a lot of momentum and the pace is unlikely to ease up due to the sheer number of contributions and things that OTel as a community wishes to achieve.
the most likely eventuality is that enough stability emerges to allow vendors (including but not limited to Grafana Labs) to abstract away the pace of innovation occurring underneath, but this is in tension with providing the benefits of the innovation to the people that use it.
what I would say is that for most people the boring and slow path does still exist, and it's still good... just use Prometheus, a logging option of your choice, and simple Grafana dashboards and alerts. that combination hasn't varied in years, and those on it today are still immune from caring about the pace of innovation and change in OTel and across the Observability industry. OTel is being used in production at massive scale by lots of companies, but whether your project or company need move to it now reflects your priorities, many are adopting to gain independence from vendors, or just control over their telemetry, but many customers are also saying they're happy to stay on the slow and boring path and for everything to work predictably with low cost to keep pace... it works too.
This is the worst reason to migrate to OTEL format for metrics, since every vendor and every solution for metrics has its' own set of transformation rules for the ingested OTEL metrics before saving them into the internal storage (this is needed in order to align OTEL metrics to the internal data model unique per each vendor / service). These transformation rules are incompatible among vendors and services. Also, every vendor / service may have its own querying API. This means that users cannot easily migrate from one vendor / service to another one by just switching from the old format to OTEL format for metrics' transfer. Read more about this at https://x.com/valyala/status/1982079042355343400
If this migration appeared to be so painful, why did you decide to finish it (and make users unhappy) instead of cancelling the migration at early stages? What are benefits of this migration?
eduction•2mo ago
StackTopherFlow•2mo ago
pseidemann•2mo ago
incorrect-horse•2mo ago