Observability's past, present, and future

https://blog.sherwoodcallaway.com/observability-s-past-present-and-future/

79•shcallaway•1mo ago

Comments

buchanae•1mo ago

I share a lot of this sentiment, although I struggle more with the setup and maintenance than the diagnosis.

It's baffling to me that it can still take _so_much_work_ to set up a good baseline of observability (not to mention the time we spend on tweaking alerting). I recently spent an inordinate amount of time trying to make sense of our telemetry setup and fill in the gaps. It took weeks. We had data in many systems, many different instrumentation frameworks (all stepping on each other), noisy alerts, etc.

Part of my problem is that the ecosystem is big. There's too much to learn: OpenTelemetry, OpenTracing, Zipkin, Micrometer, eBPF, auto-instrumentation, OTel SDK vs Datadog Agent, and on and on. I don't know, maybe I'm biased by the JVM-heavy systems I've been working in.

I worked for New Relic for years, and even in an observability company, it was still a lot of work to maintain, and even then traces were not heavily used.

I can definitely imagine having Claude debug an issue faster than I can type and click around dashboards and query UIs. That sounds fun.

pphysch•1mo ago

> Part of my problem is that the ecosystem is big. There's too much to learn: OpenTelemetry, OpenTracing, Zipkin, Micrometer, eBPF, auto-instrumentation, OTel SDK vs Datadog Agent, and on and on. I don't know, maybe I'm biased by the JVM-heavy systems I've been working in.

We've had success keeping things simple with VictoriaMetrics stack, and avoiding what we perceive as unnecessary complexity in some of the fancier tools/standards.

shcallaway•1mo ago

I completely agree w/ your points about why observability sucks: - Too much setup - Too much maintenance - Too steep of a learning curve

This isn't the whole picture, but it's a huge part of the picture. IMO, observability shouldn't be so complex that it warrants specialized experience; it should be something that any junior product engineer can do on their own.

> I can definitely imagine having Claude debug an issue faster than I can type and click around dashboards and query UIs. That sounds fun.

Working on it :)

tech_ken•1mo ago

> Observability made us very good at producing signals, but only slightly better at what comes after: interpreting them, generating insights, and translating those insights into reliability.

I'm a data professional who's kind of SRE adjacent for a big corpo's infra arm and wow does this post ring true for me. I'm tempted to just say "well duh, producing telemetry was always the low hanging fruit, it's the 'generating insights' part that's truly hard", but I think that's too pithy. My more reflective take is that generating reliability from data lives in a weird hybrid space of domain knowledge and data management, and most orgs headcount strategy don't account for this. SWEs pretend that data scientists are just SQL jockeys minutes from being replaced by an LLM agent; data scientists pretend like stats is the only "hard" thing and all domain knowledge can be learned with sufficient motivation and documentation. In reality I think both are equally hard, it's rare that you find someone who can do both, and that doing both is really what's required for true "observability".

At a high level I'd say there are three big areas where orgs (or at least my org) tend to fall short:

* extremely sound data engineering and org-wide normalization (to support correlating diverse signals with highly disparate sources during root-cause)

* telemetry that's truly capable of capturing the problem (ie. it's not helpful to monitor disk usage if CPU is the bottleneck)

* true 'sleuths' who understand how to leverage the first two things to produce insights, and have the org-wide clout to get those insights turned into action

I think most orgs tend to pick two of these, and cheap out on the third, and the result is what you describe in your post. Maybe they have some rockstar engineers who understand how to overcome the data ecosystem shortcomings to produce a root-cause analysis, or maybe they pay through the nose for some telemetry/dashboard platform that they then hand over to contract workers who brute-force reliability through tons of work hours. Even when they do create dedicated reliability teams, it seems like they are more often than not hamstrung by not having any leverage with the people who actually build the product. And when everything is a distributed system it might actually be 5 or 6 teams who you have no leverage with, so even if you win over 1 or 2 critical POCs you're left with an incomplete patchwork of telemetry systems which meet the owning team's (teams') needs and nothing else.

All this to say that I think reliability is still ultimately an incentive problem. You can have the best observability tooling in the world, but if don't have folks at every level of the org who understand (a) what 'reliable' concretely looks like for your product and (b) have the power to effect necessary changes then you're going to get a lot of churn with little benefit.

ghaff•1mo ago

It's a long running topic in a lot of areas. I remember back when data warehousing was the hot thing, collecting and cleaning all this data was supposed to be the key to insights that would unlock juicy profits. Basically didn't happen.

shcallaway•1mo ago

This is a super insightful comment & there is a bunch that I want to respond to but I can't do it all neatly in one comment. Hahaha

I'll choose this point:

> reliability is still ultimately an incentive problem

This is a fascinating argument and it feels true.

Think about it. Why do companies give a shit about reliability at all? They only care b/c it impacts bottom line. If the app is "reliable enough" such that customers aren't complaining and churning, it makes sense that the company would not make further investments in reliability.

This same logic is true at all levels of the organization, but the signal gets weaker as you go down the chain. A department cares about reliability b/c it impacts the bottom line of the org, but that signal (revenue) is not directly and attributable to the department. This is even more true for a team, or an individual.

I think SLOs are, to some extent, a mechanism that is designed to mitigate this problem; they serve as stronger incentive signals for departments and teams.

donavanm•1mo ago

I'd +1 incentives, primarily P&L/revenue/customer acquisition/retention, with a small carve out for "culture." I've worked places, and for people, where the culture was to "do the right thing" or focus on user experience as the objective which influenced decisions like paying more (time and money) for better support. For the SDEs and line teams it wasnt about revenue or someone yelling at them, they just emulated the behavior they saw around them which led to better observability/introspection/reliable/support. Which, of course, we'd like to believe leads to long term to success and $$$$.

I also like the call out of SLOs (or OKR or SMART goals or whatever) as a mechanism to broadcast your priorities and improve visibility. BUT I've also worked places where they didnt work because the ultimate owner with a VP title didnt care or understand to buy in to it.

And of course theres the hazard of principal agent problems between those selling, buying, building, and running are probably different teams and may not have any meaningful overlap in directly responsible individual.

hommes-r•1mo ago

I would add that "extremely sound data engineering" is also necessary to make observability cost-effective. Some of these otel platforms can burn 10%-25% of your cloud budget to show you your logs. That is insane.

vrnvu•1mo ago

First. Love that more tools like Honeycomb (amazing) are popping up in the space. I agree with the post.

But. IMO, statistics and probability can’t be replaced with tooling. As software engineering can’t be replaced with no-code services to build applications…

If you need to profile some bug or troubleshoot complex systems (distributed, dbs). You must do your math homework consistently as part of the job.

If you don’t comprehend the distribution of your data, the seasonality, noise vs signal; how can you measure anything valuable? How can you ask the right questions?

shcallaway•1mo ago

Vibe-coders don't comprehend how the code works, yet they can create impressive apps that are completely functional.

I don't see why the same isn't true for "vibe-fixers" and their data (telemetry).

jbs789•1mo ago

The distinction is originality vs replicating existing.

I believe the author is in the former camp.

Veserv•1mo ago

Of course that sucks. Just enable full time-travel recording in production and then you can use a standard multi-program trace visualizer and time travel debugger to identify the exact execution down to the instruction and precisely identify root causes in the code.

Everything is then instrumented automatically and exhaustively analyzable using standard tools. At most you might need to add in some manual instrumentation to indicate semantic layers, but even that can frequently be done after the fact with automated search and annotation on the full (instruction-level) recording.

shcallaway•1mo ago

You're not the first person I've met that has articulated an idea like this. It sounds amazing. Do you have an idea about why this approach isn't broadly popular?

donavanm•1mo ago

cost and compliance are non-trivial for non-trivial applications. Universal instrumentation and recording creates a meaningful fixed cost for every transaction, and you must record ~every transaction; you can't sample & retain post-hoc. If you're processing many thousands of TPS on many thousands of nodes that quickly adds up to a very significant aggregate cost even if the individual cost is small.

For compliance (or contractual agreement) there are limitations on data collection, retention, transfer, and access. I certainly don't want private keys, credentials, or payment instruments inadvertently retained. I dont want confidential material to be distributed out of band or in an uncontrolled manner (like your dev laptop). I probably don't even want employees to be able to _see_ "customer data." Which runs head long in to a bunch challenges where low level trace/sampling/profiling tools have more less open access to record and retain arbitrary bytes.

Edit: Im a big fan of continuous and pervasive observability and tracing data. Enable and retain that at ~debug level and filter + join post-hoc as needed. My skepticism above is about continuous profiling and recording (ala vtune/perf/ebpf), which is where "you" need to be cognizant of risks & costs.

esafak•1mo ago

We need more automation. Less data, more insight. We're at the firehose stage, and nobody's got time for that. ML-based anomaly detection is not widespread and automated RCA barely exists. We'll have solved the problem when AI detects the problem and submits the bug fix before the engineers wake up.

shcallaway•1mo ago

You're so right.

> We'll have solved the problem when AI detects the problem and submits the bug fix before the engineers wake up.

Working on it :)

antsou•1mo ago

Observability and APM are way, way older than depicted in this simplistic post!

shcallaway•1mo ago

Hello! Yes, you are right - observability and APM have both been around for many decades, but the incarnations that most people are familiar with are the ones that emerged in the 2010s.

My intention wasn't for this post to be a comprehensive historical record. That would have taken many more words & would have put everyone to sleep. My goal was to unpack and analyze _modern observability_ - the version that we are all dealing w/ today.

Good point though!

camel_gopher•1mo ago

2006 - Bryan Cantrill publishes this work on software observability https://queue.acm.org/detail.cfm?id=1117401

2015 - Ben Sigelman (one of the Dapper folks) cofounds Lightstep

shcallaway•1mo ago

These are great. I should have included them in my timeline!

Huge fan of historical artifacts like Cantrill's ACM paper

kazsa•1mo ago

Strong post. The mismatch between observability effort and reliability gains feels real. The bottleneck seems cognitive, not technical.

Kinrany•1mo ago

Log stores should be first-class members, allowing other services to read logs too, not just write.

We need better primitives and protocols for event-based communication. Logging configuration should be mostly about routing and storage.

kenyuz•1mo ago

This doesn't mention a possible future. Honeycomb is a columnar database that ingests wide spans. Add richer context and pass it to an AI SRE tool. This seems like a data engineering issue

ptx•1mo ago

Wait a minute...

> with the rise of [...] microservices, apps were becoming [...] too complex for any individual to fully understand.

But wasn't the idea of microservices that these services would be developed and deployed independently and owned by different teams? Building a single app out of multiple microservices and expecting a single individual to debug it sounds like holding it wrong, which then requires this distributed tracing solution to fix it.

valyala•1mo ago

Coroot is the future of observability

skybrian•1mo ago

Could you say more about your experience and what you like about it?

valyala•1mo ago

It monitors all the applications in your network and automatically detects the most common issues with the applications. It also collects metrics, traces, logs and CPU profiles for the monitored applications, so you could quickly investigate the root cause of various issues if needed.

I like that Coroot works out of the box without the need in complicated configuration.

nijave•1mo ago

Mostly agree although I think ease of instrumentation is getting pretty good. At least in the Python ecosystem, you set some env vars and run opentelemtry-bootstrap and it spits out a list of packages to add. Then you run your code with the otel cli wrapper and it just works.

Datadog is equally as easy.

That alone gets you pretty good canned dashboards on vendors that have built in APM views.

The rest definitely rings true and I suspect some of it has come with the ease of software development. You need to know less about computer fundamentals and debugging with the proliferation of high level frameworks, codegen, and AI.

I also have noticed a trend that brings observability closer to development with IDE integration which I think is a good direction. Having the info "silo'd" in an opaque mega data store isn't useful.

hrimfaxi•1mo ago

The timeline seems contrived for narrative purposes. It mentioned Honeycomb's founding as "one of the first managed tracing platforms". But honeycomb didn't have tracing until 2018 https://www.prnewswire.com/news-releases/honeycomb-launches-...

New wave of GLP-1 drugs is coming–and they're stronger than Wegovy and Zepbound

Convert tempo (BPM) to millisecond durations for musical note subdivisions

Show HN: Tasty A.F.

The Contagious Taste of Cancer

U.S. Jobs Disappear at Fastest January Pace Since Great Recession

Bithumb mistakenly hands out $195M in Bitcoin to users in 'Random Box' giveaway

Beyond Agentic Coding

OpenClaw ClawHub Broken Windows Theory – If basic sorting isn't working what is?

OpenBSD Copyright Policy

OpenClaw Creator: Why 80% of Apps Will Disappear

What Happens When Technical Debt Vanishes?

AI Is Finally Eating Software's Total Market: Here's What's Next

Computer Science from the Bottom Up

Show HN: A toy compiler I built in high school (runs in browser)

You don't need Mac mini to run OpenClaw

Learning to Reason in 13 Parameters

Convergent Discovery of Critical Phenomena Mathematics Across Disciplines

Ask HN: Will GPU and RAM prices ever go down?

From hunger to luxury: The story behind the most expensive rice (2025)

Substack makes money from hosting Nazi newsletters

A New Crypto Winter Is Here and Even the Biggest Bulls Aren't Certain Why

Moltbook was peak AI theater

Why Claude Cowork is a math problem Indian IT can't solve

Show HN: Built an space travel calculator with vanilla JavaScript v2

Why a 175-Year-Old Glassmaker Is Suddenly an AI Superstar

Micro-Front Ends in 2026: Architecture Win or Enterprise Tax?

These White-Collar Workers Actually Made the Switch to a Trade

The Wonder Drug That's Plaguing Sports

Show HN: Which chef knife steels are good? Data from 540 Reddit tread

Federated Credential Management (FedCM)

New wave of GLP-1 drugs is coming–and they're stronger than Wegovy and Zepbound

Convert tempo (BPM) to millisecond durations for musical note subdivisions

Show HN: Tasty A.F.

The Contagious Taste of Cancer

U.S. Jobs Disappear at Fastest January Pace Since Great Recession

Bithumb mistakenly hands out $195M in Bitcoin to users in 'Random Box' giveaway

Beyond Agentic Coding

OpenClaw ClawHub Broken Windows Theory – If basic sorting isn't working what is?

OpenBSD Copyright Policy

OpenClaw Creator: Why 80% of Apps Will Disappear

What Happens When Technical Debt Vanishes?

AI Is Finally Eating Software's Total Market: Here's What's Next

Computer Science from the Bottom Up

Show HN: A toy compiler I built in high school (runs in browser)

You don't need Mac mini to run OpenClaw

Learning to Reason in 13 Parameters

Convergent Discovery of Critical Phenomena Mathematics Across Disciplines

Ask HN: Will GPU and RAM prices ever go down?

From hunger to luxury: The story behind the most expensive rice (2025)

Substack makes money from hosting Nazi newsletters

A New Crypto Winter Is Here and Even the Biggest Bulls Aren't Certain Why

Moltbook was peak AI theater

Why Claude Cowork is a math problem Indian IT can't solve

Show HN: Built an space travel calculator with vanilla JavaScript v2

Why a 175-Year-Old Glassmaker Is Suddenly an AI Superstar

Micro-Front Ends in 2026: Architecture Win or Enterprise Tax?

These White-Collar Workers Actually Made the Switch to a Trade

The Wonder Drug That's Plaguing Sports

Show HN: Which chef knife steels are good? Data from 540 Reddit tread

Federated Credential Management (FedCM)

Observability's past, present, and future

Comments