frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

OTelBench: AI struggles with simple SRE tasks (Opus 4.5 scores only 29%)

https://quesma.com/blog/introducing-otel-bench/
39•stared•1h ago

Comments

whalesalad•1h ago
If everyone else is the problem... maybe you are the problem. To me this says more about OTel than AI.
apercu•55m ago
Can you help me understand where you are coming from? Is it that you think the benchmark is flawed or overly harsh? Or that you interpret the tone as blaming AI for failing a task that is inherently tricky or poorly specified?

My takeaway was more "maybe AI coding assistants today aren’t yet good at this specific, realistic engineering task"....

hobofan•12m ago
In my experience many OTEL libraries are aweful to use and most of the "official" ones are the worst offenders as the are largely codegened. That typically makes them feel clunky to use and they exhibit code patterns that are non-native to the language used, which would an explanation of why AI systems struggle with the benchmark.

I think you would see similar results if tasking an AI to e.g. write GRPC/Protobuf systems using only the builtin/official protobuf codegen languages.

Where I think the benchmark is quite fair is in the solutions. It looks like for each of the languages (at least the ones I'm familiar with), the "better" options were chosen, e.g. using `tracing-opentelemtry` rather than `opentelemetry-sdk` directly in Rust.

However the one-shot nature of the benchmark also isn't that reflective of the actual utility. In my experience, if you have the initial framework setup done in your repo + a handful of examples, they do a great job of applying OTEL tracing to the majority of your project.

vimda•53m ago
But not everyone else is the problem? OTel works well for humans. Sometimes AIs are just shit
devin•12m ago
It's not a new thing to bring up that OTel is difficult to get correct. This was a criticism levied before the AI era.
jcims•57m ago
I've been building an 'sre agent' with LangGraph for the past couple of weeks and honestly I've been incredibly impressed with the ability for frontier models, when properly equipped with useful tools and context, to quickly diagnose issues and suggest reasonable steps to remediate. Primary tooling for me is access to source code, cicd environment and infrastructure control plane. Some cues in the context to inform basic conventions really helps.

Even when it's not particularly effective, the additional information provided tends to be quite useful.

dgxyz•52m ago
Our humans struggle with them too. It’s the only domain where you need actually to know everything.

I wouldn’t touch this with a pole if our MTTR was dependent on it being successful though.

vasco•48m ago
I can say that as someone that does this for a job for a while, it's starting to be useful in many domains related to SRE that make parts of the job easier.

MCP servers for monitoring tools are making our developers more competent at finding metrics and issues.

It'll get there but nobody is going to type "fix my incident" in production and have a nice time today outside of the most simple things that if they are possible to fix like this, could've been automated already anyway. But between writing a runbook and automating sometimes takes time so those use cases will grow.

another_twist•48m ago
Fuck OTel. Such a stupid protocol. Histograms have delta or cumulative. If delta do this, cumulative do this. Sums the same. Gauge are kinda onay. But everything else is just oh my ducking god. All collectors have dotted metric names and ducking prometheus needs dashes. What the actual fuck.
asyncadventure•48m ago
This aligns with my experience trying to automate observability tasks - AI excels at individual coding patterns but struggles with the holistic understanding needed for distributed tracing. The 29% success rate actually seems optimistic considering how OpenTelemetry requires deep context about service boundaries and business logic, not just syntactic correctness.
jakozaur•16m ago
In this benchmark, micro-services are really small, ~300 lines, and sometimes just two of them. More realistic tasks (large codebases, more microservices) would have a lower success rate.
winton•46m ago
So if I try to do it with Opus three or four times, I'll get it done? And probably in about 10 minutes? Awesome
throwup238•40m ago
That’s only if the failures are truly random and aren’t correlated
stared•23m ago
Nope, these are no random dice rolls. Some times are solved each run, a few - occasionally (so here would be meaningful to try a few times - and metrics of pass@1 and pass@3 would be different), but most are never solved.

See e.g.: https://quesma.com/benchmarks/otel/models/claude-opus-4.5/

AnotherGoodName•43m ago
This is a little damning of the way Google does things honestly.

>When an app runs on a single machine, you can often trace an error by scrolling through a log file. But when it runs across 50 microservices, that single request gets scattered into a chaotic firehose of disconnected events.

Yep this is about Google. It's painful for humans to debug and it's also an extremely bespoke issue to deal with. No one else has quite the same level of clusterfuck and there's going to be no training for LLMs on this.

youknownothing•40m ago
isn't that what trace IDs are for?
belval•7m ago
Yeah I don't know their stack but I have a service that is a collection of microservices and Opus can debug them fine by aggregating the logs tied to the same faulty request ID.

In general for those tasks though the question is more "How would a human do it". If it's impossible for a human because your tooling is so bad you can't even get the logs across services for a single ID, that seems like a pretty serious design issue.

In general looking at the prompt though, this is also not very representative. You don't have an SOP that you can share with your agent? How do you expect new hires to onboard?

pixl97•6m ago
Much like nested errors, management of trace IDs becomes difficult under scale as you will start getting multiple correlation references in complex systems.
tayo42•30m ago
It's bespoke to debug across multiple services?

This seems like typical work in any business that isn't trivial.

whynotminot•36m ago
I would wager the main reason for this is the same reason it’s also hard to teach these skills to people: there’s not a lot of high quality training for distributed debugging of complex production issues. Competence comes from years of experience fighting fires.

Very few people start their careers as SREs, it’s generally something they migrate into after enjoying it and showing aptitude for it.

With that said, I wouldn’t expect this wall to hold up for too long. There has been a lot of low hanging fruit teaching models how to code. When that is saturated, the frontier companies will likely turn their attention to honing training environments for SRE style debug.

lysace•29m ago
> With that said, I wouldn’t expect this wall to hold up for too long.

The models are already so good at the traditionally hard stuff: collecting that insane amount of detailed knowledge across so many different domains, languages and software stacks.

raincole•21m ago
Original title: Benchmarking OpenTelemetry: Can AI trace your failed login?

HN Editorialized: OTelBench: AI struggles with simple SRE tasks (Opus 4.5 scores only 29%)

The task:

> Your task is: Add OTEL tracing to all microservices.

> Requirements:

> Instrumentation should match conventions and well-known good practices.

> Instrumentation must match the business domain of the microservices.

> Traces must be sent to the endpoint defined by a standard OTEL environment variable.

> Use the recent version of the OTEL SDK.

I really don't think anything involved with multiple microservices can be called 'simple' even to humans. Perhaps to an expert who knows the specific business's domain knowledge it is.

pixl97•12m ago
As someone whos job is support more than SWE, I agree with this.

I've had to work in systems where events didn't share correlation IDs, I had to go in and filter entries down to microseconds to get a small enough number of entries that I could trace what actually happened between a set of services.

From what I've seen in the enterprise software side of the world is a lot of companies are particularly bad at SRE and there isn't a great amount of standardization.

chaps•5m ago
Having done app support across many environments, um - yes, multiple microservices is usually pretty simple. Just look at the open file/network handles and go from there. It's absolutely maddening to watch these models flail in trying to do something basic as, "check if the port is open" or "check if the process is running... and don't kill firefox this time".

These aren't challenging things to do for a human at all. But it's such a huge pain point for these models!

yomismoaqui•14m ago
I'm a human with 20+ years of experience and making OTEL work on Go was painful.

It made me remember when I was working on the J2EE ecosystem shudder

the_duke•13m ago
This is very confusingly written.

From the post I expected that the tasks were about analysing traces, but all the tasks in the repository are about adding instrumentation to code!

Some of the instructions don't give any guidance how to do it, some specify which libraries to use.

I'd be very curious HOW exactly the models fail.

Are the test sets just incredibly specific about what output they except, and you get a lot of failures because of tiny subtle mismatches? Or do they just get the instrumentation categorically wrong?

Also important: do the models have access to a web search tool to read the library docs? Otel libraries are often complicated to use... without reading latest docs or source code this would be quite tricky.

Some models have gotten better at adding dependencies, installing them and then reading the code from the respective directory where dependencies get stored, but many don't do well with this.

All in all, I'm not so sure how valuable this benchmark is.

I'd be much more interested in tasks like:

Here are trace/log outputs , here is the source code, find and fix the bug.

pixl97•8m ago
>Some of the instructions don't give any guidance how to do it, some specify which libraries to use.

In supporting a piece of cloud software with a lot of microservices I think this is a more generalized problem for humans. The app I work with demanded some logging requirements like the library to use. But that was it, different parts by different teams ended up with all kinds of different behaviors.

As for the AI side, this is something where I see our limited context sizes causing issues when developing architecture across multiple products.

linuxftw•7m ago
The prompts for this are pretty sparse. This could 100% be accomplished with better prompting. Even with the current prompts, it's likely I could complete the task with a follow up request specifying what it did correctly and incorrectly. In fact, this could probably be entirely automated with multiple agents checking each other.
NitpickLawyer•5m ago
I'm always interested in new benchmarks, so this is cool. I only had a brief look at [1] and [2], a few quick things that I noticed:

For [1]: instruction.md is very brief, quite vague and "assumes" a lot of things.

- Your task is: Add OTEL tracing to all microservices. Add OTEL logging to all microservices. (this is good)

- 6.I want to know if the microservice has OTEL instrumentation and where the data is being sent. (??? i have no idea what this means)

- 9.Use the recent version of the OTEL SDK. (yeah, this won't work unless you also use an MCP like context7 or provide local docs)

What's weird here is that instruct.md has 0 content regarding conventions, specifically how to name things. Yet in tests_outputs you have this "expected_patterns = ["order", "stock", "gateway"]" and you assert on it. I guess that makes some sense, but being specific in the task.md is a must. Otherwise you're benching assumptions, and those don't even work with meatbags :)

For [2]: instruction.md is more detailed, but has some weird issues:

- "You should only be very minimal and instrument only the critical calls like request handlers without adding spans for business calls \n The goal is to get business kind of transaction" (??? this is confusing, even skipping over the weird grammar there)

- "Draw ascii trace diagram into /workdir/traces.txt" (????)

- "When modifying Python files, use Python itself to write files or use sed for targeted changes" (? why are you giving it harness-specific instructions in your instruct.md? this is so dependent on the agentic loop used, that it makes no sense here.

- "Success Criteria: Demonstrate proper distributed tracing \n Include essential operations without over-instrumenting (keep it focused) \n Link operations correctly \n Analyze the code to determine which operations are essential to trace and how they relate to each other. (i mean ... yes and no. these are not success criteria IMO. It's like saying "do good on task not do bad". This could definitely be improved.)

----

Also, I noticed that every folder has a summary_claude... that looks like a claude written summary over a run. I hope that's not what's used in actually computing the benchmark scores. In that case, you're adding another layer of uncertainty in checking the results...

The ideea is nice, but tbf some of the tests seem contrived, your instructions are not that clear, you expect static naming values while not providing instructions at all about naming conventions, and so on. It feels like a lot of this was "rushed"? I peaked a bit at the commit history and saw some mentions of vibe-coding a viewer for this. I hope that's the only thing that was vibe-coded :)

[1] - https://github.com/QuesmaOrg/otel-bench/tree/main/datasets/o...

[2] - https://github.com/QuesmaOrg/otel-bench/blob/main/datasets/o...

1•swengcrunch•29s ago

Show HN: Guide to Writing Better AI Prompts

https://howtomakethebestprompt.com/
1•detroitwebsites•1m ago•0 comments

The Largest Zip Tie Is Nearly 4 Feet Long and $75

https://www.thedrive.com/news/youll-have-that-on-those-big-jobs-the-worlds-largest-zip-tie-is-nea...
1•PaulHoule•1m ago•0 comments

Shift more left with coding agents

https://gricha.dev/blog/shift-more-left-with-coding-agents
1•surprisetalk•2m ago•0 comments

FAQ: Memorization

https://pgadey.ca/notes/faq-memorization/
1•surprisetalk•2m ago•0 comments

Plantable Brings Plants and Tables Together in the Workplace

https://design-milk.com/plantable-brings-plants-and-tables-together-in-the-workplace/
1•surprisetalk•2m ago•0 comments

Attilio Berni plays the sub-contrabass saxophone [video]

https://www.youtube.com/watch?v=9BiW2mVKk0w
1•surprisetalk•2m ago•0 comments

Project Genie: Interactive worlds generated in real-time

https://labs.google/projectgenie
1•jedixit•2m ago•0 comments

TikTok Competitor UpScrolled Hits No. 1 on App Store

https://www.forbes.com/sites/conormurray/2026/01/29/tiktok-competitor-upscrolled-hits-no-1-on-app...
1•nullchan•2m ago•0 comments

Agentic Vision in Gemini 3 Flash

https://blog.google/innovation-and-ai/technology/developers-tools/agentic-vision-gemini-3-flash/
1•pretext•4m ago•0 comments

Joybubbles, early phone phreak, Documentary

https://www.joybubblesthemovie.com
1•ChrisArchitect•5m ago•1 comments

Show HN: Lytics – open-source Web Analytics with Heatmaps

https://lytics.cloud/
2•subprotocol•6m ago•0 comments

Project Genie: Experimenting with infinite, interactive worlds

https://blog.google/innovation-and-ai/models-and-research/google-deepmind/project-genie/
1•meetpateltech•7m ago•1 comments

Nvidia to shift 2028 chip production to Intel, reshaping TSMC strategy

https://www.digitimes.com/news/a20260128PD213/tsmc-intel-nvidia-packaging-2028.html
3•akyuu•7m ago•0 comments

Show HN: A voice-first budget tracking app

https://tallytalk.vercel.app/
1•mayankk25•8m ago•0 comments

Show HN: Autonomous recovery for distributed training jobs

https://docs.tensorpool.dev/features/agent
3•tsvoboda•8m ago•0 comments

AI-Assisted Development at Block

https://engineering.block.xyz/blog/ai-assisted-development-at-block
1•herbertl•8m ago•0 comments

Clear skies and autonomous Waymo rides at SFO

https://waymo.com/blog/2026/01/waymo-rides-at-sfo
1•boulos•9m ago•1 comments

Disenshittification Nation

https://pluralistic.net/2026/01/29/post-american-canada/#ottawa
2•bovermyer•10m ago•0 comments

Rad Power Bikes was valued at $1.6B and now, assets are sold for $13M

https://micromobility.io/news/rad-power-bikes-assets-to-be-sold-for-13-2m-after-chapter-11-filing
2•prabinjoel•11m ago•0 comments

Ingress Nginx: Statement from Kubernetes Steering, Security Response Committees

https://kubernetes.io/blog/2026/01/29/ingress-nginx-statement/
2•shscs911•11m ago•0 comments

Show HN: vind – A Better Kind (Kubernetes in Docker)

https://github.com/loft-sh/vind
9•saiyampathak•12m ago•0 comments

I wrote a n8n external-secrets <-> HashiCorp Vault shim so you don't have to

https://medium.com/@lustine67/i-shouldnt-have-to-write-this-forcing-n8n-s-external-secret-to-resp...
1•polux33•12m ago•0 comments

Secure, Customizable and Reliable WebRTC Video Calls| Whereby

https://whereby.com/
1•janandonly•13m ago•0 comments

Are Luxury Smartwatches a Thing of the Past?

https://www.nytimes.com/2026/01/28/fashion/luxury-smartwatches-tag-heuer.html
1•bookofjoe•14m ago•1 comments

Observation of Strong Nonreciprocal Thermal Emission

https://par.nsf.gov/biblio/10626678-observation-strong-nonreciprocal-thermal-emission
1•georgecmu•14m ago•0 comments

MicroPythonOS graphical operating system delivers Android-like user experience

https://www.cnx-software.com/2026/01/29/micropythonos-graphical-operating-system-delivers-android...
2•mikece•15m ago•0 comments

CGO-Free Implementation of CTAP/Fido 2 in Golang

https://github.com/mohammadv184/go-fido2
1•mohammadv184•15m ago•0 comments

Modeling identity and access hierarchy in Postgres with ltree

https://atlas9.dev/blog/iam-hierarchy.html
1•buchanae•16m ago•0 comments

Databases: Are We There Yet? – Spasov [video]

https://www.youtube.com/watch?v=Wu-OyLxiP88
1•adityaathalye•18m ago•1 comments