frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

We gave terabytes of CI logs to an LLM

https://www.mendral.com/blog/llms-are-good-at-sql
48•shad42•1h ago

Comments

verdverm•1h ago
This is one of those HN posts you share internally in the hopes you can work this into your sprint
sollewitt•1h ago
But does it work? I’ve used LLMs for log analysis and they have been prone to hallucinate reasons: depending on the logs the distance between cause and effects can be larger than context, usually we’re dealing with multiple failures at once for things to go badly wrong, and plenty of benign issues throw scary sounding errors.
verdverm•1h ago
It can, like all the other tasks, it's not magic and you need to make the job of the agent easier by giving it good instructions, tools, and environments. It's exactly the same thing that makes the life of humans easier too.

This post is a case study that shows one way to do this for a specific task. We found an RCA to a long-standing problem with our dev boxes this week using Ai. I fed Gemini Deep Research a few logs and our tech stack, it came back with an explanation of the underlying interactions, debugging commands, and the most likely fix. It was spot on, GDR is one of the best debugging tools for problems where you don't have full understanding.

If you are curious, and perhaps a PSA, the issue was that Docker and Tailscale were competing on IP table updates, and in rare circumstances (one dev, once every few weeks), Docker DNS would get borked. The fix is to ignore Docker managed interfaces in NetworkManager so Tailscale stops trying to do things with them.

sollewitt•50m ago
Thanks - that’s the maddening with flakes - is it the thing under test or the thing doing the testing? Hermeticity is a lie we tell ourselves :)
shad42•1h ago
Mendral co-founder here, we built this infra to have our agent detect CI issues like flaky tests and fix them. Observing logs are useful to detect anomalies but we also use those to confirm a fix after the agent opens a PR (we have long coding sessions that verifies a fixe and re-run the CI if needed, all in the same agent loop).

So yes it works, we have customers in production.

aluzzardi•1h ago
Post author here.

Yes, it works really well.

1) The latest models are radically better at this. We noticed a massive improvement in quality starting with Sonnet 4.5

2) The context issue is real. We solve this by using sub agents that read through logs and return only relevant bits to the parent agent’s context

sollewitt•55m ago
I would be very interested in reading about this kind of orchestration and filtering than data acquisition if you have the energy for another post :)
shad42•52m ago
We started writing very recently: https://www.mendral.com/blog - there is a another post we made yesterday about the overall architecture. And we have a long list of things we're planning to write about in more details.

Taking good note of your comment :)

kburman•7m ago
Honestly, with recent models, these types of tasks are very much possible. Now it mostly depends on whether you are using the model correctly or not.
dbreunig•1h ago
Check out “Recursive Language Models”, or RLMs.

I believe this method works well because it turns a long context problem (hard for LLMs) into a coding and reasoning problem (much better!). You’re leveraging the last 18 months of coding RL by changing you scaffold.

koakuma-chan•1h ago
This seems really weird to me. Isn't that just using LLMs in a specific way? Why come up with a new name "RLM" instead of saying "LLM"? Nothing changes about the model.
vimda•47m ago
RLMs are a new architecture, but you can mimic an RLM by providing the context through a tool, yes
Yizahi•1h ago
We have an ongoing effort in parsing logs for our autotests to speed up debug. It is vary hard to do, mainly because there is a metric ton of false positives or plain old noise even in the info logs. Tracing the culprit can be also tricky, since an error in container A can be caused by the actual failure in the container B which may in turn depend on something entirely else, including hardware problems.

Basically a surefire way to train LLM to parse logs and detect real issues almost entirely depends on the readability and precision of logging. And if logging is good enough then humans can do debug faster and more reliable too :) . Unfortunately people reading logs and people coding them are almost not intersecting in practice and so the issue remains.

shad42•39m ago
Yeah it sounds very familiar with what we went through while building this agent. We're focused on CI logs for now because we wanted something that works really well for things like flaky tests, but planning to expand the context to infrastructure logs very soon.
whoami4041•1h ago
"LLMs are good at SQL" is quite the assertion. My experience with LLM generated SQL in OLTP and OLAP platforms has been a mixed bag. IMO analytics/SQL will always be a space that needs a significant weight of human input and judgement in generating. Probably always will be due to the critical business decisions that can be made from the insights.
shad42•57m ago
What we learned while building this is every token matters in the context, we spend lot of time watching logs of agent sessions, changing the tool params, errors returned by tools, agent prompts, etc...

We noticed for example the importance of letting the model pull from the context, instead of pushing lots of data in the prompt. We have a "complex" error reporting because we have to differentiate between real non-retryable errors and errors that teach the model to retry differently. It changes the model behavior completely.

Also I agree with "significant weight of human input and judgement", we spent lots of time optimizing the index and thinking about how to organize data so queries perform at scale. Claude wasn't very helpful there.

whoami4041•40m ago
Very interesting work here, no doubt. It's a measured approach to using an LLM with SQL rather than trying to make it responsible for everything end-to-end.
dylan604•55m ago
> IMO analytics/SQL will always be a space that needs a significant weight of human input and judgement in generating.

Isn't that precisely what is done when prompting?

whoami4041•38m ago
The key to my point is in the word "generating". Meaning human input/judgement by actually typing more SQL than the LLM produces. The model's reasoning and code generation pipelines are typically 2 separate code paths, so it may not always actually do what it intends which can lead to unexpected results.
blharr•2m ago
"LLMs are good at [task I'm not good enough at to tell the LLM is bad at]" is becoming common
sathish316•1h ago
SQL is the best exploratory interface for LLMs. But, most of Observability data like Metrics, Logs, Traces we have today are hidden in layers of semantics, custom syntax that’s hard for an agent to translate from explore or debug intent to the actual query language.

Large scale data like metrics, logs, traces are optimised for storage and access patterns and OLAP/SQL systems may not be the most optimal way to store or retrieve it. This is one of the reasons I’ve been working on a Text2SQL / Intent2SQL engine for Observability data to let an agent explore schema, semantics, syntax of any metrics, logs data. It is open sourced as Codd Text2SQL engine - https://github.com/sathish316/codd_query_engine/

It is far from done and currently works for Prometheus,Loki,Splunk for few scenarios and is open to OSS contributions. You can find it in action used by Claude Code to debug using Metrics and Logs queries:

Metric analyzer and Log analyzer skills for Claude code - https://github.com/sathish316/precogs_sre_oncall_skills/tree...

testbjjl•39m ago
> SQL is the best exploratory interface for LLMs

Any qualifiers here from your experience or documentation?

shad42•23m ago
From own experience it's true, and I think it's due to the amount of SQL content (docs, best practices, code) that you can find online, which is now in all LLM's corpus data.

Same applies when picking a programming language nowadays.

p0w3n3d•31m ago
That's in the contrary to my experience. Logs contain a lot of noise and unnecessary information, especially Java, hence best is to prepare them before feeding them to LLM. Not speaking about wasted tokens too...
shad42•14m ago
LLMs are better now at pulling the context (as opposed to feeding everything you can inside the prompt). So you can expose enough query primitives to the LLM so it's able to filter out the noise.

I don't think implementing filtering on log ingestion is the right approach, because you don't know what is noise at this stage. We spent more time on thinking about the schema and indexes to make sure complex queries perform at scale.

kikki•30m ago
Unrelated; what does "mendral" mean? It's a very... unmemorable word
shad42•19m ago
I am sure you heard before: there are only two hard things in CS: cache invalidation and naming things.

In the history of this company, I can honestly say that this SQL/LLM thing wasn't the hardest :)

THESMOKINGUN•5m ago
We gave an autocomplete terabytes of text. The results will shock you.
buryat•3m ago
I just wrote a tool for reducing logs for LLM analysis (https://github.com/ascii766164696D/log-mcp)

Lots of logs contain non-interesting information so it easily pollutes the context. Instead, my approach has a TF-IDF classifier + a BERT model on GPU for classifying log lines further to reduce the number of logs that should be then fed to a LLM model.

I trained it on ~90GB of logs and provide scripts to retrain the models (https://github.com/ascii766164696D/log-mcp/tree/main/scripts)

It's meant to be used with Claude Code CLI so it could use these tools instead of trying to read the log files

We deserve a better streams API for JavaScript

https://blog.cloudflare.com/a-better-web-streams-api/
181•nnx•3h ago•73 comments

We gave terabytes of CI logs to an LLM

https://www.mendral.com/blog/llms-are-good-at-sql
48•shad42•1h ago•32 comments

Statement from Dario Amodei on our discussions with the Department of War

https://www.anthropic.com/news/statement-department-of-war
2634•qwertox•18h ago•1403 comments

The Pentagon is making a mistake by threatening Anthropic

https://www.understandingai.org/p/the-pentagon-is-making-a-mistake
127•speckx•2h ago•92 comments

Tenth Circuit: 4th Amendment Doesn't Support Broad Search of Protesters' Devices

https://www.eff.org/deeplinks/2026/02/victory-tenth-circuit-finds-fourth-amendment-doesnt-support...
176•hn_acker•2h ago•19 comments

NASA announces major overhaul of Artemis program amid safety concerns, delays

https://www.cbsnews.com/news/nasa-artemis-moon-program-overhaul/
15•voxadam•1h ago•6 comments

Show HN: RetroTick – Run classic Windows EXEs in the browser

https://retrotick.com/
101•lqs_•4h ago•34 comments

Show HN: Badge that shows how well your codebase fits in an LLM's context window

https://github.com/qwibitai/nanoclaw/tree/main/repo-tokens
43•jimminyx•2h ago•28 comments

F-Droid Board of Directors nominations 2026

https://f-droid.org/2026/02/26/board-of-directors-nominations.html
121•edent•7h ago•64 comments

Can you reverse engineer our neural network?

https://blog.janestreet.com/can-you-reverse-engineer-our-neural-network/
203•jsomers•2d ago•128 comments

Experts sound alarm after ChatGPT Health fails to recognise medical emergencies

https://www.theguardian.com/technology/2026/feb/26/chatgpt-health-fails-recognise-medical-emergen...
82•simonebrunozzi•1h ago•62 comments

Vibe coded Lovable-hosted app littered with basic flaws exposed 18K users

https://www.theregister.com/2026/02/27/lovable_app_vulnerabilities/
29•nottorp•52m ago•0 comments

Sprites on the Web

https://www.joshwcomeau.com/animation/sprites/
36•vinhnx•3d ago•11 comments

An interactive intro to quadtrees

https://growingswe.com/blog/quadtrees
149•evakhoury•3d ago•16 comments

Get free Claude max 20x for open-source maintainers

https://claude.com/contact-sales/claude-for-oss
167•zhisme•8h ago•97 comments

The Hunt for Dark Breakfast

https://moultano.wordpress.com/2026/02/22/the-hunt-for-dark-breakfast/
428•moultano•13h ago•160 comments

The normalization of corruption in organizations (2003) [pdf]

https://gwern.net/doc/sociology/2003-ashforth.pdf
211•rendx•11h ago•111 comments

Breaking Free

https://www.forbrukerradet.no/breakingfree/
129•Aissen•7h ago•21 comments

Modeling Cycles of Grift with Evolutionary Game Theory

https://www.oranlooney.com/post/grifters-skeptics-marks/
8•ibobev•3d ago•2 comments

The quixotic team trying to build a world in a 20-year-old game

https://arstechnica.com/gaming/2026/02/inside-the-quixotic-team-trying-to-build-an-entire-world-i...
84•nxobject•2d ago•16 comments

How to Allocate Memory

https://geocar.sdf1.org/alloc.html
41•tosh•2d ago•1 comments

Ubicloud (YC W24): Software Engineer – $95-$250K in Turkey, Netherlands, CA

https://www.ycombinator.com/companies/ubicloud/jobs/j4bntEJ-software-engineer
1•ozgune•8h ago

Reading English from 1000 Ad

https://lewiscampbell.tech/blog/260224.html
64•LAC-Tech•3d ago•24 comments

What Claude Code chooses

https://amplifying.ai/research/claude-code-picks
550•tin7in•23h ago•207 comments

Working on Pharo Smalltalk: BPatterns: Rewrite Engine with Smalltalk Style

http://dionisiydk.blogspot.com/2026/02/bpatterns-rewrite-engine-with-smalltalk.html
54•mpweiher•8h ago•4 comments

80386 Protection

https://nand2mario.github.io/posts/2026/80386_protection/
113•nand2mario•3d ago•28 comments

What does " 2>&1 " mean?

https://stackoverflow.com/questions/818255/what-does-21-mean
384•alexmolas•21h ago•225 comments

AirSnitch: Demystifying and breaking client isolation in Wi-Fi networks [pdf]

https://www.ndss-symposium.org/wp-content/uploads/2026-f1282-paper.pdf
394•DamnInteresting•1d ago•173 comments

The complete Manic Miner disassembly

https://skoolkit.ca/disassemblies/manic_miner/
54•sandebert•9h ago•7 comments

Layoffs at Block

https://twitter.com/jack/status/2027129697092731343
853•mlex•20h ago•970 comments