frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Open-source LLM-as-judge eval suite with root cause analysis and failure mining

https://github.com/colingfly/cane-eval
1•colinfly•2h ago

Comments

colinfly•2h ago
Built an eval toolkit for AI agents that goes beyond pass/fail scoring. Define test suites in YAML, use Claude as an LLM judge, then automatically analyze why your agent fails and turn those failures into training data.

The main loop:

Define test cases with expected answers and weighted criteria Run against any agent (HTTP endpoint, CLI command, or Python callable) Claude judges each response on your criteria (0-100 per criterion) Root cause analysis finds patterns across failures (knowledge gaps, prompt issues, missing sources) Failure mining classifies each failure and uses LLM to rewrite bad answers Export as DPO/SFT/OpenAI fine-tuning JSONL The RCA piece is what I think is most useful. Instead of just seeing "5 tests failed," you get things like "Agent consistently fabricates refund policies because no refund documentation exists in the knowledge base" with specific fix recommendations.

CLI:

pip install cane-eval cane-eval run tests.yaml cane-eval rca tests.yaml --threshold 60 cane-eval run tests.yaml --mine --export dpo

GitHub: https://github.com/colingfly/cane-eval

MIT licensed, pure Python, uses the Anthropic API. Happy to answer questions about the approach.

Biases abound when predicting biomarkers from histological images

https://www.nature.com/articles/s41551-026-01616-8
1•PaulHoule•38s ago•0 comments

What Did You Forget to Prompt? $87,500 in Fraud from Vibe-Coded Startup

https://qualitymax.io/vibe-check
2•qualitymax•2m ago•0 comments

Do Not A/B Test My Workflow

https://backnotprop.com/blog/do-not-ab-test-my-workflow/
3•ramoz•5m ago•0 comments

BuzzFeed Nearing Bankruptcy After Disastrous Turn Toward AI

https://futurism.com/artificial-intelligence/buzzfeed-disastrous-earnings-ai
2•jsheard•6m ago•0 comments

Spaced Repetition Algorithm: A Three‐Day Journey from Novice to Expert

https://github.com/open-spaced-repetition/fsrs4anki/wiki/Spaced-Repetition-Algorithm:-A-Three%E2%...
1•primenumber1•7m ago•0 comments

I got tired of GitHub's starred repo search, so I built a better one

https://github.com/alonronin/orbit
2•alonronin•8m ago•0 comments

Free Multi Cloud TCO and Feature Comparison

https://cloudcompare.online
1•rohan044•13m ago•0 comments

systemd 260-rc3 Released With AI Agents Documentation Added

https://www.phoronix.com/news/systemd-260-rc3
2•voxadam•14m ago•0 comments

Trump: U.S. hit military sites on Iranian Kharg island

https://www.iranintl.com/en/202603137666
2•ukblewis•14m ago•1 comments

After Decapitation, What's Next?

https://www.gzeromedia.com/by-ian-bremmer/after-decapitation-whats-next
1•petethomas•15m ago•0 comments

Senate Votes to Block Private Equity from Buying Homes

https://www.thebignewsletter.com/p/boom-senate-votes-to-block-private
4•pseudolus•15m ago•2 comments

Desperate for skilled workers, a furniture maker looks to apprenticeships

https://www.npr.org/2026/03/13/nx-s1-5727509/apprenticeships-manufacturing-workforce-trump-arkansas
1•toomuchtodo•16m ago•0 comments

We visited "ground zero" for hospice fraud: Los Angeles, California

https://www.cbsnews.com/projects/2026/hospice-fraud/
2•gmays•17m ago•0 comments

Hollywood Hacks OT: Cybersecurity Lessons from the Movies

https://www.emberot.com/resources/blog/ot-cybersecurity-lessons-from-the-movies/
2•TheWiggles•19m ago•0 comments

40 Years of Wireless Evolution Leads to a Smart, Sensing Network

https://spectrum.ieee.org/telecom-history-1g-to-6g
2•Brajeshwar•20m ago•0 comments

"Added 1M context window for Opus 4.6 by default for Max, Team, and Enterprise"

https://raw.githubusercontent.com/anthropics/claude-code/refs/heads/main/CHANGELOG.md
2•taspeotis•22m ago•1 comments

Could a Day Job Be the Foundation of an Artist's Success?

https://3quarksdaily.com/3quarksdaily/2026/03/could-a-day-job-be-the-foundation-of-an-artists-suc...
1•herbertl•24m ago•0 comments

Japanese government makes indie game devs eligible for grants up to $60k USD

https://automaton-media.com/en/news/japanese-government-makes-indie-game-developers-eligible-for-...
2•maenbalja•26m ago•0 comments

Pick one of catastrophic or equitable. Are founder clean breaks possible?

1•mehctothroaway•27m ago•0 comments

How the Strait of Hormuz closure affects global oil supply

https://www.reuters.com/graphics/IRAN-CRISIS/OIL-LNG/mopaokxlypa/
4•aanet•27m ago•1 comments

Iran and Region Monitor of Attacks and Major Events

https://newsfeed-staging.pages.dev/
1•msukhareva•28m ago•0 comments

Smaller Than a Fingernail: Unboxing the Tiniest Books [video]

https://www.youtube.com/watch?v=faN_yEghseo
1•gnabgib•30m ago•0 comments

Electron microscopy shows 'mouse bite' defects in semiconductors

https://news.cornell.edu/stories/2026/03/electron-microscopy-shows-mouse-bite-defects-semiconductors
1•hhs•30m ago•0 comments

Show HN: diz – SSH key exchange in one command each side

https://github.com/noahra/diz
1•noahra•31m ago•0 comments

The Playbook and Play-Engine Site (2003)

https://www.wisdom.weizmann.ac.il/~playbook/
1•turtleyacht•34m ago•1 comments

Digg cuts jobs after facing AI bot surge

https://www.reuters.com/technology/digg-cuts-jobs-after-facing-ai-bot-surge-2026-03-13/
2•geox•36m ago•1 comments

macOS backups with Kopia and Backblaze (2023)

https://hmarr.com/blog/mac-backups-with-kopia/
2•chmaynard•36m ago•0 comments

Dust Outbreak Reaches Europe

https://science.nasa.gov/earth/earth-observatory/dust-outbreak-reaches-europe/
1•gnabgib•37m ago•0 comments

How the Iran War Threatens Big Tech's AI Data Center Buildout in the Middle East [video]

https://www.youtube.com/watch?v=-vhTIkq9-ng
2•mgh2•38m ago•0 comments

Harnessing eDNA to help conserve Australia's oceans

https://phys.org/news/2026-03-harnessing-edna-australia-oceans.html
1•Brajeshwar•40m ago•0 comments