frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Show HN: Cobalt – Unit tests for AI agents, like Jest but for LLMs

https://github.com/basalt-ai/cobalt
3•fdefitte•1h ago
Hey HN, I built Cobalt, an open-source testing framework for AI agents and LLM apps.

Most eval tools (Braintrust, Arize, LangSmith) want you to live in their UI. Dashboards, manual reviews, clicking through results. That's fine for exploration, but it doesn't catch regressions. We needed something that runs in CI like any other test suite, lives in code, and fails the build when quality drops.

  npm install @basalt-ai/cobalt
  npx cobalt init
  npx cobalt run
Write experiments as code:

  import { experiment, Dataset, Evaluator } from '@basalt-ai/cobalt'

  const dataset = Dataset.fromLangfuse('support-tickets')

  experiment('support-agent', dataset, async ({ item }) => {
    const result = await myAgent(item.input)
    return { output: result }
  }, {
    evaluators: [
      new Evaluator({ name: 'Helpful', type: 'llm-judge', prompt: 'Is this response helpful and accurate? {{output}}' }),
      new Evaluator({ name: 'No hallucination', type: 'llm-judge', prompt: 'Does this contain fabricated info? {{output}}' }),
    ]
  })
`npx cobalt run --ci` exits with code 1 if thresholds are violated. The GitHub Action posts score tables on PRs and auto-compares against base branch.

The part I'm most excited about: Cobalt ships with a built-in MCP server, so you can drive it entirely from Claude Code. Just tell it "compare GPT 5.2 with 5.1 on my support agent" or "run my experiments, find the failing cases, and fix the prompt." It runs the experiments, diffs the results, and iterates on your code without you leaving the terminal. Turns eval from a chore into a conversation.

Pull datasets from Langfuse, LangSmith, Braintrust, or plain JSON/JSONL/CSV. Results stored locally in SQLite. No accounts, no dashboards, no vendor lock-in.

Show HN: BeadHub, Beads-based coordination for multiple coding agents

https://github.com/beadhub/beadhub
1•juanre•1m ago•0 comments

Georgian wine culture dates back, uninterrupted, approximately 8k years

https://www.wsetglobal.com/knowledge-centre/blog/2023/july/05/exploring-georgian-wine-history-gra...
1•Anon84•2m ago•0 comments

Fall-from-grace: A prompt engineering functional programming language

https://github.com/Gabriella439/grace
1•bwestergard•5m ago•0 comments

Turns Out There Was Voter Fraud in Georgia–By Elon Musk

https://newrepublic.com/post/206857/georgia-voter-fraud-elon-musk
1•mandeepj•6m ago•1 comments

The AI security nightmare is here and it looks suspiciously like lobster

https://www.theverge.com/ai-artificial-intelligence/881574/cline-openclaw-prompt-injection-hack
1•cschick•7m ago•1 comments

YouTube tests 'conversational AI' on TV apps

https://9to5google.com/2026/02/19/youtube-tv-conversational-ai-test/
1•geox•8m ago•0 comments

Exploring Linux on a LoongArch Mini PC

https://www.wezm.net/v2/posts/2026/loongarch-mini-pc-m700s/
2•naves•10m ago•0 comments

Interview with Steve Klabnik

https://alexalejandre.com/programming/steve-klabnik-interview/
2•veqq•10m ago•0 comments

Radical Forces in Germany (1931)

https://www.foreignaffairs.com/articles/germany/1931-04-01/radical-forces-germany
2•jjmarr•12m ago•1 comments

Venting Doesn't Reduce Anger, but Something Else Does, Review Finds

https://www.sciencealert.com/venting-doesnt-reduce-anger-but-something-else-does-review-finds
1•PaulHoule•13m ago•0 comments

The century of the maxxer: things are happening in America

https://samkriss.substack.com/p/the-century-of-the-maxxer
2•thinkingemote•14m ago•0 comments

Trump directs US Government to prepare release of files on aliens and UFOs

https://www.bbc.co.uk/news/articles/c4g57gqqln1o
3•smurda•15m ago•0 comments

Pdf-light: Enterprise-grade, lightweight HTML to PDF generator for Node.js

https://github.com/thisha-me/pdf-light
2•thunderbong•15m ago•0 comments

Show HN: AetherCam, a video recorder focusing on audio

https://aethercamera.pro
1•miloo94•17m ago•0 comments

Did GPT 5.2 make a breakthrough discovery in theoretical physics?

https://huggingface.co/blog/dlouapre/gpt-single-minus-gluons
1•ibobev•17m ago•0 comments

France Bets on Carbon Capture as North Sea Rivals Surge Ahead

https://oilprice.com/Energy/Energy-General/France-Bets-on-Carbon-Capture-as-North-Sea-Rivals-Surg...
1•PaulHoule•18m ago•0 comments

I found a Vulnerability. They found a Lawyer

https://dixken.de/blog/i-found-a-vulnerability-they-found-a-lawyer
2•toomuchtodo•19m ago•0 comments

Oxide plans new rack attack, packing in Zen 5 CPUs and DDR5 RAM

https://www.theregister.com/2026/02/13/whats_next_for_oxide_computer/
3•naltun•20m ago•0 comments

Aurea – The Living Code

https://docs.google.com/document/d/1X-5f6KnDckRIzlq7kdL5qzVDhRAPxJ_7OOLuPUcYT7c/edit?usp=sharing
1•CWHBEATZ•20m ago•1 comments

Tesla loses bid to toss $243M verdict in fatal Autopilot crash suit

https://www.cnbc.com/2026/02/20/tesla-loses-bid-toss-243-million-verdict-fatal-autopilot-crash-su...
3•1vuio0pswjnm7•21m ago•0 comments

Instance segmentation model that extracts 3D geometry from 2D floor plans

2•acaciabengo•21m ago•0 comments

I hacked ChatGPT and Google's AI – and it only took 20 minutes

https://www.bbc.com/future/article/20260218-i-hacked-chatgpt-and-googles-ai-and-it-only-took-20-m...
3•Tomte•24m ago•1 comments

The Unlikely Success of an Alabama Bookstore

https://www.newyorker.com/books/page-turner/the-unlikely-success-of-a-strange-alabama-bookstore
2•robenkleene•27m ago•0 comments

Building My Own Blog System with React and Supabase on My Website (Just for Fun)

https://mywebsite-3.vercel.app/blog/Building-My-Own-Blog-System-with-React-+-Supabase-(Just-for-Fun)
1•Jimmy6929•27m ago•0 comments

VOIS – O(1) similarity search, 2.7x faster than FAISS HNSW with perfect recall

https://pxquantum.com/technology.html
1•PXQuantumLabs•32m ago•1 comments

I built an open src tool called crashvault

https://github.com/Ak-dude/crashvault
1•Gremm•33m ago•1 comments

The splines are hallucinating now: how I built and what got built by AI mayors

https://dunn.us/notes/the-splines-are-hallucinating/
2•aed•34m ago•0 comments

Testing Super Mario Using a Behavior Model Autonomously

https://testflows.com/blog/testing-super-mario-using-a-behavior-model-autonomously-part1/
7•Naulius•35m ago•1 comments

How will OpenAI compete?

https://www.ben-evans.com/benedictevans/2026/2/19/how-will-openai-compete-nkg2x
2•sanj•35m ago•0 comments

Contribution: CLI tool to draw an image on your GitHub contribution graph

https://github.com/blaise-io/contribution
1•rootforce•35m ago•0 comments