frontpage.

Hey HN, I built Cobalt, an open-source testing framework for AI agents and LLM apps.

Most eval tools (Braintrust, Arize, LangSmith) want you to live in their UI. Dashboards, manual reviews, clicking through results. That's fine for exploration, but it doesn't catch regressions. We needed something that runs in CI like any other test suite, lives in code, and fails the build when quality drops.

  npm install @basalt-ai/cobalt
  npx cobalt init
  npx cobalt run

Write experiments as code:

  import { experiment, Dataset, Evaluator } from '@basalt-ai/cobalt'

  const dataset = Dataset.fromLangfuse('support-tickets')

  experiment('support-agent', dataset, async ({ item }) => {
    const result = await myAgent(item.input)
    return { output: result }
  }, {
    evaluators: [
      new Evaluator({ name: 'Helpful', type: 'llm-judge', prompt: 'Is this response helpful and accurate? {{output}}' }),
      new Evaluator({ name: 'No hallucination', type: 'llm-judge', prompt: 'Does this contain fabricated info? {{output}}' }),
    ]
  })

`npx cobalt run --ci` exits with code 1 if thresholds are violated. The GitHub Action posts score tables on PRs and auto-compares against base branch.

The part I'm most excited about: Cobalt ships with a built-in MCP server, so you can drive it entirely from Claude Code. Just tell it "compare GPT 5.2 with 5.1 on my support agent" or "run my experiments, find the failing cases, and fix the prompt." It runs the experiments, diffs the results, and iterates on your code without you leaving the terminal. Turns eval from a chore into a conversation.

Pull datasets from Langfuse, LangSmith, Braintrust, or plain JSON/JSONL/CSV. Results stored locally in SQLite. No accounts, no dashboards, no vendor lock-in.

Show HN: BeadHub, Beads-based coordination for multiple coding agents

Georgian wine culture dates back, uninterrupted, approximately 8k years

Fall-from-grace: A prompt engineering functional programming language

Turns Out There Was Voter Fraud in Georgia–By Elon Musk

The AI security nightmare is here and it looks suspiciously like lobster

YouTube tests 'conversational AI' on TV apps

Exploring Linux on a LoongArch Mini PC

Interview with Steve Klabnik

Radical Forces in Germany (1931)

Venting Doesn't Reduce Anger, but Something Else Does, Review Finds

The century of the maxxer: things are happening in America

Trump directs US Government to prepare release of files on aliens and UFOs

Pdf-light: Enterprise-grade, lightweight HTML to PDF generator for Node.js

Show HN: AetherCam, a video recorder focusing on audio

Did GPT 5.2 make a breakthrough discovery in theoretical physics?

France Bets on Carbon Capture as North Sea Rivals Surge Ahead

I found a Vulnerability. They found a Lawyer

Oxide plans new rack attack, packing in Zen 5 CPUs and DDR5 RAM

Aurea – The Living Code

Tesla loses bid to toss $243M verdict in fatal Autopilot crash suit

Instance segmentation model that extracts 3D geometry from 2D floor plans

I hacked ChatGPT and Google's AI – and it only took 20 minutes

The Unlikely Success of an Alabama Bookstore

Building My Own Blog System with React and Supabase on My Website (Just for Fun)

VOIS – O(1) similarity search, 2.7x faster than FAISS HNSW with perfect recall

I built an open src tool called crashvault

The splines are hallucinating now: how I built and what got built by AI mayors

Testing Super Mario Using a Behavior Model Autonomously

How will OpenAI compete?

Contribution: CLI tool to draw an image on your GitHub contribution graph

Show HN: Cobalt – Unit tests for AI agents, like Jest but for LLMs