frontpage.

Testing AI agents is painful. Every test run calls the LLM API, costs real money, takes minutes, and gives different results each time. CI? Forget about it.

Evalcraft fixes this with cassette-based capture and replay — think VCR for HTTP, but for LLM calls and tool use.

How it works:

1. Run your agent once with real API calls. Evalcraft records every LLM request, tool call, and response into a JSON cassette file.

2. In tests, replay from the cassette. Zero API calls, zero cost, deterministic output.

3. Assert on what matters: tool call sequences, output content, cost budgets, token counts.

  run = replay("cassettes/support_agent.json")
  assert_tool_called(run, "lookup_order", with_args={"order_id": "ORD-1042"})
  assert_tool_order(run, ["lookup_order", "search_knowledge_base"])
  assert_cost_under(run, max_usd=0.01)

It's pytest-native — fixtures, markers, CLI flags. Works with OpenAI, Anthropic, LangGraph, CrewAI, AutoGen, and LlamaIndex out of the box. Adapters auto-instrument your agent with zero code changes.

Also ships with golden-set management, regression detection, PII sanitization, and 16 CLI commands for inspecting/diffing cassettes.

555 tests, MIT licensed, `pip install evalcraft`.

Repo: https://github.com/beyhangl/evalcraft PyPI: https://pypi.org/project/evalcraft/ Docs: https://beyhangl.github.io/evalcraft/docs/

Would love feedback from anyone testing agents in CI.

Show HN: Claudine – A Kanban board for your Claude Code and Codex conversations

Show HN: I built the first scripting language for multiplayer game dev

Cognitive and Physical Improvement with Positive Age Beliefs

Manual to Phil Zimmermans PGPfone Circa 1996 [pdf]

Self taught gen-xers with senior dev/pm exp. Where's my imposter syndrome team?

Lotus 1-2-3 on the PC with DOS

Knightian Uncertainty

Generate cell-type specific mRNAs for better vaccines autoregressively

Withheld Epstein files with accusations against Trump released by justice dept

Three Quiet Brothers on Long Island, All of Them Related to Hitler

Time to teach our children about finance

A Plea for Lean Software (1995) [pdf]

Show HN: CloakPipe – Rust privacy proxy for LLM APIs with pseudonymization

An approach to provably safe AI engineering for legacy codebases

M6 MacBook Pro could have four innovations new to the Mac

We fixed Postgres connection pooling on serverless with PgDog

Interpreting Pull Request Changes Before CI Enforcement

Colorado SB26-051 Age Attestation

When Using AI Leads to "Brain Fry"

Artificial Intelligence: friend or foe for hiring in Europe today?

Making Hybrid Bonding Better

Building a High-Performance Postgres Time Series Stack with Iceberg

Advice for Staying in the Hospital for a Week

Scientist rule out a 2032 lunar impact for asteroid 2024 YR4

Claude Code Skill to write better Lean4 proofs

US companies denied refunds on Trump's illegal tariffs

Why Can't I Think of Anything to Vibe Code?

Show HN: What Is AI Citation Optimization?

OpenAI sued for practicing law without a license

Context Engineering

Show HN: Evalcraft – cassette-based testing for AI agents (pytest, $0/run)