frontpage.

I tested Claude 3 Haiku on "What is 247 * 18?" across 100 trials. Pass rate: 70%. 95% CI: 48%-85%. A task any calculator solves 100% of the time.

This is the core problem with agent evals today: one run tells you nothing. The same prompt, same model, same tools — different result every time.

I built agentrial to fix this. It's a pytest-style CLI that runs your agent N times and gives you:

- Wilson confidence intervals on pass rate - Step-level failure attribution (Fisher exact test pinpoints which tool call or reasoning step diverges between pass/fail runs) - Real API cost from response metadata - A GitHub Action that blocks PRs when reliability drops

Usage is minimal — write a YAML config, run "agentrial run":

  pip install agentrial

Tested extensively with LangGraph agents. 100 trials cost $0.06. MIT licensed, no telemetry, runs locally.

Looking for feedback on what metrics matter most when you're shipping agents to production.

Avoiding Modern C++ – Anton Mikhailov (YouTube) [video]

Mapping the Moon: The Apollo Transforming Printer

Show HN: Busted – eBPF tool that monitors what your AI agents send

Rule #1 for coding with AI agents

A New Kind of Wireless [video]

Show HN: xllify: Turn formulas, VBA, or plain English into Excel add-ins

AppSecMaster – Learn web hacking with hands-on challenges

FRTMProxy, free, open source alternative to Proxyman and Charles

Show HN: Minimal NIST/OWASP-compliant auth implementation for Cloudflare Workers

I Am an Anarcho Pacifist, Opposing the Deep State Center of Power, The Pentagon

A daily word puzzle my 9-year-old daughter designed

Thinkst citation: collection of infosec talks

Thinkst Scapes

Roll with Advantage: Hacking Lenovo Vantage

Technology Radar Volume 33

Show HN: An AI mock interviewer that keeps asking follow-ups (export transcript)

Design Patterns Catalogue

Seedance 2.0 – Multimodal AI Video Generation with Image/Video/Audio References

Self-Improving Claude.md Files

Show HN: Bypassed query layer of SQLite,accessed b-tree APIs for KV store

Unlocking AI's Potential with Context Graphs – Smarter Business Decisions

Zigazoo: GCP Security at "The Largest Social Network for Kids"

The Network Is Reliable

Understanding the Go Runtime: The Bootstrap

FLUX.2 Klein – Fastest AI Image Generator – Free to Try

Making AI sustainable by design is key to greener future - World Economic Forum

Querying India's MoSPI Data with Claude and MCP

How to Debloat Brave on macOS

Show HN: I built an invoicing SaaS with AI-generated invoice templates

Show HN: I made Sundaysub to help small startup create invoice for free

Show HN: Ran an AI agent 100x – pass rate 70%, not 100%