frontpage.

I built agentrial because I was tired of the "it works... sometimes" problem with AI agents.

Existing eval tools treat agents like deterministic functions: run once, check output, done. But agents aren't deterministic - same input, different tool calls, different outputs.

agentrial runs every test N times (default 10) and gives you:

- Pass rate with Wilson confidence intervals (not "72%" but "72%, CI 55-84%") - Step-level failure attribution (which exact step diverged between success/fail runs) - Real cost tracking from API metadata - GitHub Action for CI/CD

Tested with Claude 3 Haiku on a 3-tool LangGraph agent: 100 trials, $0.06 total, full trajectory capture.

pip install agentrial

Open source, MIT. LangGraph supported, CrewAI/AutoGen coming next.

I'm in the Epstein Files

Django: Profile Memory Usage with Memray

Instacloud as infinite cloud storage using Instagram as remote disk

Calling Lean Functions as Python Functions

Expertise is a Relic; They want Drones

Fli4l

Why AI-Generated Code Will Hurt Both Customers and Companies

Voxtral.c Voxtral Realtime 4B model inference as a C library

Aqua2Terra: Bringing subsea cable architecture to terrestrial fiber networks

Ottermon.ai – Effortless Observability Deployed in Seconds

Learning to Reason in 13 Parameters

The Logic of Surveillance (2013)

Claude Code Is the Inflection Point

Experimental: Add xx-zones protocol for area-limited window positioning

Show HN: API Unit manage and schedule real API test flows, not just requests

Elon Musk – "In 36 months, the cheapest place to put AI will be space" [video]

Bringing Engineering-as-Code to the Sphinx Framework

BalatroBench – Benchmarking LLMs' Strategic Performance Through Games

We are QA Engineers now

Building an AI-Native Pharma

IP Based Geolocation by Apple

Codex and Claude Code Automated Coding Orchestrator Controlled via Telegram

Probing the dark energy in the functional protein universe

Sopro v1.5: A 135M TTS model trained for ~$100, runs 20× real-time on CPU

(Hetzner) Statement on the adjustment of setup fees

Redditors hack Epstein personal email

Meteor 3.4: 4x faster builds, 8x smaller bundles

XenoAtom.Terminal.UI – reactive retained‑mode terminal UI framework for .NET

Updated disk price comparison tool – fixed $/TB math and bundle normalization

Ask HN: How do you deal with software tracking your data?