frontpage.

I built a small benchmark to test CLI coding agents on blind bug detection.

A challenger agent injects bugs and writes ground truth (`bugs.json`). A different reviewer agent audits the repo without seeing ground truth, and an LLM matcher scores bug-to-finding assignments.

Current run: 50 repos, 150 challenges, 450 reviews, 2,603 injected bugs.

Weighted detection: Claude 58.05%, Codex 37.84%, Gemini 27.81%.

LLM-judge benchmarks are easy to get wrong, so I’d really appreciate critical feedback on benchmark fairness, scoring/matching methodology, and obvious failure modes I’m missing.

Full dataset is linked in the docs.

RSS-Librarian: A read-it-later service for RSS purists

Observations from Building with AI Agents

Where's software going? Is software dead?

Repeating Prompts

Does Syntax Matter?

Money Transfer in Chat

Git's Magic Files

Does Opus 4.6 find the needle in the haystack?

Show HN: A virtual Zen garden for vibe coding

Show HN: ByePhone- An AI assistant to automate tedious phone calls

Show HN: Approve Claude Code permission requests from your phone via ntfy

Browse, preview and install 460 Ghostty terminal themes in one click

A 26-Gram Butterfly-Inspired Robot Achieving Autonomous Tailless Flight

Show HN: Finnish Humanizer – 26 patterns for detecting AI-generated Finnish text

Wonderful vi

Scipy.stats. Chatterjeexi

The engineering behind GitHub Copilot CLI's animated ASCII banner

Iran students stage first large anti-government protests since deadly crackdown

Show HN: SergioAI – Trello bot with Claude that reviews PRDs and opens draft PRs

Show HN: Run 10 AI coding agents in parallel–each opens a PR when done

Show HN: Aethene – Open-source AI memory layer

Show HN: ClawHuddle – Self-hosted OpenClaw management for teams

Show HN: OpenBrowser MCP: Give your AI agent a real efficient browser

I put New Zealand behind a $1 paywall

The AI apocalypse for enshitification has started

Reverse-engineered Twitter API with full client impersonation

OpenQ4: Open-source reimplementation of Quake 4 engine

What podcasts are you listening to?

Show HN: CrewForge - A share room where humans and agents think out loud

Show HN: TLA+ Workbench skill for coding agents (compat. with Vercel skills CLI)

Show HN: Cheddar-bench – unsupervised benchmark for coding agents