frontpage.

I'm sharing Tracecore, an OS tool I'm building to evaluate AI agents' ability to handle deterministic software tasks like log triage, config remediation, and incident recovery. It started as a way to test whether agents could reliably perform structured operations without free-form guessing, inspired by frustrations with brittle automation in ops workflows.

What sets it apart: Unlike benchmarks like SWE-Bench (which tests code generation on open-ended GitHub issues) or general agent evaluation suites (that mix diverse reasoning, coding, and interaction tasks), Tracecore focuses on deterministic episodes where agents must use constrained actions (e.g., file operations, ops triage) to achieve exact outcomes, with strict validation. It includes 15+ tasks across suites like operations and games, and supports running agents via adapters for frameworks like OpenClaw and Autogen, or custom scripts.

You can try it out by installing through pip/uv, or by cloning the repo and installing the optional dev dependencies, and running the dashboard, the cli wizard or the cli commands. It outputs structured results with success/failure, steps used, traces for analysis, diffs, bundles and more.

I've been iterating on this over the past few weeks, adding new tasks and improving the harness. Previous discussions on AI eval tools were helpful in shaping the design. Feedback welcome, especially on expanding task suites or integration ideas.

Netflix declines to raise its offer to buy Warner

Show HN: GoldRush CLI – one command for blockchain data

Don't Lose Your Context

'Proof by intimidation': AI is confidently solving 'impossible' math problems

Netflix Declines to Raise Offer for Warner Bros

Block to layoff half its employees

Show HN: Gridly – Minimalist math puzzle (Wordle vibe), no signup, no ads, clean

Surprise on Ice

Using OpenCode in CI/CD for AI pull request reviews

Ask HN: My competitor wants to buy us out, recommend a lawyer?

You can't tie knots in four dimensions

C3 0.7.10: constdef Takes Shape

Open Timeline Engine – Local first behavioral cloning for AI agents via MCP

Semantic grep running locally on Apple Silicon via MLX

Netflix drops bid for Warner Bros after Paramount offer

Philosophy as Fact-Based Discipline: 200 Philosophical Facts [pdf]

Netflix Backs Out of Warner Bros. Bidding, Paramount Set to Win

Show HN: I built a browser-based image comparison tool for subtle differences

Anthropic says company 'cannot in good conscience accede' to Pentagon's demands

I vibe coded and I have feelings about it

Viewert – The One App for Prompt Notes

Tesla touts California robotaxis but does nothing to get permits

The Lobotomy Ultimatum: What happens when a Government removes an AI's morals

Show HN: Free app to track countries you've travelled to

Netflix drops out of bidding after WBD deems Paramount's takeover bid 'superior'

Brave New Smart Phone Dependence World and Google Support

Netflix Declines to Raise Offer for Warner Bros

RSS Guard v5.0.0

Paramount Wins Bidding War for Warner Discovery as Netflix Drops Out

A Phase-Ordered Pre-Geometric Projection Framework (Physics)

Show HN: Tracecore: Benchmark AI Agents on Deterministic Coding Tasks

Netflix declines to raise its offer to buy Warner

Show HN: GoldRush CLI – one command for blockchain data

Don't Lose Your Context

'Proof by intimidation': AI is confidently solving 'impossible' math problems

Netflix Declines to Raise Offer for Warner Bros

Block to layoff half its employees

Show HN: Gridly – Minimalist math puzzle (Wordle vibe), no signup, no ads, clean

Surprise on Ice

Using OpenCode in CI/CD for AI pull request reviews

Ask HN: My competitor wants to buy us out, recommend a lawyer?

You can't tie knots in four dimensions

C3 0.7.10: constdef Takes Shape

Open Timeline Engine – Local first behavioral cloning for AI agents via MCP

Semantic grep running locally on Apple Silicon via MLX

Netflix drops bid for Warner Bros after Paramount offer

Philosophy as Fact-Based Discipline: 200 Philosophical Facts [pdf]

Netflix Backs Out of Warner Bros. Bidding, Paramount Set to Win

Show HN: I built a browser-based image comparison tool for subtle differences

Anthropic says company 'cannot in good conscience accede' to Pentagon's demands

I vibe coded and I have feelings about it

Viewert – The One App for Prompt Notes

Tesla touts California robotaxis but does nothing to get permits

The Lobotomy Ultimatum: What happens when a Government removes an AI's morals

Show HN: Free app to track countries you've travelled to

Netflix drops out of bidding after WBD deems Paramount's takeover bid 'superior'

Brave New Smart Phone Dependence World and Google Support

Netflix Declines to Raise Offer for Warner Bros

RSS Guard v5.0.0

Paramount Wins Bidding War for Warner Discovery as Netflix Drops Out

A Phase-Ordered Pre-Geometric Projection Framework (Physics)