frontpage.

Show HN: Perstack – Containerized harness, 5 tests with full logs and API cost

https://github.com/perstack-ai/perstack

1•FL4TLiN3•1h ago

I burned out after 2 years of building agentic apps for clients. I'd become the single point of failure with no backup. Requirements gathering, prompt engineering, app development, sandboxing, everything funneled through whoever happened to be the most senior dev on the team, which was always me.

The root cause wasn't the team or clients. It was how we designed the agent: there were no clear boundaries unless you adopted a well-known agent framework.

I started this project because drawing clear boundaries that developers are already familiar with felt like the right thing to do.

To dogfood it, I defined a game-dev expert with a simple topology (plan → build → verify + coordinator) and ran the same task across 5 models.

Here are the results: https://github.com/perstack-ai/demo-catalog

The query was simple: "Create a Wizardry-like dungeon crawler..."

For evaluation, I focused on just three things. (1) Does the expert adhere to my instructions? (2) Is the outcome verified and actually working? (3) Is the API cost affordable?

Why these three? Because even if the harness architecture is solid, an agent needs to be evaluated on instruction adherence, minimum quality assurance, and cost efficiency. That's what I learned from working with clients.

What I noticed:

- 3 out of 5 models followed the full plan → build → verify pipeline and produced verified working output, with no provider-specific tuning. The topology was defined once and ran as-is.

- Claude (4.6 Opus + 4.6 Sonnet) produced the richest output with flawless instruction adherence. It also achieved the highest cache hit rate (96%) among all providers, but pricing still pushed the total to 8× the nearest competitor.

- Kimi K2.5 produced excellent output at $3.43 and was the most faithful to delegation. In this test, it outperformed GPT and Gemini in both instruction adherence and quality.

- Gemini (3.1 Pro + 3.0 Flash) followed the full pipeline and produced a verified working game. But its output is buggier than GPT's and almost unplayable.

- GPT (5.4 + 5-mini) was the fastest and cheapest, but skipped the verify step entirely. It called build three times instead of following the pipeline.

- MiniMax M2.5 ignored instructions entirely and made a browser-based HTML game. Instruction adherence is a challenge, but the newest version, M2.7, was recently announced with adherence improvements, so I'm looking forward to it.

It's one task from a demo catalog. But the full execution logs for every run are in the repo, so you can see exactly what each model did and reproduce it yourself.

NYC ends criminal summonses for cyclists, e-bike riders

We Spoke to Game Devs and All of Them Hate DLSS 5

Rethinking open source mentorship in the AI era

Bifrost CLI and Codex CLI: One Command to Set Up OpenAI Agent with Any Model

Artifact Production Just Got Cheap – What remains when code costs nothing

Apple Urges iPhone Users Running Outdated iOS Versions to Update Immediately

New technology will help satellites avoid collisions in space

Country Budget Allocation Simulator – EconoSIM

NanoGPT Slowrun: 10x Data Efficiency with Infinite Compute

Europe sleepwalked into yet another energy crisis

I found a hidden library of podcasts (and it's brilliant)

What Is Swarm Intelligence? An Explainer

Instagram to remove end-to-end encryption for private messages in May

New Version of the Elixir Language Tour

Study pinpoints when bow and arrow came to North America

Affordable Passive Income Course

Show HN: Monte Carlo simulator for March Madness bracket pools

Sovereign V4: A Cleaner, Stronger Approach to Cryptography

Show HN: Run Claude Code with –dangerously-skip-permissions in a Docker sandbox

European municipalities leak citizen data to US companies

Ask HN: How to Find a Job in the UK

Iran revives 'Zionist sorcery' claims in propaganda against Israel

Don't Call Me Francis

Introducing DoorDash Tasks

Freedom in the World 2026 [pdf]

The quadratic problem nobody fixed

Steer your Waymo with your phone

Equality Saturation and Symbolic Regression

Clipping Parts of Videos from Plex

Breaking Signals, Breaking Systems