The root cause wasn't the team or clients. It was how we designed the agent: there were no clear boundaries unless you adopted a well-known agent framework.
I started this project because drawing clear boundaries that developers are already familiar with felt like the right thing to do.
To dogfood it, I defined a game-dev expert with a simple topology (plan → build → verify + coordinator) and ran the same task across 5 models.
Here are the results: https://github.com/perstack-ai/demo-catalog
The query was simple: "Create a Wizardry-like dungeon crawler..."
For evaluation, I focused on just three things. (1) Does the expert adhere to my instructions? (2) Is the outcome verified and actually working? (3) Is the API cost affordable?
Why these three? Because even if the harness architecture is solid, an agent needs to be evaluated on instruction adherence, minimum quality assurance, and cost efficiency. That's what I learned from working with clients.
What I noticed:
- 3 out of 5 models followed the full plan → build → verify pipeline and produced verified working output, with no provider-specific tuning. The topology was defined once and ran as-is.
- Claude (4.6 Opus + 4.6 Sonnet) produced the richest output with flawless instruction adherence. It also achieved the highest cache hit rate (96%) among all providers, but pricing still pushed the total to 8× the nearest competitor.
- Kimi K2.5 produced excellent output at $3.43 and was the most faithful to delegation. In this test, it outperformed GPT and Gemini in both instruction adherence and quality.
- Gemini (3.1 Pro + 3.0 Flash) followed the full pipeline and produced a verified working game. But its output is buggier than GPT's and almost unplayable.
- GPT (5.4 + 5-mini) was the fastest and cheapest, but skipped the verify step entirely. It called build three times instead of following the pipeline.
- MiniMax M2.5 ignored instructions entirely and made a browser-based HTML game. Instruction adherence is a challenge, but the newest version, M2.7, was recently announced with adherence improvements, so I'm looking forward to it.
It's one task from a demo catalog. But the full execution logs for every run are in the repo, so you can see exactly what each model did and reproduce it yourself.