Improving 15 LLMs at Coding in One Afternoon. Only the Harness Changed

http://blog.can.ac/2026/02/12/the-harness-problem/

129•kachapopopow•2h ago

Comments

energy123•1h ago

I feel the baseline comparison should be relative to the intuitive and simple "line-numbers only" schema.

It's less token heavy than the proposed hash approach, and I don't think frontier LLMs hallucinate line numbers if each line in the context is prefixed with them.

withinboredom•1h ago

The issue is when the file changed between when the LLM read the file and when it wrote to the file. Just using line numbers will clobber a file if that happens. The hashes prevent that from being an issue.

energy123•1h ago

Point taken.

kachapopopow•52m ago

it starts writing to the wrong part of the file after multiple edits.

rafaelmn•1h ago

I wonder if we'll get to "VI for LLMs" - if the model was trained on using that kind of text navigation and you show context around cursor when it navigates.

Would also be worth having special tokens for this kind of navigation.

cousinbryce•1h ago

I bet it’s good enough at VI already

1313ed01•1h ago

I always thought ed would be a perfect match. Line-based instead of having to manage cursor movements.

woeirua•1h ago

The harness matters far more than most people think. This post about the CORE benchmark where Opus’ score almost doubled when they switched to Claude Code from their own harness. https://x.com/sayashk/status/1996334941832089732

withinboredom•1h ago

Which, IMHO, should be why we should be able to change them freely or make our own. Being locked into a specific harness because you pay 20 bucks per month vs. pay-per-use ... is kinda dumb.

horsawlarway•1h ago

Also another place where having it change out from underneath you can drastically alter the quality of your work in unexpected ways.

Like most things - assume the "20/100/200" dollar deals that are great now are going to go down the enshitification route very rapidly.

Even if the "limits" on them stay generous, the product will start shifting to prioritize things the user doesn't want.

Tool recommendations are my immediate and near term fear - paid placement for dev tools both at the model level and the harness level seem inevitable.

---

The right route is open models and open harnesses, ideally on local hardware.

eshaham78•53m ago

The harness is effectively the agent's 'body'. Swapping the brain (model) is good, but if the body (tools/environment) is locked down or inefficient, the brain can't compensate. Local execution environments that standardize the tool interface are going to be critical for avoiding that lock-in.

deaux•53m ago

At this point subsidizing Chinese open-weights vendors by paying for them is just the right thing to do. Maybe they too might go closed-weights when they become SotA, but they're now pretty close and haven't done it.

DeathArrow•37m ago

I am wondering what kinds of harness are best for GLM, Deepseek, Qwen, Kimi.

deaux•33m ago

OpenCode is great in general. At least one of them is specifically trained on CC - I think it was Qwen - so for those that should give best results.

Aurornis•29m ago

> Like most things - assume the "20/100/200" dollar deals that are great now are going to go down the enshitification route very rapidly.

I don’t assume this at all. In fact, the opposite has been happening in my experience: I try multiple providers at the same time and the $20/month plans have only been getting better with the model improvements and changes. The current ChatGPT $20/month plan goes a very long way even when I set it to “Extra High” whereas just 6 months ago I felt like the $20/month plans from major providers were an exercise in bouncing off rate limits for anything non-trivial.

Inference costs are only going to go down from here and models will only improve. I’ve been reading these warnings about the coming demise of AI plans for 1-2 years now, but the opposite keeps happening.

disgruntledphd2•18m ago

> Inference costs are only going to go down from here and models will only improve. I’ve been reading these warnings about the coming demise of AI plans for 1-2 years now, but the opposite keeps happening.

This time also crosses over with the frontier labs raising ever larger and larger rounds. If Anthropic IPO (which I honestly doubt), then we may get a better sense of actual prices in the market, as it's unlikely the markets will continue letting them spend more and more money each year without a return.

CuriouslyC•37m ago

The reason Anthropic is pushing on the closed harness is that they're not confident with their ability to win on model quality long term, so they're trying to build lock-in. They can capture some additional telemetry owning the harness as well, but given the amount of data the agent loop already transmits, that borders on unethical spyware (which might be part of the reason they're afraid to open source).

Ultimately the market is going to force them to open up and let people flex their subs.

Aurornis•21m ago

> Being locked into a specific harness because you pay 20 bucks per month vs. pay-per-use ... is kinda dumb.

I’ll probably get downvoted for this, but am I the only one who thinks it’s kind of wild how much anger is generated by these companies offering discounted plans for use with their tools?

At this point, there would be less anger and outrage on HN if they all just charged us the same high per-token rate and offered no discounts or flat rate plans.

theturtletalks•1h ago

Mario, the creator of Pi terminal agent, has this great blog post[0]. He talks about how TerminalBench's highest scores comes from using the Terminus 2 harness which uses tmux under the hood.

When I was reading the Opus 4.6 launch post, they mentioned the same thing and their TerminalBench score was based on using Terminus 2 and not CC.

0. https://mariozechner.at/posts/2025-11-30-pi-coding-agent/

avereveard•1h ago

I use small model I like to give them TOC more than lines wonder how it'd stack up with the hashline approach

read_toc tool:

...

  {

    "name": "mcp",

    "qualified_name": "mcp",

    "type": "constant",

    "docstring": null,

    "content_point": "src\\mcps\\code_help\\server.py::17::18::python::mcp",

    "is_nested": false

  },

  {

    "name": "handler",

    "qualified_name": "handler",

    "type": "constant",

    "docstring": null,

    "content_point": "src\\mcps\\code_help\\server.py::18::19::python::handler",

    "is_nested": false

  },

....

update_content tool:

{

  "content": "...",

  "content_point": "src\\mcps\\code_help\\server.py::18::19::python::handler",

  "project_root": ....

}

pcwelder•1h ago

Great work, but concurrency is lost.

With search-replace you could work on separate part of a file independently with the LLM. Not to mention with each edit all lines below are shifted so you now need to provide LLM with the whole content.

Have you tested followup edits on the same files?

kachapopopow•53m ago

(not the author) it works fine most of the time been using it alongside an active agent and haven't ran into too many noticable problems. The token savings alone are worth it.

wrsh07•30m ago

Serializing writes is probably fine and the hashes should only change if you're updating the same line, right?

You probably don't want to use the line number though unless you need to disambiguate

But your write tool implementation can take care of that

animan•1h ago

What was the point of Claude code or Gemini banning the OP? Why would they care about how IDEs use the underlying API?

infecto•1h ago

I assume he was using Gemini the same way as he was Claude when I make the following statement.

I don’t believe it’s exceptionally unique or new that companies will revoke access if you are using an unpublished API that the apps use. I don’t see anything wrong with it myself. If you want, pay for normal token use on the published APIs. There is no expectation that you can use APIs for an application, even if you are a paid user, that are not published explicitly for usage.

deaux•51m ago

Indeed, that's why Anthropic, OpenAI and other LLM providers are known to adhere to published APIs to gather the world's data, obeying licensing and ROBOTS.txt.

It's truly disgusting.

skybrian•38m ago

I was under the impression that they do obey robots.txt now? There are clearly a lot of dumb agents that don’t, but didn’t think it was the major AI labs.

deaux•36m ago

After 3 years of pirating and scraping the entire world by doing the above, I guess they have everything that they now need or want.

So then it's better to start obeying ROBOTS.txt as a ladder pull through a "nicely behaved" image advantage.

skybrian•21m ago

Obeying robots.txt (now) is still better than not obeying it, regardless of what they did before.

The alternative is to say that bugs shouldn’t be fixed because it’s a ladder pull or something. But that’s crazy. What’s the point of complaining if not to get people to fix things?

sigmar•49m ago

He wasn't using the regular paid api (ie per token pricing). He was using the endpoints for their subscribed customers (ie paid per month and heavily subsidized).

DANmode•44m ago

Why does Google/Facebook et al arbitrarily enforce one human per account?

It’s because they want to study you.

They want the data!

logicallee•38m ago

>What was the point of Claude code or Gemini banning the OP? Why would they care about how IDEs use the underlying API?

Underscores the importance of sovereign models you can run on the edge, finetune yourself, and run offline. At State of Utopia, we're working on it!

bri3d•10m ago

When you buy a subscription plan, you’re buying use of the harness, not the underlying compute / tokens. Buying those on their own is way more expensive. This is probably because:

* Subscriptions are oversubscribed. They know how much an “average” Claude Code user actually consumes to perform common tasks and price accordingly. This is how almost all subscription products work.

* There is some speculation that there is cooperative optimization between the harness and backend (cache related etc).

* Subscriptions are subsidized to build market share; to some extent the harnesses are “loss leader” halo products which drive the sales of tokens, which are much more profitable.

notsylver•1h ago

I feel like cursors solution is still the best answer. Let the model suggest edits in whatever format it prefers using as few "extra" tokens as possible and have a small model figure it out. I don't use cursor anymore but when I did it was impressive how consistently it worked, I think there was a single time it failed. 70b might be overkill though...

mromanuk•50m ago

Someone should try prompting the same LLM in use, to suggest an edit as a subagent.

deaux•58m ago

Great article, recommend reading all of it.

> Why bother, you ask? Opus may be a great model, but Claude Code to this day leaks raw JSONL from sub-agent outputs, wasting hundreds of thousands of tokens. I get to say, “fuck it, subagents output structured data now”.

This is why I find the banning of using Claude subscriptions in other harnesses is so heinous. Their harness that they're forcing onto everyone has tons of big issues including wasting massive numbers of tokens. Very much in line with intentionally refusing to adhere to standards in the most IE6 way possible.

techpression•43m ago

I mean they want to make money right? CC is a cool tool, but obviously they want you to use the api eventually if you’re even remotely a power user, 200/month for all you can eat tokens (well, until some arbitrary limit of the day kicks in) just doesn’t make sense when compared to api prices. In other words, CC should be seen as a software subscription.

deaux•35m ago

The token limit is the same whether used in CC or in other harnesses.

kachapopopow•54m ago

My personal notes (not the author): have been way faster performance wise which is honestly the biggest improvement over correctless. I've posted https://github.com/can1357/oh-my-pi before, but didn't seem to gain traction. It's a great little agent.

a11r•51m ago

This is very nicely done. We have seen the same issue at a higher level of getting separators right when generating multiple files in a single inference call.

logicallee•43m ago

I agree with this article completely, nice to see it presented quantitatively.

>re "only" the harness changed

In our experience, AI's are like amnesiacs who can barely remember what they did three minutes ago (their last autonomous actions might still be in their context if you're lucky), with no chance at remembering what they did three days ago. As such, the "harness" determines their entire memory and is the single most important determinant of their outcome.

The best harness is a single self-contained, well-commented, obvious, and tiny code file followed by a plain explanation of what it does and what it's supposed to do, the change request, how you want it to do it (you have to say it with so much force and confidence that the AI is afraid of getting yelled at if they do anything else) and a large amount of text devoted to asking the AI not to break what is already working. Followed by a request to write a test that passes. Followed by asking for its judgment about whether it broke what was already working on or not. All in one tiny crisp prompt.

With such a harness, it's able to not break the code one time in twenty. If you use reverse psychology and ask it to do the opposite of what you want, it rises to fifty-fifty odds you'll get what you're trying to do.

Don't believe me? You can watch the livestream (see my previous comments).

Baby steps toward Utopia.

tosh•41m ago

Shows how much room for improvement there is on the harness level.

Agents waste a lot of tokens on editing, sandboxes, passing info back and forth from tool calls and subagents.

Love the pragmatic mix of content based addressing + line numbers. Beautiful.

chasd00•18m ago

i haven't dug into the article but your comment reminded me about the ClaudeCode Superpowers plugin. I find the plugin great but it's quite "expensive", I use the pay-as-you-go account with CC because i've just been trying it out personally and the superpowers plugin spends a lot of money, relative to regular CC, with all the back and forth.

With CC you can do a /cost to see how much your session cost in dollar terms, that's a good benchmark IMO for plugins, .md files for agents, and so on. Minimize the LLM cost in the way you'd minimize typical resource usage on a computer like cpu, ram, storage etc.

kachapopopow•15m ago

you can actually go the other way and spend more tokens to solve more complex problems (multi-agent) by letting agents work with smaller problems

znnajdla•37m ago

Yep this has been my experience with browser agents as well. One little change in the harness/agentic loop and the model suddenly becomes a whole lot smarter at navigating the web. I was also able to build a better browser agent than ‘claude —chrome’ in just a few afternoons just by tweaking the harness.

chrisweekly•36m ago

Great post. A few choice quotes:

> Often the model isn’t flaky at understanding the task. It’s flaky at expressing itself. You’re blaming the pilot for the landing gear.

> The model is the moat. The harness is the bridge. Burning bridges just means fewer people bother to cross. Treating harnesses as solved, or even inconsequential, is very short-sighted.

> The gap between “cool demo” and “reliable tool” isn’t model magic. It’s careful, rather boring, empirical engineering at the tool boundary.

brendanmc6•13m ago

You’re absolutely right! This isn’t your average engineering advice— it’s like painting the reader a vivid tapestry of the author’s mind.

esafak•11m ago

Guys, stop it, I just can't any more! Yes, I'm absolutely right.

znnajdla•33m ago

My experience as well. People worry our profession is being reduced to "prompt engineer", but actually I get the feeling that programming will soon be mainly about designing and building harnesses for specific tasks.

ambicapter•18m ago

Personal opinion is that LLMs are definitely not as magical as people think they are, they fill a specific niche of problem-solving, and harnesses are necessary to corral your problem into the niche that they are extremely good at solving.

__mharrison__•32m ago

Is there a skill file I can use for these edits?

jcims•19m ago

I ran into this from the other direction. I built a small SRE agent for my cloud infra and just kind of walked into hand-rolling some of the tools rather than using what exists today. I provided an edit_file tool that felt like it was of reasonable capability, but in practice the agent was regularly 'trying' to do a one line change and submitting PRs that hallucinated 3/4s of the file.

Seeing how bad the results are when you're casually approaching something makes it very evident that it's a topic that can be optimized.

evolly•8m ago

My experience exactly! I’ve recently become so tired of the Claude harness that I switched to OpenCode (which is extremely good compared to Claude). However, OpenCode is also tedious to change, and it inherits all the “good stuff,” like treating agents as Markdown files and all the dancing around with hooks/plugins/skills scattered all over the place. Getting stuck again and again, I’ve ultimately come to the conclusion that this must be solved by writing my own damn coding agent, with extensibility that’s acceptable for real-world engineering.

jwpapi•8m ago

Great article and tbh I thought it would’ve been implemented that way makes sense to hash and save mainly context I don’t expect them to care about token usage

How about Kimi tho how can I play with it?

jwpapi•6m ago

Arguably I would think that the last year was mainly inner harness improvement instead model improvement but I could be wrong, just feels like that to me

softwaredoug•5m ago

Underrated is how much improving harnesses, not just models, has a lot to do with productive uses of LLMs at tasks like coding in the last year.

Email is tough: Major European Payment Processor's Emails rejected by GWorkspace

Improving 15 LLMs at Coding in One Afternoon. Only the Harness Changed

The "Crown of Nobles" Noble Gas Tube Display (2024)

The Future for Tyr, a Rust GPU Driver for Arm Mali Hardware

Warcraft III Peon Voice Notifications for Claude Code

Culture Is the Mass-Synchronization of Framings

A brief history of barbed wire fence telephone networks

Apache Arrow is 10 years old

Discord/Twitch/Snapchat age verification bypass

Apple patches decade-old iOS zero-day, possibly exploited by commercial spyware

I Wrote a Scheme in 2025

AI agent opens a PR write a blogpost to shames the maintainer who closes it

The missing digit of Stela C

Kim Jong Un chooses teen daughter as heir

Carl Sagan's Baloney Detection Kit: Tools for Thinking Critically (2025)

“Nothing” is the secret to structuring your work

GLM-5: Targeting complex systems engineering and long-horizon agentic tasks

Using an engineering notebook

Fluorite – A console-grade game engine fully integrated with Flutter

Ireland rolls out basic income scheme for artists

Byte magazine artist Robert Tinney, who illustrated the birth of PCs, dies at 78

HeyWhatsThat

How to make a living as an artist

Show HN: Geo Racers – Race from London to Tokyo on a single bus pass

Text classification with Python 3.14's ZSTD module

Hologram v0.7.0: Milestone release for Elixir-to-JavaScript porting initiative

RISC-V Vector Primer

NetNewsWire Turns 23

Kanchipuram Saris and Thinking Machines

WiFi could become an invisible mass surveillance system

Email is tough: Major European Payment Processor's Emails rejected by GWorkspace

Improving 15 LLMs at Coding in One Afternoon. Only the Harness Changed

The "Crown of Nobles" Noble Gas Tube Display (2024)

The Future for Tyr, a Rust GPU Driver for Arm Mali Hardware

Warcraft III Peon Voice Notifications for Claude Code

Culture Is the Mass-Synchronization of Framings

A brief history of barbed wire fence telephone networks

Apache Arrow is 10 years old

Discord/Twitch/Snapchat age verification bypass

Apple patches decade-old iOS zero-day, possibly exploited by commercial spyware

I Wrote a Scheme in 2025

AI agent opens a PR write a blogpost to shames the maintainer who closes it

The missing digit of Stela C

Kim Jong Un chooses teen daughter as heir

Carl Sagan's Baloney Detection Kit: Tools for Thinking Critically (2025)

“Nothing” is the secret to structuring your work

GLM-5: Targeting complex systems engineering and long-horizon agentic tasks

Using an engineering notebook

Fluorite – A console-grade game engine fully integrated with Flutter

Ireland rolls out basic income scheme for artists

Byte magazine artist Robert Tinney, who illustrated the birth of PCs, dies at 78

HeyWhatsThat

How to make a living as an artist

Show HN: Geo Racers – Race from London to Tokyo on a single bus pass

Text classification with Python 3.14's ZSTD module

Hologram v0.7.0: Milestone release for Elixir-to-JavaScript porting initiative

RISC-V Vector Primer

NetNewsWire Turns 23

Kanchipuram Saris and Thinking Machines

WiFi could become an invisible mass surveillance system

Improving 15 LLMs at Coding in One Afternoon. Only the Harness Changed

Comments