Improving 15 LLMs at Coding in One Afternoon. Only the Harness Changed

http://blog.can.ac/2026/02/12/the-harness-problem/

104•kachapopopow•1h ago

Comments

energy123•1h ago

I feel the baseline comparison should be relative to the intuitive and simple "line-numbers only" schema.

It's less token heavy than the proposed hash approach, and I don't think frontier LLMs hallucinate line numbers if each line in the context is prefixed with them.

withinboredom•1h ago

The issue is when the file changed between when the LLM read the file and when it wrote to the file. Just using line numbers will clobber a file if that happens. The hashes prevent that from being an issue.

energy123•1h ago

Point taken.

kachapopopow•31m ago

it starts writing to the wrong part of the file after multiple edits.

rafaelmn•1h ago

I wonder if we'll get to "VI for LLMs" - if the model was trained on using that kind of text navigation and you show context around cursor when it navigates.

Would also be worth having special tokens for this kind of navigation.

cousinbryce•1h ago

I bet it’s good enough at VI already

1313ed01•54m ago

I always thought ed would be a perfect match. Line-based instead of having to manage cursor movements.

woeirua•1h ago

The harness matters far more than most people think. This post about the CORE benchmark where Opus’ score almost doubled when they switched to Claude Code from their own harness. https://x.com/sayashk/status/1996334941832089732

withinboredom•1h ago

Which, IMHO, should be why we should be able to change them freely or make our own. Being locked into a specific harness because you pay 20 bucks per month vs. pay-per-use ... is kinda dumb.

horsawlarway•51m ago

Also another place where having it change out from underneath you can drastically alter the quality of your work in unexpected ways.

Like most things - assume the "20/100/200" dollar deals that are great now are going to go down the enshitification route very rapidly.

Even if the "limits" on them stay generous, the product will start shifting to prioritize things the user doesn't want.

Tool recommendations are my immediate and near term fear - paid placement for dev tools both at the model level and the harness level seem inevitable.

---

The right route is open models and open harnesses, ideally on local hardware.

eshaham78•32m ago

The harness is effectively the agent's 'body'. Swapping the brain (model) is good, but if the body (tools/environment) is locked down or inefficient, the brain can't compensate. Local execution environments that standardize the tool interface are going to be critical for avoiding that lock-in.

deaux•32m ago

At this point subsidizing Chinese open-weights vendors by paying for them is just the right thing to do. Maybe they too might go closed-weights when they become SotA, but they're now pretty close and haven't done it.

DeathArrow•17m ago

I am wondering what kinds of harness are best for GLM, Deepseek, Qwen, Kimi.

deaux•12m ago

OpenCode is great in general. At least one of them is specifically trained on CC - I think it was Qwen - so for those that should give best results.

Aurornis•9m ago

> Like most things - assume the "20/100/200" dollar deals that are great now are going to go down the enshitification route very rapidly.

I don’t assume this at all. In fact, the opposite has been happening in my experience: I try multiple providers at the same time and the $20/month plans have only been getting better with the model improvements and changes. The current ChatGPT $20/month plan goes a very long way even when I set it to “Extra High” whereas just 6 months ago I felt like the $20/month plans from major providers were an exercise in bouncing off rate limits for anything non-trivial.

Inference costs are only going to go down from here and models will only improve. I’ve been reading these warnings about the coming demise of AI plans for 1-2 years now, but the opposite keeps happening.

CuriouslyC•16m ago

The reason Anthropic is pushing on the closed harness is that they're not confident with their ability to win on model quality long term, so they're trying to build lock-in. They can capture some additional telemetry owning the harness as well, but given the amount of data the agent loop already transmits, that borders on unethical spyware (which might be part of the reason they're afraid to open source).

Ultimately the market is going to force them to open up and let people flex their subs.

theturtletalks•56m ago

Mario, the creator of Pi terminal agent, has this great blog post[0]. He talks about how TerminalBench's highest scores comes from using the Terminus 2 harness which uses tmux under the hood.

When I was reading the Opus 4.6 launch post, they mentioned the same thing and their TerminalBench score was based on using Terminus 2 and not CC.

0. https://mariozechner.at/posts/2025-11-30-pi-coding-agent/

avereveard•59m ago

I use small model I like to give them TOC more than lines wonder how it'd stack up with the hashline approach

read_toc tool:

...

  {

    "name": "mcp",

    "qualified_name": "mcp",

    "type": "constant",

    "docstring": null,

    "content_point": "src\\mcps\\code_help\\server.py::17::18::python::mcp",

    "is_nested": false

  },

  {

    "name": "handler",

    "qualified_name": "handler",

    "type": "constant",

    "docstring": null,

    "content_point": "src\\mcps\\code_help\\server.py::18::19::python::handler",

    "is_nested": false

  },

....

update_content tool:

{

  "content": "...",

  "content_point": "src\\mcps\\code_help\\server.py::18::19::python::handler",

  "project_root": ....

}

pcwelder•57m ago

Great work, but concurrency is lost.

With search-replace you could work on separate part of a file independently with the LLM. Not to mention with each edit all lines below are shifted so you now need to provide LLM with the whole content.

Have you tested followup edits on the same files?

kachapopopow•32m ago

(not the author) it works fine most of the time been using it alongside an active agent and haven't ran into too many noticable problems. The token savings alone are worth it.

wrsh07•10m ago

Serializing writes is probably fine and the hashes should only change if you're updating the same line, right?

You probably don't want to use the line number though unless you need to disambiguate

But your write tool implementation can take care of that

animan•54m ago

What was the point of Claude code or Gemini banning the OP? Why would they care about how IDEs use the underlying API?

infecto•47m ago

I assume he was using Gemini the same way as he was Claude when I make the following statement.

I don’t believe it’s exceptionally unique or new that companies will revoke access if you are using an unpublished API that the apps use. I don’t see anything wrong with it myself. If you want, pay for normal token use on the published APIs. There is no expectation that you can use APIs for an application, even if you are a paid user, that are not published explicitly for usage.

deaux•30m ago

Indeed, that's why Anthropic, OpenAI and other LLM providers are known to adhere to published APIs to gather the world's data, obeying licensing and ROBOTS.txt.

It's truly disgusting.

skybrian•17m ago

I was under the impression that they do obey robots.txt now? There are clearly a lot of dumb agents that don’t, but didn’t think it was the major AI labs.

deaux•15m ago

After 3 years of pirating and scraping the entire world by doing the above, I guess they have everything that they now need or want.

So then it's better to start obeying ROBOTS.txt as a ladder pull through a "nicely behaved" image advantage.

sigmar•29m ago

He wasn't using the regular paid api (ie per token pricing). He was using the endpoints for their subscribed customers (ie paid per month and heavily subsidized).

DANmode•23m ago

Why does Google/Facebook et al arbitrarily enforce one human per account?

It’s because they want to study you.

They want the data!

logicallee•18m ago

>What was the point of Claude code or Gemini banning the OP? Why would they care about how IDEs use the underlying API?

Underscores the importance of sovereign models you can run on the edge, finetune yourself, and run offline. At State of Utopia, we're working on it!

notsylver•43m ago

I feel like cursors solution is still the best answer. Let the model suggest edits in whatever format it prefers using as few "extra" tokens as possible and have a small model figure it out. I don't use cursor anymore but when I did it was impressive how consistently it worked, I think there was a single time it failed. 70b might be overkill though...

mromanuk•30m ago

Someone should try prompting the same LLM in use, to suggest an edit as a subagent.

deaux•37m ago

Great article, recommend reading all of it.

> Why bother, you ask? Opus may be a great model, but Claude Code to this day leaks raw JSONL from sub-agent outputs, wasting hundreds of thousands of tokens. I get to say, “fuck it, subagents output structured data now”.

This is why I find the banning of using Claude subscriptions in other harnesses is so heinous. Their harness that they're forcing onto everyone has tons of big issues including wasting massive numbers of tokens. Very much in line with intentionally refusing to adhere to standards in the most IE6 way possible.

techpression•22m ago

I mean they want to make money right? CC is a cool tool, but obviously they want you to use the api eventually if you’re even remotely a power user, 200/month for all you can eat tokens (well, until some arbitrary limit of the day kicks in) just doesn’t make sense when compared to api prices. In other words, CC should be seen as a software subscription.

deaux•15m ago

The token limit is the same whether used in CC or in other harnesses.

kachapopopow•33m ago

My personal notes (not the author): have been way faster performance wise which is honestly the biggest improvement over correctless. I've posted https://github.com/can1357/oh-my-pi before, but didn't seem to gain traction. It's a great little agent.

a11r•30m ago

This is very nicely done. We have seen the same issue at a higher level of getting separators right when generating multiple files in a single inference call.

logicallee•22m ago

I agree with this article completely, nice to see it presented quantitatively.

>re "only" the harness changed

In our experience, AI's are like amnesiacs who can barely remember what they did three minutes ago (their last autonomous actions might still be in their context if you're lucky), with no chance at remembering what they did three days ago. As such, the "harness" determines their entire memory and is the single most important determinant of their outcome.

The best harness is a single self-contained, well-commented, obvious, and tiny code file followed by a plain explanation of what it does and what it's supposed to do, the change request, how you want it to do it (you have to say it with so much force and confidence that the AI is afraid of getting yelled at if they do anything else) and a large amount of text devoted to asking the AI not to break what is already working. Followed by a request to write a test that passes. Followed by asking for its judgment about whether it broke what was already working on or not. All in one tiny crisp prompt.

With such a harness, it's able to not break the code one time in twenty. If you use reverse psychology and ask it to do the opposite of what you want, it rises to fifty-fifty odds you'll get what you're trying to do.

Don't believe me? You can watch the livestream (see my previous comments).

Baby steps toward Utopia.

tosh•21m ago

Shows how much room for improvement there is on the harness level.

Agents waste a lot of tokens on editing, sandboxes, passing info back and forth from tool calls and subagents.

Love the pragmatic mix of content based addressing + line numbers. Beautiful.

znnajdla•17m ago

Yep this has been my experience with browser agents as well. One little change in the harness/agentic loop and the model suddenly becomes a whole lot smarter at navigating the web. I was also able to build a better browser agent than ‘claude —chrome’ in just a few afternoons just by tweaking the harness.

chrisweekly•15m ago

Great post. A few choice quotes:

> Often the model isn’t flaky at understanding the task. It’s flaky at expressing itself. You’re blaming the pilot for the landing gear.

> The model is the moat. The harness is the bridge. Burning bridges just means fewer people bother to cross. Treating harnesses as solved, or even inconsequential, is very short-sighted.

> The gap between “cool demo” and “reliable tool” isn’t model magic. It’s careful, rather boring, empirical engineering at the tool boundary.

znnajdla•12m ago

My experience as well. People worry our profession is being reduced to "prompt engineer", but actually I get the feeling that programming will soon be mainly about designing and building harnesses for specific tasks.

__mharrison__•11m ago

Is there a skill file I can use for these edits?

SaySigned – The e-signature service built for AI agents

Figure skating is the oldest winter Olympic sport, starting 1908 Summer Olympics

Show HN: The GPG Guide – Practical OpenPGP for 2026

Transcription APIs – OpenAI vs. Groq vs. Mistral

Allocators from C to Zig

Did you want that link to be permanent?

Scratch–minimalist, open-source, offline-first Markdown note-taking app for Mac

Chipping Away

Rethinking rush hour with vehicle automation

HySparse: A Hybrid Sparse Attention Architecture

Technical "Whitepaper" for Afl-Fuzz

Resist and Unsubscribe

Don't kill my pretty RSS feed

Show HN: HZ Chat – A simple session-based chat tool

Someone's attacking SolarWinds WHD to steal high‑privilege credentials

Restore Pkg_resources #5174

Show HN: VibeDB – store anything with zero config

When Execution Is Cheap, Judgment Becomes Scarce

6.2M names, birthdays and passport details leaked from Odido

Administration working to strip citizenship from foreign-born Americans

Show HN: Mem – deterministic CLI memory sidecar for dev workflows

Show HN: BetterDB – Valkey/Redis monitoring that persists what servers forget

A brief history of barbed wire fence telephone networks

Show HN: Stock skill for OpenClaw 6,500 stocks, 900 days of data

Nearly half of ammo seized by Mexican government came from US Army plant

A vintage electric shower with bare 240V elements in the water [video]

Party Line (Telephony)

Are CDs Making a Comeback? A Statistical Analysis

How We AI

add kdoc for napi_consume_skb()