Unless I'm misunderstanding what's being described here, running Claude Code with different backend models is pretty common.
https://docs.z.ai/scenario-example/develop-tools/claude
It doesn't perform on par with Anthropic's models in my experience.
Why do you think that is the case? Is Anthropic's models just better or do they train the models to somehow work better with the harness?
If you want to look at some of the tooling and process for this, check out verifiers (https://github.com/PrimeIntellect-ai/verifiers), hermes (https://github.com/nousresearch/hermes-agent) and accompanying trace datasets (https://huggingface.co/datasets/kai-os/carnice-glm5-hermes-t...), and other open source tools and harnesses.
I think spec-driven generation is the antithesis of chat-style coding for this reason. With tools like Claude Code, you are the one tracking what was already built, what interfaces exist, and why something was generated a certain way.
I built Ossature[1] around the opposite model. You write specs describing behavior, it audits them for gaps and contradictions before any code is written, then produces a build plan toml where each task declares exactly which spec sections and upstream files it needs. The LLM never sees more than that, and there is no accumulated conversation history to drift from. Every prompt and response is saved to disk, so traceability is built in rather than something you reconstruct by scrolling back through a chat. I used it over the last couple of days to build a CHIP-8 emulator entirely from specs[2]. I have some more example projects on GitHub[3]
1: https://github.com/ossature/ossature
How does the human intervention work out? Do you use a mix of spec and audit editing to get into the ready to generate state? How high is the success/error rate if you generate from tasks to code, do LLMs forget/mess up things or does it feel better?
The spec driven approach is potentially better for writing things from scratch, do you have any plans for existing code?
Any reason you’ve opted for custom markdown formats with the @ syntax rather than using something like frontmatter?
Very conscious that this would prevent any markdown rendering in github etc.
It's also a good example that you can turn any useful code component that requires 1k LOC into a mess of 500k LOC.
armcat•2h ago
esafak•1h ago
Yokohiii•11m ago
stanleykm•1h ago
HarHarVeryFunny•42m ago
For a preview of what it'd be like, just tell your AI chat app that you'll run bash commands for it, and please change the app in your "current directory" to "sort the output before printing it", or some such request.
Yokohiii•13m ago
stanleykm•10m ago
senko•9m ago
So, yes, it can work.
HarHarVeryFunny•1h ago
I suspect that more could be done in terms of translating semi-naive user requests into the steps that a senior developer would take to enact them, maybe including the tools needed to do so.
It's interesting that the author believes that the best open source models may already be good enough to complete with the best closed source ones with an optimized agent and maybe a bit of fine tuning. I guess the bar isn't really being able to match the SOTA model, but being close to competent human level - it's a fixed bar, not a moving one. Adding more developer expertise by having the agent translate/augment the users request/intent into execution steps would certainly seem to have potential to lower the bar of what the model needs to be capable of one-shotting from the raw prompt.
Yokohiii•1h ago