$500 GPU outperforms Claude Sonnet on coding benchmarks

79•yogthos•9h ago

Comments

memothon•5h ago

I'm always skeptical because you can make it pass the benchmarks, then you use it and it is not practically useful unlike an extremely general model.

Cool work though, really excited for the potential of slimming down models.

yogthos•3h ago

You obviously have to try it out to see how it works for you, but the trick they use is pretty clever. When you ask an AI to write code, it doesn’t always get it right. Sometimes the code has bugs, sometimes it misunderstands the problem entirely. A naive way to address that is to generate a few solutions and test each one. The odds that at least one works go way up. ATLAS generates multiple attempts, running each through a test suite. Each retry also gets told what went wrong with the previous attempt, so it can try to avoid the same mistake.

But this can be pretty slow since you have to run the code in an isolated environment, check the outputs, wait for it to finish. Doing that for every candidate quickly adds up. So ATLAS has another shortcut for avoiding unnecessary testing. Instead of simply generating solutions and testing all of them, it tries to predict which one is most likely correct before running any tests.

ATLAS also asks the model for an embedding of what it just wrote which acts as a fingerprint. Two similar pieces of code will produce similar fingerprints. A well-written, confident solution will produce a different fingerprint than a confused, buggy one.

These fingerprints get fed into a separate, much smaller neural network called the Cost Field. This little network was trained ahead of time on examples where they already knew which solutions were correct and which were wrong. It learned to assign a score to each fingerprint. Correct solutions get a low score and incorrect ones get a high one.

So the process is to generate multiple solutions, get their fingerprints, score each one, and pick the lowest. Only that one gets tested. The Cost Field picks correctly about 88% of the time according to the repo.

zar1048576•3h ago

Really intriguing set of techniques to improve accuracy by generating multiple solutions. Even with the work to predict the most likely solutions, it's not clear to me based on the description how this could all be done efficiently. Would definitely be really impressive if it pans out on real-world use cases. Will look to kick the tires on this if I can get some time.

yogthos•2h ago

Seems like the key insight is to train a small model that acts as a heuristic for embeddings that resemble quality code. I imagine a lot depends on how well this model is trained. And you could probably create specialized versions for different languages and domains.

Another interesting approach could be to use this set up with a language like Clojure or Common Lisp which facilitates interactive development. If you could hook up the agent directly to a REPL in a running program, then it could run tests with a lot less overhead.

xyzzy123•1h ago

I'm super confused. The small model "cost field" `rag-api/geometric_lens/cost_field.py` was trained on PASS_TASKS like "Write a function that counts vowels in a string." and FAIL_TASKS like "Write a function that converts a regular expression string to an NFA using Thompson's construction, then converts the NFA to a DFA.".

So it seems like it's a difficulty classifier for task descriptions written in English.

This is then used to score embeddings of Python code, which is a completely different distribution.

Presumably it's going to look at a simple solution, figure out it lands kinda close to simple problems in embedding space and pass it.

But none of this helps you solve harder problems, or distinguish between a simple solution which is wrong, and a more complex solution which is correct.

yogthos•38m ago

I think the goal is to have a light heuristic that helps find plausibly useful solutions. They're still going to go through a testing phase as a next step, so this is just a very simple filter to decide what's even worth testing.

negativegate•2h ago

Am I still SOL on AMD (9070 XT) when it comes to this stuff?

dangus•2h ago

Well, this specific solution was only set up on specific hardware, and is Nvidia dependent, as the readme stares.

That doesn’t mean the 9070XT can’t do AI stuff, quite the opposite. ROCm gets better all the time. There are many AI workloads you can do on AMD cards.

Is it a card I would choose if I was primarily working on AI? Absolutely not. But it is the card I own and it’s been a great value for gaming.

dannyw•46m ago

Unfortunately AMD is much worse with supporting AI features like FSR4 on older hardware generations, despite the capability and leaked INT8 models being there. Totally unlike NVIDIA.

It’s absurd I have to use open source programs to get INT8 FSR4 support.

patshead•58m ago

No, but yes? OmniCoder 9B at Q6 fits on my 9070 XT with 200k+ tokens of context, and it works pretty well with OpenCode. It is for sure the best local model that I've managed to squeeze onto my GPU, and it even works at 120k context at Q3 on an 8GB RX 580 GPU.

I can't imagine trying to using this model on either GPU for real work. I can use much bigger and faster models on the $3 Chutes subscription or $10 OpenCode Go subscription.

Even so, I am still excited. I don't feel like there was even a model worth using with a tool like OpenCode 6 to 9 months ago. I like the way things are heading, and I am looking forward to seeing how capable coding models of this size are in another 6 to 9 months!

riidom•2h ago

Not a word about the tok/sec, unfortunately.

arjie•53m ago

It won’t be meaningful considering the architecture: it’s a harness around the model that generated multiple solutions in multiple passes using the test to measure compliance and repair broken solutions. The resulting program won’t be streamed to you because it has existed for minutes as it goes through the cycle. It’s more for an asynchronous use-case.

I, too, was interested because I am always eager to use local models in my claw-like. It looks like this could be useful for an async portion of the harness but it wouldn’t work in interactive contexts.

Very cool ensemble of techniques, particularly because they’re so accessible. I think I will use this form for reusable portions of web browsing functionality in my personal agent.

superkuh•1h ago

If anyone else was hoping this was using Q8 internally and that converted to Q4 it could fit in 12GB VRAM: unfortunately it's already at Q4_K_M (~9GB) and the the 16GB requirement is from other parts not a 14B@8bit+kv cache/etc you might guess.

selcuka•1h ago

It's a race to the bottom. DeepSeek beats all others (single-shot), and it is ~50% cheaper than the cost of local electricity only.

> DeepSeek V3.2 Reasoning 86.2% ~$0.002 API, single-shot

> ATLAS V3 (pass@1-v(k=3)) 74.6% ~$0.004 Local electricity only, best-of-3 + repair pipeline

mikestorrent•1h ago

> cheaper than the cost of local electricity only.

Can you explain what that means?

simonw•1h ago

I think they mean that the DeepSeek API charges are less than it would cost for the electricity to run a local model.

Local model enthusiasts often assume that running locally is more energy efficient than running in a data center, but fail to take the economies of scale into account.

jojobas•1h ago

China has cheap electricity.

ericd•55m ago

Well, also, LLM servers get much more efficient with request queue depth >1 - tokens per second per gpu are massively higher with 100 concurrents than 1 on eg vllm.

atoav•18m ago

It means that the electricity you would have to pay if you did the computations yourself would be more expensive than paying them to do it. Part of thst has to do with the fact that China has cheap electricity, also due to their massive push into renewables. Part of that is just economies of scale. A big server farm can run more efficiently than your PC on average.

yogthos•34m ago

You could use this approach with DeepSeek as well. The innovation here is that you can generate a bunch of solutions, use a small model to pick promising candidates and then test them. Then you feed errors back to the generator model and iterate. In a way, it's sort of like a genetic algorithm that converges on a solution.

mmaunder•1h ago

I’d encourage devs to use MiniMax, Kimi, etc for real world tasks that require intelligence. The down sides emerge pretty fast: much higher reasoning token use, slower outputs, and degradation that is palpable. Sadly, you do get what you pay for right now. However that doesn’t prevent you from saving tons through smart model routing, being smart about reasoning budgets, and using max output tokens wisely. And optimize your apps and prompts to reduce output tokens.

XCSme•58m ago

Yup, they do quite poorly on random non-coding tasks:

https://aibenchy.com/compare/minimax-minimax-m2-7-medium/moo...

limoce•49m ago

The title should be "Adaptive Test-time Learning and Autonomous Specialization".

Show HN: I put an AI agent on a $7/month VPS with IRC as its transport layer

Why so many control rooms were seafoam green (2025)

Apple discontinues the Mac Pro with no plans for future hardware

Judge blocks Pentagon effort to 'punish' Anthropic with supply chain risk label

Chicago artist creates tourism posters for city's neighborhoods

Moving from GitHub to Codeberg, for lazy people

DOOM Over DNS

From 0% to 36% on Day 1 of ARC-AGI-3

Anthropic Subprocessor Changes

My minute-by-minute response to the LiteLLM malware attack

Dobase – Your workspace, your server

Whistler: Live eBPF Programming from the Common Lisp REPL

Chroma Context-1: Training a Self-Editing Search Agent

We haven't seen the worst of what gambling and prediction markets will do

HyperAgents: Self-referential self-improving agents

Order Granting Preliminary Injunction – Anthropic vs. U.S. Department of War [pdf]

OpenTelemetry profiles enters public alpha

CERN to host a new phase of Open Research Europe

John Bradley, author of xv, has died

Show HN: Fio: 3D World editor/game engine – inspired by Radiant and Hammer

Using FireWire on a Raspberry Pi

Show HN: Veil – Dark mode PDFs without destroying images, runs in the browser

Show HN: Turbolite – a SQLite VFS serving sub-250ms cold JOIN queries from S3

Colibri – chat platform built on the AT Protocol for communities big and small

Running Tesla Model 3's computer on my desk using parts from crashed cars

How much precision can you squeeze out of a table?

$500 GPU outperforms Claude Sonnet on coding benchmarks

Swift 6.3

Stripe Projects: Provision and manage services from the CLI

What Does a Hologram Trademark Signify When the Hologram Isn't There?

Show HN: I put an AI agent on a $7/month VPS with IRC as its transport layer

Why so many control rooms were seafoam green (2025)

Apple discontinues the Mac Pro with no plans for future hardware

Judge blocks Pentagon effort to 'punish' Anthropic with supply chain risk label

Chicago artist creates tourism posters for city's neighborhoods

Moving from GitHub to Codeberg, for lazy people

DOOM Over DNS

From 0% to 36% on Day 1 of ARC-AGI-3

Anthropic Subprocessor Changes

My minute-by-minute response to the LiteLLM malware attack

Dobase – Your workspace, your server

Whistler: Live eBPF Programming from the Common Lisp REPL

Chroma Context-1: Training a Self-Editing Search Agent

We haven't seen the worst of what gambling and prediction markets will do

HyperAgents: Self-referential self-improving agents

Order Granting Preliminary Injunction – Anthropic vs. U.S. Department of War [pdf]

OpenTelemetry profiles enters public alpha

CERN to host a new phase of Open Research Europe

John Bradley, author of xv, has died

Show HN: Fio: 3D World editor/game engine – inspired by Radiant and Hammer

Using FireWire on a Raspberry Pi

Show HN: Veil – Dark mode PDFs without destroying images, runs in the browser

Show HN: Turbolite – a SQLite VFS serving sub-250ms cold JOIN queries from S3

Colibri – chat platform built on the AT Protocol for communities big and small

Running Tesla Model 3's computer on my desk using parts from crashed cars

How much precision can you squeeze out of a table?

$500 GPU outperforms Claude Sonnet on coding benchmarks

Swift 6.3

Stripe Projects: Provision and manage services from the CLI

What Does a Hologram Trademark Signify When the Hologram Isn't There?

$500 GPU outperforms Claude Sonnet on coding benchmarks

Comments