frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

AI2: Open Coding Agents

https://allenai.org/blog/open-coding-agents
64•publicmatt•4h ago

Comments

jauntywundrkind•2h ago
Awesome stuff. Output speed looks crazy fast too.

I wonder if this indeed will start prompting more language specific work.

Afaik training still requires not just looking at sample code but also being able to write loss functions being able to have problems the AI can work at. That seems hard.

One random thought, are there training styles of just deleting some code from "good" projects then making the AI make it work again?

CuriouslyC•1h ago
The technique people use is to capture PR diffs from public repos and extract the tests then use that to see if agents can reconstruct the patch that satisfies the tests.
ahmadyan•1h ago
Claims in the article are incorrect. They conveniently ignore Meta CWM models, which are open-sourced [1] and open-weight [2] and are at 65% SWE-bench verified (with TTS) and 54% pass@1 and the same size (32B dense). So claims like "surpassing prior open-source state-of-the-art coding models of comparable sizes and context lengths" and conveniently leaving out the previous OSS SOTA out of your eval tables are ... sketch.

[1]https://github.com/facebookresearch/cwm [2]https://huggingface.co/facebook/cwm

philipkglass•1h ago
The difference is that the Allen Institute models have open training data, not just open code and weights. Meta doesn't share the training data you would need to reproduce their final models. For many uses open-weight models are nearly as good, but for advancing research it's much better to have everything in the open.
kevmo314•58m ago
Reading their paper, it wasn't trained from scratch, it's a fine tune of a Qwen3-32B model. I think this approach is correct, but it does mean that only a subset of the training data is really open.
mhitza•1h ago
The linked open weight disallows commercial, and is only licensed for research purpose
ethan_l_shen•1h ago
Hey! These are great observations. So first, while TTS can improve performance, we wanted to evaluate the raw capability of our model. This meant generating only one rollout per evaluation instance, which follows other papers in the space like SWE-smith and BugPilot. In addition, TTS adds extra inference cost and is reliant on how rollouts are ranked, two confounding factors for deployable models where memory and inference speed are extremely important.

Following that line of reasoning, context length is another very large confounding factor. Longer context lengths improve performance - but also result in enormous increases in KV cache size and memory requirements. We decide to control for this in our paper and focus at the 32K context length for 32B size models, a context length that already pushes the bounds of what can be "deployable" locally.

Still, we evaluate at 64K context length using YARN and are able to outperform CWM's 54% performance (non TTS), which it achieves using 128K context, a substantial increase over what we use. This is also pretty significant because we only ever train at 32K context, but CWM trains for a full 128K.

khimaros•1h ago
it's great to see this kind of progress in reproducible weights, but color me confused. this claims to be better and smaller than Devstral-Small-2-24B, while clocking in at 32B (larger) and scoring more poorly?
ethan_l_shen•1h ago
Hey! We are able to outperform Devstral-Small-2-24B when specializing on repositories, and come well within the range of uncertainty with our best SERA-32B model. That being said, our model is a bit larger than Devstral 24B. Could you point out what in the paper gave the impression that we were smaller? If theres something unclear we would love to revise
khimaros•29m ago
"SERA-32B is the first model in Ai2's Open Coding Agents series. It is a state-of-the-art open-source coding agent that achieves 49.5% on SWE-bench Verified, matching the performance of much larger models like Devstral-Small-2 (24B)" from https://huggingface.co/allenai/SERA-32B
ethan_l_shen•23m ago
Ah great catch I don't know how we missed that. Thanks! Will fix.
augusteo•1h ago
The ahmadyan comparison is fair. Meta's CWM models hitting 65% vs SERA's 54% is a meaningful gap.

But the interesting number here isn't accuracy. It's the $400 to reproduce top open-source performance. That's the part that matters for teams building internal tooling.

We've been running agents on proprietary codebases at work. The pain isn't model quality. It's customization. Most off-the-shelf agents don't understand your repo structure, your conventions, your test patterns. If you can fine-tune a 32B model on your own codebase for a few hundred dollars, that changes the economics completely.

But codebases changes everyday, so finetuning will have to be continuously done!

Probably not worth it versus something like Claude Code.

Curious whether anyone's tried this on non-Python codebases. Most SWE-Bench stuff is Python-heavy.

storystarling•1h ago
The fine-tuning overhead is definitely a factor, but for smaller shops the hard constraint is usually inference VRAM. Running a 32B model locally or on a rented GPU is surprisingly expensive if you aren't saturating it. Even at 4-bit quantization you are looking at dual 3090s or an A6000 to get decent tokens per second. The $400 training cost is impressive but the hosting bill is what actually kills the margin compared to per-token APIs.
nickandbro•1h ago
Great work! Really respect AI2. they open source everything. The model, the weights, the training pipeline, inference stack, and corpus
Imustaskforhelp•58m ago
Hey this looks great? Is it available on Openrouter.

I wish if AI2 could release a more denser model on Openrouter for free than the 8B model as I was using Devstral model for agentic purposes.

If we can get an agentic good 32B like model on openrouter for ~free, then I feel like it will be very interesting to see how things would go imo.

Good luck with AI2! The premise of truly open source models is really interesting and I feel like it could help bring more innovation in the space imo!

TikTok settles just before social media addiction trial to begin

https://www.bbc.com/news/articles/c24g8v6qr1mo
31•ourmandave•44m ago•8 comments

Prism

https://openai.com/index/introducing-prism
180•meetpateltech•3h ago•94 comments

430k-year-old well-preserved wooden tools are the oldest ever found

https://www.nytimes.com/2026/01/26/science/archaeology-neanderthals-tools.html
268•bookofjoe•5h ago•146 comments

Parametric CAD in Rust

https://campedersen.com/vcad
11•ecto•46m ago•3 comments

A few random notes from Claude coding quite a bit last few weeks

https://twitter.com/karpathy/status/2015883857489522876
101•bigwheels•1d ago•126 comments

Lennart Poettering, Christian Brauner founded a new company

https://amutable.com/about
131•hornedhob•2h ago•155 comments

SoundCloud Data Breach Now on HaveIBeenPwned

https://haveibeenpwned.com/Breach/SoundCloud
106•gnabgib•4h ago•44 comments

FBI is investigating Minnesota Signal chats tracking ICE

https://www.nbcnews.com/tech/internet/fbi-investigating-minnesota-signal-minneapolis-group-ice-pa...
265•duxup•3h ago•240 comments

AI2: Open Coding Agents

https://allenai.org/blog/open-coding-agents
64•publicmatt•4h ago•15 comments

Doing the thing is doing the thing

https://www.softwaredesign.ing/blog/doing-the-thing-is-doing-the-thing
97•prakhar897•15h ago•35 comments

Hypercubic (YC F25) Is Hiring a Founding SWE and COBOL Engineer

https://www.ycombinator.com/companies/hypercubic/jobs
1•sai18•2h ago

Show HN: One Human + One Agent = One Browser From Scratch in 20K LOC

https://emsh.cat/one-human-one-agent-one-browser/
86•embedding-shape•8h ago•50 comments

Amazon closing its Fresh and Go stores

https://finance.yahoo.com/news/amazon-closing-fresh-grocery-convenience-150437789.html
95•trenning•5h ago•290 comments

Show HN: LemonSlice – Upgrade your voice agents to real-time video

38•lcolucci•3h ago•55 comments

Clawdbot Renames to Moltbot

https://github.com/moltbot/moltbot/commit/6d16a658e5ebe6ce15856565a47090d5b9d5dfb6
104•philip1209•3h ago•65 comments

OpenSSL: Stack buffer overflow in CMS AuthEnvelopedData parsing

https://openssl-library.org/news/vulnerabilities/#CVE-2025-15467
62•MagerValp•4h ago•35 comments

How many chess games are possible?

https://win-vector.com/2026/01/27/how-many-chess-games-are-possible/
6•jmount•1h ago•0 comments

I made my own Git

https://tonystr.net/blog/git_immitation
298•TonyStr•10h ago•134 comments

Arm's Cortex A725 Ft. Dell's Pro Max with GB10

https://chipsandcheese.com/p/arms-cortex-a725-ft-dells-pro-max
21•pixelpoet•2h ago•1 comments

Flexible use of a multi-purpose tool by a cow

https://doi.org/10.1016/j.cub.2025.11.059
73•PlaceboGazebo•6d ago•12 comments

Time Station Emulator

https://github.com/kangtastic/timestation
5•FriedPickles•47m ago•0 comments

Show HN: I Wrapped the Zorks with an LLM

https://infocom.tambo.co/
5•alecf•23m ago•1 comments

Try text scaling support in Chrome Canary

https://www.joshtumath.uk/posts/2026-01-27-try-text-scaling-support-in-chrome-canary/
4•linolevan•2h ago•0 comments

LLM-as-a-Courtroom

https://falconer.com/notes/llm-as-a-courtroom/
6•jmtulloss•2h ago•0 comments

The threat eating away at museum treasures

https://www.scientificamerican.com/article/how-extremophile-molds-are-destroying-museum-artifacts/
16•sohkamyung•4d ago•7 comments

Xfwl4 – The Roadmap for a Xfce Wayland Compositor

https://alexxcons.github.io/blogpost_15.html
230•pantalaimon•7h ago•174 comments

Chuck Klosterman on why we've never actually seen a real football game

https://www.latimes.com/entertainment-arts/books/story/2026-01-22/chuck-klosterman-new-book-football
19•proposal•1h ago•35 comments

Bassoontracker, Tracking in the Browser

https://www.stef.be/bassoontracker/
20•jdboyd•12h ago•5 comments

Avoiding duplicate objects in Django querysets

https://johnnymetz.com/posts/avoiding-duplicate-objects-in-django-querysets/
6•johnnymetz•4d ago•1 comments

Twin – The AI Company Builder

https://twin.so/
9•bkolobara•1h ago•5 comments