frontpage.
newsnewestaskshowjobs

Open Source @Github

fp.

Open in hackernews

FrontierCode

https://cognition.ai/blog/frontier-code
46•streamer45•1h ago

Comments

swyx•1h ago
:wave: i was on the team! AMA.

some headlines

- 3000 rubrics on code quality. First benchmark to measure: "would this code get actually merged?"

- 20+ expert open-source maintainer created tasks on their own repos to capture their opinion & taste.

- total 1000+ hours of real life software maintainer work captured in dataset. ON TOP of that, 40+ hours of real human work to turn that real life work into well validated and structured tasks with rubrics (even more work to turn tasks/prompts from devin-infra-specific to pluggable coding agent)

- results in 81% lower false positive rate than SWE-Bench Pro

- High quality bar: many QA stages & each task manually reviewed by Cognition researchers (examples in post)

Opus 4.8 scores 13% on FrontierCode Diamond.

one of my goals was also to datamine interesting stuff even on the easy tasks. for example, if you squint you can see the answer to "WTF Happened in late 2025" with coding models: https://x.com/swyx/status/2064081945567580323

great_psy•1h ago
How do you measure quality at scale ? Is there another model that determines if it adheres to codebase standard ?
swyx•56m ago
see Beyond Unit Tests and Novel Grading Methods in TFA.

i think something like ~60% llm as judge rubrics and the rest as described. every rubric validated by maintainer. 3000 rubrics

tedsanders•39m ago
Very cool! So glad to see people building and sharing evals that are better than SWE bench.

I'm curious - any particular reason you didn't put error bars on the graphs? Seems like it could be helpful when there are only 50 unique problems in the diamond set.

swyx•28m ago
*50 unique problems but 20-40 rubrics per problem (something I had to keep reminding people internally who were unimpressed with the N)

simple answer is our reporting was pass@5. feel like you'd need like 50+ runs to have reasonable confidence intervals, which somehow i dont see other people do, so i also didnt insist on it.

hoping to work with <prominent third party evals shop> to get this on their infra and evaluated along with whatever the industry standard is.

typs•4m ago
What did you do around cross-harness testing? I don't see anything in the blog post about what harnesses were used in evaluation. SOTA benchmarks have consistently shown that frontier model performance is quite sensitive to what tools are exposed (e.g. str_replace vs. apply_patch) as the labs are RLing on their own harnesses. Did you do testing of the models in a standard setup or in their native harnesses?
singpolyma3•1h ago
Since no one knows or can agree on what "code quality" is and we can't measure it for human output, I'm dubious about measuring it for LLMs
fHr•57m ago
babe wake up another eval dropped
einpoklum•33m ago
> Today’s coding benchmarks have established that models can write correct code.

I wouldn't say that.

> But as AI-generated code becomes the dominant path to production

I really hope that's not the case.

zakisaad•7m ago
How do you define "correct" code?
vessenes•31m ago
This looks great. Well reasoned, tons of work put into eval, thanks for building it.

It strikes me as kind of wild that good evals can drive tens to hundreds of millions of dollars of compute deployment in the wild — there’s something new and collaborative and competitive about the eval / frontier model race that’s quite interesting..

In this case “shorter actually mergable patches that open source maintainers would accept” feels like a great thing to deliver to the world.

I didn’t deep dive into good and bad patches, but I wonder if swyx or others on the team have predictions on saturation. Both when, and how useful will it be? That is, do you guys think this test is broad enough as written to get better behavior out of models, and if there is saturation on this test, will we see generalized better patch / coding behavior?

swyx•23m ago
thanks - credit to silas, eric, ben, and team for the depth of the evals, and the rest of the research team for doing the transcript reading parties lol

by nature of being based on open source, frontiercode public will saturate very very quickly. frontiercode main will be >80% in less than a year. hopefully diamond will last a bit longer. we can do annual refreshes, thats not my strategy for staying relevant - what i'm more excited to get funding for is private held out version of frontiercode based on repros of real enterprise customer problems. in an ideal agent lab (https://latent.space/p/agent-labs) you meticulously build up this domain understanding and that is essentially why both model labs and serious customers come to you.

OpenAI Submits S-1 Draft to SEC

https://openai.com/index/openai-submits-confidential-s-1/
108•hackerBanana•1h ago•47 comments

Surveillance Is Not Safety: A statement on the UK's latest threat to privacy [pdf]

https://signal.org/blog/pdfs/2026-06-08-uk-surveillance-is-not-safety.pdf
256•g0xA52A2A•3h ago•65 comments

Siri AI

https://www.apple.com/apple-intelligence/
305•0xedb•4h ago•236 comments

MiMo-v2.5-Pro-UltraSpeed: 1T model with 1000 tokens per second

https://mimo.xiaomi.com/blog/mimo-tilert-1000tps
450•gainsurier•7h ago•308 comments

Show HN: Performative-UI – A react component library of design tropes

https://vorpus.github.io/performativeUI/
688•lizhang•8h ago•141 comments

Apple reveals new AI architecture built around Google Gemini models

https://www.macrumors.com/2026/06/08/apple-reveals-new-ai-architecture/
282•unclefuzzy•3h ago•251 comments

Why are cells small?

https://burrito.bio/essays/what-limits-a-cells-size
89•mailyk•3h ago•38 comments

EU-banned pesticides found in rice, tea and spices

https://www.foodwatch.org/en/eu-banned-pesticides-found-in-rice-tea-and-spices
185•john-titor•6h ago•66 comments

Apple Core AI Framework

https://developer.apple.com/documentation/coreai/
95•hmokiguess•3h ago•8 comments

xAI is looking more like a datacentre REIT than a frontier lab

https://martinalderson.com/posts/xais-new-rental-business/
336•martinald•7h ago•253 comments

Anti-social: It's fads, not friends, which now dominate social media feeds

https://www.bbc.com/worklife/article/20260520-how-social-media-ceased-to-be-social
499•1vuio0pswjnm7•10h ago•374 comments

Show HN: Gitdot – a better GitHub. Open-source, written in Rust

https://gitdot.io/
82•baepaul•5h ago•72 comments

Ask HN: What are tools you have made for yourself since the advent of AI?

92•aryamaan•4h ago•150 comments

FrontierCode

https://cognition.ai/blog/frontier-code
46•streamer45•1h ago•12 comments

Launch HN: Intuned (YC S22) – Build and run reliable browser automations as code

https://intunedhq.com
96•fkilaiwi•9h ago•44 comments

Doing Something That's Never Been Done Before

https://talglobus.com/p/doing-something-thats-never-been-done-before/
8•surprisetalk•3d ago•0 comments

I'm building a parallel internet, and it's called The Thinnernet

https://inavoyage.blogspot.com/2026/06/im-building-parallel-internet-and-its.html
36•initramfs•2h ago•30 comments

Fooling Go's X.509 Certificate Verification

https://danielmangum.com/posts/fooling-go-x509-certificate-verification/
22•hasheddan•2d ago•11 comments

Switzerland wil have a referendum to cap population at 10M

https://www.admin.ch/en/sustainability-initiative
188•napolux•3h ago•394 comments

AI is slowing down

https://www.wheresyoured.at/ai-is-slowing-down/
311•crescit_eundo•6h ago•345 comments

Stop the Apple Music app from launching

https://lowtechguys.com/musicdecoy/
537•bobbiechen•5h ago•213 comments

OCaml Onboarding: Introduction to the Dune build system

https://ocamlpro.com/blog/2025_07_29_ocaml_onboarding_introduction_to_dune/
135•andrewstetsenko•4d ago•16 comments

120k Lines of Rust: Inside the Nosdesk Backend

https://kyle.au/blog/nosdesk-backend-rust
28•kylephillipsau•2d ago•1 comments

Massachusetts bans sale of precise location data in new privacy rights bill

https://techcrunch.com/2026/06/08/massachusetts-votes-to-pass-new-privacy-rights-bill-that-bans-s...
208•01-_-•5h ago•34 comments

1worldflag: A blue dot on a transparent background

https://1worldflag.com/
153•davidbarker•21h ago•129 comments

The Cypherpunk Library

https://www.cypherpunkbooks.com
345•yu3zhou4•14h ago•94 comments

Using XDG-Compliant Config Files (2024)

https://wxwidgets.org/blog/2024/01/using-xdg-compliant-config-files/
27•ankitg12•4d ago•5 comments

How much of Thermo Fisher's antibody data has been manipulated?

https://reeserichardson.blog/2026/05/28/how-much-of-thermo-fishers-antibody-data-has-been-manipul...
381•mhrmsn•15h ago•85 comments

Show HN: Courtside – TUI for NBA Games

https://github.com/NolanFogarty/courtside
10•nolanfogarty•2d ago•3 comments

Apple WWDC 2026

https://www.apple.com/apple-events/event-stream/
224•nextstep•5h ago•434 comments