frontpage.
newsnewestaskshowjobs

Open Source @Github

fp.

Open in hackernews

Senior SWE-Bench: open-source benchmark that assesses agents as senior engineers

https://senior-swe-bench.snorkel.ai/
35•matt_d•2h ago

Comments

jonathanleane•1h ago
Top solve rate is currently 24% with Opus 4.8... What's a competent human supposed to score?
lacunary•1h ago
presumably whatever the top model uses and then some, since the human can use the model.

I wonder if a model could score higher if it had a human at its disposal?

danpalmer•1h ago
Why didn't they just make it "Staff SWE-Bench", would be much better smh. /s

But seriously, as an industry we're terrible at assessing engineering levels, I've worked with "senior engineers" who can't code and I've worked with "junior engineers" who could run rings around them.

Benchmarks like this should be much more precise about what they're actually testing, and what axes they're hard on. We also need to rise above prompts like "you are a senior engineer", it's woo, and it's far better to ask for precise outcomes.

amrrs•1h ago
As someone who's trying to get better assessments, I'm struggling to come up with objective coding tasks that evaluates all aspects of real life like planning, design choices, problem solving and context usage. From your experience with humans, Do you have any recommendations on what could be effective in measuring it?
allan_s•43m ago
I think the source of your issue is in your statement itself, why do you want a task that evaluate things as broad to be only a coding task ? Shouldn't it be a planning task, documentation task, knowledge retrieval task etc. And very certainly not with just an initial prompt but an existing codebase + existing doc + tickets ?
glaslong•22m ago
Principal-SWE-Bench will take some time to run, because the LLM needs to wait for a crisis to present its solution, having correctly identified that the same solution would have been organizationally impossible to propose until that moment.
purple-leafy•1h ago
Benchmarks are great, but I feel like there’s a better way this seems quite subjective.

What you really need is an objective benchmark

echelon•1h ago
> What you really need is an objective benchmark

"When are all the software engineers unemployed?"

purple-leafy•58m ago
Not sure I follow haha
eli•1h ago
I actually really like subjective benchmarks, so long as it's a human (ideally me) grading the results. LLM as judge never made much sense.
charcircuit•14m ago
The issue is that you can't do unsupervised learning if you require humans.
LiamPowell•1h ago
> You are a senior SWE-Bench reviewer, make no mistakes.

I don't know what a better approach would look like while still remaining feasible, however this approach of telling a LLM to make a subjective judgement seems fundamentally flawed.

Madmallard•54m ago
next round of trust me bro benchmarks
dozerly•10m ago
Just wait for the next 100 rounds. People love seeing the 65% -> 85% seemingly over and over again for every new model.
guilhermecgs•19m ago
fable 5?

ZCode – Harness for GLM-5.2

https://zcode.z.ai/en
304•chvid•7h ago•248 comments

Oomwoo, an open-source robot vacuum you build yourself

https://makerspet.com/blog/building-an-open-source-robot-vacuum-meet-oomwoo/
120•devicelimit•4h ago•16 comments

Senior SWE-Bench: open-source benchmark that assesses agents as senior engineers

https://senior-swe-bench.snorkel.ai/
36•matt_d•2h ago•16 comments

For first time, a cell built from scratch grows and divides

https://www.quantamagazine.org/for-the-first-time-a-cell-built-from-scratch-grows-and-divides-202...
794•defrost•14h ago•262 comments

Bring back crappy forums

https://tedium.co/2026/07/01/online-web-forums-retrospective/
91•pentagrama•2h ago•60 comments

What to learn to be a graphics programmer

https://blog.demofox.org/2026/07/01/what-to-learn-to-be-a-graphics-programmer/
278•atan2•11h ago•150 comments

Physical disc production ending in Jan 2028 for new games on PlayStation

https://blog.playstation.com/2026/07/01/physical-disc-production-ending-in-january-2028-for-new-g...
656•Tiberium•16h ago•670 comments

Database Traffic Control

https://planetscale.com/blog/introducing-database-traffic-control
11•religio•1d ago•0 comments

Opening up 'Zero-Knowledge Proof' technology to promote privacy in age assurance

https://blog.google/innovation-and-ai/technology/safety-security/opening-up-zero-knowledge-proof-...
90•consumer451•6h ago•68 comments

FFmpeg 9.1's new AAC encoder

https://hydrogenaudio.org/index.php/topic,129691.0.html
323•ledoge•14h ago•102 comments

The Underhanded C Contest

https://underhanded-c.org/
61•ccabraldev•6h ago•7 comments

Ask HN: Who is hiring? (July 2026)

171•whoishiring•14h ago•187 comments

How do wombats poop cubes?

https://www.science.org/content/article/how-do-wombats-poop-cubes-scientists-get-bottom-mystery
78•bushwart•1d ago•31 comments

The <Usermedia> HTML Element

https://developer.chrome.com/blog/usermedia-html-element
60•twapi•5h ago•26 comments

Show HN: Searchable directory of 22k+ products from worker-owned co-ops

https://www.workerowned.info/
299•IESAI_ski•8h ago•62 comments

Learn Vim motions with an ice-cream van

https://thisismodest.com/vimscoops/
24•marcusmichaels•11h ago•2 comments

Qualcomm Linux 2.0

https://www.qualcomm.com/developer/blog/2026/06/qualcomm-linux-2-now-available
73•gilgamesh3•8h ago•27 comments

Box3D, an open source 3D physics engine

https://box2d.org/posts/2026/06/announcing-box3d/
457•makepanic•16h ago•102 comments

Weave Robotics launches Isaac 1, a $7,999 home robot with Fall 2026 deliveries

https://www.weaverobotics.com/isaac-1
124•ryanmerket•10h ago•163 comments

Global review confirms mRNA vaccines are safe, effective and full of promise 

https://news.ubc.ca/2026/06/mrna-vaccines-are-safe-effective-and-full-of-promise/
252•coloneltcb•4h ago•211 comments

Internal Combustion Engine (2021)

https://ciechanow.ski/internal-combustion-engine/
298•StefanBatory•16h ago•81 comments

Proliferate (YC S25) Is Hiring

https://www.ycombinator.com/companies/proliferate/jobs/mMHvKR9-founding-product-engineer
1•pablo24602•8h ago

The vibration of the pager has a sound all its own

https://www.notyouremergency.com/triage-intro
10•mooreds•3d ago•1 comments

Monetization Gateway: Charge for any resource behind Cloudflare via x402

https://blog.cloudflare.com/monetization-gateway/
270•soheilpro•15h ago•186 comments

Ask HN: Who wants to be hired? (July 2026)

113•whoishiring•14h ago•270 comments

Chip Off The Old Block

https://www.astralcodexten.com/p/chip-off-the-old-block
63•paulpauper•7h ago•7 comments

Why jet engines aren't made in China

https://aakash.substack.com/p/why-jet-engines-arent-made-in-china
88•paulpauper•1d ago•65 comments

The Apple Disk II Controller Card (2021)

https://www.bigmessowires.com/2021/11/12/the-amazing-disk-ii-controller-card/
65•stmw•2d ago•16 comments

Healthy but sedentary people show early decline in cellular energy production

https://news.cuanschutz.edu/news-stories/healthy-but-sedentary-individuals-show-early-decline-in-...
94•littlexsparkee•6h ago•66 comments

Launch HN: Parsewise (YC P25) – Reason Across Documents with an API

50•gergelycsegzi•15h ago•47 comments