frontpage.
newsnewestaskshowjobs

Open Source @Github

fp.

Open in hackernews

Senior SWE-Bench: open-source benchmark that assesses agents as senior engineers

https://senior-swe-bench.snorkel.ai/
25•matt_d•1h ago

Comments

jonathanleane•1h ago
Top solve rate is currently 24% with Opus 4.8... What's a competent human supposed to score?
lacunary•1h ago
presumably whatever the top model uses and then some, since the human can use the model.

I wonder if a model could score higher if it had a human at its disposal?

danpalmer•1h ago
Why didn't they just make it "Staff SWE-Bench", would be much better smh. /s

But seriously, as an industry we're terrible at assessing engineering levels, I've worked with "senior engineers" who can't code and I've worked with "junior engineers" who could run rings around them.

Benchmarks like this should be much more precise about what they're actually testing, and what axes they're hard on. We also need to rise above prompts like "you are a senior engineer", it's woo, and it's far better to ask for precise outcomes.

amrrs•50m ago
As someone who's trying to get better assessments, I'm struggling to come up with objective coding tasks that evaluates all aspects of real life like planning, design choices, problem solving and context usage. From your experience with humans, Do you have any recommendations on what could be effective in measuring it?
allan_s•14m ago
I think the source of your issue is in your statement itself, why do you want a task that evaluate things as broad to be only a coding task ? Shouldn't it be a planning task, documentation task, knowledge retrieval task etc. And very certainly not with just an initial prompt but an existing codebase + existing doc + tickets ?
purple-leafy•1h ago
Benchmarks are great, but I feel like there’s a better way this seems quite subjective.

What you really need is an objective benchmark

echelon•49m ago
> What you really need is an objective benchmark

"When are all the software engineers unemployed?"

purple-leafy•29m ago
Not sure I follow haha
eli•43m ago
I actually really like subjective benchmarks, so long as it's a human (ideally me) grading the results. LLM as judge never made much sense.
LiamPowell•48m ago
> You are a senior SWE-Bench reviewer, make no mistakes.

I don't know what a better approach would look like while still remaining feasible, however this approach of telling a LLM to make a subjective judgement seems fundamentally flawed.

Madmallard•24m ago
next round of trust me bro benchmarks

Web Apps for Meta Ray-Ban Display Glasses

https://wearables.developer.meta.com/docs/develop/webapps/build/
1•arbayi•34s ago•0 comments

I hacked this temu camera. what I found should be illegal. [video][8 mins]

https://www.youtube.com/watch?v=RuqQR_R-dEQ
1•Bender•53s ago•0 comments

Show HN: Claude Desktop Switcher for separating the whole Claude Desktop suite

https://matsumotory.github.io/claude-desktop-switcher/
1•matsumotory•1m ago•0 comments

Kimi K2.7 Code is generally available in GitHub Copilot

https://github.blog/changelog/2026-07-01-kimi-k2-7-is-now-available-in-github-copilot/
2•unliftedq•3m ago•0 comments

Show HN: Margarita - Programming Agents with Markdown

https://www.margarita.run
1•margarita_dev•4m ago•0 comments

US reportedly mulls pulling troops from Saudi Arabia as ties sour over Iran war

https://www.timesofisrael.com/us-reportedly-mulls-pulling-troops-from-saudi-arabia-as-ties-sour-o...
1•koolhead17•7m ago•0 comments

Herdr: One terminal to rule them all

https://herdr.dev/
1•handfuloflight•7m ago•0 comments

Microsoft/Flint-Chart

https://github.com/microsoft/flint-chart
1•geoffbp•7m ago•0 comments

Society of Saint Pius X

https://en.wikipedia.org/wiki/Society_of_Saint_Pius_X
1•febed•8m ago•0 comments

Can solar and wind and batteries provide 24/7/365 electricity?

https://unpopular-truth.com/2026/06/19/can-solar-and-wind-batteries-really-provide-24-7-365-elect...
1•dpraburaj•8m ago•0 comments

Show HN: A complete AI agency at your fingertips

https://github.com/msitarzewski/agency-agents
1•adithyaharish•12m ago•0 comments

Chemurgy

https://en.wikipedia.org/wiki/Chemurgy
1•petethomas•12m ago•0 comments

State of X402 – Independent Audit – Every Faciliator, Every Chain

https://x402stats.io
1•pro_methe5•14m ago•0 comments

Zig - All Package Management Functionality Moved from Compiler to Build System

https://ziglang.org/devlog/2026/?2026-06-30#2026-06-30
3•Retro_Dev•16m ago•0 comments

Ask HN: What are you building first with Fable?

4•akashwadhwani35•25m ago•1 comments

Trump's plan to redesign every .gov website leads to AI-designed horrors

https://arstechnica.com/tech-policy/2026/06/trumps-plan-to-redesign-every-gov-website-leads-to-ai...
2•duxup•29m ago•1 comments

Understanding Why Language Models Hallucinate: Testing Reasoning Against Priors

https://neohughus.github.io/Understanding_Why_Language_Models_Hallucinate/
1•ilreb•30m ago•0 comments

Who Shuts Down the Internet the Most?

https://pulse.internetsociety.org/en/shutdowns/
1•Bender•32m ago•0 comments

White House accelerates plans for AI model standards

https://www.ft.com/content/0bb7e2f9-007b-4577-9c4a-858948ee969a
1•OutOfHere•32m ago•0 comments

Carson Block: If you thought the global financial crisis was bad

https://www.economist.com/by-invitation/2026/06/28/if-you-thought-the-global-financial-crisis-was...
1•burntcaramel•33m ago•0 comments

Circle of Empathy

https://video.cirquedusoleil.com/originals/cirque-du-sound/season-1/ep5
1•andsoitis•34m ago•0 comments

Your AI lover will change you

https://www.newyorker.com/culture/the-weekend-essay/your-ai-lover-will-change-you
1•andsoitis•35m ago•0 comments

Arena, the AI leaderboard everyone uses, is now a $100M business

https://techcrunch.com/2026/06/29/arena-the-ai-leaderboard-everyone-uses-is-now-a-100m-business/
2•doppp•42m ago•0 comments

Ask HN: What things might help me to become inference engineer?

2•chalshik•45m ago•2 comments

Apple 'Hide My Email' Vulnerability Reveals Peoples' Real Email Addresses

https://www.404media.co/apple-hide-my-email-vulnerability-reveals-peoples-real-email-addresses/
2•phyzome•49m ago•0 comments

Show HN: One Page Life Calendar

https://beta.actions.life
1•eltonlin•49m ago•0 comments

Egg Libor Was Also Manipulated

https://www.bloomberg.com/opinion/newsletters/2026-07-01/egg-libor-was-also-manipulated
1•greyface-•54m ago•0 comments

The Graduate-School Dropout Toppling China's Academic Stars

https://www.wsj.com/science/the-graduate-school-dropout-toppling-chinas-academic-stars-3c1e5d86
1•petethomas•57m ago•0 comments

UK-US trade deal will lead to more than 200k avoidable deaths

https://www.theguardian.com/society/2026/jul/01/us-uk-drug-deal-could-result-in-229000-excess-dea...
2•secretslol•1h ago•0 comments

Stashing data in WOFF2 color glyphs to get free Brotli decompression via Canvas

https://github.com/EtherDream/brpack/
1•etherdream•1h ago•0 comments