frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

DeepSWE: A contamination-free benchmark for long-horizon coding agents

https://deepswe.datacurve.ai/blog
27•ammar_x•5h ago

Comments

ammar_x•5h ago
https://x.com/serenaa_ge/status/2059308400866111692
dnnssl2•3h ago
70% at launch seems pretty saturated, why ship a benchmark frontier models are about to top out on?
vanuatu•27m ago
sell data for them to hillclimb :)
charleyslee•3h ago
tysm for posting this! i'm charley, cofounder of datacurve, we created this benchmark and my team and i are here to answer any q's.
toastmaster11•2h ago
What happened that placed Opus 4.6 on max reasoning below Sonnet 4.6 on a lowered reasoning level?
vanuatu•40m ago
This benchmark matches my experience with GPT (I occasionally go back to Claude when I run into limits and frequently run into forgotten requirements and reward hacking)

I do have two questions / critiques:

- The verifier doesn't seem to check for code quality / maintainability, which I would posit is one of the major qualms with SOTA coding models i.e. they lack code 'taste'. Ofc this is a difficult problem to solve at scale, but wanted to point that out nonetheless

- This almost feels written like a critique on SWE Bench Pro. Hopefully they fix the issues with that benchmark!

vanuatu•29m ago
Out of curiosity, I examined the worst task:

https://deepswe.datacurve.ai/data/trials/quill-shared-toolba...

It seems like GPT here is failing due to an environment issue of connecting to chromium, even though its local unit tests passed. All the models failed 4/4 and checking Opus it ran into the same problem

I checked some other tasks and they seemed legit, although in general the prompts seem somewhat contrived vs. what a typical user would ask their coding agent (such is the difficulty of benchmark construction)

JacobAsmuth•18m ago
I wonder why they didn't test Gemini 3.5 Flash (High).

Chemistry behind the Garden Grove chemical tank

https://www.science.org/content/blog-post/methyl-methacrylate-tank
208•nooks•5h ago•84 comments

Cloudflare Flagship

https://developers.cloudflare.com/flagship/
25•tjek•1h ago•7 comments

Agent Memory: An Anatomy

https://brgsk.xyz/agent-memory-anatomy/
20•brgsk•39m ago•6 comments

A few interesting modern pixel fonts

https://unsung.aresluna.org/a-few-interesting-modern-pixel-fonts/
234•zdw•1d ago•52 comments

I Bypassed Adobe and Microsoft to Build a Git-Tracked Book Production Pipeline

https://www.djspeckhals.com/posts/2026-05-22-how-i-bypassed-adobe-and-microsoft-to-build-a-git-tr...
151•dustin1114•4d ago•36 comments

C array types are weird

https://anselmschueler.com/blogposts/2025-c-pointers/
50•signa11•1d ago•24 comments

A portentous reunion

https://bcantrill.dtrace.org/2026/05/25/a-portentous-reunion/
39•cafkafk•18h ago•17 comments

Big tech's anti-labor playbook has come for Wikipedia

https://medium.com/@jakeorlowitz/wikipedia-is-doing-the-capitalist-thing-56a393232943
262•cdrnsf•4h ago•137 comments

Rosalind: A genomics toolkit in Rust running whole-genome pipelines on a laptop

https://github.com/logannye/rosalind
120•samuell•5d ago•29 comments

Spain blocks prediction markets Polymarket, Kalshi over lack of gambling licence

https://www.reuters.com/business/spain-blocks-prediction-markets-polymarket-kalshi-over-lack-gamb...
758•thm•11h ago•345 comments

The Steinwinter Supercargo

https://www.thedrive.com/article/12603/the-forgotten-steinwinter-supercargo-is-unlike-anything-on...
39•itronitron•3d ago•6 comments

Dropbox CEO Drew Houston to step down

https://www.cnbc.com/2026/05/26/dropbox-ceo-drew-houston-ashraf-alkarmi.html
280•aghuang•11h ago•318 comments

Launch HN: Minicor (YC P26) – Windows desktop automations at scale

https://www.minicor.com/
69•fchishtie•9h ago•46 comments

The real cost of owning a home

https://ericturner.dev/posts/cost-of-home-ownership/
268•ggcr•8h ago•592 comments

Liverpool and Manchester Railway

https://en.wikipedia.org/wiki/Liverpool_and_Manchester_Railway
6•daverol•2d ago•1 comments

The Ballad of TIGIT

https://www.owlposting.com/p/the-ballad-of-tigit
93•crescit_eundo•9h ago•17 comments

What color is your function? (2015)

https://journal.stuffwithstuff.com/2015/02/01/what-color-is-your-function/
91•tosh•8h ago•107 comments

C64 Basic: Game Map Overhead “Camera View”

https://retrogamecoders.com/overhead-camera-view/
75•ibobev•11h ago•11 comments

Sonny Rollins, Jazz's Saxophone Colossus and Greatest Improvisor, Dead at 95

https://www.rollingstone.com/music/music-news/sonny-rollins-jazz-legend-saxophone-colossus-dead-o...
12•boarsofcanada•54m ago•2 comments

Use boring languages with LLMs

https://jry.io/writing/use-boring-languages-with-llms/
166•evakhoury•4d ago•137 comments

Sage Care (YC S24) Is Hiring Software Engineers

https://www.ycombinator.com/companies/sagecare/jobs/xtloH8r-senior-software-engineer
1•ian-gillis•7h ago

Outsourcing plus local AI will soon become more economical vs. frontier labs

https://www.signalbloom.ai/posts/outsourcing-plus-localai-will-soon-become-more-economical-vs-fro...
238•GodelNumbering•12h ago•260 comments

Show HN: Rapel – chunked resumable downloads in unstable networks

https://github.com/redraw/rapel
4•autorun•21h ago•1 comments

Are we self-sovereign PKI yet?

https://buffrr.dev/blog/are-we-self-sovereign-pki-yet/
73•ca98am79•5d ago•44 comments

Opaque Types in Python

https://blog.glyph.im/2026/05/opaque-types-in-python.html
111•lumpa•3d ago•51 comments

RescueRadar – UK Emergency Services Flight Tracking Since 2013

https://rescueradar.co.uk/about
8•dp-hackernews•2d ago•1 comments

Netherlands blocks US takeover of vital digital supplier

https://www.politico.eu/article/netherlands-blocks-us-takeover-vital-digital-supplier/
522•vrganj•13h ago•207 comments

The worst job interview I ever had

https://www.oliverio.dev/blog/the-worst-job-interview-i-had
133•oliverio•4h ago•115 comments

Phantasy Star IV – 1993 Developer Interviews

https://shmuplations.com/phantasystariv/
133•speckx•4d ago•56 comments

The user is visibly frustrated

https://pscanf.com/s/354/
273•croes•20h ago•243 comments