frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

SWE-Bench Pro

https://github.com/scaleapi/SWE-bench_Pro-os
58•tosh•2h ago

Comments

siliconc0w•2h ago
Looks like the associated article is: https://scale.com/research/swe_bench_pro (link in the repo is wrong)
gpt5•2h ago
Slightly tangent question - they said that they have protected the public test set with a strong copyleft license to prevent training private models on them.

Does it actually work? Isn’t AI training so far simply ignores all license and copyright restrictions completely?

ej88•1h ago
https://scale.com/leaderboard/swe_bench_pro_commercial

I definitely trust the totally private dataset more.

stephendause•1h ago
This is a key question in my opinion. It's one of the things that make benchmarking the SWE capabilities of LLMs difficult. It's usually impossible to know whether the LLM has seen a problem before, and coming up with new, representative problem sets is time-consuming.
CuriouslyC•1h ago
You can just fuzz names and switch to a whitespace compact representation.
stri8ed•1h ago
Not a chance. Even if American companies did abide by it, there is no reason Chinese companies would. And good luck definitely proving that a model trained on it.
candiddevmike•1h ago
Sir, we've already ingested 503,377 copyleft licensed codebases, I don't think the training set can take anymore!
BoorishBears•47m ago
I feel like public datasets are something we're holding onto for LLM benchmarks for historical reasons, but need to move on from.

Older, non-instruction tuned models needed post-training on public datasets to even reliably produce meaningful answers.

Now we're testing tasks that are so complex that the LLM should reasonably be expected to answer without additional post-training.

Once you have a public dataset, even feeding those examples to an LLM and producing synthetic variations is enough to let you game the benchmark. And the worst part is you don't need to be unethical to do this: some people would say it's just a good way to expand your training data even though it incidentally allows you to overfit on the task, without overfitting on the public dataset.

So everyone's doing stuff like that, and we're getting models that are increasing overfit to a few narrow tasks.

-

The alternative is just giving detailed plain english descriptions of the tasks in question. Those can be used to generate synthetic tasks, but won't result in matching the benchmark's "shape" perfectly (as long as the questions stay hidden), and that alone is enough to ensure some level of generalization takes place.

WhitneyLand•1h ago
Recently it was pointed out that models were sometimes finding SWE-Bench verified cheats by scanning parts of the repo not meant to be visible.

Hope they’re addressing that at the same time.

hereme888•1h ago
Is it possible to benchmark the GPT-5-Pro model?
nyrikki•40m ago
> Larger models (e.g., Opus 4.1) often fail on semantic or algorithmic correctness in large, multi-file edits, whereas smaller models (e.g., Qwen 3 32B) more frequently fail due to issues in syntax and formatting, tool use, or context management.

While I haven’t dug into the details of this benchmark, this absolutely matches my personal experience.

Assuming “semantic correctness” is in the sense of Rice and runtime behavior.

While syntactic correctness has dramatically improved, security and architectural erosion and other long term issues have not.

Unfortunately Rice’s theorem also applies to finite programs in finite time too.

Actually it can apply to total functions in the general case.

I am still optimistic that coding agents will provide value long term in some fashion.

But the open domain frame problem simply reduces to the halting problem, yes and humans struggle with it too.

But fundamentally, PAC learning has to be reduced to _trivial_ problems, with only T/F.

We have found clever ways to work within these s limitations, but they still exist.

Hopefully we find clever ways to keep humans engaged with the code, while gaining the potential force multiplier that ML may offer.

The long tailed problems are particularly important, and while human SREs make mistakes and organizations often have constraints that add to the problem, SREs do a lot more to avoid those long tailed problems than they are given credit for.

IMHO that has always been one of the hardest parts of the industry and a true measure for what makes great team members.

Unfortunately the metrics and incentives often don’t capture that value.

leoh•9m ago
Frankly, several repos and tools from Google/DeepMind look a lot better.

https://github.com/google-deepmind/bbeh?tab=readme-ov-file

https://github.com/google/lmeval

I hesitate to say this lest folks adapt, but does anyone else immediately distrust a repo when it has a bunch of emojis in the README? It is often a giveaway that they were LLM-generated.

OpenAI and Nvidia announce partnership to deploy 10GW of Nvidia systems

https://openai.com/index/openai-nvidia-systems-partnership/
258•meetpateltech•2h ago•298 comments

A collection of technical things every software developer should know about

https://github.com/mtdvio/every-programmer-should-know
18•redbell•38m ago•5 comments

PlanetScale for Postgres is now GA

https://planetscale.com/blog/planetscale-for-postgres-is-generally-available
193•munns•3h ago•97 comments

Cloudflare is sponsoring Ladybird and Omarchy

https://blog.cloudflare.com/supporting-the-future-of-the-open-web/
430•jgrahamc•6h ago•266 comments

SWE-Bench Pro

https://github.com/scaleapi/SWE-bench_Pro-os
58•tosh•2h ago•12 comments

A board member's perspective of the RubyGems controversy

https://apiguy.substack.com/p/a-board-members-perspective-of-the
52•janpio•2h ago•27 comments

Qwen3-Omni: Native Omni AI Model for Text, Image & Video

https://github.com/QwenLM/Qwen3-Omni
14•meetpateltech•1h ago•1 comments

A simple way to measure knots has come unraveled

https://www.quantamagazine.org/a-simple-way-to-measure-knots-has-come-unraveled-20250922/
80•baruchel•4h ago•35 comments

The Beginner's Textbook for Fully Homomorphic Encryption

https://arxiv.org/abs/2503.05136
128•Qision•1d ago•21 comments

Mentra (YC W25) Is Hiring to build smart glasses

1•caydenpiercehax•2h ago

Choose Your Own Adventure

https://www.filfre.net/2025/09/choose-your-own-adventure/
8•naves•40m ago•0 comments

Cap'n Web: a new RPC system for browsers and web servers

https://blog.cloudflare.com/capnweb-javascript-rpc-library/
188•jgrahamc•5h ago•82 comments

Morgan and Morgan takes Disney to court over 'Steamboat Willie' in ads

https://www.clickorlando.com/news/local/2025/09/17/morgan-morgan-takes-disney-to-court-over-right...
43•wrayjustin•2d ago•25 comments

Easy Forth (2015)

https://skilldrick.github.io/easyforth/
150•pkilgore•7h ago•80 comments

CompileBench: Can AI Compile 22-year-old Code?

https://quesma.com/blog/introducing-compilebench/
101•jakozaur•6h ago•36 comments

What is algebraic about algebraic effects?

https://interjectedfuture.com/what-is-algebraic-about-algebraic-effects/
56•iamwil•4h ago•22 comments

Show HN: Python Audio Transcription: Convert Speech to Text Locally

https://www.pavlinbg.com/posts/python-speech-to-text-guide
4•Pavlinbg•44m ago•0 comments

A New Internet Business Model?

https://blog.cloudflare.com/cloudflare-2025-annual-founders-letter/
167•mmaia•3h ago•162 comments

Testing is better than Data Structures and Algorithms

https://nedbatchelder.com/blog/202509/testing_is_better_than_dsa.html
20•rsyring•2h ago•6 comments

Appleii Air Attack.BAS

https://basic-code.bearblog.dev/applesoft-air-attackbas/
4•ibobev•3d ago•0 comments

Beyond the Front Page: A Personal Guide to Hacker News

https://hsu.cy/2025/09/how-to-read-hn/
143•firexcy•9h ago•64 comments

SGI demos from long ago in the browser via WASM

https://github.com/sgi-demos
203•yankcrime•10h ago•53 comments

The Strange Tale of the Hotchkiss

https://www.edrdg.org/~jwb/mondir/hotchkiss.html
17•rwmj•1d ago•2 comments

AI-Generated "Workslop" Is Destroying Productivity

https://hbr.org/2025/09/ai-generated-workslop-is-destroying-productivity
33•McScrooge•56m ago•6 comments

The American Nations regions across North America

https://colinwoodard.com/new-map-the-american-nations-regions-across-north-america/
61•loughnane•3h ago•78 comments

Diffusion Beats Autoregressive in Data-Constrained Settings

https://blog.ml.cmu.edu/2025/09/22/diffusion-beats-autoregressive-in-data-constrained-settings/
4•djoldman•42m ago•1 comments

California issues historic fine over lawyer's ChatGPT fabrications

https://calmatters.org/economy/technology/2025/09/chatgpt-lawyer-fine-ai-regulation/
74•geox•2h ago•40 comments

Human-Oriented Markup Language

https://huml.io/
34•vishnukvmd•3h ago•33 comments

Dear GitHub: no YAML anchors, please

https://blog.yossarian.net/2025/09/22/dear-github-no-yaml-anchors
148•woodruffw•4h ago•117 comments

Anti-*: The Things We Do but Not All the Way

https://blog.jim-nielsen.com/2025/my-antis/
35•gregwolanski•3h ago•13 comments