frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Visual data modelling in the browser (open source)

https://github.com/sqlmodel/sqlmodel
1•Sean766•37s ago•0 comments

Show HN: Tharos – CLI to find and autofix security bugs using local LLMs

https://github.com/chinonsochikelue/tharos
1•fluantix•1m ago•0 comments

Oddly Simple GUI Programs

https://simonsafar.com/2024/win32_lights/
1•MaximilianEmel•1m ago•0 comments

The New Playbook for Leaders [pdf]

https://www.ibli.com/IBLI%20OnePagers%20The%20Plays%20Summarized.pdf
1•mooreds•1m ago•0 comments

Interactive Unboxing of J Dilla's Donuts

https://donuts20.vercel.app
1•sngahane•3m ago•0 comments

OneCourt helps blind and low-vision fans to track Super Bowl live

https://www.dezeen.com/2026/02/06/onecourt-tactile-device-super-bowl-blind-low-vision-fans/
1•gaws•4m ago•0 comments

Rudolf Vrba

https://en.wikipedia.org/wiki/Rudolf_Vrba
1•mooreds•5m ago•0 comments

Autism Incidence in Girls and Boys May Be Nearly Equal, Study Suggests

https://www.medpagetoday.com/neurology/autism/119747
1•paulpauper•6m ago•0 comments

Wellness Hotels Discovery Application

https://aurio.place/
1•cherrylinedev•7m ago•1 comments

NASA delays moon rocket launch by a month after fuel leaks during test

https://www.theguardian.com/science/2026/feb/03/nasa-delays-moon-rocket-launch-month-fuel-leaks-a...
1•mooreds•7m ago•0 comments

Sebastian Galiani on the Marginal Revolution

https://marginalrevolution.com/marginalrevolution/2026/02/sebastian-galiani-on-the-marginal-revol...
1•paulpauper•10m ago•0 comments

Ask HN: Are we at the point where software can improve itself?

1•ManuelKiessling•11m ago•0 comments

Binance Gives Trump Family's Crypto Firm a Leg Up

https://www.nytimes.com/2026/02/07/business/binance-trump-crypto.html
1•paulpauper•11m ago•0 comments

Reverse engineering Chinese 'shit-program' for absolute glory: R/ClaudeCode

https://old.reddit.com/r/ClaudeCode/comments/1qy5l0n/reverse_engineering_chinese_shitprogram_for/
1•edward•11m ago•0 comments

Indian Culture

https://indianculture.gov.in/
1•saikatsg•14m ago•0 comments

Show HN: Maravel-Framework 10.61 prevents circular dependency

https://marius-ciclistu.medium.com/maravel-framework-10-61-0-prevents-circular-dependency-cdb5d25...
1•marius-ciclistu•14m ago•0 comments

The age of a treacherous, falling dollar

https://www.economist.com/leaders/2026/02/05/the-age-of-a-treacherous-falling-dollar
2•stopbulying•14m ago•0 comments

Ask HN: AI Generated Diagrams

1•voidhorse•17m ago•0 comments

Microsoft Account bugs locked me out of Notepad – are Thin Clients ruining PCs?

https://www.windowscentral.com/microsoft/windows-11/windows-locked-me-out-of-notepad-is-the-thin-...
4•josephcsible•17m ago•0 comments

Show HN: A delightful Mac app to vibe code beautiful iOS apps

https://milq.ai/hacker-news
5•jdjuwadi•20m ago•1 comments

Show HN: Gemini Station – A local Chrome extension to organize AI chats

https://github.com/rajeshkumarblr/gemini_station
1•rajeshkumar_dev•20m ago•0 comments

Welfare states build financial markets through social policy design

https://theloop.ecpr.eu/its-not-finance-its-your-pensions/
2•kome•24m ago•0 comments

Market orientation and national homicide rates

https://onlinelibrary.wiley.com/doi/10.1111/1745-9125.70023
4•PaulHoule•24m ago•0 comments

California urges people avoid wild mushrooms after 4 deaths, 3 liver transplants

https://www.cbsnews.com/news/california-death-cap-mushrooms-poisonings-liver-transplants/
1•rolph•25m ago•0 comments

Matthew Shulman, co-creator of Intellisense, died 2019 March 22

https://www.capenews.net/falmouth/obituaries/matthew-a-shulman/article_33af6330-4f52-5f69-a9ff-58...
3•canucker2016•26m ago•1 comments

Show HN: SuperLocalMemory – AI memory that stays on your machine, forever free

https://github.com/varun369/SuperLocalMemoryV2
1•varunpratap369•27m ago•0 comments

Show HN: Pyrig – One command to set up a production-ready Python project

https://github.com/Winipedia/pyrig
1•Winipedia•29m ago•0 comments

Fast Response or Silence: Conversation Persistence in an AI-Agent Social Network [pdf]

https://github.com/AysajanE/moltbook-persistence/blob/main/paper/main.pdf
1•EagleEdge•29m ago•0 comments

C and C++ dependencies: don't dream it, be it

https://nibblestew.blogspot.com/2026/02/c-and-c-dependencies-dont-dream-it-be-it.html
1•ingve•30m ago•0 comments

Show HN: Vbuckets – Infinite virtual S3 buckets

https://github.com/danthegoodman1/vbuckets
1•dangoodmanUT•30m ago•0 comments
Open in hackernews

SWE-Bench Pro

https://github.com/scaleapi/SWE-bench_Pro-os
101•tosh•4mo ago

Comments

siliconc0w•4mo ago
Looks like the associated article is: https://scale.com/research/swe_bench_pro (link in the repo is wrong)
gpt5•4mo ago
Slightly tangent question - they said that they have protected the public test set with a strong copyleft license to prevent training private models on them.

Does it actually work? Isn’t AI training so far simply ignores all license and copyright restrictions completely?

ej88•4mo ago
https://scale.com/leaderboard/swe_bench_pro_commercial

I definitely trust the totally private dataset more.

stephendause•4mo ago
This is a key question in my opinion. It's one of the things that make benchmarking the SWE capabilities of LLMs difficult. It's usually impossible to know whether the LLM has seen a problem before, and coming up with new, representative problem sets is time-consuming.
CuriouslyC•4mo ago
You can just fuzz names and switch to a whitespace compact representation.
Uehreka•4mo ago
If you fuzz the names they won’t mean the same thing anymore, and then it’s no longer the same test. If you remove the whitespace the LLM will just run a formatter on the code. It’s not like the LLM just loads in all the code and then starts appending its changes.
CuriouslyC•4mo ago
I've never had a LLM try to run a formatter on my code with probably a few thousand hours logged driving agents (driving 4+ agents at once in most of those). Fuzzing makes the semantics slightly less immediately obvious, but LLMs are more robust to this than you or I, the biggest difference is the reduction in memorization carryover. If it feels like too different of a test for you, not sure what to tell you, but I know the world would appreciate a better way to test for training set contamination if you can figure one out.
flare_blitz•4mo ago
And your basis for saying this is...?
CuriouslyC•4mo ago
I've done it? I have a benchmark called scramblebench that will do rewriting to evaluate model performance degradation with symbol replacement and layers of indirection.
stri8ed•4mo ago
Not a chance. Even if American companies did abide by it, there is no reason Chinese companies would. And good luck definitely proving that a model trained on it.
candiddevmike•4mo ago
Sir, we've already ingested 503,377 copyleft licensed codebases, I don't think the training set can take anymore!
ipnon•4mo ago
Snark is against the rules but sometimes a good one-liner says more than a whole paragraph.
BoorishBears•4mo ago
I feel like public datasets are something we're holding onto with LLM benchmarks for historical reasons, but need to move on from.

Older, non-instruction tuned models needed post-training on public datasets to even reliably produce meaningful answers.

Now we're testing tasks that are so complex that the LLM should reasonably be expected to answer without additional post-training.

Once you have a public dataset, even feeding those examples to an LLM and producing synthetic variations is enough to let you game the benchmark. And the worst part is you don't need to be unethical to do this: some people would say it's just a good way to expand your training data even though it incidentally allows you to overfit on the task, without overfitting on the public dataset.

So everyone's doing stuff like that, and we're getting models that are increasing overfit to a few narrow tasks.

-

The alternative is just giving detailed plain english descriptions of the tasks in question. Those can be used to generate synthetic tasks, but won't result in matching the benchmark's "shape" perfectly (as long as the questions stay hidden), and that alone is enough to ensure some level of generalization takes place.

joefkelley•4mo ago
I happen to have worked on exactly this at Google. No, we don't train on restrictively-licensed code to the best of our abilities.
pama•4mo ago
Out of curiosity, and IANAL, what is it in a GPL/copyleft license that would make it undesirable to train LLMs on projects with such license? Or are there stronger yet copyleft licenses that you had in mind?

(FWIW, and not directly related to my question, I always thought of GPL as the less (not more) restrictive license from the perspective of a user, because I could always ask for the source code and debug a problem on my own.)

kenstler•4mo ago
One of the authors here -- we should clarify that strong copyleft license is a best attempt at decontamination for the public set. It's part of the tradeoff for having an open source set -- true decontamination is available with the private commercial set, but we can't release these, and if we did they'd be immediately susceptible to future contamination.
heavyset_go•4mo ago
If courts find model training and inference to be fair use of data sets, licenses mean nothing.

It looks like one court did in a non-precedent binding case, but I might be remembering incorrectly.

WhitneyLand•4mo ago
Recently it was pointed out that models were sometimes finding SWE-Bench verified cheats by scanning parts of the repo not meant to be visible.

Hope they’re addressing that at the same time.

hereme888•4mo ago
Is it possible to benchmark the GPT-5-Pro model?
nyrikki•4mo ago
> Larger models (e.g., Opus 4.1) often fail on semantic or algorithmic correctness in large, multi-file edits, whereas smaller models (e.g., Qwen 3 32B) more frequently fail due to issues in syntax and formatting, tool use, or context management.

While I haven’t dug into the details of this benchmark, this absolutely matches my personal experience.

Assuming “semantic correctness” is in the sense of Rice and runtime behavior.

While syntactic correctness has dramatically improved, security and architectural erosion and other long term issues have not.

Unfortunately Rice’s theorem also applies to finite programs in finite time too.

Actually it can apply to total functions in the general case.

I am still optimistic that coding agents will provide value long term in some fashion.

But the open domain frame problem simply reduces to the halting problem, yes and humans struggle with it too.

But fundamentally, PAC learning has to be reduced to _trivial_ problems, with only T/F.

We have found clever ways to work within these s limitations, but they still exist.

Hopefully we find clever ways to keep humans engaged with the code, while gaining the potential force multiplier that ML may offer.

The long tailed problems are particularly important, and while human SREs make mistakes and organizations often have constraints that add to the problem, SREs do a lot more to avoid those long tailed problems than they are given credit for.

IMHO that has always been one of the hardest parts of the industry and a true measure for what makes great team members.

Unfortunately the metrics and incentives often don’t capture that value.

leoh•4mo ago
Frankly, several repos and tools from Google/DeepMind look a lot better.

https://github.com/google-deepmind/bbeh?tab=readme-ov-file

https://github.com/google/lmeval

I hesitate to say this lest folks adapt, but does anyone else immediately distrust a repo when it has a bunch of emojis in the README? It is often a giveaway that they were LLM-generated.

scosman•4mo ago
Unless this is actually made by the SWE Bench team, and I see no evidence it is, this name is incredibly poor form. Just adding "Pro" to someone else's name not only is infringing on their mark, but implying yours is superior.
stathibus•4mo ago
its the new YOLOv*
philip1209•4mo ago
JavaScript
burgerquizz•4mo ago
why should we trust this benchmark more than an other for coding? geniune question, there are so many out there
segmondy•4mo ago
Silly, if you are going to come up with a new benchmark, then add capable models, they have Opus, Gemini Pro, and then Qwen3-32B.

Why not qwen3-coder-480b, qwen3-235b-instruct, deepseek-v3.1, kimi-k2, GLM-4.5, gpt-oss-120b?

tootyskooty•4mo ago
would be nice to finally see multi-turn coding benchmarks. everything we have so far is single-turn and that's clearly not a realistic scenario.
yangcheng•4mo ago
The public dataset only contains 3 or 4 languages. go-280 python-266 js-165 ts-20

I hope in future the benchmark can cover other widely used languages, such as c++, java, swift, rust etc.