frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

N-Day-Bench – Can LLMs find real vulnerabilities in real codebases?

https://ndaybench.winfunc.com
46•mufeedvh•6h ago
N-Day-Bench tests whether frontier LLMs can find known security vulnerabilities in real repository code. Each month it pulls fresh cases from GitHub security advisories, checks out the repo at the last commit before the patch, and gives models a sandboxed bash shell to explore the codebase.

Static vulnerability discovery benchmarks become outdated quickly. Cases leak into training data, and scores start measuring memorization. The monthly refresh keeps the test set ahead of contamination — or at least makes the contamination window honest.

Each case runs three agents: a Curator reads the advisory and builds an answer key, a Finder (the model under test) gets 24 shell steps to explore the code and write a structured report, and a Judge scores the blinded submission. The Finder never sees the patch. It starts from sink hints and must trace the bug through actual code.

Only repos with 10k+ stars qualify. A diversity pass prevents any single repo from dominating the set. Ambiguous advisories (merge commits, multi-repo references, unresolvable refs) are dropped.

Currently evaluating GPT-5.4, Claude Opus 4.6, Gemini 3.1 Pro, GLM-5.1, and Kimi K2.5. All traces are public.

Methodology: https://ndaybench.winfunc.com/methodology

Live Leaderboard: https://ndaybench.winfunc.com/leaderboard

Live Traces: https://ndaybench.winfunc.com/traces

Comments

Rohinator•6h ago
Very curious how Claude Mythos will perform here
mbbutler•5h ago
It would be helpful to add in some cases that do not contain any vulnerabilities to assess false-positive rate as well.
mufeedvh•5h ago
This is a good idea.

Will incorporate false-positive rates into the rubric from the next run onwards.

At winfunc, we spent a lot of research time taming these models to eradicate false-positive rates (it's high!) so this does feel important enough to be documented. Thanks!

cortesoft•5h ago
Any code that is certain that it doesn't have any vulnerabilities is going to be pretty trivial to verify.
Cynddl•5h ago
> Each case runs three agents: a Curator reads the advisory and builds an answer key, a Finder (the model under test) gets 24 shell steps to explore the code and write a structured report, and a Judge scores the blinded submission. The Finder never sees the patch. It starts from sink hints and must trace the bug through actual code.

Curator, answer key, Finder, shell steps, structured report, sink hints… I understand nothing. Did you use an LLM to generate this HN submission?

It looks like a standard LLM-as-a-judge approach. Do you manually validate or verify some of the results? Done poorly, the results can be very noisy and meaningless.

peyton•5h ago
> Did you use an LLM to generate this HN submission?

Must have.

> The Finder will never see the patch.

I wasn’t worried that this eval would show the answer to the model before evaluating it. Seems requirements leaked into this post.

rohansood15•3h ago
I worked in AppSec in the past, made sense to me. Maybe you aren't the target audience?

You don't really need manual verification for these, the CVEs (vulnerabilities) are public and can be programmatically validated.

johnfn•3h ago
Is this really that hard to parse?

Curator and Finder are the names of the agents. "answer key" - haven't you ever taken a test in high school? It's an explanation of the answer. "shell steps" I presume means it gets to run 24 commands on the shell. "structured report" - do I really need to explain to you what a report is? "sink hints" - I admit I didn't know this one, but a bit of searching indicates that it's a hint at where the vulnerability lies.

spicyusername•2h ago
I'd love to see some of the open source models in there
linzhangrun•2h ago
Definitely possible. In January, I tried using Gemini to perform black-box/white-box testing on an existing system in my company (it's quite old). It successfully exploited a hidden SQL injection vulnerability to penetrate the system and extract password hashes (not particularly strong passwords, successfully decrypted on a public website). In terms of pure skill level, I'd say this is at least the level of a mid-level cybersecurity professional, not even considering the significant efficiency improvement.
sacrelege•1h ago
Thanks for putting N-Day-Bench together - really interesting benchmark design and results.

I'd love to see how the model we serve, Qwen3.5 122B A10B, stacks up against the rest on this benchmark. AI Router Switzerland (aiRouter.ch) can sponsor free API access for about a month if that helps for adding it to the evaluation set.

DaVinci Resolve releases Photo Editor

https://www.blackmagicdesign.com/products/davinciresolve/photo
99•thebiblelover7•1h ago•23 comments

A new spam policy for "back button hijacking"

https://developers.google.com/search/blog/2026/04/back-button-hijacking
43•zdw•1h ago•16 comments

Someone bought 30 WordPress plugins and planted a backdoor in all of them

https://anchor.host/someone-bought-30-wordpress-plugins-and-planted-a-backdoor-in-all-of-them/
779•speckx•10h ago•221 comments

GitHub Stacked PRs

https://github.github.com/gh-stack/
530•ezekg•7h ago•286 comments

Lean proved this program correct; then I found a bug

https://kirancodes.me/posts/log-who-watches-the-watchers.html
146•bumbledraven•3h ago•81 comments

WiiFin – Jellyfin Client for Nintendo Wii

https://github.com/fabienmillet/WiiFin
97•throwawayk7h•4h ago•37 comments

Design and implementation of DuckDB internals

https://duckdb.org/library/design-and-implementation-of-duckdb-internals/
56•mpweiher•3d ago•5 comments

Nothing Ever Happens: Polymarket bot that always buys No on non-sports markets

https://github.com/sterlingcrispin/nothing-ever-happens
374•m-hodges•12h ago•198 comments

Rust Threads on the GPU

https://www.vectorware.com/blog/threads-on-gpu/
15•PaulHoule•4d ago•3 comments

How to make Firefox builds 17% faster

https://blog.farre.se/posts/2026/04/10/caching-webidl-codegen/
147•mbitsnbites•9h ago•23 comments

US appeals court declares 158-year-old home distilling ban unconstitutional

https://nypost.com/2026/04/11/us-news/us-appeals-court-declares-158-year-old-home-distilling-ban-...
331•t-3•14h ago•245 comments

Write less code, be more responsible

https://blog.orhun.dev/code-responsibly/
48•orhunp_•2d ago•27 comments

Servo is now available on crates.io

https://servo.org/blog/2026/04/13/servo-0.1.0-release/
439•ffin•16h ago•140 comments

Make tmux pretty and usable (2024)

https://hamvocke.com/blog/a-guide-to-customizing-your-tmux-conf/
334•speckx•13h ago•210 comments

The AI revolution in math has arrived

https://www.quantamagazine.org/the-ai-revolution-in-math-has-arrived-20260413/
49•sonabinu•4h ago•23 comments

Building a CLI for all of Cloudflare

https://blog.cloudflare.com/cf-cli-local-explorer/
276•soheilpro•12h ago•90 comments

Air Powered Segment Display? [video]

https://www.youtube.com/watch?v=E1BLGpE5zH0
67•ProfDreamer•2d ago•9 comments

GAIA – Open-source framework for building AI agents that run on local hardware

https://amd-gaia.ai/docs
111•galaxyLogic•8h ago•25 comments

I just want simple S3

https://blog.feld.me/posts/2026/04/i-just-want-simple-s3/
125•g0xA52A2A•2d ago•69 comments

I shipped a transaction bug, so I built a linter

https://leonh.fr/posts/go-transaction-linter/
4•leonhfr•3d ago•0 comments

Show HN: Ithihāsas – a character explorer for Hindu epics, built in a few hours

https://www.ithihasas.in
129•cvrajeesh•9h ago•32 comments

Android now stops you sharing your location in photos

https://shkspr.mobi/blog/2026/04/android-now-stops-you-sharing-your-location-in-photos/
314•edent•16h ago•282 comments

Tool to explore regularly sampled time series

https://github.com/rajivsam/tseda
8•rsva•3d ago•0 comments

What we learned building a Rust runtime for TypeScript

https://encore.dev/blog/rust-runtime
51•vinhnx•2d ago•12 comments

Tracking down a 25% Regression on LLVM RISC-V

https://blog.kaving.me/blog/tracking-down-a-25-regression-on-llvm-risc-v/
103•luu•1d ago•21 comments

N-Day-Bench – Can LLMs find real vulnerabilities in real codebases?

https://ndaybench.winfunc.com
46•mufeedvh•6h ago•11 comments

Hacker compromises A16Z-backed phone farm, calling them the 'antichrist'

https://www.404media.co/hacker-compromises-a16z-backed-phone-farm-tries-to-post-memes-calling-a16...
13•wibbily•46m ago•2 comments

Why it’s impossible to measure England’s coastline

https://www.bbc.com/travel/article/20260410-why-its-impossible-to-measure-englands-coastline
23•BiraIgnacio•4h ago•17 comments

Visualizing CPU Pipelining (2024)

https://timmastny.com/blog/visualizing-cpu-pipelining/
70•flipacholas•9h ago•9 comments

New Orleans's Car-Crash Conspiracy

https://www.newyorker.com/magazine/2026/04/20/the-car-crash-conspiracy
88•Geekette•10h ago•53 comments