Show HN: A benchmark for SAST exploit chain and evasion detection

https://github.com/TheAuditorTool/sast-benchmark

2•ThailandJohn•2h ago

MAKE HACKERNEWS SHOWCASE POST AND SUBMIT IT 10pm MORNING SILICON VALLEY...

Show HN: A benchmark for SAST exploit chain and evasion detection

Traditional SAST benchmarks are great at measuring simple source-to-sink taint flows, but real-world attacks have moved past that. I spent some time building a benchmark suite to test the things that current static analysis tools structurally struggle to see.

Design Principles

Test cases written from security knowledge, not from knowledge of any specific SAST engine's detection capabilities No vulnerability hints in source code -- the CSV answer key is the ONLY ground truth. No comments, no CWE references, no category names in filenames or function names. 50/50 TP/TN balance prevents classifier gaming -- a tool that flags everything scores 0%, not 100% Category-averaged scoring prevents large categories from dominating small ones Minimum 25 TP + 25 TN per category ensures statistical significance (Youden's J per-case swing ≤ 4%) Tool-agnostic SARIF-based scoring -- any SAST tool that exports SARIF 2.1.0 can be scored 1 file = 1 test case for the baseline language benchmarks (standalone functions with no cross-file dependencies), while the Chain Detection tests explicitly use multi-file application structures.

It focuses heavily on two main areas:

Chain Detection: 500 test cases that measure if a tool can correlate multiple low-severity findings across different files into a compound exploit path. Adversarial Evasion: Tests to see if a tool can detect intentional concealment, like payloads hidden inside invisible Unicode characters or visual deception using Bidi overrides.

Since there was no public ground truth for Go, Rust, Bash, PHP, and Ruby, I also built baseline vulnerability benchmarks for those languages as part of the suite, bringing the total to over 7,700 test cases.

Building ground truth at this scale as a solo developer is a massive undertaking, and right now I have a serious echo chamber problem. I am the student taking the exam, the master designing it, and the professor grading my own homework. It sucks, and I know I have blind spots in my test designs.

I am releasing this openly because imperfect ground truth that invites correction is more valuable than no ground truth at all. If you work in AppSec, build SAST engines, or just enjoy breaking logic, I would love your scrutiny. Finding my misclassifications and edge cases will make this infinitely more valuable for everyone.

Repo link: https://github.com/TheAuditorTool/sast-benchmark // ThailandJohn. TheAuditorTool Maintainer.

Comments

ThailandJohn•1h ago

lol... thats bit embarrassing I copy paste my memo note too... ohh well. It doesnt change much lol, it was supposed to end up here anyhow and now it did xD <3

Show HN: Application management app (for job applications)

Behind the Pretty Frames: Pragmata

Depwire – Codebase dependency graph and MCP server for AI coding assistants

AI for Alzheimer's

DARPA puts money where bots' mouths are, seeks new science of AI communication

Claude Managed Agents: everything you need to build and deploy agents at scale

I've been waiting over a month for Anthropic support to respond

The Future of Everything Is Lies, I Guess: Dynamics

NERC is 'actively monitoring the grid' following Iran-linked cyber threat

AI-to-Butt Chrome Plugin

1SubML: Plan vs. Reality

Fragile U.S.-Iran ceasefire shows cracks as attacks continue across the region

Plain of Jars Archaeological Project (Pjarp)

Prevent confidential data leaks at compile time with labelled types in Sigil

Show HN: A two (or single) player codenames like game with an embedding based AI

How Costco Won in Japan

Dux – Distributed DuckDB-Native DataFrames for Elixir

Show HN: Palinode – Git-versioned Markdown memory for AI agents

Surelock: Deadlock-Free Mutexes for Rust

Untangling Tokio and Rayon in production: From 2s latency spikes to 94ms flat

Intel is going all-in on advanced chip packaging

Should Chat(TextArea) be the new homepage for SaaS?

Allium

S3 Is Not a Filesystem (But Now There's One in Front of It)

ClawsBench shows GPT-5.4 tries to reward hack 80% of the time

Anthropic Launches Claude Managed Agents

AI Is Really Weird

Deadnet is agent vs. agent gameplay and chat

Brit says he is not elusive Bitcoin creator named by New York Times

Show HN: Embedding Similarity with Confidence Intervals

Show HN: A benchmark for SAST exploit chain and evasion detection

Comments

Show HN: Application management app (for job applications)

Behind the Pretty Frames: Pragmata

Depwire – Codebase dependency graph and MCP server for AI coding assistants

AI for Alzheimer's

DARPA puts money where bots' mouths are, seeks new science of AI communication

Claude Managed Agents: everything you need to build and deploy agents at scale

I've been waiting over a month for Anthropic support to respond

The Future of Everything Is Lies, I Guess: Dynamics

NERC is 'actively monitoring the grid' following Iran-linked cyber threat

AI-to-Butt Chrome Plugin

1SubML: Plan vs. Reality

Fragile U.S.-Iran ceasefire shows cracks as attacks continue across the region

Plain of Jars Archaeological Project (Pjarp)

Prevent confidential data leaks at compile time with labelled types in Sigil

Show HN: A two (or single) player codenames like game with an embedding based AI

How Costco Won in Japan

Dux – Distributed DuckDB-Native DataFrames for Elixir

Show HN: Palinode – Git-versioned Markdown memory for AI agents

Surelock: Deadlock-Free Mutexes for Rust

Untangling Tokio and Rayon in production: From 2s latency spikes to 94ms flat

Intel is going all-in on advanced chip packaging

Should Chat(TextArea) be the new homepage for SaaS?

Allium

S3 Is Not a Filesystem (But Now There's One in Front of It)

ClawsBench shows GPT-5.4 tries to reward hack 80% of the time

Anthropic Launches Claude Managed Agents

AI Is Really Weird

Deadnet is agent vs. agent gameplay and chat

Brit says he is not elusive Bitcoin creator named by New York Times

Show HN: Embedding Similarity with Confidence Intervals