Show HN: A benchmark for SAST exploit chain and evasion detection
Traditional SAST benchmarks are great at measuring simple source-to-sink taint flows, but real-world attacks have moved past that. I spent some time building a benchmark suite to test the things that current static analysis tools structurally struggle to see.
Design Principles
Test cases written from security knowledge, not from knowledge of any specific SAST engine's detection capabilities No vulnerability hints in source code -- the CSV answer key is the ONLY ground truth. No comments, no CWE references, no category names in filenames or function names. 50/50 TP/TN balance prevents classifier gaming -- a tool that flags everything scores 0%, not 100% Category-averaged scoring prevents large categories from dominating small ones Minimum 25 TP + 25 TN per category ensures statistical significance (Youden's J per-case swing ≤ 4%) Tool-agnostic SARIF-based scoring -- any SAST tool that exports SARIF 2.1.0 can be scored 1 file = 1 test case for the baseline language benchmarks (standalone functions with no cross-file dependencies), while the Chain Detection tests explicitly use multi-file application structures.
It focuses heavily on two main areas:
Chain Detection: 500 test cases that measure if a tool can correlate multiple low-severity findings across different files into a compound exploit path. Adversarial Evasion: Tests to see if a tool can detect intentional concealment, like payloads hidden inside invisible Unicode characters or visual deception using Bidi overrides.
Since there was no public ground truth for Go, Rust, Bash, PHP, and Ruby, I also built baseline vulnerability benchmarks for those languages as part of the suite, bringing the total to over 7,700 test cases.
Building ground truth at this scale as a solo developer is a massive undertaking, and right now I have a serious echo chamber problem. I am the student taking the exam, the master designing it, and the professor grading my own homework. It sucks, and I know I have blind spots in my test designs.
I am releasing this openly because imperfect ground truth that invites correction is more valuable than no ground truth at all. If you work in AppSec, build SAST engines, or just enjoy breaking logic, I would love your scrutiny. Finding my misclassifications and edge cases will make this infinitely more valuable for everyone.
Repo link: https://github.com/TheAuditorTool/sast-benchmark // ThailandJohn. TheAuditorTool Maintainer.
ThailandJohn•1h ago