Over the past year I built and analyzed a dataset of 23K+ vulnerabilities extracted from smart contract audit reports published between 2023 and 2025. Sources include private auditors, audit firms, and competitive platforms such as Code4rena and Sherlock.
The dataset was cleaned before analysis: 99% of Informational-severity findings and ~40% of Low-severity were removed, as they consistently lacked sufficient detail to be informative.
The goal was to quantify report quality — not just flag vulnerabilities, but measure how well each one is documented. This became the foundation for a RAG-based audit assistant I've been building, where data quality has an outsized effect on output quality.
Scoring methodology:
Each finding was scored on three primary dimensions — description depth, remediation quality, and presence of a PoC. PoC carried the highest weight, as it is the most reliable signal of a useful report. Solidity snippets and severity level contributed additional points. Raw scores (0–15) were log-normalized to 0–1 to prevent score concentration at the top.
Key findings:
— Total findings analyzed: 23,625
— Mean score: 0.32 | Median: 0.27
— Distribution is multimodal with three distinct quality tiers (~0.05, ~0.25, ~0.60)
— ~25% of findings score above 0.51 — these form the high-quality tier ("golden data fund")
— All three normality tests confirm the distribution is significantly non-Gaussian
Most counterintuitive result: Critical-severity bugs score lower on average (0.33) than High-severity ones (0.53). Critical findings tend to be reported as brief alerts without PoC — the severity speaks for itself, so the write-up gets less attention. High findings, by contrast, typically include more thorough documentation. This is a problem: the bugs most likely to cause catastrophic losses are often the least well-documented.
What this means in practice:
The three-peak distribution reflects real behavioral patterns in how auditors write reports. The first cluster (scores ~0.05) represents minimal one-liner findings with no context. The second (~0.25) covers standard reports with a description but no PoC. The third (~0.60) is the minority that includes everything: a clear description, remediation steps, and working exploit code. Only this last group is genuinely useful for both AI training and human review.
For the broader ecosystem, the takeaway is uncomfortable: the current standard of audit reporting leaves most findings underexplained. A well-documented bug with a PoC can be understood, reproduced, and fixed in hours. A vague one-liner can stay misunderstood for weeks — or get silently ignored in the next audit cycle.
If you want to see the full distribution charts and statistics for yourself, I put together an interactive notebook with all the visualizations:
zaevlad•1h ago
The dataset was cleaned before analysis: 99% of Informational-severity findings and ~40% of Low-severity were removed, as they consistently lacked sufficient detail to be informative.
The goal was to quantify report quality — not just flag vulnerabilities, but measure how well each one is documented. This became the foundation for a RAG-based audit assistant I've been building, where data quality has an outsized effect on output quality.
Scoring methodology:
Each finding was scored on three primary dimensions — description depth, remediation quality, and presence of a PoC. PoC carried the highest weight, as it is the most reliable signal of a useful report. Solidity snippets and severity level contributed additional points. Raw scores (0–15) were log-normalized to 0–1 to prevent score concentration at the top.
Key findings:
— Total findings analyzed: 23,625 — Mean score: 0.32 | Median: 0.27 — Distribution is multimodal with three distinct quality tiers (~0.05, ~0.25, ~0.60) — ~25% of findings score above 0.51 — these form the high-quality tier ("golden data fund") — All three normality tests confirm the distribution is significantly non-Gaussian
Most counterintuitive result: Critical-severity bugs score lower on average (0.33) than High-severity ones (0.53). Critical findings tend to be reported as brief alerts without PoC — the severity speaks for itself, so the write-up gets less attention. High findings, by contrast, typically include more thorough documentation. This is a problem: the bugs most likely to cause catastrophic losses are often the least well-documented.
What this means in practice:
The three-peak distribution reflects real behavioral patterns in how auditors write reports. The first cluster (scores ~0.05) represents minimal one-liner findings with no context. The second (~0.25) covers standard reports with a description but no PoC. The third (~0.60) is the minority that includes everything: a clear description, remediation steps, and working exploit code. Only this last group is genuinely useful for both AI training and human review.
For the broader ecosystem, the takeaway is uncomfortable: the current standard of audit reporting leaves most findings underexplained. A well-documented bug with a PoC can be understood, reproduced, and fixed in hours. A vague one-liner can stay misunderstood for weeks — or get silently ignored in the next audit cycle.
If you want to see the full distribution charts and statistics for yourself, I put together an interactive notebook with all the visualizations:
https://colab.research.google.com/drive/1Wp4yyEmXYjHATak7Bmy...
Open to questions on methodology or dataset composition.