Autonomous Bug Bounty Agent: Reached #86 on HackerOne, DoD Triage

1•Layer_8•1h ago

Hello HN,

We’re three security researchers in Tokyo building an autonomous agent framework for authorized security testing (VDP/Bug Bounty).

We wanted to share our experimental results running this agent against live targets (as of Feb 8):

Real-World Impact: Reached #86 globally on the HackerOne VDP leaderboard (90 days).

Gov Targets: 3 vulnerabilities triaged by the U.S. Department of Defense (DoD).

Benchmark: Solved 84% of PortSwigger Web Security Academy labs autonomously.

Interestingly, we encountered an "Impact Gap": while the agent finds technically valid exploits, it often struggles to assess business criticality, leading to "Informative" closures.

We released our architecture design and safety proxy details on GitHub. We'd love to hear your thoughts on bridging this gap between technical exploitability and business impact.

URL: https://github.com/cyberprobe-ai/autonomous-pentest-agent-research

Comments

Layer_8•1h ago

Quick clarifications (to avoid ambiguity / keep this responsible): Authorized only: we run this strictly within explicit VDP/bug bounty scopes. We do not run it as a general internet crawler. Human-in-the-loop: the system drafts a report + evidence, but a human makes the final call and we never auto-submit. Scope-enforcing proxy: all outbound traffic is forced through a gate with default-deny, FQDN allowlists, method constraints, rate/concurrency caps, and full allow/deny logging. “Safe PoC” policy: we prioritize read-only verification patterns and stop on signs of instability (error spikes, account risk, unexpected side effects). We’re not sharing real-world exploit payloads here. Metrics: “84% labs solved” refers to server-side lab completion outcomes; details / breakdown are in the README. The thing we’re most interested in feedback on is the “impact gap”: how would you teach an agent to estimate business severity (or chain low-severity issues into a meaningful impact narrative) without pushing into risky/destructive testing?

Show HN: See what your AI agents do under the hood

EPA to repeal its own conclusion that greenhouse gases warm the planet

Can you trust LastPass in 2026? Inside the quest to rebuild its security culture

Show HN: Z-Image Base – Fast AI Image Generator (Open-Source, Free Tier)

When the Competition Is Down the Hall

The Banality of MAGA Evil

Show HN: Onlybots.cam

PostmarketOS at FOSDEM 2026 and Hackathon

How We Built the Fastest Kimi K2.5 on Artificial Analysis

The Budget and Economic Outlook: 2026 to 2036

Web-Git-sum – Git is not GitHub

Show HN: MEVA, a desktop Markdown reader for AI-generated docs

Trends in Prevalence of Autism by Adaptive and Intellectual Functioning Levels

Mamdani Hires Groundbreaking Computer Scientist as Chief Tech Officer

Ask HN: Why electronics are still so unrecyclable?

Stablecoins for Skeptics

The Truth About No-KYC Crypto Cards, from Someone Who Ran One

Who's the Agent Now?

Freenginx 1.29.5 Release

Show HN: I built a tool to help generate short form videos

Show HN: SPICEBridge – MCP server for AI circuit design via ngspice

Blender source code was 9 files in January-8-1994

The temporary closure of airspace over El Paso has been lifted

Sabotage Risk Report: Claude Opus 4.6 [pdf]

Chowla conjecture on the minimum of a cosine series

Fibonacci numbers and time-space tradeoffs

"Have I Been Stalked" post-mortem

Computing Large Fibonacci Numbers

Life on Earth is lucky: A rare chemical fluke may have made our planet habitable

Lost Soviet Moon Lander May Have Been Found