I figured the best environment for such a tool would be a Weekend CTF event. I like web challenges since you get a nice dump of source code, as well as a Dockerfile or docker compose setup for how to run everything locally. Usually, I can complete 2-3 Web challenges before I get stuck. To help get unstuck I found myself increasingly turning to LLMs as a pairing partner.
I'm a fan of devcontainers, so I figured I could apply a similar concept with an agent*, where I load the agent into a container, mount the source code, and even start up any provided Dockerfile or docker-compose.yml so that the agent can actually test real `curl` commands!
So how did it go? It was pretty helpful for the web challenges. I was able to cruise through 5 between Friday and Saturday. I decided to see how it would do in the other categories - without any input / guidance from me as I typically stick to web.
In total *we* solved 19 challenges. It's best category was crypto with 4/7 solved, and it's worst was pwn with 2/5.
I was also curious how different providers would fair, because this was an automated agent, I started off using xai since they were the cheapest.
xai was able to solve 8 challenges autonomously with just source code and challenge descriptions.
I then pivoted to gemini as the next cheapest, and it did pretty well and was able to build on xai's "analysis" and solve 5 additional challenges.
I further tried to pivot to anthropic's Opus model, but it wasn't able to crack any additional challenges, and I got frustrated since I kept getting rate limited with 429 errors (so I kind of wish I switched to openai 5 instead, as it seems like Anthropic doesn't really like agents other than Claude calling their models.
In terms of cost breakdown I spent
$ 33.06 with xai
$ 35.61 with google
$ 24.04 with anthropic
Bringing the total just under $100 for a weekend benchmarking exercise.
Going forward I'm not really interested in paying to copy-paste CTF flags, but I did find the agent helpful for brainstorming solutions, and it worked a lot better when connected to the source code, with access to an instance running locally, and also augmented with MCP tools that allowed concept and source code searching. I'm planning to use similar concepts to build out a dev/review agent.
The source code for my setup is here: https://github.com/edelauna/prompt2pwn
* My initial version does require setting `--priveleged` on the Docker runtime. I originally tried to use podman, but I ran into networking / dns issues with how I wanted to make MCP tools available to the agent. Please open an issue on the repo and let me know if you have any ideas how to harden this.