frontpage.

I spent $100 benchmarking LLM providers on a weekend CTF

1•wwdmaxwell•1h ago

This past weekend, I decided to test out a cli tool I've been building to help me do source code reviews _faster_.

I figured the best environment for such a tool would be a Weekend CTF event. I like web challenges since you get a nice dump of source code, as well as a Dockerfile or docker compose setup for how to run everything locally. Usually, I can complete 2-3 Web challenges before I get stuck. To help get unstuck I found myself increasingly turning to LLMs as a pairing partner.

I'm a fan of devcontainers, so I figured I could apply a similar concept with an agent*, where I load the agent into a container, mount the source code, and even start up any provided Dockerfile or docker-compose.yml so that the agent can actually test real `curl` commands!

So how did it go? It was pretty helpful for the web challenges. I was able to cruise through 5 between Friday and Saturday. I decided to see how it would do in the other categories - without any input / guidance from me as I typically stick to web.

In total *we* solved 19 challenges. It's best category was crypto with 4/7 solved, and it's worst was pwn with 2/5.

I was also curious how different providers would fair, because this was an automated agent, I started off using xai since they were the cheapest.

xai was able to solve 8 challenges autonomously with just source code and challenge descriptions.

I then pivoted to gemini as the next cheapest, and it did pretty well and was able to build on xai's "analysis" and solve 5 additional challenges.

I further tried to pivot to anthropic's Opus model, but it wasn't able to crack any additional challenges, and I got frustrated since I kept getting rate limited with 429 errors (so I kind of wish I switched to openai 5 instead, as it seems like Anthropic doesn't really like agents other than Claude calling their models.

In terms of cost breakdown I spent

$ 33.06 with xai

$ 35.61 with google

$ 24.04 with anthropic

Bringing the total just under $100 for a weekend benchmarking exercise.

Going forward I'm not really interested in paying to copy-paste CTF flags, but I did find the agent helpful for brainstorming solutions, and it worked a lot better when connected to the source code, with access to an instance running locally, and also augmented with MCP tools that allowed concept and source code searching. I'm planning to use similar concepts to build out a dev/review agent.

The source code for my setup is here: https://github.com/edelauna/prompt2pwn

* My initial version does require setting `--priveleged` on the Docker runtime. I originally tried to use podman, but I ran into networking / dns issues with how I wanted to make MCP tools available to the agent. Please open an issue on the repo and let me know if you have any ideas how to harden this.

Show HN: Bruce – AI signal radar for Reddit/HN that learns what matters to you

The Prisoner's Dilemma: Why Rational Choices Can Lead to the Worst Outcomes

We Shouldn't Fight Automation

First-of-a-kind stem-cell therapies set for approval in Japan

Bhutan's crypto experiment shows how hard digital money is in the real world

AI 2027 and the Shrinking of Understanding

OpenClaw Meets Healthcare

I'm a 15-year-old girl. Here's the vile misogyny I face daily on social media

Female Reproductive Tract-on-a-Chip for selecting healthier sperm

Covert DEI Design Techniques for Earthly Survival in Hostile Contexts

LFM2-24B-A2B: Scaling Up the LFM2 Architecture

SQL history lesson with Oracle V2

Metabolism, not cells or genetics, may have begun life on Earth

Walkman.land

Show HN: DoNotify – Google Calendar reminders as phone calls(not notifications)

There's software, and then there's promptware

EDRi Open Letter: We say no to Big Tech mass snooping on our messages

Tim Cook Warned by CIA That China Could Move on Taiwan by 2027

IBM stock tumbles 10% after Anthropic launches COBOL AI tool

Data center builders thought farmers would willingly sell land, learn otherwise

Towards a Science of AI Agent Reliability

How we made Docker builds 193x faster across AI agent sessions

Ask HN: Did your client ever replace you by a more junior freelancer?

Addressing your questions about the Cyber Resilience Act

I don't care what tools you use. But – and this is a big but

Show HN: StarkZap – Gasless Bitcoin Payments SDK for TypeScript

Mercury 2: Diffusion Reasoning Model

SpacetimeDB 2.0 [video]

Show HN: Awsim – Lightweight AWS emulator in Go (40 services in progress)

Stripe valued at $159B, 2025 annual letter