frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

I spent $100 benchmarking LLM providers on a weekend CTF

1•wwdmaxwell•1h ago
This past weekend, I decided to test out a cli tool I've been building to help me do source code reviews _faster_.

I figured the best environment for such a tool would be a Weekend CTF event. I like web challenges since you get a nice dump of source code, as well as a Dockerfile or docker compose setup for how to run everything locally. Usually, I can complete 2-3 Web challenges before I get stuck. To help get unstuck I found myself increasingly turning to LLMs as a pairing partner.

I'm a fan of devcontainers, so I figured I could apply a similar concept with an agent*, where I load the agent into a container, mount the source code, and even start up any provided Dockerfile or docker-compose.yml so that the agent can actually test real `curl` commands!

So how did it go? It was pretty helpful for the web challenges. I was able to cruise through 5 between Friday and Saturday. I decided to see how it would do in the other categories - without any input / guidance from me as I typically stick to web.

In total *we* solved 19 challenges. It's best category was crypto with 4/7 solved, and it's worst was pwn with 2/5.

I was also curious how different providers would fair, because this was an automated agent, I started off using xai since they were the cheapest.

xai was able to solve 8 challenges autonomously with just source code and challenge descriptions.

I then pivoted to gemini as the next cheapest, and it did pretty well and was able to build on xai's "analysis" and solve 5 additional challenges.

I further tried to pivot to anthropic's Opus model, but it wasn't able to crack any additional challenges, and I got frustrated since I kept getting rate limited with 429 errors (so I kind of wish I switched to openai 5 instead, as it seems like Anthropic doesn't really like agents other than Claude calling their models.

In terms of cost breakdown I spent

$ 33.06 with xai

$ 35.61 with google

$ 24.04 with anthropic

Bringing the total just under $100 for a weekend benchmarking exercise.

Going forward I'm not really interested in paying to copy-paste CTF flags, but I did find the agent helpful for brainstorming solutions, and it worked a lot better when connected to the source code, with access to an instance running locally, and also augmented with MCP tools that allowed concept and source code searching. I'm planning to use similar concepts to build out a dev/review agent.

The source code for my setup is here: https://github.com/edelauna/prompt2pwn

* My initial version does require setting `--priveleged` on the Docker runtime. I originally tried to use podman, but I ran into networking / dns issues with how I wanted to make MCP tools available to the agent. Please open an issue on the repo and let me know if you have any ideas how to harden this.

Show HN: Bruce – AI signal radar for Reddit/HN that learns what matters to you

https://smartbruce.com/
1•rklosowski•48s ago•0 comments

The Prisoner's Dilemma: Why Rational Choices Can Lead to the Worst Outcomes

https://twitter.com/Riazi_Cafe_en/status/2025621049082089548
1•ibobev•1m ago•0 comments

We Shouldn't Fight Automation

https://www.update.news/p/why-we-shouldnt-fight-automation
1•StefanSchubert•1m ago•0 comments

First-of-a-kind stem-cell therapies set for approval in Japan

https://www.nature.com/articles/d41586-026-00585-x
1•Brajeshwar•2m ago•0 comments

Bhutan's crypto experiment shows how hard digital money is in the real world

https://restofworld.org/2026/bhutan-bitcoin-tourism-payment-adoption-failure/
1•Brajeshwar•2m ago•0 comments

AI 2027 and the Shrinking of Understanding

https://nader.io/posts/ai-2027/
1•nader•2m ago•0 comments

OpenClaw Meets Healthcare

https://evestel.substack.com/p/how-i-build-my-personal-openclaw
1•brandonb•2m ago•0 comments

I'm a 15-year-old girl. Here's the vile misogyny I face daily on social media

https://www.theguardian.com/commentisfree/2026/feb/23/15-year-old-girl-misogyny-social-media-onli...
1•randycupertino•2m ago•0 comments

Female Reproductive Tract-on-a-Chip for selecting healthier sperm

https://www.nature.com/articles/s41378-026-01165-9
1•TEHERET•2m ago•0 comments

Covert DEI Design Techniques for Earthly Survival in Hostile Contexts

https://dl.acm.org/doi/10.1145/3750069.3755946
1•tokai•2m ago•0 comments

LFM2-24B-A2B: Scaling Up the LFM2 Architecture

https://www.liquid.ai/blog/lfm2-24b-a2b
1•salkahfi•2m ago•0 comments

SQL history lesson with Oracle V2

https://databaseblog.myname.nl/2026/02/some-sql-history-with-oracle-v2.html
1•dveeden2•2m ago•0 comments

Metabolism, not cells or genetics, may have begun life on Earth

https://bigthink.com/starts-with-a-bang/metabolism-begun-life-earth/
1•Brajeshwar•3m ago•0 comments

Walkman.land

https://walkman.land/
1•ohjeez•3m ago•0 comments

Show HN: DoNotify – Google Calendar reminders as phone calls(not notifications)

https://donotifys.com
1•micahele•3m ago•0 comments

There's software, and then there's promptware

https://kelvinfichter.com/pages/thoughts/promptware/
1•kfichter•5m ago•0 comments

EDRi Open Letter: We say no to Big Tech mass snooping on our messages

https://edri.org/our-work/open-letter-we-say-no-to-big-tech-mass-snooping-on-our-messages/
1•robtherobber•6m ago•0 comments

Tim Cook Warned by CIA That China Could Move on Taiwan by 2027

https://www.macrumors.com/2026/02/24/tim-cook-warned-by-cia-china-taiwan-2027/
1•stalfosknight•6m ago•1 comments

IBM stock tumbles 10% after Anthropic launches COBOL AI tool

https://finance.yahoo.com/news/ibm-stock-tumbles-10-anthropic-194042677.html
2•jspdown•8m ago•0 comments

Data center builders thought farmers would willingly sell land, learn otherwise

https://arstechnica.com/tech-policy/2026/02/im-not-for-sale-farmers-refuse-to-take-millions-in-da...
3•stalfosknight•8m ago•0 comments

Towards a Science of AI Agent Reliability

https://arxiv.org/abs/2602.16666
1•smartmic•8m ago•0 comments

How we made Docker builds 193x faster across AI agent sessions

https://blog.helix.ml/p/how-we-made-docker-builds-193x-faster
1•quesobob•10m ago•0 comments

Ask HN: Did your client ever replace you by a more junior freelancer?

1•goingbananas•12m ago•0 comments

Addressing your questions about the Cyber Resilience Act

https://fsfe.org/news/2026/news-20260224-01.html
2•Tomte•12m ago•0 comments

I don't care what tools you use. But – and this is a big but

https://come-from.mad-scientist.club/@algernon/statuses/01KHYGWT17C1HNKRCVBMYTZVHQ
2•latexr•13m ago•0 comments

Show HN: StarkZap – Gasless Bitcoin Payments SDK for TypeScript

https://github.com/keep-starknet-strange/starkzap
1•starkience•13m ago•2 comments

Mercury 2: Diffusion Reasoning Model

https://www.inceptionlabs.ai/blog/introducing-mercury-2
2•zof3•13m ago•0 comments

SpacetimeDB 2.0 [video]

https://www.youtube.com/watch?v=C7gJ_UxVnSk
9•aleasoni•14m ago•1 comments

Show HN: Awsim – Lightweight AWS emulator in Go (40 services in progress)

https://github.com/sivchari/awsim
2•sivchari•14m ago•0 comments

Stripe valued at $159B, 2025 annual letter

https://stripe.com/newsroom/news/stripe-2025-update
3•jez•15m ago•0 comments