Show HN: We built a public CTF to stress-test AI agent guardrails

https://vault.aport.io/

1•uchibeke•1h ago

Comments

uchibeke•1h ago

Since October I've been building APort — an authorization layer that intercepts every AI agent tool call before execution and evaluates it against a versioned policy. The problem I kept running into: internal tests always passed. My test suite maps the space I imagined, which is exactly what an adversarial input tries to escape.

So I built this CTF to find the gaps I couldn't find myself.

A few things we learned before opening it publicly — we spent two weeks breaking it ourselves first:

• Prompt injection worked better than expected. Not because detection was weak, but because we were matching content not intent. Reframing "retrieve the restricted file" as "open the user-requested file" shifted the evaluator's judgment. We fixed this by mapping semantic equivalence — every synonym of a blocked operation routes to the same evaluation path.

• Policy ambiguity was a free pass. Any undefined term in a policy is exploitable. "Don't read sensitive files" left "sensitive" undefined. We moved to explicit default-deny: if the policy doesn't explicitly allow it, it's denied.

• Multi-step chaining went undetected. Our guardrail evaluated each call independently. A denied macro-action split into ten individually-approved micro-actions passed clean. We only caught it by looking at the full session replay. This is the same composability problem as transaction laundering in fintech — each transaction passes compliance, the composed behavior doesn't.

We fixed what we found before launch. Level 5 (full system bypass) hasn't been cracked yet. I'm genuinely uncertain if the architecture has a systemic weakness — that's the point of opening it up.

Runs on a Hetzner VPS, ~$10/month. Levels 1 and 2 are free, no sign-up. Levels 3-5 pay out $500/$1,000/$5,000.

Happy to go deep on the policy engine design, the evaluation architecture, or anything about how the levels were constructed.

ollybrinkman•1h ago

Interesting approach to security testing. One angle we've been exploring: what if the authentication layer itself was the guardrail?

With x402, every API call requires a signed payment. No API keys to steal, no credentials to leak. The economic cost of each call is itself a rate limiter and audit trail.

Not a replacement for proper guardrails, but it eliminates the credential-based attack surface entirely.

uchibeke•33m ago

That's a genuinely useful distinction to draw. x402 solves the "who is authorized to make this call" problem: removes credential theft as an attack vector, adds economic friction. APort is trying to solve a different layer: "what is this call actually doing in the context of everything else in the session."

The multi-step chaining issue from my post still fires even when every call is authenticated and paid for. Ten individually-approved calls, each costing a fraction of a cent, composing into a full exfiltration: each one passes x402, the composed behavior doesn't.

The AML analogy maps directly: transaction monitoring doesn't care if each payment was legitimate. It cares whether the pattern of payments looks like structuring. x402 is the per-call check. You still need session-level behavioral evaluation on top.

Genuinely curious how x402 handles replay attacks across sessions ie is the payment the audit trail, or is there preserved session context?

LokulMem – Local-first memory management for browser LLMs

Show HN: OpportuAI – remote jobs, AI tools and digital products aggregator

Show HN: RetroTick – Run classic Windows EXEs in the browser

Generative AI Use and Depressive Symptoms Among US Adults

Show HN: A Spatial Alternative to Timeline-Based Digital Memory

The error handling bugs that worry me aren't the ones that crash

Pallas Puzzles

Show HN: Sugar – A task queue that lets AI coding agents work autonomously

Chat Control is in the final stretch – but it could be a marathon, not a sprint

Show HN: Globs – a daily puzzle about finding the hidden connections

Iinit7: Bits and Bites #15

Jack Dorsey lays off 4k, says others will do same 'within the next year'

How I Caught a Spy Using Her Cat (Bellingcat) [video]

How do you catch schema drift and security gaps in Firestore?

McNamara Fallacy

iOS and iPadOS 26 with Indigo Configuration

Show HN: PokeInvasion – Wild Pokémon appear on every website

Hetzner Price Increase

Who Believes in Vibe-Coding?

Show HN: TAS – Tracking, Automation, and Skills for Claude Code

Claude.ai Is Down

Viewert – AI User's Absolute Must Have

Show HN: OSS Go client for signed agent-to-agent messaging in the ClaWeb network

Ask HN: Continuous User-Sentiment Surveys?

Training realtime video LoRAs for fun and profit

Created `MCP-guard`, open MCP guarding tool

UK's first geothermal power plant has been turned on

Snakes.run: rendering 100M pixels a second over SSH

Flying to the Moon and Mars: Engineering Challenges

Spanish engineer reports flaw in 'smart' vacuums, takes control of 7k devices