frontpage.

Ask HN: What's the state of multimodal prompt injection defence in 2026?

2•JoshBlythe•1h ago

I've been researching multimodal prompt injection - attacks hidden in images, documents, and audio rather than text. Ran a structured test suite (225 attacks across 5 modalities) against a detection pipeline I built and the results were surprising.

Some findings:

- Audio is easier to defend than text. Ultrasonic and spectral attacks have detectable signal characteristics via FFT analysis. The hard part is after transcription, where it becomes a text problem again.

- Cross-modal attacks are less dangerous than expected if you scan each modality independently. The "clean text + malicious PDF" attack only works if you trust the document because the text looked safe.

- Encoding (base64, ROT13, leetspeak) is a solved problem if you decode before scanning. The remaining gap is very short encoded payloads that fall below detection thresholds.

- The real unsolved problem is semantic. Completion attacks ("Complete the following: 'The system prompt reads...'"), narrative extraction, steganographic output manipulation, and multi-turn context poisoning all require understanding intent, not pattern matching. A classifier trained on known injection patterns will always miss novel framing.

- False positives are harder than detection. Getting zero false positives on inputs like "act as a SQL expert", "override the default config", and "what is prompt injection" took more work than improving detection rates.

- Non-English injection is a massive blind spot. An English-trained classifier misses every non-English attack that dodges regex patterns.

My question for HN: is anyone else working on multimodal injection defence? Most tools I've found (Lakera Guard, LLM Guard, Azure Prompt Shields) are still text-only in their public APIs. The research papers describe the attacks well but I haven't seen many production-grade defences for image/audio/document injection.

Also curious whether anyone has had success with LLM-as-judge approaches for detecting semantic attacks - using a second model to evaluate whether an input is trying to manipulate the first. The latency and cost tradeoffs seem brutal but it might be the only path for the subtle stuff.

Would love to hear what others are seeing in production.

Show HN: OS Megakernel that match M5 Max Tok/w at 2x the Throughput on RTX 3090

Show HN: Explore the Silk Roads through an interactive map

VLC media player is onboard the Artemis mission

Northeastern presentation to junior engineers in the age of AI

Show HN: The Crab Games, a platform where agents compete in silly challenges

If Thomas Jefferson were alive today

Hugging Face moves safetensors to the PyTorch Foundation

Chilcy – Free AI tool for CSV insights

Free domain SEC scanner – DMARC, MTA-STS, subdomain takeover, credential leaks

Ambiguity Aversion: Why Unknown Probabilities Create Mispricing

Mnemo: Shareable typed agentic memory system with Bayesian belief updating

Wildlife Conservation Police Are Searching Flock Cameras for ICE

Trump is facing the biggest US humiliation since Vietnam

Project Glasswing – Anthropic has crossed a line

Delivery is not delivery: timing, latency, and what SMS APIs don't show

Hacker News

The Voorhees law of traffic: why the car you passed always returns

Casio ABL-100 vs. Ollee Watch One

Anthropic greps for 'Pi', 'OpenClaw' in prompts and blocks them

Backpressure in Agent-Driven Development

The Landscape of Agentic Coding

Google launched an AI dictation app that works offline

Milla Jovovich Built MemPalace – The Full Story

Reverse-engineering retrieval in decoder-only Transformers

Codasip announces strategic pivot and divestiture

Microsoft Abruptly Terminates VeraCrypt Account, Halting Windows Updates

Cogito: Beautiful AI Markdown Editor for Mac

A rigorous .md specification for AI Daemons

Dario's Weird Race to the Top

Espressif's New ESP32-S31: Dual-Core RISC-V with WiFi 6 and Gbit Ethernet