Show HN: Nyx – multi-turn, adaptive, offensive testing harness for AI agents

16•zachdotai•1h ago

We built Nyx to solve a problem we kept hitting while building agents: AI agents break in ways traditional software doesn't. Logic bugs, reasoning failures, edge cases that manual testing and static benchmarks never explore.

Nyx is an autonomous testing harness that probes your AI agents to find failure modes before users do. It’s used to find logic bugs, instruction following failures, edge cases in agent behavior, and for red-team security testing (jailbreaks, prompt injection, tool hijacking)

Technical approach: * Pure blackbox (no special access needed - test like your users interact) * Multi-turn adaptive conversations * Multi-modal testing (voice, text, images, documents, browser interactions) * Massively parallel by default

Instead of spending time writing static evals for the key failure modes of your AI agents, point Nyx at any system and it autonomously discovers failure modes that matter. We typically find issues in under 10 minutes that manual audits take hours to surface.

This is early work and we know the methodology is still going to evolve. We would love nothing more than feedback from the community as we iterate on this.

Comments

adam_rida•1h ago

Very cool!

aacudad•1h ago

I am not sure this will work, seems like added complexity to something simple

ibrahim-fab•1h ago

Nice. Definitely true that evaluating agents behavior is by far the toughest part of building them. Also most eval cases are added without thought and not maintained when agent behaviour updates. Interesting approach.

zachdotai•26m ago

We wrote some thoughts on static vs. dynamic evals and how it relates to understanding the security posture of an AI system. Static security evals no longer carry the signal they used to. A one-shot pass/fail tells you almost nothing about real-world risk.

Would love your thoughts on this: https://fabraix.com/blog/adversarial-cost-to-exploit

ljhasdr•1h ago

i need to try this before mythos comes to attack our service. thanks!

Slava's Monoid Zoo

Nicholas Pardini on real inflation vs. CPI and financial repression [video]

Show HN: Brygga – A modern, fast, feature-rich IRC client for macOS

Arm Helium Technology Reference Book (2023) [pdf]

To Dissect a Mockingbird: A Graphical Notation for the Lambda Calculus (1996)

Vercel April 2026 security incident

The game was rigged by the one man hired to prevent rigging

Swiss AI Initiative

Got an Old Kindle? It Might Not Work Anymore

2,100 Swiss municipalities showing which provider handles their official email

Vercel hack – an old fashioned honey pot?

NVFP4 on Nvidia DGX Spark is slower than FP8 on the same model

Ultimate Guide to Vibe Coding

Ukraine Moves to Replace Frontline Soldiers with 25,000 Ground Robots

Headless Everything for Personal AI

Fear and Loathing Among the Haves and Have Mores in San Francisco

Interview with 0.1x Engineer [video]

Banned by Anthropic?

Show HN: Turn your PCB into a Minecraft world

Ex-CEO, ex-CFO of bankrupt AI company charged with fraud

US Draft Update: Major Tech Company Urges Universal National Service

Cathode Ray Tube Memory (RAM)

Audit Tool for Vercel Exposure of Environment Variables

Improving Office+Photoshop+Fusion on Linux with Adversarial Drinking

Is Prolog Worth Learning?

Claude UI Feature Request

Reminder: Enable ZRAM on your Linux system to optimize RAM usage

Aliens.gov will be running as a WordPress multisite

Show HN: Developerpod, K-Cups for Code

Fuse Shared Libraries into ELFs