Show HN: Simulation-Based Testing for Agents Using AG-UI Protocol

6•0xdeafcafe•5h ago

Comments

rchaves•5h ago

Hello HN!

tl;dr: We built Scenario, an open-source testing library for AI agents. It simulates real conversations with your agent, its code-driven, and lets you assert anything mid-dialogue. Repo: https://github.com/langwatch/scenario Docs: https://scenario.langwatch.ai/

I'm Rogerio, founder of LangWatch, I've been helping many customers building LLM applications in this past two years and worked with Alex on this.

Most of the efforts for LLM quality so far were about evaluations, single-turn, there was nothing actually good to test agents, it all felt forced, but we believe we cracked it now, we have built an agent testing library that test your agent by simulating a user and playing a conversation back and forth with it.

One of the key challenges there was that we had to make it compatible with all the 273+ AI frameworks (and counting) there are. Luckliy AG-UI protocol popped up recently, standardizing agents frameworks and UI interactions, this is perfect, because at the end of the day, we want our user simulator to "see" just the same that the user sees.

So we made Scenario in a way that is really easy to connect to any agent no matter the tech stack, from a simple string <-> string connection, to openai standard messages format, to AG-UI.

The other key challenge was to balance testing the open-endedness of agents vs having reliable cases you want to test, so we worked a lot on thinking through the autopilot simulation vs the fully scripted one, and here again, the goal was complete interoperability. At the end of the day, the design we achieved was simply having lambdas, that you can call at any point of the test, so it's just code, where you can connect any other evaluation or assertion tool you want, we are not restrictive.

Check out the repo and the docs, we would love to get some feedback in here!

Repo: https://github.com/langwatch/scenario Docs: https://scenario.langwatch.ai/

Show HN: Releasepages.dev make release pages from Git commits

Engine thrust incidents spur safety alert over biocides (2020)

Ghosting and 'breadcrumbing': the impact of bad behavior on dating apps

Restoring Arctic Exceptionalism

Lateralized sleeping positions in domestic cats

Las Vegas Through Landsat 7's Eyes (2024)

Tech in Iran-Israel conflict: internet blackout, crypto burns and camera spying

Show HN: Natively – AI mobile app builder (iOS and Android)

OpenAI Is Ruthless [video]

The consequences of Starbucks on startup culture in neighborhoods

The Ant Mill: How theoretical high-energy physics descended into groupthink

National Archives to restrict public access starting July 7

Mexico is now Chinas No. 1 car export market

Python Tools Are Quickly Adopting the New pylock.toml Standard

The Discovery Engine (automated system for scientific discovery)

Show HN: Vybetr – Hire AI app developers using tools like Lovable, Bolt and more

Using Lxcfs Together with Podman

Lessons from LangChain and Slack and MCP Integration

Use of ch unit considered inappropriate (in certain circumstances)

Brit Watchdog Cracks Down on Data Collection by Smart TVs, Speakers, Air Fryers

Thoughts on the AI 2027 Discourse

Childhood and Education #10: Behaviors

When Can I Stop Listening to My Enemy's Points?

Show HN: Letter Lockbox – A word game I built over the weekend with Claude Code

Programmers and Their Monospace Blogs

Ask HN: What's your fastest conversion from cold outreach to prepaid client?

Namespaced Pundit Policies Without the Repetition Racket

The Legacy of "The Gastronomical Me"

Show HN: How Usage Works

Why Your Car's Touchscreen Is More Dangerous Than Your Phone