frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

Open in hackernews

Agent simulations = unit testing for AI?

2•draismaa•6h ago
In traditional software, we write unit tests to catch regressions before they reach users. In AI systems—especially agentic ones that model breaks down. You can test inputs and outputs, use evals, but agents operate over time, across tools, mcps, apis, and unpredictable user input. The failure modes are non-obvious and often emerge only in edge cases. I'm seeing an emerging practice: agent simulations—structured, repeatable scenarios that test how an AI agent behaves in complex or long-tail situations.

Think: What if the upstream tool fails mid-execution? What if the user flips intent mid-dialogue? What if the agent’s assumptions were subtly wrong?

from self-driving cars to AI agents? The above aren’t one-off tests. They’re like AV simulations: controlled environments to explore failure boundaries. Autonomous vehicle teams learned long ago that real-world data isn't enough. The rarest events are the most important—and you need to generate and replay them systematically. That same long-tail distribution applies to LLM agents. We’ve started treating scenario testing as a core part of the dev loop—versioning simulations, running them in CI, and evolving them as our agent behavior changes. It’s not about perfect coverage,it’s about shifting from “test after” to “test through simulation” as part of iterative agent development. Curious if others here are doing something similar. How are you testing your agents beyond a few prompts and metrics? Would love to hear how the HN crowd is thinking about agent reliability and safety—not just in research, but in real-world deployments.

Comments

aszen•4h ago
We are just starting to introduce AI and for now rely on simple evals as unit tests that Dev's run locally to fine tune prompts and context.

Your idea of simulating agent interactions is interesting, but I want to know how are you actually evaluating simulation runs?

jangletown•42m ago
hello aszen, I work with draismaa, the way we have developed our simulations is by putting a few agents in a loop to simulate the conversation:

- the agent under test - a user simulator agent, sending messages as a user would - a judge agent, overlooking and stopping the simulation with a verdict when achieved

it then takes a description of the simulation scenario, and a list of criteria for the judge to eval, and that's enough to run the simulation

this is allowing us to tdd our way into building those agents, like, before adding something to the prompt, we can add a scenario/criteria first, see it fail, then fix the prompt, and see it playing out nicely (or having to vibe a bit further) until the test is green

we put this together in a framework called Scenario:

https://github.com/langwatch/scenario

the cool thing is that we also built in a way to control the simulation, so you can go as flexible as possible (just let it play out on autopilot), or define what the user said, mock what agent replied and so on to carry on a situation

and then in the middle of this turns we can throw in any additional evaluation, for example checking if a tool was called, it's really just a simple pytest/vitest assertion, it's a function callback so any other eval can also be called

Unless users take action, Android will let Gemini access third-party apps

https://arstechnica.com/security/2025/07/unless-users-take-action-android-will-let-gemini-access-third-party-apps/
1•azinman2•1m ago•0 comments

Zenefits' Aggressive Growth: Robby Allen on growing a 250-rep sales team

https://www.dock.us/grow-and-tell/ep-21-robby-allen
1•mooreds•2m ago•0 comments

Nvidia-backed Perplexity takes on Google with new AI-powered browser

https://qz.com/nvidia-perplexity-comet-ai-powered-browser-google
1•mikece•2m ago•0 comments

Why This Python Performance Trick Doesn't Matter Anymore

https://blog.codingconfessions.com/p/old-python-performance-trick
2•rbanffy•5m ago•0 comments

Short story written by a friendly AI

2•apolloartemis•6m ago•0 comments

Revitalizing Legacy Code

https://javapro.io/2025/06/26/revitalizing-legacy-code/
1•andrewstetsenko•7m ago•0 comments

Proposing: The American Autonomy Initiative

https://boydinstitute.org/p/american-autonomy-initiative
3•jeffgiesea•7m ago•2 comments

Show HN: Reverse-engineering 100 devtool landing pages

https://evilmartians.com/chronicles/we-studied-100-devtool-landing-pages-here-is-what-actually-works-in-2025
1•vicamelnikova•7m ago•0 comments

The Ghost of Muriel Spark

https://www.newstatesman.com/culture/books/2025/06/the-ghost-of-muriel-spark
2•Caiero•8m ago•0 comments

California achieved significant groundwater recharge last year

https://www.latimes.com/environment/story/2025-06-24/california-2024-groundwater-report
1•PaulHoule•10m ago•0 comments

Show HN: Is there a way to market BL1NG – where people pay to flex?

https://www.bl1ng.com
1•eflay•11m ago•0 comments

What Trump's Big Beautiful Bill means for Wi-Fi 6E and 7 users: It's not pretty

https://www.zdnet.com/home-and-office/networking/what-trumps-big-beautiful-bill-means-for-wi-fi-6e-and-wi-fi-7-users-hint-its-not-pretty/
1•CrankyBear•11m ago•0 comments

I made a TikTok video downloader website with no ads.. yet

https://www.tdown.app/
1•henrymuddleton•12m ago•0 comments

Bezos-funded climate satellite is lost in space

https://www.theverge.com/news/703091/methane-satellite-methanesat-lost-bezos-edf
1•Bluestein•14m ago•0 comments

AI Agents ≠ Zapier–A Better Mental Model

2•chandan_maruthi•15m ago•0 comments

Building Proactive AI Agents

https://substack.com/home/post/p-164375851
1•Mernit•15m ago•0 comments

Inertia.js in Rails: a new era of effortless integration (2024)

https://evilmartians.com/chronicles/inertiajs-in-rails-a-new-era-of-effortless-integration
2•mooreds•17m ago•0 comments

Show HN: DBUF

https://github.com/bintoca/dbuf
1•pierogitus•18m ago•0 comments

Tsukudani and hot rice: Still a go-to meal in Japan centuries after its creation

https://apnews.com/article/tsukudani-japan-side-tokyo-traditional-food-fa63e1f3f59d2b9e177a327f7c814ffe
1•petethomas•20m ago•0 comments

Building a timberframe home from scratch

https://massiehouse.blogspot.com/
1•xdfg13345•21m ago•0 comments

Robot surgery on humans could be trialled within decade after success on pigs

https://www.theguardian.com/science/2025/jul/09/robot-surgery-on-humans-could-be-trialled-within-decade-after-success-on-pig-organs
2•Bluestein•22m ago•0 comments

Unpatchable Vulnerabilities in Windows 10/11: Security Report 2025

https://zenodo.org/records/15850090
1•vinhatson•25m ago•1 comments

Show HN: A Nextflow ↔ Python Integration Plugin

https://github.com/royjacobson/nf-python
1•unddoch•26m ago•0 comments

TikTok Sans

https://fonts.google.com/specimen/TikTok+Sans
2•Tiberium•27m ago•0 comments

Managed Postgres Overview

https://fly.io/docs/mpg/overview/
1•sergiotapia•29m ago•0 comments

What are your dream companies to work at?

1•ssc23•29m ago•0 comments

A simple monthly injection allows mice to live 25% longer and free from diseases

https://english.elpais.com/science-tech/2024-07-17/a-simple-monthly-injection-allows-mice-to-live-25-longer-and-free-from-diseases.html
3•speckx•30m ago•0 comments

Symbolic 'science fair' showcases research cut by Trump team

https://www.nature.com/articles/d41586-025-02164-y
2•Bluestein•30m ago•0 comments

Scientists 3D print tumors for cancer research

https://www.tomshardware.com/3d-printing/scientists-3d-print-tumors-for-cancer-research-tissuetinker-using-3d-bioprinting-to-create-miniature-models-of-healthy-and-diseased-tissue-for-side-by-side-comparison-backed-by-mcgill
1•giuliomagnifico•31m ago•0 comments

Perplexity just launched Comet, an AI web browser

https://www.theverge.com/news/703037/perplexity-ai-web-browser-comet-launch
2•cpeterso•36m ago•0 comments