frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Ask HN: Anyone orchestrating multiple AI coding agents in parallel?

1•buildingwdavid•53s ago•0 comments

Show HN: Knowledge-Bank

https://github.com/gabrywu-public/knowledge-bank
1•gabrywu•6m ago•0 comments

Show HN: The Codeverse Hub Linux

https://github.com/TheCodeVerseHub/CodeVerseLinuxDistro
3•sinisterMage•7m ago•0 comments

Take a trip to Japan's Dododo Land, the most irritating place on Earth

https://soranews24.com/2026/02/07/take-a-trip-to-japans-dododo-land-the-most-irritating-place-on-...
2•zdw•7m ago•0 comments

British drivers over 70 to face eye tests every three years

https://www.bbc.com/news/articles/c205nxy0p31o
5•bookofjoe•7m ago•1 comments

BookTalk: A Reading Companion That Captures Your Voice

https://github.com/bramses/BookTalk
1•_bramses•8m ago•0 comments

Is AI "good" yet? – tracking HN's sentiment on AI coding

https://www.is-ai-good-yet.com/#home
1•ilyaizen•9m ago•1 comments

Show HN: Amdb – Tree-sitter based memory for AI agents (Rust)

https://github.com/BETAER-08/amdb
1•try_betaer•10m ago•0 comments

OpenClaw Partners with VirusTotal for Skill Security

https://openclaw.ai/blog/virustotal-partnership
2•anhxuan•10m ago•0 comments

Show HN: Seedance 2.0 Release

https://seedancy2.com/
2•funnycoding•11m ago•0 comments

Leisure Suit Larry's Al Lowe on model trains, funny deaths and Disney

https://spillhistorie.no/2026/02/06/interview-with-sierra-veteran-al-lowe/
1•thelok•11m ago•0 comments

Towards Self-Driving Codebases

https://cursor.com/blog/self-driving-codebases
1•edwinarbus•11m ago•0 comments

VCF West: Whirlwind Software Restoration – Guy Fedorkow [video]

https://www.youtube.com/watch?v=YLoXodz1N9A
1•stmw•12m ago•1 comments

Show HN: COGext – A minimalist, open-source system monitor for Chrome (<550KB)

https://github.com/tchoa91/cog-ext
1•tchoa91•13m ago•1 comments

FOSDEM 26 – My Hallway Track Takeaways

https://sluongng.substack.com/p/fosdem-26-my-hallway-track-takeaways
1•birdculture•13m ago•0 comments

Show HN: Env-shelf – Open-source desktop app to manage .env files

https://env-shelf.vercel.app/
1•ivanglpz•17m ago•0 comments

Show HN: Almostnode – Run Node.js, Next.js, and Express in the Browser

https://almostnode.dev/
1•PetrBrzyBrzek•17m ago•0 comments

Dell support (and hardware) is so bad, I almost sued them

https://blog.joshattic.us/posts/2026-02-07-dell-support-lawsuit
1•radeeyate•18m ago•0 comments

Project Pterodactyl: Incremental Architecture

https://www.jonmsterling.com/01K7/
1•matt_d•18m ago•0 comments

Styling: Search-Text and Other Highlight-Y Pseudo-Elements

https://css-tricks.com/how-to-style-the-new-search-text-and-other-highlight-pseudo-elements/
1•blenderob•20m ago•0 comments

Crypto firm accidentally sends $40B in Bitcoin to users

https://finance.yahoo.com/news/crypto-firm-accidentally-sends-40-055054321.html
1•CommonGuy•20m ago•0 comments

Magnetic fields can change carbon diffusion in steel

https://www.sciencedaily.com/releases/2026/01/260125083427.htm
1•fanf2•21m ago•0 comments

Fantasy football that celebrates great games

https://www.silvestar.codes/articles/ultigamemate/
1•blenderob•21m ago•0 comments

Show HN: Animalese

https://animalese.barcoloudly.com/
1•noreplica•22m ago•0 comments

StrongDM's AI team build serious software without even looking at the code

https://simonwillison.net/2026/Feb/7/software-factory/
3•simonw•22m ago•0 comments

John Haugeland on the failure of micro-worlds

https://blog.plover.com/tech/gpt/micro-worlds.html
1•blenderob•23m ago•0 comments

Show HN: Velocity - Free/Cheaper Linear Clone but with MCP for agents

https://velocity.quest
2•kevinelliott•23m ago•2 comments

Corning Invented a New Fiber-Optic Cable for AI and Landed a $6B Meta Deal [video]

https://www.youtube.com/watch?v=Y3KLbc5DlRs
1•ksec•25m ago•0 comments

Show HN: XAPIs.dev – Twitter API Alternative at 90% Lower Cost

https://xapis.dev
2•nmfccodes•25m ago•1 comments

Near-Instantly Aborting the Worst Pain Imaginable with Psychedelics

https://psychotechnology.substack.com/p/near-instantly-aborting-the-worst
2•eatitraw•31m ago•0 comments
Open in hackernews

Agent simulations = unit testing for AI?

2•draismaa•7mo ago
In traditional software, we write unit tests to catch regressions before they reach users. In AI systems—especially agentic ones that model breaks down. You can test inputs and outputs, use evals, but agents operate over time, across tools, mcps, apis, and unpredictable user input. The failure modes are non-obvious and often emerge only in edge cases. I'm seeing an emerging practice: agent simulations—structured, repeatable scenarios that test how an AI agent behaves in complex or long-tail situations.

Think: What if the upstream tool fails mid-execution? What if the user flips intent mid-dialogue? What if the agent’s assumptions were subtly wrong?

from self-driving cars to AI agents? The above aren’t one-off tests. They’re like AV simulations: controlled environments to explore failure boundaries. Autonomous vehicle teams learned long ago that real-world data isn't enough. The rarest events are the most important—and you need to generate and replay them systematically. That same long-tail distribution applies to LLM agents. We’ve started treating scenario testing as a core part of the dev loop—versioning simulations, running them in CI, and evolving them as our agent behavior changes. It’s not about perfect coverage,it’s about shifting from “test after” to “test through simulation” as part of iterative agent development. Curious if others here are doing something similar. How are you testing your agents beyond a few prompts and metrics? Would love to hear how the HN crowd is thinking about agent reliability and safety—not just in research, but in real-world deployments.

Comments

aszen•7mo ago
We are just starting to introduce AI and for now rely on simple evals as unit tests that Dev's run locally to fine tune prompts and context.

Your idea of simulating agent interactions is interesting, but I want to know how are you actually evaluating simulation runs?

jangletown•7mo ago
hello aszen, I work with draismaa, the way we have developed our simulations is by putting a few agents in a loop to simulate the conversation:

- the agent under test - a user simulator agent, sending messages as a user would - a judge agent, overlooking and stopping the simulation with a verdict when achieved

it then takes a description of the simulation scenario, and a list of criteria for the judge to eval, and that's enough to run the simulation

this is allowing us to tdd our way into building those agents, like, before adding something to the prompt, we can add a scenario/criteria first, see it fail, then fix the prompt, and see it playing out nicely (or having to vibe a bit further) until the test is green

we put this together in a framework called Scenario:

https://github.com/langwatch/scenario

the cool thing is that we also built in a way to control the simulation, so you can go as flexible as possible (just let it play out on autopilot), or define what the user said, mock what agent replied and so on to carry on a situation

and then in the middle of this turns we can throw in any additional evaluation, for example checking if a tool was called, it's really just a simple pytest/vitest assertion, it's a function callback so any other eval can also be called