frontpage.

Hi guys,

I have built TrainForgeTester, an open-source scenario test runner for AI agents that take actions (call tools).

The idea: test how agents perform in company specific scenarios and not just on general benchmarks. More specifically test taking the wrong actions, skipping a required step, calling the wrong tool, or passing the wrong arguments.

TrainForgeTester lets you run multi-turn scenarios (you create this scenarios based on your personal use case and data following the provided scenario schema) and check:

* tool calls and arguments * strict or unordered tool execution * expected responses * regressions after model, prompt, or tool changes

This scenario tester is the first part of the project(like v 0.1.0)

I’m now working on the next part: a "scenario generator" that takes messy historical company data (customer support logs, agent traces, tool calls, transcripts, etc.) and turns them into testable scenarios for this framework. Again trying to make this as deterministic as possible

Repo: https://github.com/TrainForge/TrainForgeTester

I’d love feedback on:

* real agent-testing use cases this does not cover yet (browser use, audio, video, mouse use) * whether this direction makes sense * where this could go as a product/devtool * issues, edge cases, or missing features in the repo

Any GitHub issues/forks/prs would be highly appreciated.

Most Companies Aren't Anywhere Near Ready for AI

WolfCOSE: Zero alloc, PQC, MISRA-C, FIPS 140-3 built with wolfCrypt

Performance of a large language model on the reasoning tasks of a physician

Llama.ttf: a font file which is also a large language model and inference engine

Show HN: HypergraphZ – A Hypergraph Implementation in Zig

Erm: A Local CLI That Strips Ums, Uhs, and Erms from Speech

Show HN: Interpretable AutoResearch – Legible Agent Workflows

Bose SoundTouch Cloud Replacement

My Agent Memory Library Helps Write Indie Articles

Skin Trackers – S&P-style indices for the CS2 skin market

Urban Birds Are Rising Earlier Because of Traffic Noise (2013)

Technical Founders Misread Adoption

WolfTPM TPM 2.0 Library Now Supports PQC Mldsa and Mlkem

Roger Penrose – Why Intelligence Is Not a Computational Process (2025)

Claude Code Leak: 8100 Takedown Requests and the Birth of Claw-Code

Professor's bold prediction: AI could help cure all diseases within a decade

One Interface for Everything

Only Law Can Prevent Extinction

New portable technology detects GPS spoofing in real time

My favorite adversarial review prompt

Steam Controller

Show HN: Tyche: An experimental distributed trading pipeline in Go Java

QUIC packet rejection in practice – Iroh

University Professors Disturbed to Find Their Lectures Chopped Up into AI Slop

ASU Using AI Tool to Create Courses from Professors' Work Without Their

ChatGPT crashed my browser when I continued 1k+ conversations

Punk, or why I don't stream anymore

Make Your Own Microforest

Former Nintendo Executive Says Amazon Once Requested 'Illegal' Price Discounts

Simulating Cells Fighting to the Death

Show HN: TrainForgeTester – deterministic scenario tests for AI agents