Show HN: Beval – Simple evaluations for your AI product

https://www.beval.space/

1•raviisoccupied•1h ago

I have been working on a web app called Beval - Simple evaluations for your AI product.

In my day to day as a Product Manager working in a team that ships AI products, I often found myself wanting to do 'quick and dirty' LLM-based evaluation on conversation transcripts and traces. I didn't need anything fancy, just 'did the agent answer the question', 'did the agent cover the 5 things it needed to' - that type of thing.

I found myself blocked by 'Gemini in Google Sheets', it was too slow and cumbersome, and it didn't handle eval changes well - particularly when trying to associate evals with ground truth. And because I was exploring or working on new and experimental features, it wasn't helpful to try and set up something more robust with the team.

To fix the problem I eventually learned to call the OpenAI API in Python, but I really felt that I wanted a 'product' to help me and potentially help others who need answers fast - outside of building infrastructure and pipelines.

So over the last few weeks I built: https://beval.space

It has: - LLM-as-judge evals: boolean checks (yes/no), scores (1-5), categories, and freeform comments - Reusable eval definitions you can run across different datasets - Ground truth labelling so you can compare eval versions against human judgments - Per-trace reasoning so you can see why the judge scored something the way it did - An example dataset so you can try it without having your own traces ready

One of our early users described it as 'quick n dirty evals when you don't want to touch a shit load of infra.' I'm trying to figure out if that's a common need or just a niche thing.

Free during beta. Would love HN's take — what's missing, and would you actually use something like this?

Comments

warwickmcintosh•24m ago

LLM as judge drifts in weird ways if you don't have ground truth to calibrate against. Good that you've got that built in. Would love to see eval stability tracking over time though, same prompt different day sometimes gives different scores.

DJI AVATA 360: 8K 360° drone

PhotoLedger – structured photo documentation for business

How to build beautiful enclosures from FR4 – a.k.a. PCBs

A month of OpEx quick wins

Show HN: Rename and reorder messy Alembic migrations automatically

Poison in the Corpus: The Antidote in Sight

A Buddhist Sun Miracle?

Yet Another Parametric Projectbox Generator

Ask HN: How do you check what ChatGPT says about your product?

Show HN: The Memory for OpenClaw

Oscar Reutersvärd (2021)

The Macintosh Font Names That Weren't

China's DJI sues rival Insta360 for alleged patent infringement

Claude for Marathon Training

QA Panda – Open-source AI QA engineer that tests web apps in a real browser

Graph Attention Networks for Detecting Epilepsy from EEG in LowResource Settings

US Air Force’s New Answer To Shahed Drones [video][13m]

Sunday Robotics: The Household Robot We've Been Waiting For? [YouTube]

SWE-bench will hit 90% this year

Autogrind: Let your agent grind on your projects 24x7 fully autonomously

C

Introduction to the PineTime Pro

Rotating Globes Powered by Light

Built Verit: Runs real paid tests on startup ideas and returns a demand report

Intuiting Pratt Parsing

Operating in the Dark: The First Cycle

No Plan: How Germany Is Losing Its Business Model

Attie: The Future of AI Should Serve People, Not Platforms

What software engineering got wrong for decades, you're about to repeat with AI

StorePin – watch your e-commerce sales appear live on a world map