frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Show HN: Beval – Simple evaluations for your AI product

https://www.beval.space/
1•raviisoccupied•1h ago
I have been working on a web app called Beval - Simple evaluations for your AI product.

In my day to day as a Product Manager working in a team that ships AI products, I often found myself wanting to do 'quick and dirty' LLM-based evaluation on conversation transcripts and traces. I didn't need anything fancy, just 'did the agent answer the question', 'did the agent cover the 5 things it needed to' - that type of thing.

I found myself blocked by 'Gemini in Google Sheets', it was too slow and cumbersome, and it didn't handle eval changes well - particularly when trying to associate evals with ground truth. And because I was exploring or working on new and experimental features, it wasn't helpful to try and set up something more robust with the team.

To fix the problem I eventually learned to call the OpenAI API in Python, but I really felt that I wanted a 'product' to help me and potentially help others who need answers fast - outside of building infrastructure and pipelines.

So over the last few weeks I built: https://beval.space

It has: - LLM-as-judge evals: boolean checks (yes/no), scores (1-5), categories, and freeform comments - Reusable eval definitions you can run across different datasets - Ground truth labelling so you can compare eval versions against human judgments - Per-trace reasoning so you can see why the judge scored something the way it did - An example dataset so you can try it without having your own traces ready

One of our early users described it as 'quick n dirty evals when you don't want to touch a shit load of infra.' I'm trying to figure out if that's a common need or just a niche thing.

Free during beta. Would love HN's take — what's missing, and would you actually use something like this?

Comments

warwickmcintosh•24m ago
LLM as judge drifts in weird ways if you don't have ground truth to calibrate against. Good that you've got that built in. Would love to see eval stability tracking over time though, same prompt different day sometimes gives different scores.

DJI AVATA 360: 8K 360° drone

https://www.dji.com/global/mobile/avata-360
1•teleforce•1m ago•0 comments

PhotoLedger – structured photo documentation for business

https://info.photoledger.app
1•ZarSys•1m ago•1 comments

How to build beautiful enclosures from FR4 – a.k.a. PCBs

https://hackaday.com/2015/06/03/how-to-build-beautiful-enclosures-from-fr4-aka-pcbs/
1•ikbdsk•1m ago•0 comments

A month of OpEx quick wins

https://www.jmduke.com/posts/opex-quick-wins.html
1•siegers•2m ago•0 comments

Show HN: Rename and reorder messy Alembic migrations automatically

https://github.com/super-jaba/fix-migration-order
1•pavlikenemy•5m ago•0 comments

Poison in the Corpus: The Antidote in Sight

https://zeitraum.blog/en/post/019d3529-4799-7350-b3f3-df1b3f64e02a
1•paulpauper•5m ago•0 comments

A Buddhist Sun Miracle?

https://www.astralcodexten.com/p/a-buddhist-sun-miracle
1•paulpauper•6m ago•0 comments

Yet Another Parametric Projectbox Generator

https://willem.aandewiel.nl/index.php/2022/01/02/yet-another-parametric-projectbox-generator/
1•ikbdsk•7m ago•0 comments

Ask HN: How do you check what ChatGPT says about your product?

1•gissurthor•7m ago•0 comments

Show HN: The Memory for OpenClaw

https://clawmem.ai/
1•ngaut•7m ago•0 comments

Oscar Reutersvärd (2021)

https://escherinhetpaleis.nl/en/about-escher/escher-today/oscar-reutersvard
3•layer8•11m ago•0 comments

The Macintosh Font Names That Weren't

https://unsung.aresluna.org/world-class-female-singers/
1•subdomain•11m ago•0 comments

China's DJI sues rival Insta360 for alleged patent infringement

https://www.scmp.com/tech/article/3347580/chinas-dji-sues-rival-insta360-alleged-patent-infringem...
1•teleforce•12m ago•0 comments

Claude for Marathon Training

https://temporunner.substack.com/p/claude-for-a-sub-3-hour-marathon
1•sirpthatch•12m ago•0 comments

QA Panda – Open-source AI QA engineer that tests web apps in a real browser

https://github.com/gzmagyari/qapanda
1•qapandaapp•14m ago•1 comments

Graph Attention Networks for Detecting Epilepsy from EEG in LowResource Settings

https://ieeexplore.ieee.org/document/11287992
1•giorgiodidio•14m ago•0 comments

US Air Force’s New Answer To Shahed Drones [video][13m]

https://www.youtube.com/watch?v=H2ZeZXQAtGQ
1•Bender•16m ago•0 comments

Sunday Robotics: The Household Robot We've Been Waiting For? [YouTube]

https://www.youtube.com/watch?v=QfBw0gMuhaI
1•criscros•16m ago•1 comments

SWE-bench will hit 90% this year

https://fabraix.com/blog/swe-bench-90-percent
2•asfsf23423•17m ago•0 comments

Autogrind: Let your agent grind on your projects 24x7 fully autonomously

https://github.com/ttttonyhe/autogrind
1•ttttonyhe•17m ago•0 comments

C

2•beharkabashi•18m ago•1 comments

Introduction to the PineTime Pro

https://pine64.org/2026/03/28/pinetime_march_2026/
2•birdculture•18m ago•0 comments

Rotating Globes Powered by Light

https://www.movaglobes.com/
1•mdp2021•21m ago•1 comments

Built Verit: Runs real paid tests on startup ideas and returns a demand report

https://www.verit.dev/
1•startabuild•23m ago•1 comments

Intuiting Pratt Parsing

https://louis.co.nz/2026/03/26/pratt-parsing.html
1•louisb0•25m ago•0 comments

Operating in the Dark: The First Cycle

https://www.amazon.com/dp/B0GPXHK1KS
1•LexxChe•26m ago•0 comments

No Plan: How Germany Is Losing Its Business Model

https://respublica.media/no-plan-how-germany-is-losing-its-business-model/
2•Densown•28m ago•0 comments

Attie: The Future of AI Should Serve People, Not Platforms

https://theliquidfrontier.leaflet.pub/3mi5pwkoqx22g
2•cpeterso•29m ago•1 comments

What software engineering got wrong for decades, you're about to repeat with AI

https://www.lobsterpack.com/blog/software-engineering-lessons-ai-tools/
1•pypt•32m ago•0 comments

StorePin – watch your e-commerce sales appear live on a world map

https://storepin.me
1•baptista•33m ago•0 comments