frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Ask HN: How are people doing AI evals these days?

8•yelmahallawy•20h ago
With the buzz that's happening with all the new AI models that get released (what feels like every other week), how are companies running internal AI evals to determine which model is best for their use case?

Comments

alexhans•17h ago
Very, very heterogenous and fast moving space.

Depending on how they're made up, different teams do vastly different things.

No evals at all, integration tests with no tooling, some use mixed observability tools like LangFuse in their CI/CD. Some other tools like arize phoenix, deepeval, braintrust, promptfoo, pydanticai throughout their development.

It's definitely an afterthought for most teams although we are starting to see increased interest.

My hope is that we can start thinking about evals as a common language for "product" across role families so I'm trying some advocacy [1] trying to keep it very simple including wrapping coding agents like Claude. Sandboxing and observability "for the masses" is still quite a hard concept but UX getting better with time.

What are you doing for yourself/teams? If not much yet, i'd recommend to just start and figure out where the friction/value is for you.

- [1] https://ai-evals.io/ (practical examples https://github.com/Alexhans/eval-ception)

bisonbear•5h ago
assume you're referencing coding agents - I don't think people are. If they are, it's likely using

- AI to evaluate itself (eg ask claude to test out its own skill) - custom built platform (I see interest in this space)

I've actually been thinking about this problem a lot and am working on making a custom eval runner for your codebase. What would your usecase be for this?

celestialcheese•2h ago
mix of promptfoo and ad-hoc python scripts, with langfuse observability.

Definitely not happy with it, but everything is moving too fast to feel like it's worth investing in.

kelseyfrog•1h ago
Automated benchmarking.

We were lucky enough to have PMs create a set of questions, we did a round of generation and labeled pass/fail annotations on each response.

From there we bootstrapped AI-as-a judge and approximately replicated the results. Then we plug in new models, change prompts, pipelines while being able to approximate the original feedback signal. It's not an exact match, but it's wildly better than one-off testing and the regressions it brings.

We're able to confidently make changes without accidentally breaking something else. Overall win, but it can get costly if the iteration count is high.

maxalbarello•1h ago
Also wondering how to evals agentic pipelines. For instance, I generated memories from my chatGPT conversation history, how do I know whether they are accurate or not?

I would like a single number that I would use to optimize the pipeline with but I find it hard to figure out what that number should be measuring.

dkoy•1h ago
Curious who’s used OpenAI Evals
raviisoccupied•15m ago
I have been working on a web app called Beval - Simple evaluations for your AI product that meant to be a 'lay person' introduction to evals.

In my day to day as a Product Manager working in a team that ships AI products, I often found myself wanting to do 'quick and dirty' LLM based evaluation on conversation transcripts and traces. I found myself blocked by 'Gemini in Google Sheets', it was too slow and cumbersome, and it didn't handle eval changes well. And because I was exploring, it wasn't helpful to try and set up something more robust with the team.

To fix the problem I eventually learned to call the OpenAI API in python and more sophisticated approaches like some listed here, but I really felt that I wanted a 'product' to help me and potentially help others.

You can check it out at https://www.beval.space

Full disclosure - this is vibe coded and still a work in progress.

Tell HN: Apple development certificate server seems down?

65•strongpigeon•6h ago•26 comments

Ask HN: How are people doing AI evals these days?

8•yelmahallawy•20h ago•7 comments

Ask HN: Remember Fidonet?

112•ukkare•14h ago•66 comments

Ask HN: What Are You Working On? (March 2026)

281•david927•2d ago•1076 comments

Ask HN: How to be alone?

670•sillysaurusx•2d ago•552 comments

Ask HN: Please restrict new accounts from posting

706•Oras•2d ago•501 comments

Ask HN: Does automatic multilingual support make sense for a launch platform?

2•LeanVibe•8h ago•3 comments

Ask HN: Most beautiful personal blog UI you have ever seen?

137•ms7892•2d ago•54 comments

Ask HN: Can I repurpose a Bluetooth voice remote as input device for a PC?

15•albert_e•2d ago•20 comments

Why is GPT-5.4 obsessed with Goblins?

13•pants2•21h ago•8 comments

Tell HN: I'm 60 years old. Claude Code has re-ignited a passion

1063•shannoncc•4d ago•973 comments

The Architecture of an Exit Scam: A Technical Audit of Zszrun

5•cappyfjao•14h ago•0 comments

Ask HN: Since a week HN keeps logging me off every few days, why?

5•epolanski•15h ago•2 comments

Ask HN: What AI content automation stack are you using in 2026?

2•jackcofounder•16h ago•2 comments

Ask HN: Is GitHub getting less reliable, or is it just me?

11•_pdp_•1d ago•8 comments

Ask HN: Do you still run Redis and workers just for background jobs?

2•sergF•17h ago•12 comments

Ask HN: Favorite Non-Spammy iPhone Games?

6•bix6•23h ago•8 comments

Ask HN: What game engine would you recommend for vibe coding?

6•general_reveal•1d ago•6 comments

Ask HN: Read‑only LLM tool for email triage and knowledge extraction?

2•maille•1d ago•4 comments

Ask HN: Any informed guesses on the actual size/architecture of GPT-5.4 etc.?

4•dsrtslnd23•1d ago•0 comments

Ask HN: Let's rethink the architecture and future of Emacs

3•kurouna•12h ago•3 comments

Code-review-graph: persistent code graph that cuts Claude Code token usage

2•tirthkanani•1d ago•0 comments

Ask HN: Who Needs Help?

14•surprisetalk•1d ago•16 comments

A job ad for Agentic AI Advocate

4•greenpinia•1d ago•1 comments

I replaced my freelance SaaS stack with 5 single-file HTML tools

8•AnnSri•2d ago•4 comments

Why is email so resilient as a technology?

7•noemit•15h ago•8 comments

Ask HN: Anyone else feel this community has changed recently?

57•kypro•4d ago•30 comments

Ask HN: Are showlang and thelang HN endpoints not being maintained?

4•freakynit•1d ago•1 comments

Ask HN: How are you handling persistent memory across local Ollama sessions

5•null-phnix•2d ago•0 comments

Ask HN: Which book are you reading these days?

9•chistev•1d ago•24 comments