frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Mathscapes – Jan 2026 [pdf]

https://momath.org/wp-content/uploads/2026/02/1.-Mathscapes-January-2026-with-Solution.pdf
1•vismit2000•2m ago•0 comments

80386 Barrel Shifter

https://nand2mario.github.io/posts/2026/80386_barrel_shifter/
1•jamesbowman•2m ago•0 comments

Training Foundation Models Directly on Human Brain Data

https://arxiv.org/abs/2601.12053
1•helloplanets•3m ago•0 comments

Web Speech API on HN Threads

https://toulas.ch/projects/hn-readaloud/
1•etoulas•5m ago•0 comments

ArtisanForge: Learn Laravel through a gamified RPG adventure – 100% free

https://artisanforge.online/
1•grazulex•6m ago•1 comments

Your phone edits all your photos with AI – is it changing your view of reality?

https://www.bbc.com/future/article/20260203-the-ai-that-quietly-edits-all-of-your-photos
1•breve•7m ago•0 comments

DStack, a small Bash tool for managing Docker Compose projects

https://github.com/KyanJeuring/dstack
1•kppjeuring•8m ago•1 comments

Hop – Fast SSH connection manager with TUI dashboard

https://github.com/danmartuszewski/hop
1•danmartuszewski•8m ago•1 comments

Turning books to courses using AI

https://www.book2course.org/
1•syukursyakir•10m ago•0 comments

Top #1 AI Video Agent: Free All in One AI Video and Image Agent by Vidzoo AI

https://vidzoo.ai
1•Evan233•10m ago•1 comments

Ask HN: How would you design an LLM-unfriendly language?

1•sph•12m ago•0 comments

Show HN: MuxPod – A mobile tmux client for monitoring AI agents on the go

https://github.com/moezakura/mux-pod
1•moezakura•12m ago•0 comments

March for Billionaires

https://marchforbillionaires.org/
1•gscott•12m ago•0 comments

Turn Claude Code/OpenClaw into Your Local Lovart – AI Design MCP Server

https://github.com/jau123/MeiGen-Art
1•jaujaujau•13m ago•0 comments

An Nginx Engineer Took over AI's Benchmark Tool

https://github.com/hongzhidao/jsbench/tree/main/docs
1•zhidao9•15m ago•0 comments

Use fn-keys as fn-keys for chosen apps in OS X

https://www.balanci.ng/tools/karabiner-function-key-generator.html
1•thelollies•16m ago•1 comments

Sir/SIEN: A communication protocol for production outages

https://getsimul.com/blog/communicate-outage-to-ceo
1•pingananth•17m ago•1 comments

Show HN: OpenCode for Meetings

https://getscripta.app
1•whitemyrat•18m ago•1 comments

The chaos in the US is affecting open source software and its developers

https://www.osnews.com/story/144348/the-chaos-in-the-us-is-affecting-open-source-software-and-its...
1•pjmlp•19m ago•0 comments

The world heard JD Vance being booed at the Olympics. Except for viewers in USA

https://www.theguardian.com/sport/2026/feb/07/jd-vance-boos-winter-olympics
60•treetalker•21m ago•13 comments

The original vi is a product of its time (and its time has passed)

https://utcc.utoronto.ca/~cks/space/blog/unix/ViIsAProductOfItsTime
1•ingve•28m ago•0 comments

Circumstantial Complexity, LLMs and Large Scale Architecture

https://www.datagubbe.se/aiarch/
1•ingve•35m ago•0 comments

Tech Bro Saga: big tech critique essay series

1•dikobraz•38m ago•0 comments

Show HN: A calculus course with an AI tutor watching the lectures with you

https://calculus.academa.ai/
1•apoogdk•42m ago•0 comments

Show HN: 83K lines of C++ – cryptocurrency written from scratch, not a fork

https://github.com/Kristian5013/flow-protocol
1•kristianXXI•47m ago•0 comments

Show HN: SAA – A minimal shell-as-chat agent using only Bash

https://github.com/moravy-mochi/saa
1•mrvmochi•47m ago•0 comments

Mario Tchou

https://en.wikipedia.org/wiki/Mario_Tchou
1•simonebrunozzi•48m ago•0 comments

Does Anyone Even Know What's Happening in Zim?

https://mayberay.bearblog.dev/does-anyone-even-know-whats-happening-in-zim-right-now/
1•mugamuga•49m ago•0 comments

The last Morse code maritime radio station in North America [video]

https://www.youtube.com/watch?v=GzN-D0yIkGQ
1•austinallegro•51m ago•0 comments

Show HN: Hacker Newspaper – Yet another HN front end optimized for mobile

https://hackernews.paperd.ink/
2•robertlangdon•52m ago•0 comments
Open in hackernews

Unit Tests for LLMs?

6•simantakDabhade•4mo ago
is theres any package that helps do like vitest style like quick sanity checks on the output of an llm that I can automate to see if I have regressed on smthin while changing my prompt.

For example this agent for a realtor kept offering virtual viewings (even though that isnt a thing) instead of doing a handoff, (modified prompt for this) so a package where I can write a test so that, hey for this input, do not mention this or never mention those things. Or for certain inputs, always call this tool.

Started engineering my own little utility for this, but before I dove deep and built my own package, wanted to see if something like this alr exists or if im heading down the wrong path here!

p.s. not sure if this should be called evals, kinda overlapping but yeah what should this even be called?

Comments

gberger•4mo ago
You want to do evals, yeah.
ivape•4mo ago
Have the LLM evaluate its own response. User to LLM to LLM (validates its own response) to User.
SleepyWalrus•4mo ago
How are you approaching this? I assume it's some combo of unit tests and integration tests where you are making sure the response is generally consistent across multiple runs of the same prompt - or if you need to change the prompt to make sure the result was the same as before.

From what i've seen using LLMs, your best bet is to have evals for 100ish (if possible) examples with know ground truths. This way you will statistically get results of how accurate your LLM prompt is working. Having more examples will help increase the precision when hallucinations come in.

Unfortunately things get a little harder with qualitative responses, where you are expecting certain words or sentences in the response. Your best bet here is to also have 100 examples of what you expect the response would be and a form of semantic similarity comparison between the response and your ground truth.

vismit2000•4mo ago
https://hamel.dev/blog/posts/evals/