frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Apple finalizes Gemini / Siri deal

https://www.engadget.com/ai/apple-reportedly-plans-to-reveal-its-gemini-powered-siri-in-february-...
1•vedantnair•23s ago•0 comments

Italy Railways Sabotaged

https://www.bbc.co.uk/news/articles/czr4rx04xjpo
1•vedantnair•45s ago•0 comments

Emacs-tramp-RPC: high-performance TRAMP back end using MsgPack-RPC

https://github.com/ArthurHeymans/emacs-tramp-rpc
1•fanf2•2m ago•0 comments

Nintendo Wii Themed Portfolio

https://akiraux.vercel.app/
1•s4074433•6m ago•1 comments

"There must be something like the opposite of suicide "

https://post.substack.com/p/there-must-be-something-like-the
1•rbanffy•8m ago•0 comments

Ask HN: Why doesn't Netflix add a “Theater Mode” that recreates the worst parts?

2•amichail•9m ago•0 comments

Show HN: Engineering Perception with Combinatorial Memetics

1•alan_sass•15m ago•1 comments

Show HN: Steam Daily – A Wordle-like daily puzzle game for Steam fans

https://steamdaily.xyz
1•itshellboy•17m ago•0 comments

The Anthropic Hive Mind

https://steve-yegge.medium.com/the-anthropic-hive-mind-d01f768f3d7b
1•spenvo•17m ago•0 comments

Just Started Using AmpCode

https://intelligenttools.co/blog/ampcode-multi-agent-production
1•BojanTomic•18m ago•0 comments

LLM as an Engineer vs. a Founder?

1•dm03514•19m ago•0 comments

Crosstalk inside cells helps pathogens evade drugs, study finds

https://phys.org/news/2026-01-crosstalk-cells-pathogens-evade-drugs.html
2•PaulHoule•20m ago•0 comments

Show HN: Design system generator (mood to CSS in <1 second)

https://huesly.app
1•egeuysall•20m ago•1 comments

Show HN: 26/02/26 – 5 songs in a day

https://playingwith.variousbits.net/saturday
1•dmje•21m ago•0 comments

Toroidal Logit Bias – Reduce LLM hallucinations 40% with no fine-tuning

https://github.com/Paraxiom/topological-coherence
1•slye514•24m ago•1 comments

Top AI models fail at >96% of tasks

https://www.zdnet.com/article/ai-failed-test-on-remote-freelance-jobs/
5•codexon•24m ago•2 comments

The Science of the Perfect Second (2023)

https://harpers.org/archive/2023/04/the-science-of-the-perfect-second/
1•NaOH•25m ago•0 comments

Bob Beck (OpenBSD) on why vi should stay vi (2006)

https://marc.info/?l=openbsd-misc&m=115820462402673&w=2
2•birdculture•28m ago•0 comments

Show HN: a glimpse into the future of eye tracking for multi-agent use

https://github.com/dchrty/glimpsh
1•dochrty•29m ago•0 comments

The Optima-l Situation: A deep dive into the classic humanist sans-serif

https://micahblachman.beehiiv.com/p/the-optima-l-situation
2•subdomain•29m ago•1 comments

Barn Owls Know When to Wait

https://blog.typeobject.com/posts/2026-barn-owls-know-when-to-wait/
1•fintler•30m ago•0 comments

Implementing TCP Echo Server in Rust [video]

https://www.youtube.com/watch?v=qjOBZ_Xzuio
1•sheerluck•30m ago•0 comments

LicGen – Offline License Generator (CLI and Web UI)

1•tejavvo•33m ago•0 comments

Service Degradation in West US Region

https://azure.status.microsoft/en-gb/status?gsid=5616bb85-f380-4a04-85ed-95674eec3d87&utm_source=...
2•_____k•33m ago•0 comments

The Janitor on Mars

https://www.newyorker.com/magazine/1998/10/26/the-janitor-on-mars
1•evo_9•35m ago•0 comments

Bringing Polars to .NET

https://github.com/ErrorLSC/Polars.NET
3•CurtHagenlocher•37m ago•0 comments

Adventures in Guix Packaging

https://nemin.hu/guix-packaging.html
1•todsacerdoti•38m ago•0 comments

Show HN: We had 20 Claude terminals open, so we built Orcha

1•buildingwdavid•38m ago•0 comments

Your Best Thinking Is Wasted on the Wrong Decisions

https://www.iankduncan.com/engineering/2026-02-07-your-best-thinking-is-wasted-on-the-wrong-decis...
1•iand675•38m ago•0 comments

Warcraftcn/UI – UI component library inspired by classic Warcraft III aesthetics

https://www.warcraftcn.com/
2•vyrotek•40m ago•0 comments
Open in hackernews

Unit Tests for LLMs?

6•simantakDabhade•4mo ago
is theres any package that helps do like vitest style like quick sanity checks on the output of an llm that I can automate to see if I have regressed on smthin while changing my prompt.

For example this agent for a realtor kept offering virtual viewings (even though that isnt a thing) instead of doing a handoff, (modified prompt for this) so a package where I can write a test so that, hey for this input, do not mention this or never mention those things. Or for certain inputs, always call this tool.

Started engineering my own little utility for this, but before I dove deep and built my own package, wanted to see if something like this alr exists or if im heading down the wrong path here!

p.s. not sure if this should be called evals, kinda overlapping but yeah what should this even be called?

Comments

gberger•4mo ago
You want to do evals, yeah.
ivape•4mo ago
Have the LLM evaluate its own response. User to LLM to LLM (validates its own response) to User.
SleepyWalrus•4mo ago
How are you approaching this? I assume it's some combo of unit tests and integration tests where you are making sure the response is generally consistent across multiple runs of the same prompt - or if you need to change the prompt to make sure the result was the same as before.

From what i've seen using LLMs, your best bet is to have evals for 100ish (if possible) examples with know ground truths. This way you will statistically get results of how accurate your LLM prompt is working. Having more examples will help increase the precision when hallucinations come in.

Unfortunately things get a little harder with qualitative responses, where you are expecting certain words or sentences in the response. Your best bet here is to also have 100 examples of what you expect the response would be and a form of semantic similarity comparison between the response and your ground truth.

vismit2000•4mo ago
https://hamel.dev/blog/posts/evals/