frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Show HN: Minecraft Creeper meets 90s Tamagotchi

https://github.com/danielbrendel/krepagotchi-game
1•foxiel•5m ago•0 comments

Show HN: Termiteam – Control center for multiple AI agent terminals

https://github.com/NetanelBaruch/termiteam
1•Netanelbaruch•5m ago•0 comments

The only U.S. particle collider shuts down

https://www.sciencenews.org/article/particle-collider-shuts-down-brookhaven
1•rolph•8m ago•1 comments

Ask HN: Why do purchased B2B email lists still have such poor deliverability?

1•solarisos•8m ago•0 comments

Show HN: Remotion directory (videos and prompts)

https://www.remotion.directory/
1•rokbenko•10m ago•0 comments

Portable C Compiler

https://en.wikipedia.org/wiki/Portable_C_Compiler
2•guerrilla•12m ago•0 comments

Show HN: Kokki – A "Dual-Core" System Prompt to Reduce LLM Hallucinations

1•Ginsabo•13m ago•0 comments

Software Engineering Transformation 2026

https://mfranc.com/blog/ai-2026/
1•michal-franc•14m ago•0 comments

Microsoft purges Win11 printer drivers, devices on borrowed time

https://www.tomshardware.com/peripherals/printers/microsoft-stops-distrubitng-legacy-v3-and-v4-pr...
2•rolph•14m ago•0 comments

Lunch with the FT: Tarek Mansour

https://www.ft.com/content/a4cebf4c-c26c-48bb-82c8-5701d8256282
2•hhs•18m ago•0 comments

Old Mexico and her lost provinces (1883)

https://www.gutenberg.org/cache/epub/77881/pg77881-images.html
1•petethomas•21m ago•0 comments

'AI' is a dick move, redux

https://www.baldurbjarnason.com/notes/2026/note-on-debating-llm-fans/
3•cratermoon•22m ago•0 comments

The source code was the moat. But not anymore

https://philipotoole.com/the-source-code-was-the-moat-no-longer/
1•otoolep•22m ago•0 comments

Does anyone else feel like their inbox has become their job?

1•cfata•22m ago•0 comments

An AI model that can read and diagnose a brain MRI in seconds

https://www.michiganmedicine.org/health-lab/ai-model-can-read-and-diagnose-brain-mri-seconds
2•hhs•26m ago•0 comments

Dev with 5 of experience switched to Rails, what should I be careful about?

1•vampiregrey•28m ago•0 comments

AlphaFace: High Fidelity and Real-Time Face Swapper Robust to Facial Pose

https://arxiv.org/abs/2601.16429
1•PaulHoule•29m ago•0 comments

Scientists discover “levitating” time crystals that you can hold in your hand

https://www.nyu.edu/about/news-publications/news/2026/february/scientists-discover--levitating--t...
2•hhs•31m ago•0 comments

Rammstein – Deutschland (C64 Cover, Real SID, 8-bit – 2019) [video]

https://www.youtube.com/watch?v=3VReIuv1GFo
1•erickhill•31m ago•0 comments

Tell HN: Yet Another Round of Zendesk Spam

3•Philpax•32m ago•0 comments

Postgres Message Queue (PGMQ)

https://github.com/pgmq/pgmq
1•Lwrless•35m ago•0 comments

Show HN: Django-rclone: Database and media backups for Django, powered by rclone

https://github.com/kjnez/django-rclone
2•cui•38m ago•1 comments

NY lawmakers proposed statewide data center moratorium

https://www.niagara-gazette.com/news/local_news/ny-lawmakers-proposed-statewide-data-center-morat...
1•geox•40m ago•0 comments

OpenClaw AI chatbots are running amok – these scientists are listening in

https://www.nature.com/articles/d41586-026-00370-w
3•EA-3167•40m ago•0 comments

Show HN: AI agent forgets user preferences every session. This fixes it

https://www.pref0.com/
6•fliellerjulian•42m ago•0 comments

Introduce the Vouch/Denouncement Contribution Model

https://github.com/ghostty-org/ghostty/pull/10559
2•DustinEchoes•44m ago•0 comments

Show HN: SSHcode – Always-On Claude Code/OpenCode over Tailscale and Hetzner

https://github.com/sultanvaliyev/sshcode
1•sultanvaliyev•44m ago•0 comments

Microsoft appointed a quality czar. He has no direct reports and no budget

https://jpcaparas.medium.com/microsoft-appointed-a-quality-czar-he-has-no-direct-reports-and-no-b...
3•RickJWagner•46m ago•0 comments

Multi-agent coordination on Claude Code: 8 production pain points and patterns

https://gist.github.com/sigalovskinick/6cc1cef061f76b7edd198e0ebc863397
1•nikolasi•46m ago•0 comments

Washington Post CEO Will Lewis Steps Down After Stormy Tenure

https://www.nytimes.com/2026/02/07/technology/washington-post-will-lewis.html
15•jbegley•47m ago•3 comments
Open in hackernews

Show HN: Create LLM graders and run evals in JavaScript with one file

https://github.com/bolt-foundry/bolt-foundry/tree/main/packages/bff-eval
28•randall•8mo ago
Hi HN!

Run it: OPENROUTER_API_KEY="sk" npx bff-eval --demo

We built a tool to help people take LLM outputs and easily grade them / eval them to know how good an assistant response is.

We've built a number of LLM apps, and while we could ship decent tech demos, we were disappointed with how they'd perform over time. We worked with a few companies who had the same problem, and found out scientifically building prompts and evals is far from a solved problem... writing these things feels more like directing a play than coding.

Inspired by Anthropic's constitutional ai concepts, and amazing software like DSPy, we're setting out to make fine tuning prompts, not models, the default approach to improving quality using actual metrics and structured debugging techniques.

Our approach is pretty simple: you feed it a JSONL file with inputs and outputs, pick the models you want to test against (via OpenRouter), and then use an LLM-as-grader file in JS that figures out how well your outputs match the original queries.

If you're starting from scratch, we've found TDD is a great approach to prompt creation... start by asking an LLM to generate synthetic data, then you be the first judge creating scores, then create a grader and continue to refine it till its scores match your ground truth scores.

If you’re building LLM apps and care about reliability, I hope this will be useful! Would love any feedback. The team and I are lurking here all day and happy to chat. Or hit me up directly on Whatsapp: +1 (646) 670-1291

We have a lot bigger plans long-term, but we wanted to start with this simple (and hopefully useful!) tool.

Run it: OPENROUTER_API_KEY="sk" npx bff-eval --demo

Comments

rbalicki•8mo ago
Very cool! This lets you grade output across different base models. Does it also allow you grade output across different prompts?
randall•8mo ago
that’s the next step… we have a structured approach to prompting too that we think will help people build better prompts too.