frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Without benchmarking LLMs, you're likely overpaying 5-10x

https://karllorey.com/posts/without-benchmarking-llms-youre-overpaying
70•lorey•3h ago

Comments

petcat•2h ago
> He's a non-technical founder building an AI-powered business.

It sounds like he's building some kind of ai support chat bot.

I despise these things.

r_lee•1h ago
And the whole article is about promoting his benchmarking service, of course.
montroser•1h ago
The whole post is just an advert for this person's startup. Their "friend" doesn't exist...
lorey•1h ago
Totally agree with your point. While I can't say specifically, it's a traditional (German) business he's doing vertically integrated with AI. Customer support is really bad in this traditional niche and by leveraging AI on top of doing the support himself 24/7, he was able to make it his competitive edge.
verdverm•2h ago
I'd second this wholeheartedly

Since building a custom agent setup to replace copilot, adopting/adjusting Claude Code prompts, and giving it basic tools, gemini-3-flash is my go-to model unless I know it's a big and involved task. The model is really good at 1/10 the cost of pro, super fast by comparison, and some basic a/b testing shows little to no difference in output on the majority of tasks I used

Cut all my subs, spend less money, don't get rate limited

r_lee•2h ago
Plus I've found that overall with "thinking" models, it's more like for memory, not even actual perf boost, it might even be worse because if it goes even slightly wrong on the "thinking" part, it'll then commit to that for the actual response
verdverm•49m ago
for sure, the difference in the most recent model generations makes them far more useful for many daily tasks. This is the first gen with thinking as a significant mid-training focus and it shows

gemini-3-flash stands well above gemini-2.5-pro

dpoloncsak•1h ago
Yeah, one of my first projects one of my buddies asked "Why aren't you using [ChatGPT 4.0] nano? It's 99% the effectiveness with 10% the price."

I've been using the smaller models ever since. Nano/mini, flash, etc.

phainopepla2•1h ago
I have been benchmarking many of my use cases, and the GPT Nano models have fallen completely flat one every single except for very short summaries. I would call them 25% effectiveness at best.
verdverm•29m ago
Flash is not a small model, it's still over 1T parameters. It's a hyper MoE aiui
walthamstow•1h ago
Flash Lite 2.5 is an unbelievably good model for the price
sixtyj•1h ago
Yup.

I have found out recently that Grok-4.1-fast has similar pricing (in cents) but 10x larger context window (2M tokens instead of ~128-200k of gpt-4-1-nano). And ~4% hallucination, lowest in blind tests in LLM arena.

verdverm•50m ago
You use stuff from xAi and Elmo?

I'm unwilling to look past Musk's politics, immorality, and manipulation on a global scale

rudhdb773b•34m ago
Grok is the best general purpose LLM in my experience. Only Gemini is comparable. It would be silly to ignore it, and xAI is less evil than Google these days.
verdverm•30m ago
Um, nonconsensual sexualized imagery does not bode well for that evil part

Anecdotal experience about which model is better is pointless. There are too many variables, the gap in the benchmarks is minimal, and the tool wielder makes more difference.

andy99•1h ago
Depends on what you’re doing. Using the smaller / cheaper LLMs will generally make it way more fragile. The article appears to focus on creating a benchmark dataset with real examples. For lots of applications, especially if you’re worried about people messing with it, about weird behavior on edge cases, about stability, you’d have to do a bunch of robustness testing as well, and bigger models will be better.

Another big problem is it’s hard to set objectives is many cases, and for example maybe your customer service chat still passes but comes across worse for a smaller model.

Id be careful is all.

candiddevmike•1h ago
One point in favor of smaller/self-hosted LLMs: more consistent performance, and you control your upgrade cadence, not the model providers.

I'd push everyone to self-host models (even if it's on a shared compute arrangement), as no enterprise I've worked with is prepared for the churn of keeping up with the hosted model release/deprecation cadence.

andy99•1h ago
How much you value control is one part of the optimization problem. Obviously self hosting gives you more but it costs more, and re evals, I trust GPT, Gemini, and Claude a lot more than some smaller thing I self host, and would end up wanting to do way more evals if I self hosted a smaller model.

(Potentially interesting aside: I’d say I trust new GLM models similarly to the big 3, but they’re too big for most people to self host)

jmathai•1h ago
You may also be getting a worse result for higher cost.

For a medical use case, we tested multiple Anthropic and OpenAI models as well as MedGemma. Pleasantly surprised when the LLM as Judge scored gpt5-mini as the clear winner. I don't think I would have considered using it for the specific use cases - assuming higher reasoning was necessary.

Still waiting on human evaluation to confirm the LLM Judge was correct.

andy99•1h ago
You obviously know what you’re looking for better than me, but personally I’d want to see a narrative that made sense before accepting that a smaller model somehow just performs better, even if the benchmarks say so. There may be such an explanation, it feels very dicey without one.
lorey•35m ago
That's interesting. Similarly, we found out that for very simple tasks the older Haiku models are interesting as they're cheaper than the latest Haiku models and often perform equally well.
lorey•1h ago
You're right. We did a few use cases and I have to admit that while customer service is easiest to explain, its where I'd also not choose the cheapest model for said reasons.
epolanski•1h ago
The author of this post should benchmark his own blog for accessibility metrics, text contrast is dreadful..

On the other hand, this would be interesting for measuring agents in coding tasks, but there's quite a lot of context to provide here, both input and output would be massive.

lorey•1h ago
Appreciate the feedback, will work on that.
faeyanpiraat•1h ago
One more vote on fixing contrast from me.
lorey•1h ago
Will fix, thanks :)
faeyanpiraat•50m ago
Tried Evalry, its a really nice concept, thanks for sharing it!
lorey•50m ago
Pushed a fix. Could you check, please?

Any resources you can recommend to properly tackle this going forward?

gridspy•1h ago
Wow, this was some slick long form sales work. I hope your SaaS goes well. Nice one!
hamiltont•1h ago
Anecdotal tip on LLM-as-judge scoring - Skip the 1-10 scale, use boolean criteria instead, then weight manually e.g.

- Did it cite the 30-day return policy? Y/N - Tone professional and empathetic? Y/N - Offered clear next steps? Y/N

Then: 0.5 * accuracy + 0.3 * tone + 0.2 * next_steps

Why: Reduces volatility of responses while still maintaining creativeness (temperature) needed for good intuition

pocketarc•1h ago
I use this approach for a ticket based customer support agent. There are a bunch of boolean checks that the LLM must pass before its response is allowed through. Some are hard fails, others, like you brought up, are just a weighted ding to the response's final score.

Failures are fed back to the LLM so it can regenerate taking that feedback into account. People are much happier with it than I could have imagined, though it's definitely not cheap (but the cost difference is very OK for the tradeoff).

Imustaskforhelp•1h ago
This actually seems really good advice. I am interested how you might tweak this to things like programming languages benchmarks?

By having independent tests and then seeing if it passes them (yes or no) and then evaluating and having some (more complicated tasks) be valued more than not or how exactly.

hamiltont•1h ago
Not sure I'm fully following your question, but maybe this helps:

IME deep thinking hgas moved from upfront architecture to post-prototype analysis.

Pre-LLM: Think hard → design carefully → write deterministic code → minor debugging

With LLMs: Prototype fast → evaluate failures → think hard about prompts/task decomposition → iterate

When your system logic is probabilistic, you can't fully architect in advance—you need empirical feedback. So I spend most time analyzing failure cases: "this prompt generated X which failed because Y, how do I clarify requirements?" Often I use an LLM to help debug the LLM.

The shift: from "design away problems" to "evaluate into solutions."

lorey•1h ago
Yes, absolutely. This aligns with what we found. It seems to be necessary to be very clear on scoring (at least for Opus 4.5).
46493168•1h ago
Isn’t this just rubrics?
piskov•5m ago
How come accuracy has only 50% weight?

“You’re absolutely right! Nice catch how I absolutely fooled you”

deepsquirrelnet•1h ago
This is just evaluation, not “benchmarking”. If you haven’t setup evaluation on something you’re putting into production then what are you even doing.

Stop prompt engineering, put down the crayons. Statistical model outputs need to be evaluated.

andy99•1h ago
What does that look like in your opinion, what do you use?
lorey•1h ago
This went straight to prod, even earlier than I'd opted for. What do you mean?
deepsquirrelnet•17m ago
I’m totally in alignment with your blog post (other than terminology). I meant it more as a plea to all these projects that are trying to go into production without any measures of performance behind them.

It’s shocking to me how often it happens. Aside from just the necessity to be able to prove something works, there are so many other benefits.

Cost and model commoditization are part of it like you point out. There’s also the potential for degraded performance because of the shelf benchmarks aren’t generalizing how you expect. Add to that an inability to migrate to newer models as they come out, potentially leaving performance on the table. There’s like 95 serverless models in bedrock now, and as soon as you can evaluate them on your task they immediately become a commodity.

But fundamentally you can’t even justify any time spent on prompt engineering if you don’t have a framework to evaluate changes.

Evaluation has been a critical practice in machine learning for years. IMO is no less imperative when building with llms.

nickphx•1h ago
ah yes... nothing like using another nondeterministic black box of nonsense to judge / rate the output of another.. then charge others for it.. lol
coredog64•1h ago
Amazon Bedrock Guardrails uses a purpose-built model to look for safety issues in the model inputs/outputs. While you won't get any specific guarantees from AWS, they will point you at datasets that you can use to evaluate the product and then determine if it's fit for purpose according to your risk tolerance.
OutOfHere•1h ago
You don't need a fancy UI to try the mini model first.
empiko•15m ago
I do not disagree with the post, but I am surprised that a post that is basically explaining very basic dataset construction is so high up here. But I guess most people just read the headline?

California is free of drought for the first time in 25 years

https://www.latimes.com/california/story/2026-01-09/california-has-no-areas-of-dryness-first-time...
25•thnaks•13m ago•2 comments

A 26,000-year astronomical monument hidden in plain sight (2019)

https://longnow.org/ideas/the-26000-year-astronomical-monument-hidden-in-plain-sight/
271•mkmk•4h ago•46 comments

Inside the secret world of Japanese snack bars

https://www.bbc.com/travel/article/20260116-inside-the-secret-world-of-japanese-snack-bars
36•rmason•1h ago•10 comments

The Unix Pipe Card Game

https://punkx.org/unix-pipe-game/
156•kykeonaut•6h ago•43 comments

The challenges of soft delete

https://atlas9.dev/blog/soft-delete.html
20•buchanae•1h ago•3 comments

I'm addicted to being useful

https://www.seangoedecke.com/addicted-to-being-useful/
441•swah•12h ago•216 comments

Running Claude Code dangerously (safely)

https://blog.emilburzo.com/2026/01/running-claude-code-dangerously-safely/
258•emilburzo•10h ago•207 comments

Maintenance: Of Everything, Part One

https://press.stripe.com/maintenance-part-one
38•mitchbob•3h ago•8 comments

Unconventional PostgreSQL Optimizations

https://hakibenita.com/postgresql-unconventional-optimizations
226•haki•8h ago•27 comments

Provably Unmasking Malicious Behavior Through Execution Traces

https://arxiv.org/abs/2512.13821
3•PaulHoule•34m ago•0 comments

Show HN: Mastra 1.0, open-source JavaScript agent framework from the Gatsby devs

https://github.com/mastra-ai/mastra
48•calcsam•6h ago•21 comments

Show HN: Agent Skills Leaderboard

https://skills.sh
4•andrewqu•1h ago•0 comments

Show HN: wxpath – Declarative web crawling in XPath

https://github.com/rodricios/wxpath
51•rodricios•6d ago•8 comments

Nvidia Stock Crash Prediction

https://entropicthoughts.com/nvidia-stock-crash-prediction
304•todsacerdoti•6h ago•250 comments

Fast Concordance: Instant concordance on a corpus of >1,200 books

https://iafisher.com/concordance/
24•evakhoury•4d ago•1 comments

RCS for Business

https://developers.google.com/business-communications/rcs-business-messaging
4•sshh12•18h ago•1 comments

TopicRadar – Track trending topics across Hacker News, GitHub, ArXiv, and more

https://apify.com/mick-johnson/topic-radar
3•MickolasJae•8h ago•1 comments

Linux kernel framework for PCIe device emulation, in userspace

https://github.com/cakehonolulu/pciem
208•71bw•15h ago•74 comments

Danish pension fund divesting US Treasuries

https://www.reuters.com/business/danish-pension-fund-divest-its-us-treasuries-2026-01-20/
540•mythical_39•7h ago•580 comments

Scheme implementation as O'Reilly book via Claude Code

https://ezzeriesa.notion.site/Scheme-implementation-as-O-Reilly-book-via-Claude-Code-2ee1308b4204...
3•kurinikku•8h ago•1 comments

Cloudflare zero-day: Accessing any host globally

https://fearsoff.org/research/cloudflare-acme
4•2bluesc•6h ago•0 comments

Dockerhub for Skill.md

https://skillregistry.io/
3•tomaspiaggio12•7h ago•0 comments

When "likers'' go private: Engagement with reputationally risky content on X

https://arxiv.org/abs/2601.11140
29•linolevan•4h ago•20 comments

Model Market Fit

https://www.nicolasbustamante.com/p/model-market-fit
3•nbstme•5h ago•1 comments

Channel3 (YC S25) Is Hiring

https://www.ycombinator.com/companies/channel3/jobs/3DIAYYY-backend-engineer
1•aschiff1•10h ago

Show HN: Fence – Sandbox CLI commands with network/filesystem restrictions

https://github.com/Use-Tusk/fence
5•jy-tan•4h ago•1 comments

Electricity use of AI coding agents

https://www.simonpcouch.com/blog/2026-01-20-cc-impact/
4•linolevan•4h ago•0 comments

Level S4 solar radiation event

https://www.swpc.noaa.gov/news/g4-severe-geomagnetic-storm-levels-reached-19-jan-2026
593•WorldPeas•1d ago•197 comments

The fix for a segfault that never shipped

https://www.recall.ai/blog/the-fix-for-a-segfault-that-never-shipped
11•davidgu•4h ago•1 comments

The Zen of Reticulum

https://github.com/markqvist/Reticulum/blob/master/Zen%20of%20Reticulum.md
87•mikece•9h ago•56 comments