frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Ask HN: What are you working on? (May 2026)

256•david927•1d ago•948 comments

Ask HN: We just had an actual UUID v4 collision...

459•mittermayr•3d ago•334 comments

Ask HN: How often do you investigate issues in production vs. looking at logs?

3•aspectrr•5h ago•1 comments

Ask HN: How do you choose a model for a task?

6•bix6•5h ago•7 comments

Rumors of my death are slightly exaggerated

1654•CliffStoll•5d ago•253 comments

Ask HN: Best static site generator for a docs site in 2026?

9•agenttestjekuqz•12h ago•8 comments

Ask HN: What would you like to be working on?

6•DDerTyp•8h ago•6 comments

Ask HN: Do you know the ethics of Developers?

7•eropatori•12h ago•16 comments

Ask HN: Will low quality AI customer support be the new normal?

22•0-bad-sectors•1d ago•16 comments

Our keyboards are tracking us

6•tukunjil•11h ago•5 comments

Ask HN: What Wintel/AMD (Laptop) Harware are you liking?

3•aagha•19h ago•0 comments

Remind HN: Today is Mother's Day, call your moms

368•rationalist•1d ago•159 comments

Ask HN: Is this the SWE workflow of the future?

15•mc-0•1d ago•10 comments

Ask HN: Which LLM are you using to evaluate your ideas?

5•Marius77•1d ago•9 comments

Ask HN: Can a tinnitus be triggered by high frequency noises?

6•tinnitus_crazy•1d ago•15 comments

Tell HN: Claude claims the AGPLv3 license violates it's content policy

12•freedomben•1d ago•0 comments

Cancelling Claude subscription renewal immediately revokes Design access

5•o10449366•1d ago•1 comments

Best AI coding plan alternative to Claude and ChatGPT

15•Jsttan•1d ago•10 comments

Ask HN: Former master-tech building AI systems – how to break into software?

4•nicku711•2d ago•3 comments

Ask HN: Before Open Source took over the server, what was the discourse like?

7•mbgerring•1d ago•3 comments

Ask HN: What is your go-to solution for a personal wiki in 2026?

16•ex-aws-dude•4d ago•21 comments

How are folks affordably self-training in AI?

7•macartain•2d ago•9 comments

You've reached the end!

Open in hackernews

Ask HN: How do you choose a model for a task?

6•bix6•5h ago
How do you decide a model is good enough for a given task? Right now I use Opus for planning and harder tasks and switch to sonnet for more defined tasks. But I feel like sonnet is kind of stupid and is introducing issues because it can’t grasp the larger context? Is there some definitive way to say a model is good enough for a task? Or is it all vibes?

Comments

PaulHoule•5h ago
Evaluation is harder than you think because of statistics.

Like if you want to accurately know if one model is better than another you have to test it on hundreds if not thousands of examples which are carefully graded in difficulty, not in the training sets, etc.

Practically you might try model A and model B and use each one 2-3 times on different tasks and walk out with the impression that A is really good and B sux, but it could be model A got lucky because you asked it to do things it is good at or maybe it just got lucky and got the right answer anyway.

See https://arxiv.org/html/2410.12972v1 and https://arxiv.org/pdf/2505.14810 -- those papers are considering a general space of tasks but you could totally do the same kind of eval for the tasks you care about.

bix6•3h ago
Have you implemented any of this in practice? Eg are you benchmarking models?
PaulHoule•1h ago
I've done some for classification, ranking, and other sorts of non-generative tasks.
freedomben•5h ago
This is a hard problem for me as well. Right now I've just been using the best model available (like Opus, or GPT 5.5, or Gemin Pro) but it's not ideal. My problem is anytime I step down the results are subtlely worse and sometimes I don't notice immediately depending on what I'm doing.

As far as Opus vs. GPT 5.5 etc, I generally decide with:

1. Code? -> Opus

2. Docs? -> GPT

3. Real-time or recent information needed? -> Gemini

It's far from perfect though. Would love to hear others thoughts.

bix6•3h ago
Opus eats tokens so fast so I try to minimize it but compared to Sonnet I definitely see fewer issues in my larger projects. Sonnet has gone off the rails a few times.
shouvik12•4h ago
for short, stateless stuff,definitions, formatting, quick lookups I have never noticed a meaningful difference between models. But anything that requires reasoning across a lot of prior context, it's usually claude sonet or opus. But feels like the vibe will soon take me to codex
noashavit•2h ago
Gemini for recent search and google workspace automation

Perplexity for deep research

Claude Opus for coding, Sonnet for writing

Gemma4 for local AI overviews and analysis

Qwen coder for local prototyping