frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Ask HN: How do you choose a model for a task?

6•bix6•1h ago
How do you decide a model is good enough for a given task? Right now I use Opus for planning and harder tasks and switch to sonnet for more defined tasks. But I feel like sonnet is kind of stupid and is introducing issues because it can’t grasp the larger context? Is there some definitive way to say a model is good enough for a task? Or is it all vibes?

Comments

PaulHoule•1h ago
Evaluation is harder than you think because of statistics.

Like if you want to accurately know if one model is better than another you have to test it on hundreds if not thousands of examples which are carefully graded in difficulty, not in the training sets, etc.

Practically you might try model A and model B and use each one 2-3 times on different tasks and walk out with the impression that A is really good and B sux, but it could be model A got lucky because you asked it to do things it is good at or maybe it just got lucky and got the right answer anyway.

See https://arxiv.org/html/2410.12972v1 and https://arxiv.org/pdf/2505.14810 -- those papers are considering a general space of tasks but you could totally do the same kind of eval for the tasks you care about.

freedomben•59m ago
This is a hard problem for me as well. Right now I've just been using the best model available (like Opus, or GPT 5.5, or Gemin Pro) but it's not ideal. My problem is anytime I step down the results are subtlely worse and sometimes I don't notice immediately depending on what I'm doing.

As far as Opus vs. GPT 5.5 etc, I generally decide with:

1. Code? -> Opus

2. Docs? -> GPT

3. Real-time or recent information needed? -> Gemini

It's far from perfect though. Would love to hear others thoughts.

shouvik12•22m ago
for short, stateless stuff,definitions, formatting, quick lookups I have never noticed a meaningful difference between models. But anything that requires reasoning across a lot of prior context, it's usually claude sonet or opus. But feels like the vibe will soon take me to codex

Show HN: AriaType v0.5 – Context-aware voice input for writing

https://github.com/joe223/AriaType
1•Joe_Harris•23s ago•0 comments

Despite rise of AI is there still hope for EE's translators?

https://www.theguardian.com/technology/2026/may/08/being-human-helps-despite-rise-of-ai-is-there-...
1•speckx•43s ago•0 comments

Freeman Dyson's Model of a Cell

https://chillphysicsenjoyer.substack.com/p/freeman-dysons-model-of-a-cell
1•crescit_eundo•52s ago•0 comments

The Highest-Flying Repo Men Are Collecting Spirit Airlines' Jets

https://www.wsj.com/business/airlines/spirit-airlines-jets-liquidation-repo-men-5c44a46f
1•jbredeche•1m ago•0 comments

Why MistralAI Grows Faster Than OpenAI/Anthropic

https://productify.substack.com/p/why-mistralai-grows-faster-than-openaianthropic
1•gmays•2m ago•0 comments

Contra Kotler and Friston on 'the body keeps the score'

https://essays.debugyourpain.com/p/contra-kotler-friston-et-al-on-the
1•yichab0d•2m ago•0 comments

Podatek od narzędzi. Dlaczego serwery MCP pożerają Twój budżet na AI

https://jakubpradzynski.substack.com/p/podatek-od-narzedzi-dlaczego-serwery
1•jakubpradzynski•3m ago•0 comments

Rassvet, Russia's Answer to Starlink

https://www.wired.com/story/meet-rassvet-russias-answer-to-starlink/
1•smurda•3m ago•0 comments

Are old Java Developer Journals or Dr. Dobbs mags worth anything?

https://old.reddit.com/r/java/comments/1t3uqc7/are_old_java_developer_journals_or_dr_dobbs_mags/
1•theanonymousone•4m ago•0 comments

Wordle to Become Prime-Time TV Show, with Savannah Guthrie as Host

https://www.nytimes.com/2026/05/11/business/media/wordle-nbc-savannah-guthrie.html
1•apparent•4m ago•1 comments

The left-wing case for AI

https://www.seangoedecke.com/the-left-wing-case-for-ai/
1•rhazn•5m ago•0 comments

State of the Map 2027 – Call for Venues

https://blog.openstreetmap.org/2026/05/11/state-of-the-map-2027-call-for-venues-is-now-open/
1•rhazn•5m ago•0 comments

Build a Basic AI Agent from Scratch

https://www.ruxu.dev/articles/ai/build-a-basic-ai-agent/
1•ruxudev•7m ago•0 comments

Microsoft Israel chief leaves amid ethical controversy

https://en.globes.co.il/en/article-microsoft-israel-chief-leaves-amid-ethical-controversy-1001542602
1•bhouston•8m ago•0 comments

Porting 3D Movie Maker to Linux

https://benstoneonline.com/posts/porting-3d-movie-maker-to-linux/
1•speckx•9m ago•0 comments

AIQ Rank – a score for how AI-native your workflow is

https://www.aiqrank.com
2•grahac•9m ago•0 comments

Pull Request: Java language implementation of value classes and objects

https://github.com/openjdk/jdk/pull/31120
2•munksbeer•11m ago•0 comments

Ask HN: Agents for coding feels like a L2 automation system. Yay/Nay?

1•ak681443•11m ago•0 comments

PaidSync, Run Google/Meta/LinkedIn/TikTok Ads from Claude or ChatGPT

https://paidsync.ai/
1•ahmedashrav•14m ago•0 comments

Mythos 'Discovered' a CVE in Its Training Data – That's Still Worrying

https://rival.security/posts/mythos-discovered-a-cve-already-in-its-training-data---and-thats-sti...
1•theanonymousone•14m ago•0 comments

Show HN: I built a mobile website builder on the Claude Agent SDK

https://sitespin.app/
1•adamhsn•16m ago•0 comments

Redistricting, Democrats are playing as the away team

https://www.natesilver.net/p/on-redistricting-democrats-are-playing
1•7777777phil•17m ago•0 comments

Show HN: Design.md and Skill.md Generator Chrome Extension

https://chromewebstore.google.com/detail/ai-design-taste-designmd/peclkdlolmcclhhgpoehpikgknbmkknc
4•novateg•19m ago•0 comments

How to Leave Instagram

https://www.a-side.social/blog/how-to-leave-instagram/
11•proletarian•20m ago•5 comments

AI's Next Phase Plays into TSMC's Hands

https://www.wsj.com/tech/ais-next-phase-plays-into-tsmcs-hands-3d1f2b60
1•Brajeshwar•20m ago•0 comments

Unauthorized Anthropic stock sales and investment scams

https://support.claude.com/en/articles/13704655-unauthorized-anthropic-stock-sales-and-investment...
4•dylanpyle•22m ago•1 comments

Want a better DDD domain model in 5 minutes?

https://www.esdm.io/getting-started/your-first-model-with-ai/
1•goloroden•22m ago•0 comments

Estimating Levenshtein Distance for Large Documents Using Compact Signatures

https://zenodo.org/records/20125438
2•coatespt•23m ago•0 comments

Excavate - Find the "hidden landmines" in your codebase

https://www.npmjs.com/package/excavate
1•Heysonics•24m ago•0 comments

Fostering breakthrough AI innovation through customer-back engineering

https://www.technologyreview.com/2026/05/11/1136967/fostering-breakthrough-ai-innovation-through-...
1•joozio•25m ago•0 comments