Ask HN: How do you find SOTA LLMs for a task?

1•throwaw12•7h ago

There are thousands of models at the moment available at Hugging Face. But whenever I need a model for specific task, I am struggling to find SOTA model, can you recommend me how to find it?

I am not ML practitioner, I just need models for my work, for example for coding, I know we can use Claude/Gemini models, but sometimes I want to compare them to SOTA open source, every week something better is coming and reading articles from month ago or finding LLM leaderboard for a specific task is difficult sometimes. I think some kind of model picker already exists, but don't know where

Comments

Oras•7h ago

I usually go to OpenRouter usage to learn that by category https://openrouter.ai/rankings

Scroll down to categories, and select from the dropdown on top right of the chart.

throwaw12•6h ago

that's nice addition to my tool set :) thanks!

but it seems mostly reflects proprietary models (because they are easier to serve)

incomingpain•7h ago

For open source, you're not going to see stats well online. Openhands + devstral doesnt touch the internet, so wont make it to many stats.

You can look at benchmarks.

https://livebench.ai/#/?Agentic+Coding=a

Keep scrolling until you see something your size. Deepseek R1 is nice, but 600B isnt running on my hardware. You'll also notice they arent doing everything. dominated by the Saas options.

https://huggingface.co/models

This is sorted by trending by default. This tends to help show interest but not necessarily the best.

throwaw12•6h ago

> Deepseek R1 is nice, but 600B isnt running on my hardware.

Yeah, this is my concern as well, usually top SOTA generic models are good at many tasks, but I can't test them quickly on my machine locally. Especially when seeing claims how 32B model is outperforming proprietary models in benchmarks, I really want to test it myself in my tasks, but after some time they are dropped from news/trends and difficult to find them

Thursday Is Durable Computing Day

A Quick(ish) Introduction to Tuning Postgres

Ask HN: Can we better use heat from data centers?

Distribution Package vs. Import Package

Burning Man Festival Is Burning Through Cash

MCK: Open-Source MongoDB Operator

ΜFork: A pure actor-based concurrent machine architecture with memory-safety an

Study: How American Consumers Are Using AI

Why "How many tennis balls fit in a bus?" is a good interview question

Amazon buys Bee AI wearable that listens to everything you say

Inheritance over Composition, Sometimes

Show HN: Featurevisor v2.0 – declarative feature flags management with Git

Crowdfunding Success – Was it worth it?

Show HN: It's Like FIFA for Developers 1vs1 Code Battle

Why everyone is probably wrong about AI

Brave Browser Blocks Windows Recall

Taiwan is creating an offshore wind industry to fuel its semiconductor factories

Ask HN: How have you optimized your company/ work?

Show HN: Like Lusha/Apollo, but with 250M deliverable emails

Lost in the Wilderness: Ansel Adams in the 1960s

Use AI to Create Professionally Bound Coloring Books

Integrate Email Notifications with RustMailer: A Must-Have for Developers

Conspiracy theorists don't realize they're on the fringe

Late Ozzy Osbourne's Short, Sweet Stint in Video Games

PMs Were Vibe Coders All Along

Top questions every recruiter is asking in 2025: Answered with AI sourcing

NonRAID – fork of unRAID array kernel module

Zetamax – Zetamac clone with progress tracking and modern UI

First Users

Building a fuzzing testing framework with Locust and Docker