frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

LLMs are powerful, but enterprises are deterministic by nature

3•prateekdalal•1h ago•0 comments

Ask HN: Anyone Using a Mac Studio for Local AI/LLM?

44•UmYeahNo•1d ago•28 comments

Ask HN: Ideas for small ways to make the world a better place

13•jlmcgraw•14h ago•19 comments

Ask HN: Non AI-obsessed tech forums

23•nanocat•12h ago•19 comments

Ask HN: 10 months since the Llama-4 release: what happened to Meta AI?

44•Invictus0•1d ago•11 comments

Ask HN: Non-profit, volunteers run org needs CRM. Is Odoo Community a good sol.?

2•netfortius•9h ago•1 comments

Ask HN: Who wants to be hired? (February 2026)

139•whoishiring•4d ago•514 comments

AI Regex Scientist: A self-improving regex solver

6•PranoyP•16h ago•1 comments

Ask HN: Who is hiring? (February 2026)

312•whoishiring•4d ago•511 comments

Tell HN: Another round of Zendesk email spam

104•Philpax•2d ago•54 comments

Ask HN: Is Connecting via SSH Risky?

19•atrevbot•2d ago•37 comments

Ask HN: Has your whole engineering team gone big into AI coding? How's it going?

17•jchung•2d ago•12 comments

Ask HN: Why LLM providers sell access instead of consulting services?

4•pera•22h ago•13 comments

Ask HN: What is the most complicated Algorithm you came up with yourself?

3•meffmadd•1d ago•7 comments

Ask HN: How does ChatGPT decide which websites to recommend?

5•nworley•1d ago•11 comments

Ask HN: Is it just me or are most businesses insane?

7•justenough•1d ago•6 comments

Ask HN: Any International Job Boards for International Workers?

2•15charslong•11h ago•2 comments

Ask HN: Mem0 stores memories, but doesn't learn user patterns

9•fliellerjulian•2d ago•6 comments

Ask HN: Is there anyone here who still uses slide rules?

123•blenderob•3d ago•122 comments

Kernighan on Programming

170•chrisjj•4d ago•61 comments

Ask HN: Anyone Seeing YT ads related to chats on ChatGPT?

2•guhsnamih•1d ago•4 comments

Ask HN: Does global decoupling from the USA signal comeback of the desktop app?

5•wewewedxfgdf•1d ago•2 comments

We built a serverless GPU inference platform with predictable latency

5•QubridAI•2d ago•1 comments

Ask HN: How Did You Validate?

4•haute_cuisine•1d ago•5 comments

Ask HN: Does a good "read it later" app exist?

8•buchanae•3d ago•18 comments

Ask HN: Have you been fired because of AI?

17•s-stude•4d ago•15 comments

Ask HN: Cheap laptop for Linux without GUI (for writing)

15•locusofself•3d ago•16 comments

Ask HN: Anyone have a "sovereign" solution for phone calls?

12•kldg•3d ago•1 comments

Test management tools for automation heavy teams

2•Divyakurian•2d ago•2 comments

Ask HN: OpenClaw users, what is your token spend?

14•8cvor6j844qw_d6•4d ago•6 comments
Open in hackernews

Ask HN: What benchmarks are you using to judge AI models?

4•cowpig•9mo ago
There are so many models, and so many new ones being released all the time, that I have a hard time knowing which ones to prioritize testing anecdotally. What benchmarks have you found to be especially indicative of real-world performance?

I use:

* Aider's Polyglot benchmark seems to be a decent indicator of which models are going to be good at coding:

https://aider.chat/docs/leaderboards/

* I generally assume OpenRouter usage to be an indicator of a model's popularity, and by proxy, utility:

https://openrouter.ai/rankings

* LLM-Stats has a lot of charts of benchmarks that I look at:

https://llm-stats.com/

Comments

paulcole•9mo ago
> There are so many models, and so many new ones being released all the time, that I have a hard time knowing which ones to prioritize testing anecdotally

Just pick one and use it. The ones you’ve heard of (if you are not obsessively refreshing AI model rankings pages) are basically the same.

I’m sure I’ll get a ton of pushback that the one somebody loves is obviously so much better than the other one, but whatever.

Just give me OpenAI’s most popular model, their fastest model, and their newest model. I’ll pick among those 3 based on what I’m prioritizing in the moment (speed, depth, everyday use).

kadushka•9mo ago
For me it's the opposite - we don't get enough models to test. In the last 6 months, we got Claude 3.7, OpenAI o1, Grok 3, Gemini 2.5 Pro, and OpenAI o3. That's it - 5 frontier models. Not that hard to test each one of them manually, which I did for many hours and with many different tasks. o1 --> o3 and 2.5 Pro are the ones I'm using the most.

I couldn't care less about benchmarks - I know what these models are capable of from personal experience.