news newest ask show jobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

15 Cloud/local LLMs benchmarked on 38 real tasks. MiniMax and Kimi tied for 2nd

https://ianlpaterson.com/blog/llm-benchmark-2026-38-actual-tasks-15-models-for-2-29/

3•ianlpaterson•2h ago

Comments

ianlpaterson•2h ago

I built this to answer a question for myself: which model should I actually route each type of task to? The harness runs 38 deterministic tests (CSV transforms, letter counting, modular arithmetic, regex extraction, code gen, multi-step instructions), costs $2.29 per full run across all 15 models, and all scoring is programmatic. No LLM judge for primary scores.

The surprising part was the QA process. My initial results showed Haiku beating Sonnet. That turned out to be a json_array scorer bug where max_score was set to expected_row_count instead of len(expected_rows), producing quality scores above 100%. A thin-space Unicode character (U+2009) in Gemini Flash responses broke three regex scorers silently. I ended up running 5 separate QA passes, each using a different model, and each pass found bugs the previous ones missed.

Gemini 2.5 Flash scored 97.1% at $0.003/run w/ a 1.1s median response time. Opus scored 100% at $0.69/run. GPT-oss-20b scored 98.3% for $0. The cost spread across models that all score above 95% is genuinely hard to justify for most tasks.

Scoring code and raw results are in the post. Happy to answer questions about methodology.

First Brands row hints at banks' shadow exposure

https://www.reuters.com/commentary/breakingviews/first-brands-row-hints-banks-shadow-exposure-202...

1•petethomas•34s ago•0 comments

3DIMLI – Sell Software and Video, Zero Commission, Direct Payments

https://www.3dimli.com

1•arpit077•52s ago•1 comments

Hugging Face Storage Buckets: Mutable, non-versioned object storage at $12/TB

https://huggingface.co/blog/storage-buckets

1•victormustar•55s ago•0 comments

Show HN: Server Automation in TypeScript

https://www.ignition.sh/

1•tibozaurus•2m ago•0 comments

The 'number station' sending mystery messages to Iran

https://www.ft.com/content/86c4a4ca-ca06-4fc8-90fe-4f46357b804f

1•g-mork•2m ago•0 comments

Show HN: I reproduced the CL1 DOOM policy in 132 parameters

https://www.mikeayles.com/blog/its-just-weights/

1•mikeayles•3m ago•0 comments

Day Week Job Board

https://fourdayweek.co.uk/

1•robtherobber•4m ago•0 comments

Auto-accept everything and nothing else

https://github.com/HalfEmptyDrum/press-one

1•Kai20211111•4m ago•1 comments

Windows 11 taskbar's new Internet Speed Test tool is a shortcut to Bing.com

https://www.windowslatest.com/2026/03/10/windows-11-taskbars-new-internet-speed-test-tool-is-a-sh...

3•akyuu•6m ago•1 comments

The Process of Movie Casting Has Changed Drastically

https://www.nytimes.com/2026/03/09/movies/oscars-casting-award-auditions-movies.html

2•ripe•6m ago•0 comments

New multimodal Gemini embeddings from Google (videos and PDFs supported)

https://haystack.deepset.ai/blog/multimodal-embeddings-gemini-haystack

1•kacperlukawski•11m ago•0 comments

Ten Thoughts on Government Data

https://www.statecraft.pub/p/ten-thoughts-on-government-data

1•casca•11m ago•0 comments

Show HN: MoveAlerts.ai – AI that distills stock news in real-time

https://www.movealerts.ai/

1•pyfreak182•11m ago•0 comments

SQLite Concurrency in Go: What We Learned Building a Desktop AI IDE

https://chatml.com/blog/sqlite-concurrency-in-go-desktop-ai-ide

1•mcastilho•11m ago•0 comments

YouTube Now Worlds Largest Media Company, Topping Disney

https://www.hollywoodreporter.com/business/digital/youtube-worlds-largest-media-company-2025-tops...

2•speckx•12m ago•0 comments

Show HN: SnapDrift – a pluggable visual regression workflow for GitHub Actions

https://github.com/ranacseruet/snapdrift

1•ranacseruet•12m ago•0 comments

Judge blocks Perplexity's bot Amazon shopping in early test of agentic commerce

https://www.geekwire.com/2026/judge-blocks-perplexitys-ai-bot-from-shopping-on-amazon-in-early-te...

2•spenvo•13m ago•0 comments

Ask HN: What would a developer-first alternative to Shopify look like?

1•google_mfg•14m ago•0 comments

Benchmarking Culture

https://www.argmin.net/p/benchmarking-culture

1•bearseascape•16m ago•0 comments

Slate Auto switches CEOs ahead of launch later this year

https://sherwood.news/tech/tesla-killer-slate-auto-switches-ceos-ahead-of-launch-later-this-year/

1•avonmach•17m ago•0 comments

New ways to learn math and science in ChatGPT

https://openai.com/index/new-ways-to-learn-math-and-science-in-chatgpt

1•meetpateltech•18m ago•0 comments

Show HN: Emotive Engine – I wrote 8 elemental shaders to prove one pattern works

https://github.com/joshtol/emotive-engine

1•emotiveengine•18m ago•1 comments

Turbopuffer: Object Storage-native Database for Search [video]

https://www.youtube.com/watch?v=pqoRNwNaxfs

1•gmcabrita•21m ago•0 comments

Who's a Better Writer: A.I. Or Humans? Take Our Quiz

https://www.nytimes.com/interactive/2026/03/09/business/ai-writing-quiz.html

2•tiahura•21m ago•0 comments

Tommy DeCarlo, Boston Fan Who Became Their Lead Singer, Dead at 60

https://www.rollingstone.com/music/music-news/tommy-decarlo-boston-singer-dead-obituary-1235527355/

2•bookofjoe•22m ago•2 comments

The Bay Area Considers the Unthinkable: Life Without BART

https://www.nytimes.com/2026/03/10/us/bart-bay-area-san-francisco-transit.html

2•mitchbob•23m ago•0 comments

A Methodological Critique of "First Proof" (Abouzaid et al., 2026)

1•Beo_VN•24m ago•0 comments

Umbra Open Data Tracker

https://github.com/bellingcat/umbra-open-data-tracker

1•marklit•24m ago•0 comments

Show HN: A tool for arranging photos for home-printing without wasting paper

https://dj-louw.github.io/photo-collage-printer/

1•beAbU•25m ago•0 comments

I've never parented a 6-year-old child. But I've dealt with macOS system updates

https://ihnatko.com/ive-never-had-the-experience-of-parenting-a-6-year-old-child-but-ive-dealt-wi...

4•brie22•26m ago•0 comments