frontpage.

So, I was checking out this research paper that compares Nemotron-3-Super-120B, GPT-OSS-120B, and Qwen3.5-122B. They looked at how these models performed on different benchmarks like IFBench, SWE-Bench, Tau Bench, and RULER.

One thing that stood out was the trade-off between accuracy and inference throughput, especially with formats like NVFP4 vs BF16.

I'm really interested to know which benchmarks folks here actually rely on when they're checking out models for real-life tasks. What seems to work best for you?

Do you rely more on reasoning benchmarks, coding benchmarks, or long-context tests?

Blue-LIRIC in rabbit cornea: efficacy, effects, and repetition rate (2022)

One of Our Credit Card

Blader/humanizer: Claude Code skill to remove AI-generated tells from writing

JSFX on Fedora Linux: an ultra-fast audio prototyping engine

Design-Driven AI Development

'Devastating blow': Atlassian lays off 1,600 workers ahead of AI push

2026 State of SaaS Report

Show HN: Outsmart – a 5-round betting game against the computer

Pi-Autoresearch

Show HN: A small macOS app to send push notifications to the iOS Simulator

Real-time filters for the web using Lucene query syntax

Vite+ Alpha: Unified Toolchain for the Web (MIT License)

Becoming a Forest Civilisation

Agent Experience: Sentry vs. TrackJS vs. RayGun

How to use storytelling to fit inline assembly into Rust

Enterprise Digital Budgets Are Shifting Away from SaaS. Here's Where They Go

Substack Is at the Gambling Stage of Desperation

Wiz CEO Assaf Rappaport refused a $23B offer – then sold for $32B

AI thinks your code is correct, but it can not prove it

Show HN: Catch Tap Toy

Bucketsquatting Is (Finally) Dead

Utm-Builder – Bulk UTM Link Generator CLI for Marketers

ManuscriptFormatter – Instant Standard Manuscript Format for Writers

Curses-exec: interactive xargs for less

Ask Maps and Immersive Navigation: New AI Features in Google Maps

100 Jumps

Black logos are taking over Silicon Valley

A defense official reveals how AI chatbots could be used for targeting decisions

Show HN: CacheLens – Local-first cost tracking proxy for LLM APIs

Tracking and analysis of a hidden mesh network operating across iOS devices

Ask HN: What benchmarks do you trust most when comparing large LLMs?