One thing that stood out was the trade-off between accuracy and inference throughput, especially with formats like NVFP4 vs BF16.
I'm really interested to know which benchmarks folks here actually rely on when they're checking out models for real-life tasks. What seems to work best for you?
Do you rely more on reasoning benchmarks, coding benchmarks, or long-context tests?