frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Data Viz: Mapping Model Performance on Reasoning vs. Honesty Benchmarks

https://claude.ai/public/artifacts/068899b8-19fc-4927-8561-736a075c5018
1•lout332•1h ago

Comments

lout332•1h ago
Was curious about how different model families scale, so I plotted their HLE (reasoning) vs. MASK (honesty) scores. Found some interesting patterns, especially with the Claude and Gemini series. Might be relevant for those thinking about model reliability and robustness. Here's the data...