frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Bullshit benchmark for LLMs

https://twitter.com/petergostev/status/2026396163637731794
1•gpvos•1h ago

Comments

noemit•1h ago
The underlying data looks scarce. If there's only a few questions per "category" of bullshit they can easily be gamed to favor one model over another.