fp.
newest
Open in hackernews
Bullshit benchmark for LLMs
https://twitter.com/petergostev/status/2026396163637731794
1
•
gpvos
•
1h ago
Comments
noemit
•
1h ago
The underlying data looks scarce. If there's only a few questions per "category" of bullshit they can easily be gamed to favor one model over another.
noemit•1h ago