Bullshit benchmark for LLMs

https://twitter.com/petergostev/status/2026396163637731794

1•gpvos•1h ago

Comments

noemit•1h ago

The underlying data looks scarce. If there's only a few questions per "category" of bullshit they can easily be gamed to favor one model over another.