- propose skills we should measure in language models—live examples include difficult math, memory-manipulation resistance, code generation, and empathy under bad news
- write evals (graded prompts) for those skills
- forecast which models will score highest once GPT-5 is released
Why this exists
Benchmarks leak into training data quickly; scores are unreliable and labs still declare progress. The prediction tool aims keeps the target moving by letting the crowd define both the tasks and the scoring. Accepted evals remain private until GPT-5 ships; afterward every prompt, response, and grade is published, with a checksum pinned on-chain for provenance.
A deeper bet
Simon Willison’s “draw a pelican on a bike” is such a great example of a custom prompt that surfaced insights no benchmark caught. Andrej Karpathy has noted that random crowds often miss the better answer. Some people have sharper intuition about model behavior than others; if we can identify that signal and fold it into the tooling, the crowd can stay ahead of the models instead of trailing them.
Implementation
Built almost completely w/ Claude Code (subagents enabled). Originally, working with a designer and using Figma MCP, but by the end we found it faster for him to just learn how to use Claude Code and prompt design iterations right to code. Rough edges remain, but the loop works end-to-end.
I’d value feedback—especially new skill ideas or failure modes we’ve missed.
s3n4•3h ago