Turns out at my rating, positions that are "+2" only actually get won around 58% of the time. That's basically a coin flip. Meanwhile the engine is acting like it's already game over.
So I built something that blends Stockfish evaluations with real game data from millions of Lichess games at your specific rating AND time control. The interesting part is how it surfaces "gems" - moves that statistically outperform based on their confidence intervals.
Each move gets a score that accounts for sample size uncertainty. A move with 60% win rate over 100 games might look good, but its confidence interval overlaps with a move showing 54% over 100K games. The tool compares the 5th percentile of each move's confidence interval - so moves that stand out actually stand out, regardless of sample size differences. Gems get visually highlighted so you can spot them fast.
The wildest thing I found: in the Scandinavian (1. e4 d5 2. exd5), the tool flagged 2...Bg4 as a gem at 1600 bullet. That's hanging a bishop for zero compensation - engine says it's losing by a mile. But in fast games, opponents premove Nc3 or d4 over 40% of the time and lose their queen. What looks like 99% for white is actually a net win for black at that level. Never would have found that scanning raw Explorer stats.
Tech: Stockfish 17 in WASM for local analysis, Beta distributions for the confidence math, Firebase for caching (< 50ms responses).
trueelo.app if anyone wants to try it. No signup, totally free. Works for all time controls and rating ranges.