I built Search Bench as a small experiment to compare search engines without showing which engine produced which results. It was inspired by the idea behind the LLM Arena, but applied to search.
How it works:
1. You enter a query.
2. You see two result sets side-by-side (search engine names hidden).
3. You pick which is better, or mark them as similar.
Methodology
- Each vote is a pairwise comparison (ties count as 0.5 win each).
- Ratings use a Bradley–Terry model with iteratively updated ability scores, normalized by geometric mean.
- Final scores are log-scaled (1500 + 400 * log10(ability)), like ELO but derived from the Bradley–Terry model.
- Pair selection is adaptive, prioritizing under-sampled search engines and close matchups via an uncertainty × closeness weighting.
This definitely isn't an objective ranking: queries and voters are self-selected, results vary by context, and what counts as “better” depends on the person. Right now, the dataset is small (≈200 comparisons, mostly from me), so I'm especially interested in seeing:
- Whether results change with more independent voters.
- Whether there's a real quality signal at scale, or if most differences disappear once brand bias is removed.
If you have a minute, comparing a few queries yourself would be very helpful! I'd also appreciate critique, especially around statistical validity, bias sources, aggregation methods, or ways this could be gamed or misinterpreted.