Show HN: Improving search ranking with chess Elo scores

https://www.zeroentropy.dev/blog/improving-rag-with-elo-scores

89•ghita_•4h ago

Hello HN,

I'm Ghita, co-founder of ZeroEntropy (YC W25). We build high accuracy search infrastructure for RAG and AI Agents.

We just released two new state-of-the-art rerankers zerank-1, and zerank-1-small. One of them is fully open-source under Apache 2.0.

We trained those models using a novel Elo score inspired pipeline which we describe in detail in the blog attached. In a nutshell, here is an outline of the training steps: * Collect soft preferences between pairs of documents using an ensemble of LLMs. * Fit an ELO-style rating system (Bradley-Terry) to turn pairwise comparisons into absolute per-document scores. * Normalize relevance scores across queries using a bias correction step, modeled using cross-query comparisons and solved with MLE.

You can try the models either through our API (https://docs.zeroentropy.dev/models), or via HuggingFace (https://huggingface.co/zeroentropy/zerank-1-small).

We would love this community's feedback on the models, and the training approach. A full technical report is also going to be released soon.

Thank you!

Comments

sippeangelo•3h ago

Really cool stuff! Just want to let you know you forgot to link to the evals at the end.

ghita_•2h ago

oh waw thanks for flagging, just fixed, thanks!

esafak•2h ago

I would have titled it "Improving ranking..."

I like that it works with `sentence_transformers`

ghita_•2h ago

yes we found it hard to find a good title for this, thanks for the feedback

dang•1h ago

We could change the title to "Improving search ranking with chess Elo scores". Anybody object?

Edit: ok, done. Submitted title was "Show HN: Improving RAG with chess Elo scores".

ashwindharne•2h ago

Cool stuff! We use a similar process internally to rerank and filter our cold outbound lists. We just use an off-the-shelf model as the judge, give it a custom criteria, and let it run until some set number of iterations. It's helped narrow down wide searches to the maximally relevant set of people (few thousand medium-bad matches to few hundred good matches)

It's not cheap and it's not fast, but it definitely works pretty well!

jayunit•48m ago

Very interesting! What are some examples of criteria that you can evaluate pairwise, but couldn't score individually?

bravura•17m ago

Pairwise rank constraints involve fewer assumptions that per-item scoring about the underlying nature of the data, thus they are more robust.

yalok•2h ago

What’s the expected additional latency due to running this re-ranker?

ghita_•2h ago

It actually runs pretty fast, our benchmarks show ~149ms for 12665 bytes. It's faster than many other models

esafak•1h ago

I would prominently display your benchmarks (against your competitors, of course). That's your selling point, right?

ghita_•1h ago

Yes! We did this here: https://www.zeroentropy.dev/blog/announcing-zeroentropys-fir... We wanted to share the approach with the community in this post. It does do better than competitors though!

seanhunter•2h ago

Fun fact about ELO. It's natural to think that it is some kind of initialism, but in fact ELO doesn't stand for anything. It's the name of the guy who invented the system. https://en.wikipedia.org/wiki/Arpad_Elo

So don't say it "E.L.O." (unless you're talking about the band, I guess), say "ee-low"

ghita_•2h ago

oh interesting, had no idea, thanks for sharing

amelius•1h ago

What was his ELO rating?

homarp•1h ago

https://chess.stackexchange.com/questions/35420/what-was-arp...

2065

esafak•27m ago

It should be Elo rating! https://en.wikipedia.org/wiki/Elo_rating_system

rahulnair23•1h ago

Interesting work.

For a slightly different take using a similar intuition, see our paper [at ACL 2024](https://arxiv.org/abs/2402.14860) on ranking LLMs which may be of interest.

Our HuggingFace space has some examples: https://huggingface.co/spaces/ibm/llm-rank-themselves

ghita_•1h ago

thank you, will check out the paper, the hf space is very cool!

mkaszkowiak•1h ago

Happy to see competition in rerankers! Good luck with your product.

My questions: what languages do your models currently support? Did you perform multilingual benchmarks? Couldn't find an answer on the website

ghita_•1h ago

Thanks! We trained on most european languages (english, french, spanish, russian...), arabic, and chinese so it does well on those! We haven't tested too much on other languages, but happy to do so if there is a use case

Could Natural Hydrogen Reserves Power the Planet for Centuries?

The party trick called LLM

Show HN: Doctor

Career Civil Servants' Socially Embedded Responses to Democratic Backsliding

A Better Look at 3I/Atlas

Zuckerberg says Meta will build a data center the size of Manhattan in AI push

Notes from a product design vibe coding hackathon

The Artificial Intelligence Revolution: Part 1

Why Whoop Stands Behind Blood Pressure Insights after FDA Warning

Google raising Nest Aware Plus pricing by 25%

Stop Building Products Nobody Wants: The Validation Method That Works

Soviet College Admission – My Dad's Story (1970)

AI Finance Academy – Free AI-Powered Personal Finance Academy and Chatbot

Future-Proofing Junior Devs in the LLM Era

JavaScript scope hoisting is broken

The CIA faces a new AI-powered spy game

RCE found in diagnostic app affecting Android devices and connected vehicles

Hackers Can Tamper with Train Brakes Using Just a Radio, Feds Warn

Payment processors pressure Valve into banning porn games with themes of incest

Provably-Correct Vibe Coding

US-founded terrorist group says it was involved in killing of officer in Kyiv

AtCoder Finals Problem Statement

YouTuber faces jail time for showing off Android-based gaming handhelds

How and where will agents ship software?

MongoDB Sues FerretDB over Patents, Misinformation, and Trademark Misuse [pdf]

A Mile-Long Gateway to Hell Opens Up in Iceland

Ticket management system for IT professionals

Green Tea GC: How Go Stopped Wasting 35% of Your CPU Cycles

Ellen Ullman's "Close to the Machine: Technophilia and Its Discontents"

I built a real AI-first OS solo – with a functional, learning "brain system"

Show HN: Improving search ranking with chess Elo scores

Comments

Could Natural Hydrogen Reserves Power the Planet for Centuries?

The party trick called LLM

Show HN: Doctor

Career Civil Servants' Socially Embedded Responses to Democratic Backsliding

A Better Look at 3I/Atlas

Zuckerberg says Meta will build a data center the size of Manhattan in AI push

Notes from a product design vibe coding hackathon

The Artificial Intelligence Revolution: Part 1

Why Whoop Stands Behind Blood Pressure Insights after FDA Warning

Google raising Nest Aware Plus pricing by 25%

Stop Building Products Nobody Wants: The Validation Method That Works

Soviet College Admission – My Dad's Story (1970)

AI Finance Academy – Free AI-Powered Personal Finance Academy and Chatbot

Future-Proofing Junior Devs in the LLM Era

JavaScript scope hoisting is broken

The CIA faces a new AI-powered spy game

RCE found in diagnostic app affecting Android devices and connected vehicles

Hackers Can Tamper with Train Brakes Using Just a Radio, Feds Warn

Payment processors pressure Valve into banning porn games with themes of incest

Provably-Correct Vibe Coding

US-founded terrorist group says it was involved in killing of officer in Kyiv

AtCoder Finals Problem Statement

YouTuber faces jail time for showing off Android-based gaming handhelds

How and where will agents ship software?

MongoDB Sues FerretDB over Patents, Misinformation, and Trademark Misuse [pdf]

A Mile-Long Gateway to Hell Opens Up in Iceland

Ticket management system for IT professionals

Green Tea GC: How Go Stopped Wasting 35% of Your CPU Cycles

Ellen Ullman's "Close to the Machine: Technophilia and Its Discontents"

I built a real AI-first OS solo – with a functional, learning "brain system"