frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

Altermagnets: The first new type of magnet in nearly a century

https://www.newscientist.com/article/2487013-weve-discovered-a-new-kind-of-magnetism-what-can-we-do-with-it/
224•Brajeshwar•6h ago•44 comments

PyPI Prohibits inbox.ru email domain registrations

https://blog.pypi.org/posts/2025-06-15-prohibiting-inbox-ru-emails/
99•miketheman•2h ago•62 comments

How and where will agents ship software?

https://www.instantdb.com/essays/agents
74•stopachka•3h ago•30 comments

Artisanal Handcrafted Git Repositories

https://drew.silcock.dev/blog/artisanal-git/
29•drewsberry•1h ago•5 comments

Pgactive: Postgres active-active replication extension

https://github.com/aws/pgactive
229•ForHackernews•12h ago•66 comments

Chain of thought monitorability: A new and fragile opportunity for AI safety

https://arxiv.org/abs/2507.11473
81•mfiguiere•6h ago•38 comments

Show HN: Improving search ranking with chess Elo scores

https://www.zeroentropy.dev/blog/improving-rag-with-elo-scores
119•ghita_•7h ago•40 comments

Show HN: 0xDEAD//TYPE – A fast-paced typing shooter with retro vibes

https://0xdeadtype.theden.sh/
29•theden•3d ago•7 comments

A Recap on May/June Stability at Neon

https://neon.com/blog/an-apology-and-a-recap-on-may-june-stability
8•nikita•1h ago•0 comments

US Importers Sued for 'Greenwashing' Mexican Avocados

https://civileats.com/2025/07/09/u-s-importers-sued-for-greenwashing-mexican-avocados/
40•gmays•1h ago•38 comments

Cloudflare 1.1.1.1 Incident on July 14, 2025

https://blog.cloudflare.com/cloudflare-1-1-1-1-incident-on-july-14-2025/
503•nomaxx117•17h ago•334 comments

Shipping WebGPU on Windows in Firefox 141

https://mozillagfx.wordpress.com/2025/07/15/shipping-webgpu-on-windows-in-firefox-141/
318•Bogdanp•14h ago•131 comments

I'm switching to Python and actually liking it

https://www.cesarsotovalero.net/blog/i-am-switching-to-python-and-actually-liking-it.html
272•cesarsotovalero•13h ago•428 comments

Young graduates are facing an employment crisis

https://www.wsj.com/economy/jobs/jobs-unemployment-rise-young-people-ce4704d8
41•bdev12345•1h ago•33 comments

Weave (YC W25) is hiring an AI engineer

https://www.ycombinator.com/companies/weave-3/jobs/SqFnIFE-founding-ai-engineer
1•adchurch•4h ago

Mkosi – Build Bespoke OS Images

https://mkosi.systemd.io/
45•leetrout•5h ago•12 comments

What's happening to reading?

https://www.newyorker.com/culture/open-questions/whats-happening-to-reading
105•Kaibeezy•3d ago•231 comments

Scanned piano rolls database

http://www.pianorollmusic.org/rolldatabase.php
5•bookofjoe•3d ago•0 comments

Tilck: A tiny Linux-compatible kernel

https://github.com/vvaltchev/tilck
251•chubot•17h ago•48 comments

'Gentle Parenting' My Smartphone Addiction

https://www.newyorker.com/culture/infinite-scroll/gentle-parenting-my-smartphone-addiction
42•fortran77•6h ago•38 comments

Atopile – Design circuit boards with code

https://atopile.io/atopile/introduction
73•poly2it•3d ago•17 comments

How I lost my backpack with passports and laptop

https://psychotechnology.substack.com/p/how-i-lost-my-backpack-with-passports
91•eatitraw•1d ago•82 comments

Show HN: Timep – a next-gen profiler and flamegraph-generator for bash code

https://github.com/jkool702/timep
11•jkool702•1d ago•0 comments

GPUHammer: Rowhammer attacks on GPU memories are practical

https://gpuhammer.com/
253•jonbaer•21h ago•86 comments

Ukrainian hackers destroyed the IT infrastructure of Russian drone manufacturer

https://prm.ua/en/ukrainian-hackers-destroyed-the-it-infrastructure-of-a-russian-drone-manufacturer-what-is-known/
559•doener•13h ago•374 comments

Intel's retreat is unlike anything it's done before in Oregon

https://www.oregonlive.com/silicon-forest/2025/07/intels-retreat-is-unlike-anything-its-done-before-in-oregon.html
39•cbzbc•2h ago•22 comments

MARS.EXE → COM (2021)

https://chaos.if.uj.edu.pl/~wojtek/MARS.COM/
136•reconnecting•4d ago•39 comments

Show HN: An MCP server that gives LLMs temporal awareness and time calculation

https://github.com/jlumbroso/passage-of-time-mcp
66•lumbroso•6h ago•33 comments

LLM Daydreaming

https://gwern.net/ai-daydreaming
174•nanfinitum•19h ago•124 comments

KX Community Edition

https://www.defconq.tech/blog/From%20Elite%20to%20Everyone%20-%20KX%20Community%20Edition%20Breaks%20Loose
59•AUnterrainer•4h ago•30 comments
Open in hackernews

Show HN: Improving search ranking with chess Elo scores

https://www.zeroentropy.dev/blog/improving-rag-with-elo-scores
119•ghita_•7h ago
Hello HN,

I'm Ghita, co-founder of ZeroEntropy (YC W25). We build high accuracy search infrastructure for RAG and AI Agents.

We just released two new state-of-the-art rerankers zerank-1, and zerank-1-small. One of them is fully open-source under Apache 2.0.

We trained those models using a novel Elo score inspired pipeline which we describe in detail in the blog attached. In a nutshell, here is an outline of the training steps: * Collect soft preferences between pairs of documents using an ensemble of LLMs. * Fit an ELO-style rating system (Bradley-Terry) to turn pairwise comparisons into absolute per-document scores. * Normalize relevance scores across queries using a bias correction step, modeled using cross-query comparisons and solved with MLE.

You can try the models either through our API (https://docs.zeroentropy.dev/models), or via HuggingFace (https://huggingface.co/zeroentropy/zerank-1-small).

We would love this community's feedback on the models, and the training approach. A full technical report is also going to be released soon.

Thank you!

Comments

sippeangelo•6h ago
Really cool stuff! Just want to let you know you forgot to link to the evals at the end.
ghita_•5h ago
oh waw thanks for flagging, just fixed, thanks!
esafak•5h ago
I would have titled it "Improving ranking..."

I like that it works with `sentence_transformers`

ghita_•5h ago
yes we found it hard to find a good title for this, thanks for the feedback
dang•5h ago
We could change the title to "Improving search ranking with chess Elo scores". Anybody object?

Edit: ok, done. Submitted title was "Show HN: Improving RAG with chess Elo scores".

ashwindharne•5h ago
Cool stuff! We use a similar process internally to rerank and filter our cold outbound lists. We just use an off-the-shelf model as the judge, give it a custom criteria, and let it run until some set number of iterations. It's helped narrow down wide searches to the maximally relevant set of people (few thousand medium-bad matches to few hundred good matches)

It's not cheap and it's not fast, but it definitely works pretty well!

jayunit•3h ago
Very interesting! What are some examples of criteria that you can evaluate pairwise, but couldn't score individually?
bravura•3h ago
Pairwise rank constraints involve fewer assumptions that per-item scoring about the underlying nature of the data, thus they are more robust.
npip99•1h ago
Yeah that's exactly what we observed. Our goal was to create an absolute score that's completely independent from the Corpus, which is difficult because naturally all ELO distributions are inherently tied to the corpus itself!

When we were exploring the mathematical foundations, we considered ELO scoring against a "Universal Corpus" based on the natural entropy of human language (Obviously that's intractable, but sometimes this term cancels out like in the DPO proof).

But eventually we figured out a method using cross-query comparisons to assign an "ELO bias" to all document ELOs within a given query's candidate list. This normalizes it correctly such that when a candidate list is all bad, the ELOs shift low. And when the candidate list is all good, the ELOs shift high. Even when the relative ELOs are all the same.

ashwindharne•3h ago
It's all unstructured text (title, company, company size, experience, skills, raw text, etc.) and LLMs are pretty bad at assigning numerical scores in a vacuum. To make it work, we'd have to provide a representative set of examples, break scoring down by specific field, etc.

Kind of a lot of work compared to just dumping the text of 2 profiles into a context window along with a vague description of what I want, and having the LLM make the binary judgment.

yalok•5h ago
What’s the expected additional latency due to running this re-ranker?
ghita_•5h ago
It actually runs pretty fast, our benchmarks show ~149ms for 12665 bytes. It's faster than many other models
esafak•5h ago
I would prominently display your benchmarks (against your competitors, of course). That's your selling point, right?
ghita_•4h ago
Yes! We did this here: https://www.zeroentropy.dev/blog/announcing-zeroentropys-fir... We wanted to share the approach with the community in this post. It does do better than competitors though!
seanhunter•5h ago
Fun fact about ELO. It's natural to think that it is some kind of initialism, but in fact ELO doesn't stand for anything. It's the name of the guy who invented the system. https://en.wikipedia.org/wiki/Arpad_Elo

So don't say it "E.L.O." (unless you're talking about the band, I guess), say "ee-low"

ghita_•5h ago
oh interesting, had no idea, thanks for sharing
amelius•4h ago
What was his ELO rating?
homarp•4h ago
https://chess.stackexchange.com/questions/35420/what-was-arp...

2065

esafak•3h ago
It should be Elo rating! https://en.wikipedia.org/wiki/Elo_rating_system
reactordev•2h ago
It’s also popular in ranking online players in games… really any game where there’s an win/loss ranking..
kayge•2h ago
Thanks for this :) I had never heard of Elo until I noticed this morning that the new Chess course in Duolingo gives you an Elo ranking after a few rounds against Oscar. Probably would have skipped right over this story and comments otherwise, but now I have a fun bit of non-tech trivia to share if it ever comes up in small talk someday.
npip99•2h ago
I often see it rendered as "Elo" but I've always found it more natural to capitalize as "ELO", but perhaps I should swap to "Elo" given this. Pronouncing "ee-low" is certainly the way it's done in chess/esports though!
bbstats•1h ago
(also because it's a name, you don't capitalize all three letters)
fvdessen•23m ago
Similar to the 'Gini coefficient', named after Corrado Gini, former president of the Italian Genetics and Eugenics Society and author of 'The Scientific Basis of Facism'

https://en.wikipedia.org/wiki/Corrado_Gini

rahulnair23•4h ago
Interesting work.

For a slightly different take using a similar intuition, see our paper [at ACL 2024](https://arxiv.org/abs/2402.14860) on ranking LLMs which may be of interest.

Our HuggingFace space has some examples: https://huggingface.co/spaces/ibm/llm-rank-themselves

ghita_•4h ago
thank you, will check out the paper, the hf space is very cool!
mkaszkowiak•4h ago
Happy to see competition in rerankers! Good luck with your product.

My questions: what languages do your models currently support? Did you perform multilingual benchmarks? Couldn't find an answer on the website

ghita_•4h ago
Thanks! We trained on most european languages (english, french, spanish, russian...), arabic, and chinese so it does well on those! We haven't tested too much on other languages, but happy to do so if there is a use case
Neywiny•2h ago
I have a paper that got denied but it was about using 2AFC sorting to do this instead of elo. It has a defined end unlike elo scores. The code is on my github and focuses on humans sorting images but basically if you have a python sort function, you put your comparison as the key instead of assigning the comparison a numeric score. Then the algorithm does the rest

Code: https://github.com/Neywiny/merge-sort Conference/abstract presentation: https://www.spiedigitallibrary.org/conference-proceedings-of...

ghita_•2h ago
would love to check out the code if you have it!
Neywiny•2h ago
https://github.com/Neywiny/merge-sort

It was actually done to counter Elo based approaches so there's some references in the readme on how to prove who's better. I haven't run this code in 5 years and haven't developed on it in maybe 6, but I can probably fix any issues that come up. My co-author looks to have diverged a bit. Haven't checked out his code. https://github.com/FrankWSamuelson/merge-sort . There may also be a fork by the FDA itself, not sure. This work was done for the FDA's medical imaging device evaluation division

reactordev•2h ago
I was going to mention this approach as well. The problem with the OP is that it has assumption bias and the entire chain is based on that assumption. It’s novel. But the original idea was to more evenly distribute scores so you can find real relevance and I think 2AFC is better. But I don’t have time to verify and post a paper about it.
Neywiny•2h ago
It's probably because that's what we used, but nAFC has been my go-to since I first learned about it. Literally any time there's a ranking, even for dumb stuff like tier list videos on YouTube, they're too arbitrary. Ok you ranked this snack an 8/10. Based on what? And then they go back and say "actually I'm going to move that to a 7". AFC fixes all of that.
npip99•1h ago
Yes our pairwise method is based entirely on 2AFC comparisons, for both intra-query and inter-query ELO calculations.

It's definitely the best if not only way to get extremely high signal, and a score assignment that actually converges the more you sample.

In terms of the "F" in 2AFC, we actually have this amusing snippet from our prompt:

> Do NOT output a score of 0.0, ensure to focus on which document is superior, and provide a negative or positive float between -1.0 and 1.0.

Alex3917•2h ago
Out of curiosity, is there a reason why you are using ELO proper, rather than one of the ELO variants that doesn't make assumptions about the distribution of results? E.g.:

https://github.com/pfmonville/whole_history_rating

npip99•2h ago
Hey! We actually did a lot of research into ELO consistency, i.e. to check whether or not the NxN pairwise matrix followed the ELO model. It was a long road that's probably grounds for an entirely separate blog post, but the TLDR is that we observe that:

For each document, there is a secret hidden score "s" which is the "fundamental relevance according to the LLM". Then, when we sample (q, d1, d2) from the LLM, the LLM follows the statistical property that:

- The "fundamental hidden preference" is `pref = s_{d1} - s_{d2}`, usually ranging between -4 and 4.

- The LLM will sample a normal distribution around the `pref` with stddev ~0.2, which is some "inner noise" that the LLM experiences before coming to a judgement.

- The preference will pass through the sigmoid to get a sampled_score \in [0, 1].

- There is an additional 2% noise. i.e., 0.98 * sampled_score + 0.02 * random.random()

When we use Maximum Likelihood Estimation to find the most likely predicted "hidden scores" \hat{s} associated with each document, then we go ahead and sample pairwise matrices according to `0.98 * sigmoid( \hat{s}_1 - \hat{s}_2 + N(0, 0.02) ) + Uniform(0.02)`, then we get a pairwise matrix with virtually identical statistical properties to the observed pairwise matrices.

etk934•2h ago
Will the reranker trained with MSE be better calibrated than those trained with InfoNCE? Will threshold on reranker scores be more useful in RAG applications?
npip99•1h ago
We tried a bradley-terry loss function, as calculated with https://hackmd.io/@-Gjw1zWMSH6lMPRlziQFEw/SJ8sRl1Zge

We found that MSE after elo-adjustment worked equally well. And, MSE lets you shuffle (q, d) across the dataset which has good statistical properties (Versus contrastive, which makes you sample the same query many times within a single minibatch)

In this case "InfoNCE" isn't applicable because the reranker's output is a scalar, not a vector. So that's why we checked both bradley-terry and MSE.

pbhjpbhj•1h ago
So this is for recruitment?

I like the pairwise approach but in the field I'm interested in, at the document level there can be a lot of relevance (we historically use scoring based on TF-IDF) but we tend to get a corpus of documents that then need involved human analysis to give the relevant sections. It seems that paragraph-level vectors are probably at the right conceptual level for refinement.

Ultimately I guess, what is considered a document is somewhat arbitrary. But I wondered if you'd looked at - or if someone here knows about - MLs for retrieval that consider documents at a mix of conceptual levels to improve retrieval. So, pairwise paragraph-level after a broader retrieval would be a simple example.

I guess for looking at CV/resumes that might relate to finding someone who was gardener at Google and then later used ML for graphic design, vs someone who did ML at Google ... which might be a similar document vector (poor example, but you get the picture).

Currently I'm seeing document level references to source material, snippets based on keywords, but not paragraph level referencing as you'd have for legal decisions.

bbstats•1h ago
Little reminder that Elo is a guy, not an acronym :)