you find a lot of things that are unsatisfying such as you never got a relevance score better than p=0.7 or so and that's very very rare. There are many specific problems in IR for which that kind of probability would be really helpful such as combining results that came from different sources or returning a stream of new documents from a collection but it was an early decision in TREC to not reward ranking functions that were good probability estimators or even that are good at the top-1 or top-3 positions but rather reward them for still being enriched in relevant results when you go deep (like 1000 results deep) into the results.
softwaredoug•6h ago
Interesting! I did not know this about TREC's decision or the scipy calibration module.
PaulHoule•6h ago
https://scikit-learn.org/stable/modules/calibration.html
you find a lot of things that are unsatisfying such as you never got a relevance score better than p=0.7 or so and that's very very rare. There are many specific problems in IR for which that kind of probability would be really helpful such as combining results that came from different sources or returning a stream of new documents from a collection but it was an early decision in TREC to not reward ranking functions that were good probability estimators or even that are good at the top-1 or top-3 positions but rather reward them for still being enriched in relevant results when you go deep (like 1000 results deep) into the results.
softwaredoug•6h ago