Counterfactual evaluation for recommendation systems

https://eugeneyan.com/writing/counterfactual-evaluation/

91•kurinikku•3w ago

Comments

ZouBisou•3w ago

I wish there was a HN for data content. Articles like this are always my favorite ways to learn something new from a domain expert. I work in fraud detection & would like to write something similar!

spott•2w ago

https://datatau.net used to be kinda that… but it has been dead for a while, and it looks like the spam bots have taken over.

yearolinuxdsktp•2w ago

I admit this is way over my head, I am still trying to grok it. This seems to require an existing model to start from—-I am not sure how one would arrive at a model from scratch (I guess start from the same weights on all items?)

I think the point about A/B testing in production to confirm if a new model is working is really important, but quite important is to also do A/B/Control testing, where Control is random (seeded to the context or user) or no recommendations, which helps not only with A vs B, but helps validate that A or B isn’t performing worse than Control. What percentage of traffic (1% or 5%) goes to Control depends on traffic levels, but also requires convincing to run control.

I think one important technique is to pre-aggregate your data on a user-centered or item-centered basis. This can make it much more palatable to collect this data on a massive scale without having to store a log for every event.

Contextual bandit is one technique that attempts to deal with confounding factors and bias from actual recommendations. However, I think there’s a major challenge to scale it to large counts of items.

I think the quality of collected non-click data is also important—-did the user actually scroll down to see the recommendations or were they served but not looked at? Likewise, I think it’s important to add depth to the “views” or “clicks” metric—-if something was clicked, how long did the user spend viewing/interacting with the item? Did they click and immediately go back or did they click and look at it for a while? Did they add the item to cart? Or if we are talking about articles, did they spend time reading it? Item interest can be estimated more closely than just views, clicks and purchases. Of course, we know that purchases (or more generally conversion rates) have a direct business value, but, for example, an add to cart is somewhat of a proxy of purchase probability and can enhance the quality of the data used to train (and thus a higher proxy business value).

It’s probably impractical to train on control interactions only (and also difficult to keep the same user in control group between visits).

The SNIPS normalization technique reminds me of the Mutual Information factor correction when training co-occurrence (or association) models, where Mutual Information rewards items less likely to randomly co-occur.

jlamberts•2w ago

Re: existing model, for recsys, as long as the product already exists you have some baseline available, even if it's not very good. Anything from "alphabetical order" to "random order" to "most popular" (a reasonable starting point for a lot of cases) is a baseline model.

I agree that a randomized control is extremely valuable, but more as a way to collect unbiased data than a way to validate that you're outperforming random: it's pretty difficult to do worse than random in most recommendation problems. A more palatable way to introduce some randomness is by showing a random item in a specific position with some probability, rather than showing totally random items for a given user/session. This has the advantage of not ruining the experience for an unlucky user when they get a page of things totally unrelated to their interests.

dfajgljsldkjag•2w ago

It is annoying how often our offline metrics look perfect while the actual A/B test shows zero lift. The article explains that gap really well by framing recommendations as an interventional problem rather than just an observational one. I guess we really need to start looking at counterfactual evaluation if we want our offline tests to actually mean something.

westurner•2w ago

From https://news.ycombinator.com/item?id=46663105 (flagged?) :

> There are a number of different types of counterfactuals; Describe the different types of counterfactuals in statistics: Classical counterfactuals, Pearl's counterfactuals, Quantum counterfactuals, Constructor theory counterfactuals

Why did the author believe that that counterfactual model was appropriate for this?

We Mourn Our Craft

Speed up responses with fast mode

Hoot: Scheme on WebAssembly

Stories from 25 Years of Software Development

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

U.S. Jobs Disappear at Fastest January Pace Since Great Recession

Al Lowe on model trains, funny deaths and working with Disney

Brookhaven Lab's RHIC Concludes 25-Year Run with Final Collisions

The AI boom is causing shortages everywhere else

The Waymo World Model

Reinforcement Learning from Human Feedback

Start all of your commands with a comma (2009)

I Write Games in C (yes, C)

SectorC: A C Compiler in 512 bytes

Vocal Guide – belt sing without killing yourself

France's homegrown open source online office suite

Coding agents have replaced every framework I used

A Fresh Look at IBM 3270 Information Display System

Selection Rather Than Prediction

History and Timeline of the Proco Rat Pedal (2021)

72M Points of Interest

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Where did all the starships go?

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Learning from context is harder than we thought

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

Monty: A minimal, secure Python interpreter written in Rust for use by AI

Making geo joins faster with H3 indexes

Software factories and the agentic moment