There was a constant stream of new content (i.e., arms for the bandits) to choose from. Instead of running manual experiments (e.g., A/B tests or other designs), the bandits would sample the new set of options and arrive at a new optimal mix much more quickly.
But we did want to run experiments with other things around the content that was managed by the bandits (e.g., UI flow, overall layout, other algorithmic things, etc.). It turns out bandits complicate these experiments significantly. Any changes to the context in which the bandits operate lead them to shift things more towards exploration to find a new optimal mix, hurting performance for some period of time.
We had a choice we could make here... treat all traffic, regardless of cohort, as a single universe that the bandits are managing (so they would optimize for the mix of cohorts as a whole). Or we could setup bandit stats for each cohort. If things are combined, then we can't use an experiment design that assumes independence between cohorts (e.g., A/B testing) because the bandits break independence. But the optimal mix will likely look different for one cohort vs. another vs. all of them combined. So it's better for experiment validity to isolate the bandits for each cohort. Now small cohorts can take quite a while to converge before we can measure how well things work. All of this puts a real limit on iteration speed.
Things also become very difficult to reason about because their is state in the bandit stats that are being used to optimize things. You can often think of that as a black box, but sometimes you need to look inside and it can be very difficult.
Much (all?) of this comes from bandits being feedback loops - these same problems are present in other approaches where feedback loops are used (e.g., control theory based approaches). Feedback mechanisms are incredibly powerful, but they couple things together in ways that can be difficult to tease apart.
On a random day, a light went off in my head that hierarchy perfectly addresses the stratification vs aggregation problems that arise in bandits. Unfortunately I’ve never had a chance to apply this (and thus see the issues) in a relevant setting since.
esafak•1h ago