There was a constant stream of new content (i.e., arms for the bandits) to choose from. Instead of running manual experiments (e.g., A/B tests or other designs), the bandits would sample the new set of options and arrive at a new optimal mix much more quickly.
But we did want to run experiments with other things around the content that was managed by the bandits (e.g., UI flow, overall layout, other algorithmic things, etc.). It turns out bandits complicate these experiments significantly. Any changes to the context in which the bandits operate lead them to shift things more towards exploration to find a new optimal mix, hurting performance for some period of time.
We had a choice we could make here... treat all traffic, regardless of cohort, as a single universe that the bandits are managing (so they would optimize for the mix of cohorts as a whole). Or we could setup bandit stats for each cohort. If things are combined, then we can't use an experiment design that assumes independence between cohorts (e.g., A/B testing) because the bandits break independence. But the optimal mix will likely look different for one cohort vs. another vs. all of them combined. So it's better for experiment validity to isolate the bandits for each cohort. Now small cohorts can take quite a while to converge before we can measure how well things work. All of this puts a real limit on iteration speed.
Things also become very difficult to reason about because their is state in the bandit stats that are being used to optimize things. You can often think of that as a black box, but sometimes you need to look inside and it can be very difficult.
Much (all?) of this comes from bandits being feedback loops - these same problems are present in other approaches where feedback loops are used (e.g., control theory based approaches). Feedback mechanisms are incredibly powerful, but they couple things together in ways that can be difficult to tease apart.
On a random day, a light went off in my head that hierarchy perfectly addresses the stratification vs aggregation problems that arise in bandits. Unfortunately I’ve never had a chance to apply this (and thus see the issues) in a relevant setting since.
Can you explain, please?
For example, cohorts with very light traffic are likely to get undue benefit as a lot of exploration might be done before the smaller cohort needs to select an arm, so things are closer to convergence.
Another example is if there are wildly different outcomes between cohorts. More of the exploration will be done in cohorts with more traffic, leading bandit optimizations to fit large cohorts better than lower traffic cohorts.
Even if you do manage to make things independent, you have to wait for bandit convergence before you know what converged results will look like. That requires being able to measure convergence, which isn't always easy, especially if you don't know to do that before designing the system.
With all of these problems, we kept bandits, and even expanded their application. At least for the 10 years I was still around. They are incredibly powerful. But there was a lot of "I wish those damned bandits didn't work so well!"
For anyone who is not aware, A/B tests assume cohorts behave independently of each other. The less true that is, the less reliable the results are. This was even worse for us in parts of our system where there were no bandits, but there was direct interactions between individuals.
At some scales it can make sense, but for the vast majority of products/companies the statistical benefits don't outweigh the added complexity.
Bandits work best for relatively simple optimization choices at very large scale. If you care at all about the information gained in an experiment, it's fairly unlikely bandits are a good choice for you (or any reinforcement learning algorithm at that point).
How to cull arms, so that there are enough samples for any kind of convergence, is another problem in this setup. We eventually built an ML model to select arms and used bandits to pick between them. This proved "too effective". The arms were all user-generated content. The bandits on top of the model, both setup to maximize clicks was stupidly good at surfacing inappropriate content because it got a lot of clicks. We ended up having to put more safeties on arm selection for certain categories of our content where we had the most inappropriate submissions.
"Bandits work best for relatively simple optimization choices at very large scale"
Have you considered different methods to address this shortcoming?
I used control theory as an example of another type of feedback loop because we were using controllers in other parts of the system. One was roughly the proportional term of a PID controller. Another was both the proportional and integral terms, but with a lot of domain specific heuristics added on (e.g., ML models to set baselines that help the controllers understand how quickly to move values). This was all incredibly powerful, but also required experiment design and analysis that was far more complicated than we would have liked.
This was not a ride-sharing company, but you can imagine dynamic pricing for Uber or Lyft working this way. Use a controller to try to keep the number of available riders and drivers in balance. If riders exceed drivers, increase payment to drivers (and therefore cost to riders), thereby attracting more riders. If drivers are sitting idle, decrease payment. Ideally, this is done constantly to respond to time of day, weather, traffic, etc. (with some predictive baseline expectations to keep it from being too slow to respond). When you get it right, riders get the cheapest price that ensures a driver is available. In economic terminology, this is finding the market clearing price. A controller can get you a long way in a hurry on such a system. But it's certainly not the only approach that can work, and it takes a lot to get it right (I have no idea how Uber and Lyft actually do any of this).
One way to peak into the state is to use bayesian models to represent the "belief" state of the bandits. For example, the arm's "utility" can be a linear function of the features of the arm. At each period, you can inspect the coefficients (and their distribution) for each arm.
See this package:
What's challenging about that though?
* we have a single optimization goal (e.g the metric to minimize or maximize). The hard part isnt defining a optimization goal. The hard part is identifying what stakeholders actually want the tradeoff between different metrics. If you have goal A and goal B, where B is more aggressive than A, then getting agreement on which is better is hard. This is a people problem.
* MAB seems to be a good proof of concept that something that can be optimized but isnt an "end game" optimization path.
* MAB for A/B testing is going to mess up your AB data and make everything more difficult. You should have a treatment that uses the MAB algorithm and a treatment that doesnt.
All of the above is for non contextual MAB. I am currently learning about different MAB algorithms although each of them are pretty similar. The ones ive read about are all effectively linear/logistic regression and tbe differences come from exploration mechanism and how uncertainty is represented. Epsilon greedy has no uncertainty, exploration just happens a fixed percentage of time. UCB is optimistic about uncertainty, and the amount of optimism controls exploration. Thompson sampling uses statistical (beta) distributions about uncertainty, exploration happens less as confidence about a particular set of options increases.
Overall its a fun area to work in that is quite different from your typical CRUD development which is a nice change of pace.
edit: ah ok yes it does: https://en.wikipedia.org/wiki/Multi-armed_bandit
"We consider the basic model with IID rewards, called stochastic bandits. An algorithm has K possible actions to choose from, a.k.a. arms, and there are T rounds, for some known K and T . In each round, the algorithm chooses an arm and collects a reward for this arm. The algorithm’s goal is to maximize its total reward over the T rounds."
esafak•4mo ago