[原文链接：http://engineering.richrelevance.com/bandits-recommendation-systems/。]

### [本文链接：http://www.cnblogs.com/breezedeus/p/3775316.html，转载请注明出处]

#### Bandits for Recommendation Systems

06/02/2014 • Topics: Bayesian, Big data, Data Science

*This is the first in a series of three blog posts on bandits for recommendation systems. *

In this blog post, we will discuss the bandit problem and how it relates to online recommender systems. Then, we'll cover some classic algorithms and see how well they do in simulation.

A common problem for internet-based companies is: *which piece of content should we display?* Google has this problem (which ad to show), Facebook has this problem (which friend's post to show), and RichRelevance has this problem (which product recommendation to show). Many of the promising solutions come from the study of the *multi-armed bandit* problem. A one-armed "bandit" is another way to say slot machine (probably because both will leave you with empty pockets). Here is a description that I hijacked from Wikipedia:

"The **multi-armed bandit problem **is the problem a gambler faces at a row of slot machines when deciding which machines to play, how many times to play each machine and in which order to play them.^{ }When played, each machine provides a random reward from a distribution specific to that machine. The objective of the gambler is to maximize the sum of rewards earned through a sequence of lever pulls."

Let's rewrite this in retail language. Each time a shopper looks at a webpage, we have to show them one of K product recommendations. They either click on it or do not, and we log this (binary) reward. Next, we proceed to either the next shopper or the next page view of this shopper and have to choose one of Kproduct recommendations again. (Actually, we have to choose multiple recommendations per page, and our 'reward' could instead be sales revenue, but let's ignore these aspects for now.)

Multi-armed bandits come in two flavors: stochastic and adversarial. The stochastic case is where each bandit doesn't change in response to your actions, while in the adversarial case the bandits learn from your actions and adjust their behavior to minimize your rewards. We care about the stochastic case, and our goal is to find the arm which has the largest *expected reward*. I will index the arms by a , and the probability distribution over possible rewards r for each arm a can be written as \( p_a(r)\) . We have to find the arm with the largest mean reward

\( μ_a=E_a[r] \)

as quickly as possible while accumulating the most rewards along the way. One important point is that in practice \( p_a(r)\) are *non-stationary*, that is, rewards change over time, and we have to take that into account when we design our algorithms.

#### Approach #1: A Naive Algorithm

We need to figure out the mean reward (expected value) of each arm. So, let's just try each arm 100 times, take the sample mean of the rewards we get back, and then pull the arm with the best sample mean forever more. Problem solved?

Not exactly. This approach will get you in trouble in a few key ways:

- If K is of even moderate size (10-100), you'll spend a long time gathering data before you can actually benefit from feedback.
- Is 100 samples for each arm enough? How many should we use? This is an arbitrary parameter that will require experimentation to determine.
- If after 100 samples (or however many), the arm you settle on is not actually the optimal one, you can never recover.
- In practice, the reward distribution is likely to change over time, and we should use an algorithm that can take that into account.

OK, so maybe the naive approach won't work. Let's move on to a few that are actually used in practice.

#### Approach #2: The ϵ− Greedy Algorithm

What we'd like to do is start using what we think is the best arm as soon as possible, and adjust course when information to the contrary becomes available. And we don't want to get stuck in a sub-optimal state forever. The " ϵ− greedy" algorithm addresses both of these concerns. Here is how it works: with probability 1−ϵ pull the arm with the best current sample mean reward, and otherwise pull a random other arm (uniformly). The advantages over the naive are:

- Guaranteed to not get stuck in a suboptimal state forever.
- Will use the current best performing arm a large proportion of the time.

But setting ϵ is *hard*. If it’s too small, learning is slow at the start, and you will be slow to react to changes. If we happen to sample, say, the second-best arm the first few times, it may take a long time to discover that another arm is actually better. If ϵ is too big, you’ll waste many trials pulling random arms without gaining much. After a while, we'll have enough samples to be pretty sure which is best, but we will still be wasting an ϵ of our traffic exploring other options. In short, ϵ is a parameter that gives poor performance at the extremes, and we have little guidance as to how to set it.

#### Approach #3: Upper Confidence Bound Algorithms

In the world of statistics, whenever you estimate some unknown parameter (such as the mean of a distribution) using random samples, there is a way to quantify the uncertainty inherent in your estimate. For example, the true mean of a fair six-sided die is 3.5. But if you only roll it once and get a 2, your best estimate of the mean is just 2. Obviously that estimate is not very good, and we can quantify just how variable it is. There are *confidence bounds* which can be written, for example, as: "The mean of this die is 2, with a 95-th percentile lower bound of 1.4 and a 95-th percentile upper bound of 5.2."

The upper confidence bound (UCB) family of algorithms, as its name suggests, simply selects the arm with the largest upper confidence bound at each round. The intuition is this: the more times you roll the die, the tighter the confidence bounds. If your roll the die an infinite number of times then the width of the confidence bound is zero, and before you ever roll it the width of the confidence bound is the largest it will ever be. So, as the number of rolls increases, the uncertainty decreases, and so does the width of the confidence bound.

In the bandit case, imagine that you have to introduce a brand new choice to the set of K choices a week into your experiment. The ϵ− greedy algorithm would keep chugging along, showing this new choice rarely (if the initial mean is defined to be 0). But the upper confidence bound of this new choice will be very large because of the uncertainty that results from us never having pulled it. So UCB will choose this new arm until its upper bound is below the upper bound of the more established choices.

So, the advantages are:

- Take uncertainty of sample mean estimate into account in a smart way.
- No parameters to validate.

And the major disadvantage is that the confidence bounds designed in the machine learning literature require heuristic adjustments. One way to get around having to wade through heuristics is to recall the central limit theorem. I'll skip the math but it says that the distribution of the sample mean computed from samples from *any* distribution converges to a Normal (Gaussian) as the number of samples increases (and fairly quickly). Why does that matter here? Because we are estimating the true expected reward for each arm with a sample mean. Ideally we want a posterior of where the true mean is, but that's hard in non-Bernoulli and non-Gaussian cases. So we will instead content ourselves with an approximation and use a Gaussian distribution centered at the sample mean instead. We can thus always use a, say, 95% upper confidence bound, and be secure in the knowledge that it will become more and more accurate the more samples we get. I will discuss this in more detail in the next blog post.

#### Simulation Comparison

So how do these 3 algorithms perform? To find out, I ran a simple simulation 100 times with K=5 and binary rewards (aka a Bernoulli bandit). Here are the 5 algorithms compared:

- Random - just pick a random arm each time without learning anything.
- Naive - with 100 samples of each arm before committing to the best one
- ϵ− Greedy - with ϵ=0.01
- UCB - with (1 - 1/t) bounds (heuristic modification of UCB1)
- UCB - with 95% bounds

The metric used to compare these algorithms is average (over all the trials) *expected regret* (lower is better), which quantifies how much reward we missed out on by pulling the suboptimal arm at each time step. The Python code is here and the results are in the plot below.

What can we conclude from this plot?

- Naive is as bad as random for the first 100K rounds, but then has effectively flat performance. In the real world, the arms have shifting rewards, so this algorithm is impractical because it over-commits
- ϵ− greedy is OK but without a decaying ϵ we're still wasting 1% of traffic on exploration when it may no longer be necessary.
- The UCB algorithms are great. It's not clear which one is the winner in this limited horizon, but both handily beat all of the other algorithms.

Now you know all about bandits, and have a good idea of how they might be relevant to online recommender systems. But there's more to do before we have a system that is really up to the job.

Coming up next: let's get Bayesian with Thompson Sampling!

About Sergey Feldman:

Sergey Feldman is a data scientist & machine learning cowboy with the RichRelevance Analytics team. He was born in Ukraine, moved with his family to Skokie, Illinois at age 10, and now lives in Seattle. In 2012 he obtained his machine learning PhD from the University of Washington. Sergey loves random forests and thinks the Fourier transform is pure magic.