COMP 424 – Artificial Intelligence Bandits
Instructor: Jackie CK Cheung and
Recall: Lotteries
Copyright By PowCoder代写 加微信 powcoder
• A lottery is a set of outcomes, each associated with some probability
• E.g., suppose you had to choose between these two lotteries:
• L1: win $1M for sure.
• L2: win $5M with prob. 0.1
win $1M with prob. 0.89 win $0 with prob 0.01.
COMP-424: Artificial intelligence 2
Recall: Utilities
• Utilities map outcomes (or states) to real values.
• MEU principle: Choose the action that maximizes expected utility.
• Most widely accepted as a standard for rational behavior.
COMP-424: Artificial intelligence 3
Outline for Today
Learning utilities from data
• Bandit problems
• Action values
• Exploration-exploitation tradeoff
• Simple exploration strategies
COMP-424: Artificial intelligence 4
Bandit Problems
A model of decision making under uncertainty; a bandit is the original name of a slot machine.
Environment is unknown
• We don’t know what the lottery actually is!
• Would like to learn the (expected) utility of a bandit
k-armed bandit, a collection of k actions, each with a lottery associated with it
e.g., 3-armed bandit
COMP-424: Artificial intelligence 5
Application #1: Internet advertising
• A large Internet company is interested in selling advertising on their website.
• It receives money when a company places an ad on the website and that ad gets clicked by a visitor to the website.
• What are the bandit’s arms?
COMP-424: Artificial intelligence 6
Application #1: Internet advertising
• A large Internet company is interested in selling advertising on their website.
• It receives money when a company places an ad on the website and that ad gets clicked by a visitor to the website.
• On a webpage, you can choose to display any of n possible ads.
• Each ad has an unknown probability of click rate.
• If the ad is clicked, there is a reward (utility), otherwise none.
• Q: What is the best advertisement strategy to maximize return?
COMP-424: Artificial intelligence 7
Application #2: Network server selection
• Suppose you can choose to send a job from a user to be processed on one of several servers.
• The servers have different processing speed (e.g. due to geographic location, load, etc.)
• What are the bandit’s arms?
COMP-424: Artificial intelligence 8
Application #2: Network server selection
• Suppose you can choose to send a job from a user to be processed on one of several servers.
• The servers have different processing speed (e.g. due to geographic location, load, etc.)
• Each server can be viewed as an action (arm).
• Over time, you want to learn what is the best action to
• This is used in routing, DNS server selection, cloud computing, etc.
COMP-424: Artificial intelligence 9
Playing a bandit
• You can choose repeatedly among the k actions; each choice is called a play.
• After each play at, the machine gives a reward rt drawn from the distribution associated with at.
• The value of action a is its expected utility: Q*(a) = E[r | a]
• Note that r is drawn from a probability distribution that depends on
• Objective: Choose arms in a way that maximizes the
reward obtained in the long run (e.g. over 1000 rounds).
COMP-424: Artificial intelligence 10
Estimating action values
• Suppose that action a has been chosen n times, and the rewards received were r1, r2, …, rn.
• Then we can estimate the value of the action as the sample average of the rewards obtained:
Qn(a) = (r1 + r2 + … + rn) / n
• As we take the action more, this estimate becomes more
accurate (law of large numbers) and in the limit:
limn->∞ Qn(a) = Q*(a)
• One can even measure how fast Qn approaches Q*.
COMP-424: Artificial intelligence 11
Estimating action values incrementally
• Do we need to remember all rewards so far to estimate Qn(a) ?
• No! Just keep a running average: Qn+1(a) = (r1 + r2 + … + rn+1) / n+1
= (r1 + r2 + … + rn) / (n+1) + rn+1 / (n+1) = nQn(a) / (n+1) + rn+1 / (n+1)
= (nQn(a)+Qn(a)-Qn(a)) / (n+1) + rn+1 / (n+1) = Qn(a) + (rn+1 – Qn(a)) / (n+1)
• The first term is the old value estimate.
• The second term is the error between the new sample and the old
value estimate, weighted by a decreasing step size (=1/(n+1)).
COMP-424: Artificial intelligence 12
Exploration-exploitation trade-off
Tension between the two:
• Explore actions, to figure out which one is best (which means some amount of random choice.)
• Exploit the knowledge you already have, which means picking the greedy action: at* = argmaxa Qt(a)
• You cannot explore all the time (because you will lose big-time).
• You cannot exploit all the time, because you may end up with sub-optimal policy.
COMP-424: Artificial intelligence 13
What if the problem is non-stationary?
• Sample averaging works if problem is stationary (i.e. Q*(a) does not change over time).
• In some applications (e.g. advertising), Q*(a) may change over time 🡪 most recent rewards should be emphasized!
• Instead of 1/n, use a constant step size α∈(0,1) for value updates:
Qn+1(a) = Qn(a) + α(rn+1 – Qn(a))
• This leads to a recency-weighted average estimate:
Qn+1(a) = (1-α)nQ0(a) + Σi=1:n+1 α(1-α)n-i ri
• Because α∈(0,1), most recent reward, rn has highest weight
COMP-424: Artificial intelligence 14
Exploration in non-stationary bandits
• If the environment is stationary, you may want to reduce exploration over time.
• If the environment is not stationary, you can never stop exploring.
• Why? Because optimal action may change over time!
COMP-424: Artificial intelligence 15
Strategies
2. Epsilon-greedy
3. Softmax action selection
4. Optimistic initialization
5. Upper confidence bound
COMP-424: Artificial intelligence 16
Strategy 1: Greedy
• Always pick the arm a with the best estimate of Qn(a)
• This favours exploitation over exploration.
• May result in suboptimal performance over time
COMP-424: Artificial intelligence 17
Strategy 2: ε-greedy action selection
• Pick ε∈(0,1) (a constant), usually small (e.g. ε=0.1).
• On every play:
• With probability ε you pull a random arm (explore).
• With probability 1-ε, pull the best arm according to current estimates (exploit).
• Advantage: Very simple! Easy to understand, easy to implement.
• Disadvantage: Exploration means that you’ll asymptotically converge to suboptimal strategy
• But you can make ε depend on time (e.g. 1/t, 1/√t, …)
COMP-424: Artificial intelligence 18
Example: 10-armed bandit problem
• The mean reward for each arm is chosen from a normal distribution with mean 0 and standard deviation 1
• Rewards are generated from a normal distribution around the true mean, with st.dev.1.
• We average 2000 different independent runs, each starts from Q(a)=0 and does 1000 pulls using the ε–greedy strategy.
COMP-424: Artificial intelligence 19
Example: 10-armed bandit problem
COMP-424: Artificial intelligence 20
Example: 10-armed bandit
from (Sutton and Barto, 2016)
• If ε=0, convergence to a sub-optimal strategy.
• If ε is too low, convergence is very slow.
• If ε is too high (not pictured here) rewards received during learning may be too low, and have high variance.
COMP-424: Artificial intelligence 21
Example: 10-armed bandit (cont’d)
from (Sutton and Barto, 2016)
• If ε=0, convergence to a sub-optimal strategy.
• If ε is too low, convergence is very slow.
• If ε is too high (not pictured here) rewards received during learning may be too low, and have high variance.
COMP-424: Artificial intelligence 22
Strategy 3: Softmax action selection
• Key idea: make the action probabilities a function of the current action values, Qt(a).
• At time t, we choose action a with probability proportional to:
Pr(at) = eQt(a)/τ
• Normalize probabilities so they sum to 1 over the actions.
• τ is the temperature parameter.
• Similar to simulated annealing.
How does τ influence the exploration strategy?
COMP-424: Artificial intelligence 23
Strategy 4: Optimistic initialization
• If you do not know anything about an action, assume it’s great! (optimism in the face of uncertainty)
• Very powerful idea – recall A* and admissible heuristics. • Implementation:
• Initialize action values higher than they could possibly be, Q0(a)=∞, or a large positive constant.
• Choose actions according to deterministic strategy: always pick the action with the best current estimate (break ties randomly).
• Whatever you do, you will be “disappointed”, but…
• If the action is not so disappointing, you will try it often.
• If an action is very disappointing, you will learn quickly to avoid it.
COMP-424: Artificial intelligence 24
Illustration: Optimistic initialization
from (Sutton and Barto, 2016)
Leads to more rapid exploration than ε–greedy strategy, which is bad in the short run but good in the long run.
Once the optimal strategy is found, it stays there (since there is no randomness).
COMP-424: Artificial intelligence 25
A similar idea: Confidence intervals
• Suppose you have a random variable X with mean E[X]=μ.
• The standard deviation of X measures how much X is spread
around its mean:
• This can be estimated from samples.
• Idea for exploration: Add 1 standard deviation to estimate Qt(a). Choose
actions with respect to this upper-confidence bound.
COMP-424: Artificial intelligence 26
Remember: UCT (Upper Confidence Trees)
Exploitation Exploration
Used for search and game trees!
COMP-424: Artificial intelligence 27
Strategy 5: Upper-confidence bound (UCB)
• In UCB the confidence bounds are generated over time rather than generated through simulations.
• Very similar formula. Pick a greedily using:
where n is the total number of actions picked so far and n(a) is the
number of times action a was picked.
• Several tweaks of the upper-bound term have been proposed, lots of theoretical analysis for this type of method has been done.
COMP-424: Artificial intelligence 28
Which algorithm to use?
• Good news: All algorithms converge in the limit to the correct values (given appropriate parameter changes if need be), assuming the environment is stationary.
• UCB has provably the fastest convergence when considering regret:
This sum is O(log(t)) for UCB, and a matching lower bound exists.
• However, when considering a finite training period after which the greedy policy is used forever, simpler strategies often perform better in practice.
COMP-424: Artificial intelligence 29
Contextual bandits
• The usual bandit problem has no notion of “state”, we just observe some interactions and payoffs.
• In general, more information is available, e.g. placing ads on a webpage, you can observe the current words displayed.
• Contextual bandits have some state information, summarized in a vector of measurements s.
• The state is chosen by the environment. The agent chooses the action. The action’s value depends on the state:
• Exploration methods remain similar (greedy, Boltzmann, UCB).
• It is necessary to learn the parameters w in this problem. COMP-424: Artificial intelligence 30
1. Come up with an example where this strategy results in
not finding the optimal action.
2. Come up with the general condition where this strategy results in a suboptimal solution.
COMP-424: Artificial intelligence 31
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com