PowerPoint Presentation
Lecturer: Ben Rubinstein
Lecture 17. Multi-armed bandits
COMP90051 Statistical Machine Learning
Copyright: University of Melbourne
COMP90051 Statistical Machine Learning
This lecture
• Bandit setting vs Learning with experts
∗ Have to pick an expert (aka. arm)
∗ Observe rewards only for chosen arm
• Aka. Sequential decision making under uncertainty
∗ Simplest explore-vs-exploit setting
∗ Incredibly rich area with heaps of industrial applications
• Basic algorithms
∗ Greedy
∗ 𝜀𝜀-Greedy
∗ Upper Confidence Bound (UCB)
• More: Contextual bandits, RL, …
2
COMP90051 Statistical Machine Learning
Multi-Armed Bandits
3
Where we learn to take actions; we receive only
indirect supervision in the form of rewards; and we
only observe rewards for actions taken – the simplest
setting with an explore-exploit trade-off.
COMP90051 Statistical Machine Learning
Exploration vs. Exploitation
4
CC0
COMP90051 Statistical Machine Learning
Exploration vs. Exploitation
• “Multi-armed bandit” (MAB)
∗ Simplest setting for balancing exploration, exploitation
∗ Same family of ML tasks as reinforcement learning
• Numerous applications
∗ Online advertising
∗ Caching in databases
∗ Stochastic search in games (e.g. AlphaGo!)
∗ Adaptive A/B testing
∗ ….
5
CC0
COMP90051 Statistical Machine Learning
Stochastic MAB setting
• Possible actions 1, … , 𝑘𝑘 called “arms”
∗ Arm 𝑖𝑖 has distribution 𝑃𝑃𝑖𝑖 on bounded rewards with mean 𝜇𝜇𝑖𝑖
• In round 𝑡𝑡 = 1 …𝑇𝑇
∗ Play action 𝑖𝑖𝑡𝑡 ∈ 1, … , 𝑘𝑘 (possibly randomly)
∗ Receive reward 𝑅𝑅𝑖𝑖𝑡𝑡 𝑡𝑡 ~𝑃𝑃𝑖𝑖𝑡𝑡
• Goal: minimise cumulative regret
∗ 𝜇𝜇∗𝑇𝑇 − ∑𝑡𝑡=1
𝑇𝑇 𝐸𝐸 𝑅𝑅𝑖𝑖𝑡𝑡 𝑡𝑡
where 𝜇𝜇∗ = max𝑖𝑖 𝜇𝜇𝑖𝑖
∗ Intuition: Do as well as a rule that is simple but has knowledge
of the future
6
Expected cumulative reward of bandit
Best expected cumulative reward with hindsight
COMP90051 Statistical Machine Learning
Greedy
• At round t
∗ Estimate value of each arm i as average reward observed
𝑄𝑄𝑡𝑡−1 𝑖𝑖 = �
∑𝑠𝑠=1
𝑡𝑡−1 𝑅𝑅𝑖𝑖 𝑠𝑠 1 𝑖𝑖𝑠𝑠 = 𝑖𝑖
∑𝑠𝑠=1
𝑡𝑡−1 1 𝑖𝑖𝑠𝑠 = 𝑖𝑖
, if �
𝑠𝑠=1
𝑡𝑡−1
1 𝑖𝑖𝑠𝑠 = 𝑖𝑖 > 0
𝑄𝑄0, otherwise
… some init constant 𝑄𝑄0 𝑖𝑖 = 𝑄𝑄0 used until arm i has been pulled
∗ Exploit, baby, exploit!
𝑖𝑖𝑡𝑡 ∈ arg max1≤𝑖𝑖≤𝑘𝑘 𝑄𝑄𝑡𝑡−1 𝑖𝑖
∗ Tie breaking randomly
• What do you expect this to do? Effect of initial Qs?
7
COMP90051 Statistical Machine Learning
𝜀𝜀-Greedy
• At round t
∗ Estimate value of each arm i as average reward observed
𝑄𝑄𝑡𝑡−1 𝑖𝑖 = �
∑𝑠𝑠=1
𝑡𝑡−1 𝑅𝑅𝑖𝑖 𝑠𝑠 1 𝑖𝑖𝑠𝑠 = 𝑖𝑖
∑𝑠𝑠=1
𝑡𝑡−1 1 𝑖𝑖𝑠𝑠 = 𝑖𝑖
, if �
𝑠𝑠=1
𝑡𝑡−1
1 𝑖𝑖𝑠𝑠 = 𝑖𝑖 > 0
𝑄𝑄0, otherwise
… some init constant 𝑄𝑄0 𝑖𝑖 = 𝑄𝑄0 used until arm i has been pulled
∗ Exploit, baby exploit… probably; or possibly explore
𝑖𝑖𝑡𝑡~ �
arg max1≤𝑖𝑖≤𝑘𝑘 𝑄𝑄𝑡𝑡−1 𝑖𝑖 𝑤𝑤.𝑝𝑝. 1 − 𝜀𝜀
Unif 1, … , 𝑘𝑘 𝑤𝑤.𝑝𝑝. 𝜀𝜀
∗ Tie breaking randomly
• Hyperparam. 𝜀𝜀 controls exploration vs. exploitation
8
COMP90051 Statistical Machine Learning
Kicking the tyres
• 10-armed bandit
• Rewards 𝑃𝑃𝑖𝑖 = 𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁(𝜇𝜇𝑖𝑖 , 1) with 𝜇𝜇𝑖𝑖~𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁(0,1)
• Play game for 300 rounds
• Repeat 1,000 games, plot average per-round rewards
9
COMP90051 Statistical Machine Learning
Kicking the tyres: More rounds
• Greedy increases fast, but levels off at low rewards
• 𝜀𝜀-Greedy does better long-term by exploring
• 0.01-Greedy initially slow (little explore) but eventually superior
to 0.1-Greedy (exploits after enough exploration) 10
COMP90051 Statistical Machine Learning
𝑖𝑖𝑡𝑡~ �
arg max1≤𝑖𝑖≤𝑘𝑘 𝑄𝑄𝑡𝑡−1 𝑖𝑖 𝑤𝑤. 𝑝𝑝. 1 − 𝜀𝜀
Unif 1, … ,𝑘𝑘 𝑤𝑤. 𝑝𝑝. 𝜀𝜀
Optimistic initialisation improves Greedy
• Pessimism: Init Q’s below observable rewards Only try one arm
• Optimism: Init Q’s above observable rewards Explore arms once
• Middle-ground init Q Explore arms at most once
But pure greedy never explores an arm more than once
11
COMP90051 Statistical Machine Learning
Limitations of 𝜀𝜀-Greedy
• While we can improve on basic Greedy with
optimistic initialisation and decreasing 𝜀𝜀…
• Exploration and exploitation are too “distinct”
∗ Exploration actions completely blind to promising arms
∗ Initialisation trick only helps with “cold start”
• Exploitation is blind to confidence of estimates
• These limitations are serious in practice
12
COMP90051 Statistical Machine Learning
Mini Summary
• Multi-armed bandit setting
∗ Simplest instance of an explore-exploit problem
∗ Greedy approaches cover exploitation fine
∗ Greedy approaches overly simplistic with exploration (if have
any!)
• Compared to: learning with experts
∗ Superficial changes: Experts Arms; Losses Rewards
∗ Choose one arm (like probabilistic experts algorithm)
∗ Big difference: Only observe rewards on chosen arm
Next: A better way: optimism under uncertainty principle
13
COMP90051 Statistical Machine Learning
Upper-Confidence Bound (UCB)
14
Optimism in the face of uncertainty;
A smarter way to balance exploration-exploitation.
COMP90051 Statistical Machine Learning
(Upper) confidence interval for Q estimates
• Theorem: Hoeffding’s inequality
∗ Let 𝑅𝑅1, … ,𝑅𝑅𝑛𝑛 be i.i.d. random variables in [0,1] mean 𝜇𝜇,
denote by 𝑅𝑅𝑛𝑛 their sample mean
∗ For any 𝜀𝜀 ∈ (0,1) with probability at least 1 − 𝜀𝜀
𝜇𝜇 ≤ 𝑅𝑅𝑛𝑛 +
log 1/𝜀𝜀
2𝑛𝑛
• Application to 𝑄𝑄𝑡𝑡−1 𝑖𝑖 estimate – also i.i.d. mean!!
∗ Take 𝑛𝑛 = 𝑁𝑁𝑡𝑡−1(𝑖𝑖) = ∑𝑠𝑠=1
𝑡𝑡−1 1 𝑖𝑖𝑠𝑠 = 𝑖𝑖 number of i plays
∗ Then 𝑅𝑅𝑛𝑛 = 𝑄𝑄𝑡𝑡−1 𝑖𝑖
∗ Critical level 𝜀𝜀 = 1/𝑡𝑡 (Lai & Robbins ’85), take 𝜀𝜀 = 1/𝑡𝑡4
15
COMP90051 Statistical Machine Learning
Upper Confidence Bound (UCB) algorithm
• At round t
∗ Estimate value of each arm i as average reward observed
𝑄𝑄𝑡𝑡−1 𝑖𝑖 =
�̂�𝜇𝑡𝑡−1 𝑖𝑖 +
2log(𝑡𝑡)
𝑁𝑁𝑡𝑡−1(𝑖𝑖)
, if �
𝑠𝑠=1
𝑡𝑡−1
1 𝑖𝑖𝑠𝑠 = 𝑖𝑖 > 0
𝑄𝑄0, otherwise
…some constant 𝑄𝑄0 𝑖𝑖 = 𝑄𝑄0 used until arm i has been pulled; where:
𝑁𝑁𝑡𝑡−1 𝑖𝑖 = �
𝑠𝑠=1
𝑡𝑡−1
1 𝑖𝑖𝑠𝑠 = 𝑖𝑖 �̂�𝜇𝑡𝑡−1 𝑖𝑖 =
∑𝑠𝑠=1
𝑡𝑡−1 𝑅𝑅𝑖𝑖 𝑠𝑠 1 𝑖𝑖𝑠𝑠 = 𝑖𝑖
∑𝑠𝑠=1
𝑡𝑡−1 1 𝑖𝑖𝑠𝑠 = 𝑖𝑖
∗ “Optimism in the face of uncertainty”
𝑖𝑖𝑡𝑡~ arg max1≤𝑖𝑖≤𝑘𝑘 𝑄𝑄𝑡𝑡−1 𝑖𝑖
…tie breaking randomly
• Addresses several limitations of 𝜀𝜀-greedy
• Can “pause” in a bad arm for a while, but eventually find best
16
COMP90051 Statistical Machine Learning
Kicking the tyres: How does UCB compare?
• UCB quickly overtakes the 𝜀𝜀-Greedy approaches
17
COMP90051 Statistical Machine Learning
Kicking the tyres: How does UCB compare?
• UCB quickly overtakes the 𝜀𝜀-Greedy approaches
• Continues to outpace on per round rewards for some time
18
COMP90051 Statistical Machine Learning
Kicking the tyres: How does UCB compare?
• UCB quickly overtakes the 𝜀𝜀-Greedy approaches
• Continues to outpace on per round rewards for some time
• More striking when viewed as mean cumulative rewards
19
COMP90051 Statistical Machine Learning
Notes on UCB
• Theoretical regret bounds, optimal up to multiplicative
constant
∗ Grows like O(log 𝑡𝑡) i.e. averaged regret goes to zero!
• Tunable 𝜌𝜌 > 0 exploration hyperparam. replaces “2”
𝑄𝑄𝑡𝑡−1 𝑖𝑖 =
�̂�𝜇𝑡𝑡−1 𝑖𝑖 +
𝜌𝜌log(𝑡𝑡)
𝑁𝑁𝑡𝑡−1(𝑖𝑖)
, if �
𝑠𝑠=1
𝑡𝑡−1
1 𝑖𝑖𝑠𝑠 = 𝑖𝑖 > 0
𝑄𝑄0, otherwise
∗ Captures different 𝜀𝜀 rates & bounded rewards outside [0,1]
• Many variations e.g. different confidence bounds
• Basis for Monte Carlo Tree Search used in AlphaGo!
20
COMP90051 Statistical Machine Learning
Mini Summary
• Addressing limitations of greedy
∗ Exploration blind to how good an arm is
∗ Exploration/exploitation blind to confidence of arm estimates
• Upper-confidence bound (UCB) algorithm
∗ Exploration boost: Upper-confidence bound (like PAC bound!)
∗ Example of: Optimism in the face of uncertain principle
∗ Achieves practical performance and good regret bounds
Next: Wrap-up with a few related directions
21
COMP90051 Statistical Machine Learning
Beyond basic bandits
22
Adding state with contextual bandits;
State transitions/dynamics with reinforcement learning.
COMP90051 Statistical Machine Learning
But wait, there’s more!! Contextual bandits
• Adds concept of “state” of the world
∗ Arms’ rewards now depend on state
∗ E.g. best ad depends on user and webpage
• Each round, observe arbitrary context (feature) vector
representing state 𝑿𝑿𝑖𝑖 𝑡𝑡 per arm
∗ Profile of web page visitor (state)
∗ Web page content (state)
∗ Features of a potential ad (arm)
• Reward estimation
∗ Was unconditional: 𝐸𝐸 𝑅𝑅𝑖𝑖 𝑡𝑡
∗ Now conditional: 𝐸𝐸 𝑅𝑅𝑖𝑖 𝑡𝑡 |𝑿𝑿𝑖𝑖 𝑡𝑡
• A regression problem!!!
23
Still choose arm with
maximizing UCB.
But UCB is not on a
mean, but a regression
prediction given context
vector.
COMP90051 Statistical Machine Learning
MABs vs. Reinforcement Learning
• Contextual bandits introduce state
∗ But don’t model actions as causing state transitions
∗ New state arrives “somehow”
• RL has rounds of states, actions, rewards too
• But (state,action) determines the next state
∗ E.g. playing Go, moving a robot, planning logistics
• Thus, RL still learns value functions w regression, but
has to “roll out” predicted rewards into the future
24
COMP90051 Statistical Machine Learning
Mini Summary
• Lecture: Stochastic multi-armed bandits
∗ Sequential decision making under uncertainty
∗ Simplest explore-vs-exploit setting
∗ (𝜀𝜀)-greedy, UCB, LinUCB
• Related directions:
∗ Contextual bandits: adding state; regression estimates rewards
∗ Reinforcement learning: introducing state transitions
Next lecture: It’s Bayesian time!
25
�Lecturer: Ben Rubinstein
This lecture
Multi-Armed Bandits
Exploration vs. Exploitation
Exploration vs. Exploitation
Stochastic MAB setting
Greedy
𝜀-Greedy
Kicking the tyres
Kicking the tyres: More rounds
Optimistic initialisation improves Greedy
Limitations of 𝜀-Greedy
Mini Summary
Upper-Confidence Bound (UCB)
(Upper) confidence interval for Q estimates
Upper Confidence Bound (UCB) algorithm
Kicking the tyres: How does UCB compare?
Kicking the tyres: How does UCB compare?
Kicking the tyres: How does UCB compare?
Notes on UCB
Mini Summary
Beyond basic bandits
But wait, there’s more!! Contextual bandits
MABs vs. Reinforcement Learning
Mini Summary