Lecture 13. Multi-armed bandits
COMP90051 Statistical Machine Learning
Semester 2, 2019 Lecturer: Ben Rubinstein
Copyright: University of Melbourne
COMP90051 Statistical Machine Learning
This lecture
• Stochasticmulti-armedbandits
∗ Where we learn to take actions; we receive only indirect supervision in the form of rewards; and we only observe rewards for actions taken – the simplest setting with an explore-exploit trade-off.
∗ Sequential decision making under uncertainty
∗ (𝜀𝜀)-greedy
∗ UCB algorithm
2
COMP90051 Statistical Machine Learning
Exploration vs. Exploitation
CC0
3
COMP90051 Statistical Machine Learning
Exploration vs. Exploitation
• “Multi-armedbandit”(MAB)
∗ Simplest setting for balancing exploration, exploitation ∗ Same family of ML tasks as reinforcement learning
• Numerousapplications ∗ Online advertising
∗ Portfolio selection
∗ Caching in databases
∗ Stochastic search in games (e.g. AlphaGo!) ∗ Adaptive A/B testing
∗ ….
CC0
4
COMP90051 Statistical Machine Learning
Stochastic MAB setting • Possible actions 1, … , 𝑘𝑘 called “arms”
∗ Arm 𝑖𝑖 has distribution 𝑃𝑃𝑖𝑖 on bounded rewards with mean 𝜇𝜇𝑖𝑖
• In round 𝑡𝑡 = 1 … 𝑇𝑇
∗ Play action 𝑖𝑖𝑡𝑡 ∈ 1, … , 𝑘𝑘 (possibly randomly) ∗ Receive reward 𝑋𝑋𝑖𝑖𝑡𝑡 𝑡𝑡 ~𝑃𝑃𝑖𝑖𝑡𝑡
∗𝜇𝜇𝑇𝑇−∑𝐸𝐸𝑋𝑋𝑡𝑡 𝑡𝑡=1 𝑖𝑖𝑡𝑡
• Goal: minimise cumulative regret
∗𝑇𝑇
where 𝜇𝜇∗ = max𝑖𝑖 𝜇𝜇𝑖𝑖
Expected cumulative reward of bandit
Best expected cumulative reward with hindsight
∗ Intuition: Do as well as a rule that is simple but has knowledge of the future
5
COMP90051 Statistical Machine Learning
Greedy
• At round t
∗ Estimate value of each arm i as average reward observed
𝑄𝑄 𝑖𝑖=�𝑠𝑠=1𝑖𝑖
𝑠𝑠 ,if�1𝑖𝑖=𝑖𝑖>0
∑𝑡𝑡−1𝑋𝑋 𝑠𝑠1𝑖𝑖 =𝑖𝑖
𝑡𝑡−1
𝑡𝑡−1 ∑𝑡𝑡−11𝑖𝑖 =𝑖𝑖 𝑠𝑠=1 𝑠𝑠 𝑠𝑠=1 𝑠𝑠
𝑄𝑄0 , otherwise
… some init constant 𝑄𝑄0 𝑖𝑖 = 𝑄𝑄0 used until arm i has been pulled
∗ Exploit, baby, exploit!
𝑖𝑖∈argmax 𝑄𝑄 𝑖𝑖
𝑡𝑡 1≤𝑖𝑖≤𝑘𝑘 𝑡𝑡−1
∗ Tie breaking randomly
• What do you expect this to do? Effect of initial Qs?
6
𝜀𝜀-Greedy • At round t
COMP90051 Statistical Machine Learning
𝑄𝑄 𝑖𝑖=�𝑠𝑠=1𝑖𝑖
𝑠𝑠 ,if�1𝑖𝑖=𝑖𝑖>0
∑𝑡𝑡−1𝑋𝑋 𝑠𝑠1𝑖𝑖 =𝑖𝑖
𝑡𝑡−1
∗ Estimate value of each arm i as average reward observed
𝑡𝑡−1 ∑𝑡𝑡−11𝑖𝑖 =𝑖𝑖 𝑠𝑠=1 𝑠𝑠 𝑠𝑠=1 𝑠𝑠
𝑄𝑄0 , otherwise
… some init constant 𝑄𝑄0 𝑖𝑖 = 𝑄𝑄0 used until arm i has been pulled
∗ Exploit, baby exploit… probably; or possibly explore 𝑖𝑖 ~�argmax 𝑄𝑄 𝑖𝑖 𝑤𝑤.𝑝𝑝. 1−𝜀𝜀
𝑡𝑡
1≤𝑖𝑖≤𝑘𝑘 𝑡𝑡−1
Unif 1,…,𝑘𝑘 𝑤𝑤.𝑝𝑝. 𝜀𝜀
∗ Tie breaking randomly
• Hyperparam. 𝜀𝜀 controls exploration vs. exploitation
7
COMP90051 Statistical Machine Learning
Kicking the tyres
• • • •
10-armed bandit
Rewards 𝑃𝑃 = 𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁(𝜇𝜇 , 1) with 𝜇𝜇 ~𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁(0,1)
𝑖𝑖𝑖𝑖𝑖𝑖
Play game for 300 rounds
Repeat 1,000 games, plot average per-round rewards
8
• • •
COMP90051 Statistical Machine Learning
Kicking the tyres: More rounds
Greedy increases fast, but levels off at low rewards
𝜀𝜀-Greedy does better long-term by exploring
0.01-Greedy initially slow (little explore) but eventually superior to 0.1-Greedy (exploits after enough exploration)
9
COMP90051 Statistical Machine Learning
Optimistic initialisation improves Greedy
𝑖𝑖 ~�argmax1≤𝑖𝑖≤𝑘𝑘 𝑄𝑄𝑡𝑡−1 𝑖𝑖 𝑤𝑤.𝑝𝑝. 1 − 𝜀𝜀
𝑡𝑡
Unif 1,…,𝑘𝑘 𝑤𝑤.𝑝𝑝. 𝜀𝜀
• • •
Pessimism: Init Q’s below observable rewardsOnly try one arm Optimism: Init Q’s above observable rewardsExplore arms once Middle-ground init QExplore arms at most once
But pure greedy never explores an arm more than once
10
Limitations of 𝜀𝜀-Greedy • WhilewecanimproveonbasicGreedywith
COMP90051 Statistical Machine Learning
optimistic initialisation and decreasing 𝜀𝜀…
• Explorationandexploitationaretoo“distinct”
∗ Exploration actions completely blind to promising arms ∗ Initialisation trick only helps with “cold start”
• Exploitation is blind to confidence of estimates
• Theselimitationsareseriousinpractice
11
COMP90051 Statistical Machine Learning
(Upper) confidence interval for Q estimates • Theorem:Hoeffding’sinequality
∗ Let 𝑋𝑋 , … , 𝑋𝑋 be i.i.d. random variables in [0,1] mean 𝜇𝜇,
1𝑛𝑛
denote by 𝑋𝑋𝑛𝑛 their sample mean
∗ For any 𝜀𝜀 ∈ (0,1) with probability at least 1 − 𝜀𝜀
𝜇𝜇 ≤ 𝑋𝑋𝑛𝑛 +
• Application to 𝑄𝑄𝑡𝑡−1 𝑖𝑖 estimate – also i.i.d. mean!!
log 1/𝜀𝜀 2𝑛𝑛
∗ Take𝑛𝑛=𝑁𝑁 (𝑖𝑖)=∑𝑡𝑡−11𝑖𝑖 =𝑖𝑖 numberofiplays 𝑡𝑡−1 𝑠𝑠=1 𝑠𝑠
∗ Then𝑋𝑋𝑛𝑛 =𝑄𝑄𝑡𝑡−1 𝑖𝑖
∗ Critical level 𝜀𝜀 = 1/𝑡𝑡 (Lai & Robbins ’85), take 𝜀𝜀 = 1/𝑡𝑡4
12
COMP90051 Statistical Machine Learning
Upper Confidence Bound (UCB) algorithm
•
At round t
∗ Estimate value of each arm i as average reward observed
2log(𝑡𝑡)
𝑄𝑄𝑡𝑡−1𝑖𝑖= 𝜇𝜇̂𝑡𝑡−1𝑖𝑖+ 𝑁𝑁 (𝑖𝑖), if�𝑡𝑡−11𝑖𝑖𝑠𝑠=𝑖𝑖>0
𝑡𝑡−1 𝑠𝑠=1
𝑄𝑄0 , otherwise
…some constant 𝑄𝑄0 𝑖𝑖 = 𝑄𝑄0 used until arm i has been pulled; where:
𝑁𝑁 𝑖𝑖 =�𝑡𝑡−11𝑖𝑖 =𝑖𝑖 𝜇𝜇̂ 𝑡𝑡−1 𝑠𝑠=1𝑠𝑠 𝑡𝑡−1
𝑖𝑖 =∑𝑡𝑡−1𝑋𝑋 𝑠𝑠1𝑖𝑖 =𝑖𝑖 𝑠𝑠=1𝑖𝑖 𝑠𝑠
• •
13
∗ “Optimism in the face of uncertainty” 𝑖𝑖 ~argmax 𝑄𝑄
𝑖𝑖
∑𝑡𝑡−11𝑖𝑖 =𝑖𝑖 𝑠𝑠=1 𝑠𝑠
…tie breaking randomly
𝑡𝑡 1≤𝑖𝑖≤𝑘𝑘 𝑡𝑡−1
Addresses several limitations of 𝜀𝜀-greedy
Can “pause” in a bad arm for a while, but eventually find best
COMP90051 Statistical Machine Learning
Kicking the tyres: How does UCB compare?
• UCB quickly overtakes the 𝜀𝜀-Greedy approaches
14
COMP90051 Statistical Machine Learning
Kicking the tyres: How does UCB compare?
• UCB quickly overtakes the 𝜀𝜀-Greedy approaches
• Continues to outpace on per round rewards for some time
15
• • •
COMP90051 Statistical Machine Learning
Kicking the tyres: How does UCB compare?
UCB quickly overtakes the 𝜀𝜀-Greedy approaches Continues to outpace on per round rewards for some time More striking when viewed as mean cumulative rewards
16
COMP90051 Statistical Machine Learning
Notes on UCB
• Theoretical regret bounds, optimal up to multiplicative
constant
∗ Grows like O(log 𝑡𝑡) i.e. averaged regret goes to zero!
• Tunable 𝜌𝜌 > 0 exploration hyperparam. replaces “2”
𝜇𝜇̂ 𝑖𝑖 + 𝜌𝜌log(𝑡𝑡), if�𝑡𝑡−11𝑖𝑖 =𝑖𝑖 >0
𝑄𝑄𝑡𝑡−1 𝑖𝑖 =
∗ Captures different 𝜀𝜀 rates & bounded rewards outside [0,1]
• Many variations e.g. different confidence bounds
• Basis for Monte Carlo Tree Search used in AlphaGo!
𝑡𝑡−1 𝑁𝑁𝑡𝑡−1(𝑖𝑖) 𝑠𝑠=1 𝑄𝑄0 ,
𝑠𝑠 otherwise
17
COMP90051 Statistical Machine Learning
MABs vs. Reinforcement Learning
• Contextual bandits introduce state
∗ But don’t model actions as causing state transitions ∗ New state arrives “somehow”
• RL has rounds of states, actions, rewards too
• But (state,action) determines the next state
∗ E.g. playing Go, moving a robot, planning logistics
∗ Covered in COMP90054 as involves elements of planning
• Both RL and Contextual bandits learns value functions w regression, but RL “rolls out” predicted rewards into the future due to state transitions
18
COMP90051 Statistical Machine Learning
This lecture
• Stochasticmulti-armedbandits
∗ Sequential decision making under uncertainty ∗ Simplest explore-vs-exploit setting
∗ (𝜀𝜀)-greedy, UCB
• Manyapplications,variations:
∗ Adversarial MAB: rewards not stochastic, but anything
∗ Contextual bandits: act based on context feature vector ∗ Reinforcement learning: more general setting
19