CS计算机代考程序代写 database Bayesian arm algorithm PowerPoint Presentation

PowerPoint Presentation

Lecturer: Ben Rubinstein

Lecture 17. Multi-armed bandits

COMP90051 Statistical Machine Learning

Copyright: University of Melbourne

COMP90051 Statistical Machine Learning

This lecture

• Bandit setting vs Learning with experts
∗ Have to pick an expert (aka. arm)
∗ Observe rewards only for chosen arm

• Aka. Sequential decision making under uncertainty
∗ Simplest explore-vs-exploit setting
∗ Incredibly rich area with heaps of industrial applications

• Basic algorithms
∗ Greedy
∗ 𝜀𝜀-Greedy
∗ Upper Confidence Bound (UCB)

• More: Contextual bandits, RL, …

2

COMP90051 Statistical Machine Learning

Multi-Armed Bandits

3

Where we learn to take actions; we receive only
indirect supervision in the form of rewards; and we

only observe rewards for actions taken – the simplest
setting with an explore-exploit trade-off.

COMP90051 Statistical Machine Learning

Exploration vs. Exploitation

4

CC0

COMP90051 Statistical Machine Learning

Exploration vs. Exploitation

• “Multi-armed bandit” (MAB)
∗ Simplest setting for balancing exploration, exploitation
∗ Same family of ML tasks as reinforcement learning

• Numerous applications
∗ Online advertising
∗ Caching in databases
∗ Stochastic search in games (e.g. AlphaGo!)
∗ Adaptive A/B testing
∗ ….

5

CC0

COMP90051 Statistical Machine Learning

Stochastic MAB setting
• Possible actions 1, … , 𝑘𝑘 called “arms”

∗ Arm 𝑖𝑖 has distribution 𝑃𝑃𝑖𝑖 on bounded rewards with mean 𝜇𝜇𝑖𝑖

• In round 𝑡𝑡 = 1 …𝑇𝑇
∗ Play action 𝑖𝑖𝑡𝑡 ∈ 1, … , 𝑘𝑘 (possibly randomly)
∗ Receive reward 𝑅𝑅𝑖𝑖𝑡𝑡 𝑡𝑡 ~𝑃𝑃𝑖𝑖𝑡𝑡

• Goal: minimise cumulative regret
∗ 𝜇𝜇∗𝑇𝑇 − ∑𝑡𝑡=1

𝑇𝑇 𝐸𝐸 𝑅𝑅𝑖𝑖𝑡𝑡 𝑡𝑡

where 𝜇𝜇∗ = max𝑖𝑖 𝜇𝜇𝑖𝑖
∗ Intuition: Do as well as a rule that is simple but has knowledge

of the future

6

Expected cumulative reward of bandit

Best expected cumulative reward with hindsight

COMP90051 Statistical Machine Learning

Greedy

• At round t
∗ Estimate value of each arm i as average reward observed

𝑄𝑄𝑡𝑡−1 𝑖𝑖 = �
∑𝑠𝑠=1
𝑡𝑡−1 𝑅𝑅𝑖𝑖 𝑠𝑠 1 𝑖𝑖𝑠𝑠 = 𝑖𝑖
∑𝑠𝑠=1
𝑡𝑡−1 1 𝑖𝑖𝑠𝑠 = 𝑖𝑖

, if �
𝑠𝑠=1

𝑡𝑡−1
1 𝑖𝑖𝑠𝑠 = 𝑖𝑖 > 0

𝑄𝑄0, otherwise
… some init constant 𝑄𝑄0 𝑖𝑖 = 𝑄𝑄0 used until arm i has been pulled

∗ Exploit, baby, exploit!
𝑖𝑖𝑡𝑡 ∈ arg max1≤𝑖𝑖≤𝑘𝑘 𝑄𝑄𝑡𝑡−1 𝑖𝑖

∗ Tie breaking randomly

• What do you expect this to do? Effect of initial Qs?
7

COMP90051 Statistical Machine Learning

𝜀𝜀-Greedy
• At round t

∗ Estimate value of each arm i as average reward observed

𝑄𝑄𝑡𝑡−1 𝑖𝑖 = �
∑𝑠𝑠=1
𝑡𝑡−1 𝑅𝑅𝑖𝑖 𝑠𝑠 1 𝑖𝑖𝑠𝑠 = 𝑖𝑖
∑𝑠𝑠=1
𝑡𝑡−1 1 𝑖𝑖𝑠𝑠 = 𝑖𝑖

, if �
𝑠𝑠=1

𝑡𝑡−1
1 𝑖𝑖𝑠𝑠 = 𝑖𝑖 > 0

𝑄𝑄0, otherwise
… some init constant 𝑄𝑄0 𝑖𝑖 = 𝑄𝑄0 used until arm i has been pulled

∗ Exploit, baby exploit… probably; or possibly explore

𝑖𝑖𝑡𝑡~ �
arg max1≤𝑖𝑖≤𝑘𝑘 𝑄𝑄𝑡𝑡−1 𝑖𝑖 𝑤𝑤.𝑝𝑝. 1 − 𝜀𝜀

Unif 1, … , 𝑘𝑘 𝑤𝑤.𝑝𝑝. 𝜀𝜀

∗ Tie breaking randomly

• Hyperparam. 𝜀𝜀 controls exploration vs. exploitation
8

COMP90051 Statistical Machine Learning

Kicking the tyres

• 10-armed bandit
• Rewards 𝑃𝑃𝑖𝑖 = 𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁(𝜇𝜇𝑖𝑖 , 1) with 𝜇𝜇𝑖𝑖~𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁(0,1)
• Play game for 300 rounds
• Repeat 1,000 games, plot average per-round rewards

9

COMP90051 Statistical Machine Learning

Kicking the tyres: More rounds

• Greedy increases fast, but levels off at low rewards
• 𝜀𝜀-Greedy does better long-term by exploring
• 0.01-Greedy initially slow (little explore) but eventually superior

to 0.1-Greedy (exploits after enough exploration) 10

COMP90051 Statistical Machine Learning

𝑖𝑖𝑡𝑡~ �
arg max1≤𝑖𝑖≤𝑘𝑘 𝑄𝑄𝑡𝑡−1 𝑖𝑖 𝑤𝑤. 𝑝𝑝. 1 − 𝜀𝜀

Unif 1, … ,𝑘𝑘 𝑤𝑤. 𝑝𝑝. 𝜀𝜀

Optimistic initialisation improves Greedy

• Pessimism: Init Q’s below observable rewards  Only try one arm
• Optimism: Init Q’s above observable rewards  Explore arms once
• Middle-ground init Q  Explore arms at most once
But pure greedy never explores an arm more than once

11

COMP90051 Statistical Machine Learning

Limitations of 𝜀𝜀-Greedy

• While we can improve on basic Greedy with
optimistic initialisation and decreasing 𝜀𝜀…

• Exploration and exploitation are too “distinct”
∗ Exploration actions completely blind to promising arms
∗ Initialisation trick only helps with “cold start”

• Exploitation is blind to confidence of estimates

• These limitations are serious in practice

12

COMP90051 Statistical Machine Learning

Mini Summary

• Multi-armed bandit setting
∗ Simplest instance of an explore-exploit problem
∗ Greedy approaches cover exploitation fine
∗ Greedy approaches overly simplistic with exploration (if have

any!)

• Compared to: learning with experts
∗ Superficial changes: Experts  Arms; Losses  Rewards
∗ Choose one arm (like probabilistic experts algorithm)
∗ Big difference: Only observe rewards on chosen arm

Next: A better way: optimism under uncertainty principle

13

COMP90051 Statistical Machine Learning

Upper-Confidence Bound (UCB)

14

Optimism in the face of uncertainty;
A smarter way to balance exploration-exploitation.

COMP90051 Statistical Machine Learning

(Upper) confidence interval for Q estimates

• Theorem: Hoeffding’s inequality
∗ Let 𝑅𝑅1, … ,𝑅𝑅𝑛𝑛 be i.i.d. random variables in [0,1] mean 𝜇𝜇,

denote by 𝑅𝑅𝑛𝑛 their sample mean
∗ For any 𝜀𝜀 ∈ (0,1) with probability at least 1 − 𝜀𝜀

𝜇𝜇 ≤ 𝑅𝑅𝑛𝑛 +
log 1/𝜀𝜀

2𝑛𝑛

• Application to 𝑄𝑄𝑡𝑡−1 𝑖𝑖 estimate – also i.i.d. mean!!
∗ Take 𝑛𝑛 = 𝑁𝑁𝑡𝑡−1(𝑖𝑖) = ∑𝑠𝑠=1

𝑡𝑡−1 1 𝑖𝑖𝑠𝑠 = 𝑖𝑖 number of i plays
∗ Then 𝑅𝑅𝑛𝑛 = 𝑄𝑄𝑡𝑡−1 𝑖𝑖
∗ Critical level 𝜀𝜀 = 1/𝑡𝑡 (Lai & Robbins ’85), take 𝜀𝜀 = 1/𝑡𝑡4

15

COMP90051 Statistical Machine Learning

Upper Confidence Bound (UCB) algorithm
• At round t

∗ Estimate value of each arm i as average reward observed

𝑄𝑄𝑡𝑡−1 𝑖𝑖 =
�̂�𝜇𝑡𝑡−1 𝑖𝑖 +

2log(𝑡𝑡)
𝑁𝑁𝑡𝑡−1(𝑖𝑖)

, if �
𝑠𝑠=1

𝑡𝑡−1
1 𝑖𝑖𝑠𝑠 = 𝑖𝑖 > 0

𝑄𝑄0, otherwise
…some constant 𝑄𝑄0 𝑖𝑖 = 𝑄𝑄0 used until arm i has been pulled; where:

𝑁𝑁𝑡𝑡−1 𝑖𝑖 = �
𝑠𝑠=1

𝑡𝑡−1
1 𝑖𝑖𝑠𝑠 = 𝑖𝑖 �̂�𝜇𝑡𝑡−1 𝑖𝑖 =

∑𝑠𝑠=1
𝑡𝑡−1 𝑅𝑅𝑖𝑖 𝑠𝑠 1 𝑖𝑖𝑠𝑠 = 𝑖𝑖
∑𝑠𝑠=1
𝑡𝑡−1 1 𝑖𝑖𝑠𝑠 = 𝑖𝑖

∗ “Optimism in the face of uncertainty”
𝑖𝑖𝑡𝑡~ arg max1≤𝑖𝑖≤𝑘𝑘 𝑄𝑄𝑡𝑡−1 𝑖𝑖

…tie breaking randomly

• Addresses several limitations of 𝜀𝜀-greedy
• Can “pause” in a bad arm for a while, but eventually find best

16

COMP90051 Statistical Machine Learning

Kicking the tyres: How does UCB compare?

• UCB quickly overtakes the 𝜀𝜀-Greedy approaches

17

COMP90051 Statistical Machine Learning

Kicking the tyres: How does UCB compare?

• UCB quickly overtakes the 𝜀𝜀-Greedy approaches
• Continues to outpace on per round rewards for some time

18

COMP90051 Statistical Machine Learning

Kicking the tyres: How does UCB compare?

• UCB quickly overtakes the 𝜀𝜀-Greedy approaches
• Continues to outpace on per round rewards for some time
• More striking when viewed as mean cumulative rewards

19

COMP90051 Statistical Machine Learning

Notes on UCB
• Theoretical regret bounds, optimal up to multiplicative

constant
∗ Grows like O(log 𝑡𝑡) i.e. averaged regret goes to zero!

• Tunable 𝜌𝜌 > 0 exploration hyperparam. replaces “2”

𝑄𝑄𝑡𝑡−1 𝑖𝑖 =
�̂�𝜇𝑡𝑡−1 𝑖𝑖 +

𝜌𝜌log(𝑡𝑡)
𝑁𝑁𝑡𝑡−1(𝑖𝑖)

, if �
𝑠𝑠=1

𝑡𝑡−1
1 𝑖𝑖𝑠𝑠 = 𝑖𝑖 > 0

𝑄𝑄0, otherwise
∗ Captures different 𝜀𝜀 rates & bounded rewards outside [0,1]

• Many variations e.g. different confidence bounds

• Basis for Monte Carlo Tree Search used in AlphaGo!
20

COMP90051 Statistical Machine Learning

Mini Summary

• Addressing limitations of greedy
∗ Exploration blind to how good an arm is
∗ Exploration/exploitation blind to confidence of arm estimates

• Upper-confidence bound (UCB) algorithm
∗ Exploration boost: Upper-confidence bound (like PAC bound!)
∗ Example of: Optimism in the face of uncertain principle
∗ Achieves practical performance and good regret bounds

Next: Wrap-up with a few related directions

21

COMP90051 Statistical Machine Learning

Beyond basic bandits

22

Adding state with contextual bandits;
State transitions/dynamics with reinforcement learning.

COMP90051 Statistical Machine Learning

But wait, there’s more!! Contextual bandits

• Adds concept of “state” of the world
∗ Arms’ rewards now depend on state
∗ E.g. best ad depends on user and webpage

• Each round, observe arbitrary context (feature) vector
representing state 𝑿𝑿𝑖𝑖 𝑡𝑡 per arm
∗ Profile of web page visitor (state)
∗ Web page content (state)
∗ Features of a potential ad (arm)

• Reward estimation
∗ Was unconditional: 𝐸𝐸 𝑅𝑅𝑖𝑖 𝑡𝑡
∗ Now conditional: 𝐸𝐸 𝑅𝑅𝑖𝑖 𝑡𝑡 |𝑿𝑿𝑖𝑖 𝑡𝑡

• A regression problem!!!

23

Still choose arm with
maximizing UCB.

But UCB is not on a
mean, but a regression

prediction given context
vector.

COMP90051 Statistical Machine Learning

MABs vs. Reinforcement Learning

• Contextual bandits introduce state
∗ But don’t model actions as causing state transitions
∗ New state arrives “somehow”

• RL has rounds of states, actions, rewards too

• But (state,action) determines the next state
∗ E.g. playing Go, moving a robot, planning logistics

• Thus, RL still learns value functions w regression, but
has to “roll out” predicted rewards into the future

24

COMP90051 Statistical Machine Learning

Mini Summary

• Lecture: Stochastic multi-armed bandits
∗ Sequential decision making under uncertainty
∗ Simplest explore-vs-exploit setting
∗ (𝜀𝜀)-greedy, UCB, LinUCB

• Related directions:
∗ Contextual bandits: adding state; regression estimates rewards
∗ Reinforcement learning: introducing state transitions

Next lecture: It’s Bayesian time!

25

�Lecturer: Ben Rubinstein
This lecture
Multi-Armed Bandits
Exploration vs. Exploitation
Exploration vs. Exploitation
Stochastic MAB setting
Greedy
𝜀-Greedy
Kicking the tyres
Kicking the tyres: More rounds
Optimistic initialisation improves Greedy
Limitations of 𝜀-Greedy
Mini Summary
Upper-Confidence Bound (UCB)
(Upper) confidence interval for Q estimates
Upper Confidence Bound (UCB) algorithm
Kicking the tyres: How does UCB compare?
Kicking the tyres: How does UCB compare?
Kicking the tyres: How does UCB compare?
Notes on UCB
Mini Summary
Beyond basic bandits
But wait, there’s more!! Contextual bandits
MABs vs. Reinforcement Learning
Mini Summary