PowerPoint Presentation
Lecturer: Ben Rubinstein
Lecture 16. Learning with
expert advice
COMP90051 Statistical Machine Learning
Copyright: University of Melbourne
COMP90051 Statistical Machine Learning
This lecture
• Learning from expert advice / multiplicative weights
∗ Learner listens to some/all experts making predictions
∗ True outcomes are ADVERSARIAL!
∗ Learner updates weights over experts based on their losses
∗ Algorithms all forms of “multiplicative weights”
∗ Nice clean bounds on total mistakes/loss: by “potential function” technique
• Infallible expert (one always perfect)
∗ Majority Vote Algorithm
• Imperfect experts (none guaranteed perfect) – increasingly better
∗ Weighted Majority Vote Algorithm by Halving
∗ Weighted Majority Voting by General Multiplicative Weights
∗ Probabilistic Experts Algorithm
2
willoweit.
Typewritten Text
对抗的
willoweit.
Typewritten Text
the data will not be iid
willoweit.
Typewritten Text
万无一失的
COMP90051 Statistical Machine Learning
An infallible expert and the
Majority Algorithm
3
Warming up example
COMP90051 Statistical Machine Learning
Warm-up: Case of the infallible expert
• Experts 𝐸𝐸1, … ,𝐸𝐸𝑛𝑛 predict the stock market daily
∗ Each expert prediction is binary: stocks will go up/down
• Learner’s game, daily:
∗ Observe predictions of all experts
∗ Make a prediction of its own
∗ Observe outcome (could be anything!)
∗ Goal: minimise number total mistakes
• Infallible expert assumption:
∗ 1 or more experts makes no mistakes
4
NYSE. CCA: skeeze
willoweit.
Underline
COMP90051 Statistical Machine Learning
Infallible Expert Algorithm: Majority Vote
1. Initialise set of experts who haven’t
made mistakes 𝐸𝐸 = 1, … ,𝑛𝑛
2. Repeat per round
a) Observe predictions 𝐸𝐸𝑖𝑖
for all 𝑖𝑖 ∈ 1, … ,𝑛𝑛
b) Make majority prediction
arg max𝑦𝑦∈ −1,1 ∑𝑖𝑖∈𝐸𝐸 1 𝐸𝐸𝑖𝑖 = 𝑦𝑦
c) Observe correct outcome
d) Remove mistaken experts from 𝐸𝐸
5
CCA3.0: Krisada, Noun Project
willoweit.
Underline
COMP90051 Statistical Machine Learning
Mistake Bound for Majority Vote
Proposition: Under infallible expert assumption,
majority vote makes total mistakes 𝑀𝑀 ≤ log2 𝑛𝑛
Proof
• Loop invariant: If algorithm makes a
mistake, then at least 𝐸𝐸 /2 experts must have been wrong
• I.e. for every incorrect prediction, 𝐸𝐸 reduced by at least half. I.e.
after 𝑀𝑀 mistakes, 𝐸𝐸 ≤ 𝑛𝑛/2𝑀𝑀
• By infallibility, at all times 1 ≤ 𝐸𝐸
• Combine to 1 ≤ 𝐸𝐸 ≤ 𝑛𝑛/2𝑀𝑀, then solve for 𝑀𝑀.
6
Intuition: Halving
(e.g. tree data
structures!)? Expect
to see log
𝑬𝑬 is the
potential
function
willoweit.
Highlight
willoweit.
Underline
willoweit.
Underline
COMP90051 Statistical Machine Learning
How is this “online learning”?
Learning
• Weights on which experts are
worth listening to
• (Infallible case: 0/1 weights)
• Making predictions/
taking actions
• Incurring loss (so far 0/1)
• IID “Distribution” replaced by
adversarial outcomes
Online
• A repeated game
7http://cesa-bianchi.di.unimi.it/predbook/
COMP90051 Statistical Machine Learning
Mini Summary
• Learning with expert advice paradigm
∗ Abstraction of online learning problem
∗ Adversarial feedback
∗ Later: Applications abound
• Bounds on mistakes (later losses) “easy”
∗ Involve “potential function” technique
∗ Later: interested in scaling with best expert performance
Next: Imperfect experts. Down weight don’t drop bad
experts
8
COMP90051 Statistical Machine Learning
Imperfect experts and the
Halving Algorithm
9
Similar proof technique; similar algorithm;
much more interesting setting
COMP90051 Statistical Machine Learning
No one’s perfect
• No more guarantee of an infallible expert
• What breaks?
∗ We could end up with 𝐸𝐸 = ∅, how to predict then?
∗ No sense: “Zero tolerance” dropping experts on a mistake
• Very general setting / very few assumptions
∗ Not assuming anything about expert error rates
∗ Not assuming anything about correlation of expert errors
∗ Not assuming anything about outcome observations. Not
even stochastic (could be adversarial!)
10
COMP90051 Statistical Machine Learning
Imperfect experts: Halving Algorithm
1. Initialise 𝑤𝑤𝑖𝑖 = 1 weight of expert 𝐸𝐸𝑖𝑖
2. Repeat per round
a) Observe predictions 𝐸𝐸𝑖𝑖
for all 𝑖𝑖 ∈ 1, … ,𝑛𝑛
b) Make weighted majority prediction
arg max𝑦𝑦∈ −1,1 ∑𝑖𝑖∈𝐸𝐸 𝑤𝑤𝑖𝑖1 𝐸𝐸𝑖𝑖 = 𝑦𝑦
c) Observe correct outcome
d) Downweigh each mistaken expert 𝐸𝐸𝑖𝑖
𝑤𝑤𝑖𝑖 ⟵ 𝑤𝑤𝑖𝑖/2
11
CCA3.0: Krisada, Noun Project
COMP90051 Statistical Machine Learning
12
Mistake Bound for Halving
Proposition: If the best expert makes 𝑚𝑚 mistakes, then
weighted majority vote makes 𝑀𝑀 ≤ 2.4 𝑚𝑚 + log2 𝑛𝑛 mistakes.
Proof
• Invariant: If algorithm makes a mistake, then weight of wrong experts
is at least half the total weight W = ∑𝑖𝑖=1
𝑛𝑛 𝑤𝑤𝑖𝑖
• Weight of wrong experts reduced by 1/2, therefore total weight
reduced by at least 3/4. I.e. after 𝑀𝑀 mistakes, 𝑊𝑊 ≤ 𝑛𝑛 3/4 𝑀𝑀
• Best expert 𝐸𝐸𝑖𝑖 has 𝑤𝑤𝑖𝑖 = (1/2)𝑚𝑚
• Combine to (1/2)𝑚𝑚= 𝑤𝑤𝑖𝑖 ≤ 𝑊𝑊 ≤ 𝑛𝑛 3/4 𝑀𝑀
• Taking logs −𝑚𝑚 ≤ log2𝑛𝑛 + 𝑀𝑀log2 3/4 , solving 𝑀𝑀 ≤
𝑚𝑚+log2𝑛𝑛
log2 4/3
COMP90051 Statistical Machine Learning
Compare, compare: What’s going on?
• Price of imperfection (vs. infallibility) is 𝒪𝒪 𝑚𝑚
∗ Infallible case: 𝑀𝑀 ∈ 𝒪𝒪 log𝑛𝑛
∗ Imperfect case: 𝑀𝑀 ∈ 𝒪𝒪 𝑚𝑚 + log𝑛𝑛
• Scaling to many experts is no problem
• Online learning vs. PAC frameworks
13
Modelling of losses Ultimate goal
PAC
i.i.d. losses
(due to e.g. Hoeffding)
(For ERM; L6c) Small estimation error
𝑅𝑅 𝑓𝑓𝑚𝑚 − 𝑅𝑅 𝑓𝑓∗ . Bounded in terms of
family’s VC dimension
Online
learning
Adversarial/arbitrary
losses
Small 𝑀𝑀 −𝑚𝑚. Bounded in terms of
number of experts.
willoweit.
Highlight
COMP90051 Statistical Machine Learning
Mini Summary
• Imperfect expert setting
∗ Don’t drop bad experts, just halve their weight
∗ Predict by weighted majority, not simply majority
∗ Mistake bound follows similar “potential function” pattern!
• Learning with expert advice paradigm
∗ Key difference to PAC is adversarial feedback
∗ Similarity: Also concerned with performance
relative to “best in class”
Next: Imperfect experts continued. Generalising halving.
14
COMP90051 Statistical Machine Learning
From Halving to
Multiplying weights by 1 − 𝜀𝜀
15
Generalising weighted majority.
COMP90051 Statistical Machine Learning
Useful (but otherwise boring) inequalities
• Lemma 1: For any 𝜀𝜀 ∈ 0,0.5 , we have
−𝜀𝜀 − 𝜀𝜀2 ≤ log𝑒𝑒 1 − 𝜀𝜀 < −𝜀𝜀
• Proof:
∗ Upper bound by Tayler expansion, dropping all by first term (as
they’re negative)
∗ Lower bound by convexity of exp(−𝜀𝜀 − 𝜀𝜀2)
• Lemma 2: For all 𝜀𝜀 ∈ [0,1] we have,
1 − 𝜀𝜀𝑥𝑥 > �
(1 − 𝜀𝜀)𝑥𝑥, if 𝑥𝑥 ∈ [0,1]
(1 + 𝜀𝜀)−𝑥𝑥, if 𝑥𝑥 ∈ [−1,0]
• Proof: by convexity of the RHS functions
16
COMP90051 Statistical Machine Learning
Weighted Majority Vote Algorithm
1. Initialise 𝑤𝑤𝑖𝑖 = 1 weight of expert 𝐸𝐸𝑖𝑖
2. Repeat per round
a) Observe predictions 𝐸𝐸𝑖𝑖
for all 𝑖𝑖 ∈ 1, … ,𝑛𝑛
b) Make weighted majority prediction
arg max𝑦𝑦∈ −1,1 ∑𝑖𝑖∈𝐸𝐸 𝑤𝑤𝑖𝑖1 𝐸𝐸𝑖𝑖 = 𝑦𝑦
c) Observe correct outcome
d) Downweigh each mistaken expert 𝐸𝐸𝑖𝑖
𝑤𝑤𝑖𝑖 ⟵ (1 − 𝜀𝜀)𝑤𝑤𝑖𝑖
17
CCA3.0: Krisada, Noun Project
1 − 𝜀𝜀 1 − 𝜀𝜀
COMP90051 Statistical Machine Learning
18
Mistake Bound
Proposition: If the best expert makes 𝑚𝑚 mistakes, then 1 − 𝜀𝜀 –
weighted majority makes 𝑀𝑀 ≤ 2 1 + 𝜀𝜀 𝑚𝑚 + (2log𝑒𝑒𝑛𝑛)/𝜀𝜀 mistakes.
Proof
• Whenever learner mistakes, at least half of total weight reduced by
factor of 1 − 𝜀𝜀 . So after 𝑀𝑀 mistakes, 𝑊𝑊 ≤ 𝑛𝑛 1 − 𝜀𝜀/2 𝑀𝑀
• Best expert 𝐸𝐸𝑖𝑖 has 𝑤𝑤𝑖𝑖 = (1 − 𝜀𝜀)𝑚𝑚
• Combine to (1 − 𝜀𝜀)𝑚𝑚= 𝑤𝑤𝑖𝑖 ≤ 𝑊𝑊 ≤ 𝑛𝑛 1 − 𝜀𝜀/2 𝑀𝑀
• Taking logs: 𝑚𝑚log𝑒𝑒(1 − 𝜀𝜀) ≤ log𝑒𝑒𝑛𝑛 + 𝑀𝑀log𝑒𝑒 1 − 𝜀𝜀/2
• Lemma 1 replaces both log𝑒𝑒(1 − 𝜀𝜀): −m ε + 𝜀𝜀2 ≤ log𝑒𝑒𝑛𝑛 −𝑀𝑀𝜀𝜀/2
• Solving for 𝑀𝑀 proves the bound.
Bound improves dependence on m
compared to halving. Why?
COMP90051 Statistical Machine Learning
Dependence in 𝑚𝑚 provably near optimal!
• New to lower bounds? example shows an analysis or
even an algorithm can’t do better than some limit
• Weighted majority almost achieves 2𝑚𝑚 dependence,
with 2 1 + 𝜀𝜀 𝑚𝑚 (considering no. experts fixed)
• Example with 𝑀𝑀 = 2𝑚𝑚
∗ Consider 𝑛𝑛 = 2 with 𝐸𝐸1 (𝐸𝐸2) correct on odd (even) days
∗ Then best expert makes mistakes half the time
∗ But after 1st round, for any 𝜀𝜀, majority vote is wrong all the time,
as incorrect expert gets more than half weight
• Consequence? Can’t improve the constant 2 factor in 𝑚𝑚
19
COMP90051 Statistical Machine Learning
Mini Summary
• Imperfect expert setting continued…
• From halving to multiplicative weights!
∗ Mistake bound proved as usual via “potential function” trick
∗ Bound’s dependence on best expert improved to 2 + 𝜀𝜀 factor
• Lower bound / impossibility result
∗ Factor of 2 is optimal for (deterministic) multiplicative weights!
Next: Imperfect experts continued. Randomise!!
20
COMP90051 Statistical Machine Learning
The probabilistic
experts algorithm
21
wherein randomisation helps us do better!
COMP90051 Statistical Machine Learning
Probabilistic experts algorithm
• Change 1 from mistakes: Loss ℓ𝑖𝑖
(𝑡𝑡) ∈ [0,1] of 𝐸𝐸𝑖𝑖, round 𝑡𝑡
• Change 2: Randomised algorithm means, bounding
expected losses (sound familiar? It should…. a risk!)
1. Initialise 𝑤𝑤𝑖𝑖 = 1 weight of expert 𝐸𝐸𝑖𝑖
2. Repeat per round
a) Observe predictions 𝐸𝐸𝑖𝑖
for all 𝑖𝑖 ∈ 1, … ,𝑛𝑛
b) Predict 𝐸𝐸𝑖𝑖 of expert 𝑖𝑖 with probability
𝑤𝑤𝑖𝑖
𝑊𝑊
where 𝑊𝑊 = ∑𝑖𝑖=1
𝑛𝑛 𝑤𝑤𝑖𝑖
c) Observe losses
d) Update each weight 𝑤𝑤𝑖𝑖 ⟵ (1 − 𝜀𝜀)
ℓ𝑖𝑖
(𝑡𝑡)
𝑤𝑤𝑖𝑖
22
COMP90051 Statistical Machine Learning
Probabilistic experts: Expected loss bound
• Proposition: Expected loss of the probabilistic
experts algorithm is 𝐿𝐿 ≤ log𝑒𝑒𝑛𝑛
𝜀𝜀
+ (1 + 𝜀𝜀)𝐿𝐿∗ where 𝐿𝐿∗
is the minimum loss over experts.
• Proof: next, follows similar
“potential” pattern
• Beats deterministic! Shaves off
optimal constant 2
• Generalises in many directions.
Active area of research in ML,
control, economics, in top labs.
23
http://cesa-bianchi.di.unimi.it/predbook/
COMP90051 Statistical Machine Learning
Proof: Upper bounding potential function
• Learner’s round 𝑡𝑡 expected loss: 𝐿𝐿𝑡𝑡 =
∑𝑖𝑖=1
𝑛𝑛 𝑤𝑤𝑖𝑖
(𝑡𝑡)ℓ𝑖𝑖
(𝑡𝑡)
W(𝑡𝑡)
• By Lemma 2, since losses are [0,1]:
updated𝑤𝑤𝑖𝑖
(𝑡𝑡+1) ← (1 − 𝜀𝜀)ℓ𝑖𝑖
(𝑡𝑡)
𝑤𝑤𝑖𝑖
(𝑡𝑡) ≤ 1 − 𝜀𝜀ℓ𝑖𝑖
(𝑡𝑡) 𝑤𝑤𝑖𝑖
(𝑡𝑡)
• Rearrange to obtain recurrence relation:
W t + 1 ≤�
𝑖𝑖=1
𝑛𝑛
1 − 𝜀𝜀ℓ𝑖𝑖
𝑡𝑡 𝑤𝑤𝑖𝑖
(𝑡𝑡) = �
𝑖𝑖=1
𝑛𝑛
𝑤𝑤𝑖𝑖
(𝑡𝑡) 1 − 𝜀𝜀
∑𝑖𝑖=1
𝑛𝑛 𝑤𝑤𝑖𝑖
(𝑡𝑡)ℓ𝑖𝑖
𝑡𝑡
∑𝑖𝑖=1
𝑛𝑛 𝑤𝑤𝑖𝑖
(𝑡𝑡)
= W(𝑡𝑡) 1 − 𝜀𝜀𝐿𝐿𝑡𝑡
• Initialisation gave W 0 = 𝑛𝑛, so telescoping we get:
W T ≤ 𝑛𝑛�
𝑡𝑡=1
𝑇𝑇
1 − 𝜀𝜀𝐿𝐿𝑡𝑡
24
COMP90051 Statistical Machine Learning
Proof: Lower bounding potential, Wrap up
• Proved upper bound: W T ≤ 𝑛𝑛∏𝑡𝑡=1𝑇𝑇 1 − 𝜀𝜀𝐿𝐿𝑡𝑡
• Lower bound from best expert total loss 𝐿𝐿∗:
W(𝑇𝑇) ≥ (1 − 𝜀𝜀)𝐿𝐿
∗
• Combining bounds and taking log’s:
𝐿𝐿∗log𝑒𝑒 1 − 𝜀𝜀 ≤ log𝑒𝑒𝑛𝑛 + �
𝑡𝑡=1
𝑇𝑇
log𝑒𝑒 1 − 𝜀𝜀𝐿𝐿𝑡𝑡
• By Lemma 1: −𝐿𝐿∗ 𝜀𝜀 + 𝜀𝜀2 ≤ log𝑒𝑒𝑛𝑛 − 𝜀𝜀 ∑𝑡𝑡=1𝑇𝑇 𝐿𝐿𝑡𝑡
• Linearity of expectation 𝐿𝐿 = ∑𝑡𝑡=1𝑇𝑇 𝐿𝐿𝑡𝑡, rearranging:
𝐿𝐿 ≤ log𝑒𝑒𝑛𝑛𝜀𝜀 + (1 + 𝜀𝜀)𝐿𝐿
∗
25
COMP90051 Statistical Machine Learning
Applications of multiplicative weights [Kale thesis 2007]
• Learning quantum states from noisy measurements
• Derandomising algorithms
• Solving certain
zero-sum games
• Fast graph partitioning
• Fast solving of
semidefinite
programming problems
• Portfolio optimisation
• A basis for boosting
• Sparse vector technique in
differential privacy
26
Public domain
COMP90051 Statistical Machine Learning
Mini Summary
• Introducing randomisation to learning with experts
∗ Algorithm choosing a random expert to follow
∗ Weights become probabilities
∗ Mistakes generalise to losses
• Loss bound
∗ Have to bound expected loss (hey, risk!!)
∗ Shaves off that 2 factor. Proves that randomisation really helps!
Next: Only observe reward of chosen expert bandits!
27
�Lecturer: Ben Rubinstein
This lecture
An infallible expert and the Majority Algorithm
Warm-up: Case of the infallible expert
Infallible Expert Algorithm: Majority Vote
Mistake Bound for Majority Vote
How is this “online learning”?
Mini Summary
Imperfect experts and the Halving Algorithm
No one’s perfect
Imperfect experts: Halving Algorithm
Mistake Bound for Halving
Compare, compare: What’s going on?
Mini Summary
From Halving to �Multiplying weights by 1−𝜀
Useful (but otherwise boring) inequalities
Weighted Majority Vote Algorithm
Mistake Bound
Dependence in 𝑚 provably near optimal!
Mini Summary
The probabilistic �experts algorithm
Probabilistic experts algorithm
Probabilistic experts: Expected loss bound
Proof: Upper bounding potential function
Proof: Lower bounding potential, Wrap up
Applications of multiplicative weights [Kale thesis 2007]
Mini Summary