PowerPoint Presentation
Lecturer: Ben Rubinstein
Lecture 2. Statistical Schools of Thought
COMP90051 Statistical Machine Learning
Copyright: University of Melbourne
COMP90051 Statistical Machine Learning
This lecture
How do learning algorithms come about?
• Frequentist statistics
• Statistical decision theory
• Extremum estimators
• Bayesian statistics
Types of probabilistic models
• Parametric vs. Non-parametric
• Generative vs. Discriminative
2
COMP90051 Statistical Machine Learning
3
Wherein unknown model parameters are treated
as having fixed but unknown values.
Frequentist Statistics
willoweit.
Typewritten Text
频率统计学学派
willoweit.
Typewritten Text
COMP90051 Statistical Machine Learning
Frequentist statistics
• Abstract problem
∗ Given: X1, X2, …, Xn drawn i.i.d. from some distribution
∗ Want to: identify unknown distribution, or a property of it
• Parametric approach (“parameter estimation”)
∗ Class of models {𝑝𝑝𝜃𝜃 𝑥𝑥 :𝜃𝜃 ∈ Θ} indexed by parameters Θ (could
be a real number, or vector, or ….)
∗ Point estimate �𝜃𝜃 (𝑋𝑋1, … ,𝑋𝑋𝑛𝑛) a function (or statistic) of data
• Examples
∗ Given n coin flips, determine probability of landing heads
∗ Learning a classifier
4
Hat means estimate
or estimator
Independent and
identically distributed
COMP90051 Statistical Machine Learning
Estimator Bias
Frequentists seek good behaviour, in ideal conditions
• Bias: B𝜃𝜃 �𝜃𝜃 = E𝜃𝜃 �𝜃𝜃 𝑋𝑋1, … ,𝑋𝑋𝑛𝑛 − 𝜃𝜃
Example: for i=1…40
• 𝑋𝑋𝑖𝑖,1, … ,𝑋𝑋𝑖𝑖,20 ~ 𝑝𝑝𝜃𝜃 = 𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁(𝜃𝜃 = 0,𝜎𝜎2 = 1)
• �𝜃𝜃𝑖𝑖 =
1
20
∑𝑗𝑗=1
20 𝑋𝑋𝑖𝑖,𝑗𝑗 the sample mean, plot as
5
Subscript θ means data
really comes from pθ
Average of �̂�𝜃1, … , �̂�𝜃40 True 𝜃𝜃 = 0
Gap converges to bias
(in probability)!
COMP90051 Statistical Machine Learning
Estimator Variance
Frequentists seek good behaviour, in ideal conditions
• Variance: Var𝜃𝜃 �̂�𝜃 = E𝜃𝜃 (�̂�𝜃 − E𝜃𝜃[�̂�𝜃])2
Example cont.
• Plot each (�̂�𝜃𝑖𝑖 − E𝜃𝜃[�̂�𝜃𝑖𝑖])2= �̂�𝜃𝑖𝑖
2
as
6
�𝜃𝜃 still function of data
Average of �̂�𝜃1
2
, … , �̂�𝜃40
2
True Var𝜃𝜃 �̂�𝜃 =
𝜎𝜎2
20
= 0.05
Once again, average converges
to true (in probability)!
COMP90051 Statistical Machine Learning
Asymptotically Well Behaved
For our example estimator (sample mean), we could calculate its
exact bias (zero) and variance (𝜎𝜎2). Usually can’t guarantee low
bias/variance exactly
Asymptotic properties often hold!
• Consistency: �𝜃𝜃 𝑋𝑋1, … ,𝑋𝑋𝑛𝑛 → 𝜃𝜃 in probability
• Asymptotic efficiency: Var𝜃𝜃 �𝜃𝜃 𝑋𝑋1, … ,𝑋𝑋𝑛𝑛 converges to the
smallest possible variance of any estimator of 𝜃𝜃
Amazing Cramér-Rao lower bound (outside subject scope):
Var𝜃𝜃 �𝜃𝜃 ≥
1
𝐼𝐼(𝜃𝜃)
with 𝐼𝐼(𝜃𝜃) the Fisher information of pθ for any �𝜃𝜃
7
Bias closer and
closer to zero
Variance closer &
closer to optimal
willoweit.
Rectangle
COMP90051 Statistical Machine Learning
Maximum-Likelihood Estimation
• A general principle for designing estimators
• Involves optimisation
• �̂�𝜃 𝑥𝑥1, … , 𝑥𝑥𝑛𝑛 ∈ argmax
𝜃𝜃∈Θ
∏𝑖𝑖=1
𝑛𝑛 𝑝𝑝𝜃𝜃(𝑥𝑥𝑖𝑖)
• “The best estimate is one under
which observed data is most likely”
Later: MLE estimators usually well-behaved asymptotically
8
Fischer
COMP90051 Statistical Machine Learning
Example I: Bernoulli
• Know data comes from Bernoulli distribution with
unknown parameter (e.g., biased coin); find mean
• MLE for mean
∗ 𝑝𝑝𝜃𝜃 𝑥𝑥 = �
𝜃𝜃, if 𝑥𝑥 = 1
1 − 𝜃𝜃, if 𝑥𝑥 = 0 = 𝜃𝜃
𝑥𝑥 1 − 𝜃𝜃 1−𝑥𝑥
(note: 𝑝𝑝𝜃𝜃 𝑥𝑥 = 0 for all other 𝑥𝑥)
∗ Maximising likelihood yields �𝜃𝜃 = 1
𝑛𝑛
∑𝑖𝑖=1
𝑛𝑛 𝑋𝑋𝑖𝑖
9Derivation: https://stats.stackexchange.com/questions/275380/maximum-likelihood-estimation-for-bernoulli-distribution
https://stats.stackexchange.com/questions/275380/maximum-likelihood-estimation-for-bernoulli-distribution
COMP90051 Statistical Machine Learning
Example II: Normal
• Know data comes from Normal distribution with
variance 1 but unknown mean; find mean
• MLE for mean
∗ 𝑝𝑝𝜃𝜃 𝑥𝑥 =
1
2𝜋𝜋
exp −1
2
𝑥𝑥 − 𝜃𝜃 2
∗ Maximising likelihood yields �𝜃𝜃 = 1
𝑛𝑛
∑𝑖𝑖=1
𝑛𝑛 𝑋𝑋𝑖𝑖
• Exercise: derive MLE for variance 𝜎𝜎2 based on
𝑝𝑝𝜃𝜃 𝑥𝑥 =
1
2𝜋𝜋𝜎𝜎2
exp − 1
2𝜎𝜎2
𝑥𝑥 − 𝜇𝜇 2 with 𝜃𝜃 = (𝜇𝜇,𝜎𝜎2)
10Derivation: https://www.statlect.com/fundamentals-of-statistics/normal-distribution-maximum-likelihood
https://www.statlect.com/fundamentals-of-statistics/normal-distribution-maximum-likelihood
COMP90051 Statistical Machine Learning
MLE ‘algorithm’
1. Given data 𝑋𝑋1, … ,𝑋𝑋𝑛𝑛 define probability distribution,
𝑝𝑝𝜃𝜃, assumed to have generated the data
2. Express likelihood of data, ∏𝑖𝑖=1
𝑛𝑛 𝑝𝑝𝜃𝜃(𝑋𝑋𝑖𝑖)
(usually its logarithm… why?)
3. Optimise to find best (most likely) parameters �̂�𝜃
1. take partial derivatives of log likelihood wrt 𝜃𝜃
2. set to 0 and solve
(failing that, use gradient descent)
11
COMP90051 Statistical Machine Learning
Mini Summary
• Frequentist school of thought
• Point estimates
• Quality: bias, variance, consistency, asymptotic efficiency
• Maximum-likelihood estimation (MLE)
Next: Statistical Decision Theory, Extremum estimators
12
COMP90051 Statistical Machine Learning
13
Statistical Decision Theory
Branch within statistics, optimisation, economics,
control, emphasising utility maximisation.
COMP90051 Statistical Machine Learning
Decision theory
• Act to maximise utility – connected to
economics and operations research
• Decision rule 𝛿𝛿(𝒙𝒙) ∈ 𝐴𝐴 an action space
∗ E.g. Point estimate �𝜃𝜃 𝑥𝑥1, … , 𝑥𝑥𝑛𝑛
∗ E.g. Out-of-sample prediction �𝑌𝑌𝑛𝑛+1|𝑋𝑋1,𝑌𝑌1, … ,𝑋𝑋𝑛𝑛,𝑌𝑌𝑛𝑛,𝑋𝑋𝑛𝑛+1
• Loss function 𝑁𝑁 𝑁𝑁,𝜃𝜃 : economic cost, error metric
∗ E.g. square loss of estimate �𝜃𝜃 − 𝜃𝜃
2
∗ E.g. 0-1 loss of classifier predictions 1 𝑦𝑦 ≠ �𝑦𝑦
14
Wald
willoweit.
Highlight
willoweit.
Highlight
COMP90051 Statistical Machine Learning
Risk & Empirical Risk Minimisation (ERM)
• In decision theory, really care about expected loss
• Risk 𝑅𝑅𝜃𝜃 𝛿𝛿 = E𝑿𝑿~𝜃𝜃 𝑁𝑁 𝛿𝛿 𝑿𝑿 ,𝜃𝜃
∗ E.g. true test error
∗ aka generalization error
• Want: Choose 𝛿𝛿 to minimise 𝑅𝑅𝜃𝜃 𝛿𝛿
• Can’t directly! Why?
• ERM: Use training set X to approximate pθ
∗ Minimise empirical risk �𝑅𝑅𝜃𝜃 𝛿𝛿 =
1
𝑛𝑛
∑𝑖𝑖=1
𝑛𝑛 𝑁𝑁 𝛿𝛿 𝑋𝑋𝑖𝑖 ,𝜃𝜃
15
willoweit.
Rectangle
COMP90051 Statistical Machine Learning
Decision theory vs. Bias-variance
We’ve already seen
• Bias: B𝜃𝜃 �̂�𝜃 = E𝜃𝜃 �̂�𝜃 𝑋𝑋1, … ,𝑋𝑋𝑛𝑛 − 𝜃𝜃
• Variance: Var𝜃𝜃 �̂�𝜃 = E𝜃𝜃 (�̂�𝜃 − E𝜃𝜃[�̂�𝜃])2
But are they equally important? How related?
• Bias-variance decomposition of square-loss risk
E𝜃𝜃 𝜃𝜃 − �̂�𝜃
2
= [B(�̂�𝜃)]2+Var𝜃𝜃(�̂�𝜃)
16
willoweit.
Rectangle
COMP90051 Statistical Machine Learning
17
Extremum estimators
Very general framework that covers elements of
major statistical learning frameworks; enjoys good
asymptotic behaviour in general!!
COMP90051 Statistical Machine Learning
Extremum estimators
• �̂�𝜃𝑛𝑛 𝑿𝑿 ∈ argmin
𝜃𝜃∈Θ
𝑄𝑄𝑛𝑛 𝑿𝑿,𝜃𝜃 for any objective Qn()
• Generalises bits of all statistical frameworks. W00t!
∗ MLE and ERM seen earlier this lecture; and
∗ MAP seen later in this lecture.
∗ These are all M-estimators, with Q as a sum over data
(i.e. of log-likelihood, loss, or log-likelihood plus log prior)
• And it generalises other frameworks too!
18
COMP90051 Statistical Machine Learning
Consistency of Extremum Estimators
• Recall consistency: stochastic convergence to 0 bias
• Theorem for extremum estimators: �𝜃𝜃𝑛𝑛 → 𝜃𝜃 in prob,
if there’s a (“limiting”) function Q() such that:
1. Q() is uniquely maximised by 𝜃𝜃.
That is, no other parameters make Q() as large as 𝑄𝑄(𝜃𝜃).
2. The parameter family Θ is “compact”
(a generalisation of the familiar “closed” & “bounded” set, like [0,1])
3. Q() is a continuous function
4. Uniform convergence: sup𝜃𝜃∈Θ 𝑄𝑄𝑛𝑛 𝜃𝜃 − 𝑄𝑄(𝜃𝜃) → 0 in probability.
19
𝑄𝑄 𝜃𝜃 + 𝜀𝜀
𝑄𝑄 𝜃𝜃
𝑄𝑄 𝜃𝜃 − 𝜀𝜀
ob
je
ct
iv
e
va
lu
e
parameter 𝜃𝜃𝜃𝜃𝐿𝐿 𝜃𝜃𝑈𝑈
𝑸𝑸𝒏𝒏 𝜽𝜽
Maximiser �̂�𝜃𝑛𝑛 of 𝑸𝑸𝒏𝒏 must be in [𝜃𝜃𝐿𝐿,𝜃𝜃𝑈𝑈]
MLE
�𝜃𝜃𝑛𝑛
willoweit.
Typewritten Text
not examiner
COMP90051 Statistical Machine Learning
A game changer
• Frequentists: estimators that aren’t even correct with infinite data
(inconsistent), aren’t adequate in practice
• Proving consistency for every new
estimator? Ouch!
• So many estimators are extremum
estimators – general guarantees make it
much easy (but not easy!) to prove
• Asymptotic normality
∗ Extremum estimators converge to Gaussian in distribution
∗ Asymptotic efficiency: the variance of that limiting Gaussian
• Practical: Confidence intervals – think error bars!!
Frequentists like to have this asymptotic theory for their algorithms
20
COMP90051 Statistical Machine Learning
Mini Summary
• Decision theory: Utility-based, Minimise risk
• Many familiar learners minimise loss over data (ERM)
• Extremum estimators generalise ERM, MLE, (later: MAP)
∗ Amazingly, consistent: Gives us confidence that they work
(eventually)
∗ Amazingly, asymptotically normal: Helps make confidence
intervals
Next: Last but not least, the Bayesian paradigm
21
COMP90051 Statistical Machine Learning
22
Bayesian Statistics
Wherein unknown model parameters have
associated distributions reflecting prior belief.
COMP90051 Statistical Machine Learning
Bayesian statistics
• Probabilities correspond to beliefs
• Parameters
∗ Modeled as r.v.’s having distributions
∗ Prior belief in 𝜃𝜃 encoded by prior distribution 𝑃𝑃(𝜃𝜃)
• Parameters are modeled like r.v.’s (even if not really random)
• Thus: data likelihood 𝑃𝑃𝜃𝜃(𝑋𝑋) written as conditional 𝑃𝑃(X|𝜃𝜃)
∗ Rather than point estimate �𝜃𝜃, Bayesians update belief 𝑃𝑃(𝜃𝜃) with
observed data to 𝑃𝑃(𝜃𝜃|𝑋𝑋) the posterior distribution)
23
Laplace
COMP90051 Statistical Machine Learning
Tools of probabilistic inference
• Bayesian probabilistic inference
∗ Start with prior P(𝜃𝜃) and likelihood P(𝑋𝑋|𝜃𝜃)
∗ Observe data 𝑋𝑋 = 𝑥𝑥
∗ Update prior to posterior P(𝜃𝜃|𝑋𝑋 = 𝑥𝑥)
• Primary tools to obtain the posterior
∗ Bayes Rule: reverses order of conditioning
𝑃𝑃 𝜃𝜃 𝑋𝑋 = 𝑥𝑥 =
𝑃𝑃 𝑋𝑋 = 𝑥𝑥 𝜃𝜃 𝑃𝑃(𝜃𝜃)
𝑃𝑃(𝑋𝑋 = 𝑥𝑥)
∗ Marginalisation: eliminates unwanted variables
P 𝑋𝑋 = 𝑥𝑥 = �
𝑡𝑡
𝑃𝑃(𝑋𝑋 = 𝑥𝑥,𝜃𝜃 = 𝑡𝑡)
24
Bayes
These are
general tools of
probability and
not specific to
Bayesian
stats/ML
This quantity
is called the
evidence
COMP90051 Statistical Machine Learning
Example
• We model 𝑋𝑋|𝜃𝜃 as 𝑁𝑁(𝜃𝜃, 1) with prior N(0,1)
• Suppose we observe X=1, then update posterior
𝑃𝑃 𝜃𝜃 𝑋𝑋 = 1 =
𝑃𝑃 𝑋𝑋 = 1 𝜃𝜃 𝑃𝑃(𝜃𝜃)
𝑃𝑃(𝑋𝑋=1)
∝ 𝑃𝑃 𝑋𝑋 = 1 𝜃𝜃 𝑃𝑃 𝜃𝜃
= 1
2𝜋𝜋
𝑒𝑒𝑥𝑥𝑝𝑝 − (1−𝜃𝜃)
2
2
1
2𝜋𝜋
𝑒𝑒𝑥𝑥𝑝𝑝 − 𝜃𝜃
2
2
∝ 𝑁𝑁 0.5,0.5
NB: allowed to push constants out front and “ignore” as
these get taken care of by normalisation
25
26
𝑃𝑃 𝜃𝜃 𝑋𝑋 = 1 =
𝑃𝑃 𝑋𝑋 = 1 𝜃𝜃 𝑃𝑃(𝜃𝜃)
𝑃𝑃(𝑋𝑋=1)
∝ 𝑃𝑃 𝑋𝑋 = 1 𝜃𝜃 𝑃𝑃 𝜃𝜃
= 1
2𝜋𝜋
𝑒𝑒𝑥𝑥𝑝𝑝 − (1−𝜃𝜃)
2
2
1
2𝜋𝜋
𝑒𝑒𝑥𝑥𝑝𝑝 −𝜃𝜃
2
2
Discard constants w.r.t
𝜃𝜃
Collect exp’s
Name of the game is to get
posterior into a recognisable form.
exp of quadratic must be a Normal
Want leading numerator
term to be 𝜃𝜃2 by moving
coefficient to denominator
Complete the square in
numerator: move out
excess constants
Factorise
Recognise as
(unnormalized) Normal!
Constant underlined
Variance/std deviation circled
COMP90051 Statistical Machine Learning
How Bayesians make point estimates
• They don’t, unless forced at gunpoint!
∗ The posterior carries full information, why discard it?
• But, there are common approaches
∗ Posterior mean 𝐸𝐸𝜃𝜃|𝑋𝑋 𝜃𝜃 = ∫𝜃𝜃𝑃𝑃 𝜃𝜃 𝑋𝑋 𝑑𝑑𝜃𝜃
∗ Posterior mode argmax
𝜃𝜃
𝑃𝑃(𝜃𝜃|𝑋𝑋) (max a posteriori or MAP)
∗ There’re Bayesian decision-theoretic interpretations of these
27
𝑃𝑃(𝜃𝜃|𝑋𝑋)
𝜃𝜃𝜃𝜃MAP
COMP90051 Statistical Machine Learning
MLE in Bayesian context
• MLE formulation: find parameters that best fit data
�̂�𝜃 ∈ argmax𝜃𝜃 𝑃𝑃 𝑋𝑋 = 𝑥𝑥 𝜃𝜃
• Consider the MAP under a Bayesian formulation
�̂�𝜃 ∈ argmax𝜃𝜃𝑃𝑃 𝜃𝜃 𝑋𝑋 = 𝑥𝑥
= argmax𝜃𝜃
𝑃𝑃 𝑋𝑋 = 𝑥𝑥 𝜃𝜃 𝑃𝑃 𝜃𝜃
𝑃𝑃 𝑋𝑋 = 𝑥𝑥
= argmax𝜃𝜃 𝑃𝑃 𝑋𝑋 = 𝑥𝑥 𝜃𝜃 𝑃𝑃 𝜃𝜃
• Prior 𝑃𝑃 𝜃𝜃 weights; MLE like uniform 𝑃𝑃 𝜃𝜃 ∝ 1
• Extremum estimator: Max log𝑃𝑃 𝑋𝑋 = 𝑥𝑥 𝜃𝜃 + log𝑃𝑃 𝜃𝜃
28
COMP90051 Statistical Machine Learning
Frequentists vs Bayesians – Oh My!
• Two key schools of statistical
thinking
∗ Decision theory complements both
• Past: controversy; animosity;
almost a ‘religious’ choice
• Nowadays: deeply connected
29
ht
tp
s:
//
xk
cd
.c
om
/1
13
2/
C
C-
N
C2
.5
COMP90051 Statistical Machine Learning
30
(Some) Categories of
Probabilistic Models
COMP90051 Statistical Machine Learning
Parametric vs non-parametric models
Parametric Non-Parametric
Determined by fixed, finite
number of parameters
Number of parameters grows
with data, potentially infinite
Limited flexibility More flexible
Efficient statistically and
computationally
Less efficient
31
Examples to come! There are non/parametric models
in both the frequentist and Bayesian schools.
COMP90051 Statistical Machine Learning
Generative vs. discriminative models
• X’s are instances, Y’s are labels (supervised setting!)
∗ Given: i.i.d. data (X1, Y1),… ,(Xn, Yn)
∗ Find model that can predict Y of new X
• Generative approach
∗ Model full joint P(X, Y)
• Discriminative approach
∗ Model conditional P(Y|X) only
• Both have pro’s and con’s
32
Examples to come! There are generative/discriminative
models in both the frequentist and Bayesian schools.
COMP90051 Statistical Machine Learning
Mini Summary
• Bayesian paradigm: Its all in the prior!
• Bayesian point estimate: MAP (an extremum estimator)
• Parametric vs Non-parametric models
• Discriminative vs. Generative models
Next: Logistic regression (unlike you’ve ever seen before)
Workshops week #2: learning Bayes one coin flip at a time!
33
�Lecturer: Ben Rubinstein
This lecture
Slide Number 3
Frequentist statistics
Estimator Bias
Estimator Variance
Asymptotically Well Behaved
Maximum-Likelihood Estimation
Example I: Bernoulli
Example II: Normal
MLE ‘algorithm’
Mini Summary
Slide Number 13
Decision theory
Risk & Empirical Risk Minimisation (ERM)
Decision theory vs. Bias-variance
Slide Number 17
Extremum estimators
Consistency of Extremum Estimators
A game changer
Mini Summary
Slide Number 22
Bayesian statistics
Tools of probabilistic inference
Example
Slide Number 26
How Bayesians make point estimates
MLE in Bayesian context
Frequentists vs Bayesians – Oh My!
Slide Number 30
Parametric vs non-parametric models
Generative vs. discriminative models
Mini Summary