Lecture 2. Statistical Schools of Thought
COMP90051 Statistical Machine Learning
Semester 2, 2019 Lecturer: Ben Rubinstein
Updated 2019-08-02_18-10: solution to posterior update, pointer to Gaussian, Bernoulli MLEs Copyright: University of Melbourne
COMP90051 Statistical Machine Learning
This lecture
• Howdolearningalgorithmscomeabout? ∗ Frequentist statistics
∗ Statistical decision theory ∗ Bayesian statistics
• Typesofprobabilisticmodels ∗ Parametric vs. Non-parametric ∗ Generative vs. Discriminative
2
COMP90051 Statistical Machine Learning
Frequentist Statistics
Wherein unknown model parameters are treated as having fixed but unknown values.
3
COMP90051 Statistical Machine Learning
Frequentist statistics
• Abstractproblem
∗ Given: X1, X2, …, Xn drawn i.i.d. from some distribution
∗ Want to: identify unknown distribution, or a property of it
Independent and identically distributed
• Parametricapproach(“parameterestimation”)
∗ Class of models�{𝑝𝑝𝜃𝜃 𝑥𝑥 : 𝜃𝜃 ∈ Θ} indexed by parameters Θ
(could be a real number, or vector, or ….)
∗ Point estimate 𝜃𝜃 (𝑥𝑥 , … , 𝑥𝑥 ) a function (or statistic) of data
1𝑛𝑛
Hat means estimate or estimator
• Examples
∗ Given n coin flips, determine probability of landing heads ∗ Choosing a classifier
4
COMP90051 Statistical Machine Learning
Bias, variance and asymptotic versions
•Bias:𝐵𝐵𝜃𝜃̂=𝐸𝐸𝜃𝜃̂𝑋𝑋,…,𝑋𝑋 −𝜃𝜃 𝜃𝜃𝜃𝜃1𝑛𝑛
Frequentists seek good behaviour in ideal conditions
• Variance: 𝑉𝑉𝑉𝑉𝑉𝑉𝜃𝜃 𝜃𝜃̂ Asymptotic properties
= 𝐸𝐸𝜃𝜃 (𝜃𝜃̂ − 𝐸𝐸𝜃𝜃 [𝜃𝜃̂ ])2
Subscript θ θ means data really
• Efficiency:asymptoticvarianceisassmallaspossible
5
comes from p 𝜃𝜃 still function of data
�
• Consistency: 𝜃𝜃̂ 𝑋𝑋 , … , 𝑋𝑋 converges to 𝜃𝜃 as n∞ 1𝑛𝑛
COMP90051 Statistical Machine Learning
Maximum-Likelihood Estimation • Ageneralprinciplefordesigningestimators
• Involvesoptimisation 𝑛𝑛
• 𝜃𝜃̂ 𝑥𝑥 , … , 𝑥𝑥 ∈ a r g m a x ∏ 𝑝𝑝 ( 𝑥𝑥 )
1 𝑛𝑛 𝜃𝜃∈Θ 𝑖𝑖=1𝜃𝜃𝑖𝑖
• MLE estimators are consistent (under technical conditions)
Fischer
6
̂
MLE: 𝜃𝜃 𝑥𝑥1, … , 𝑥𝑥𝑛𝑛 ∈ argmax ∏𝑛𝑛 𝑝𝑝𝜃𝜃(𝑥𝑥𝑖𝑖)
𝜃𝜃∈Θ 𝑖𝑖=1
7
COMP90051 Statistical Machine Learning
Example I: Bernoulli
• KnowdatacomesfromBernoullidistributionwith unknown parameter (e.g., biased coin); find mean
• MLEformean
∗𝑝𝑝 𝑥𝑥 =�𝜃𝜃, if𝑥𝑥=1 = 𝜃𝜃𝑥𝑥 1−𝜃𝜃1−𝑥𝑥
𝜃𝜃 1 − 𝜃𝜃, if 𝑥𝑥 = 0 (note: 𝑝𝑝𝜃𝜃 𝑥𝑥 = 0 for all other 𝑥𝑥)
∗ Maximising likelihood yields 𝜃𝜃� = 1 ∑𝑛𝑛 𝑋𝑋
Derivation: https://stats.stackexchange.com/questions/275380/maximum-likelihood-estimation-for-bernoulli-distribution
8
𝑛𝑛
𝑖𝑖=1 𝑖𝑖
COMP90051 Statistical Machine Learning
Example II: Normal
• KnowdatacomesfromNormaldistributionwith variance 1 but unknown mean; find mean
• MLEformean
∗ 𝑝𝑝 𝜃𝜃 𝑥𝑥 = 21 𝜋𝜋 e x p − 12 𝑥𝑥 − 𝜃𝜃 � 2
∗ Maximising likelihood yields 𝜃𝜃 = 1 ∑𝑛𝑛 𝑋𝑋
𝑖𝑖=1 𝑖𝑖
• Exercise:deriveMLEforvariance𝜎𝜎2basedon
𝑝𝑝𝜃𝜃 𝑥𝑥 = 1 exp − 1 𝑥𝑥−𝜇𝜇2 with𝜃𝜃=(𝜇𝜇,𝜎𝜎2) 2𝜋𝜋𝜎𝜎2 2𝜎𝜎2
𝑛𝑛
Derivation: https://www.statlect.com/fundamentals-of-statistics/normal-distribution-maximum-likelihood
9
COMP90051 Statistical Machine Learning
MLE ‘algorithm’
1. Given data 𝑋𝑋1, … , 𝑋𝑋𝑛𝑛 define probability distribution,
𝑝𝑝𝜃𝜃, assumed to have generated the data 2. Express likelihood of data, ∏𝑛𝑛 𝑝𝑝𝜃𝜃(𝑋𝑋𝑖𝑖)
̂ 3. Optimise to find best (most likely) parameters 𝜃𝜃
(usually its logarithm… why?)
1. take partial derivatives of log likelihood wrt 𝜃𝜃
2. set to 0 and solve
(failing that, use iterative gradient method)
𝑖𝑖=1
10
COMP90051 Statistical Machine Learning
Statistical Decision Theory
Branch within statistics, optimisation, economics, control, emphasising utility maximisation.
11
COMP90051 Statistical Machine Learning
Decision theory
• Acttomaximiseutility-connectedto economics and operations research
• Decisionrule𝛿𝛿(𝒙𝒙)∈𝐴𝐴anactionspace
∗ E.g. Point estimate 𝜃𝜃� 𝑥𝑥1, … , 𝑥𝑥𝑛𝑛
∗ E.g. Out-of-sample prediction 𝑌𝑌� |𝑋𝑋 ,𝑌𝑌 ,…,𝑋𝑋 ,𝑌𝑌 ,𝑋𝑋
Wald
𝑛𝑛+1 1 1 𝑛𝑛 𝑛𝑛 𝑛𝑛+1 • Loss function 𝑙𝑙 𝑉𝑉, 𝜃𝜃 : economic cost, error metric
∗ E.g. square loss of estimate 𝜃𝜃� − 𝜃𝜃
12
2
∗ E.g. 0-1 loss of classifier predictions 1 𝑦𝑦 ≠ 𝑦𝑦�
COMP90051 Statistical Machine Learning
Risk & Empirical Risk Minimisation (ERM)
•Risk𝑅𝑅 𝛿𝛿=𝐸𝐸 𝑙𝑙𝛿𝛿𝑿𝑿,𝜃𝜃 𝜃𝜃 𝑿𝑿~𝜃𝜃
• Indecisiontheory,reallycareaboutexpectedloss
∗ E.g. true test error
∗ aka generalization error
• Want: Choose 𝛿𝛿 to minimise 𝑅𝑅 𝛿𝛿
• Can’tdirectly!Why?
• ERM:UsetrainingsetXtoapproximatep
∗ Minimiseempiricalrisk𝑅𝑅� 𝛿𝛿 = ∑ 𝑙𝑙 𝛿𝛿 𝑋𝑋 ,𝜃𝜃 𝜃𝜃 𝑛𝑛𝑖𝑖=1 𝑖𝑖
𝜃𝜃
1𝑛𝑛
θ
13
COMP90051 Statistical Machine Learning
Looking ahead to L3
• OptimisationandML
∗ Max likelihood estimation
∗ Empirical risk minimisation
∗ … many others 𝜃𝜃2
• CannotdoMLwithoutit
• Wewillcoveralittle 𝜃𝜃
(requires multivariate/vector calculus)
1
Wikimedia Commons. Author: Nicoguaro (CC4)
14
COMP90051 Statistical Machine Learning
Is this “Just Theoretical”TM ?
• Recall Lecture 1
• Thoseevaluation metrics? They’re just estimators of a performance parameter
• Example:error
• Bias,Variance,etc.indicatequalityofapproximation
15
COMP90051 Statistical Machine Learning
Bias-variance decomposition
• B i a s : 𝐵𝐵 𝜃𝜃 𝜃𝜃̂ = 𝐸𝐸 𝜃𝜃 𝜃𝜃̂ 𝑋𝑋 1 , … , 𝑋𝑋 𝑛𝑛 − 𝜃𝜃
• Variance:𝑉𝑉𝑉𝑉𝑉𝑉𝜃𝜃 𝜃𝜃̂ =𝐸𝐸𝜃𝜃 (𝜃𝜃̂−𝐸𝐸𝜃𝜃[𝜃𝜃̂])2
• Bias-variancedecompositionofsquare-lossrisk 𝐸𝐸 𝜃𝜃−𝜃𝜃̂2=[𝐵𝐵(𝜃𝜃̂)]2+𝑉𝑉𝑉𝑉𝑉𝑉(𝜃𝜃̂)
𝜃𝜃𝜃𝜃
16
COMP90051 Statistical Machine Learning
Bayesian Statistics
Wherein unknown model parameters have associated distributions reflecting prior belief.
17
COMP90051 Statistical Machine Learning
Bayesian statistics • Probabilitiescorrespondtobeliefs
• Parameters
∗ Modeled as r.v.’s having distributions
∗ Prior belief in 𝜃𝜃 encoded by prior distribution 𝑃𝑃(𝜃𝜃)
Laplace • Parameters are modeled like r.v.’s (even if not really random)
• Thus: data likelihood 𝑃𝑃 (𝑋𝑋) written as conditional 𝑃𝑃(X|𝜃𝜃) 𝜃𝜃�
∗ Rather than point estimate 𝜃𝜃, Bayesians update belief 𝑃𝑃(𝜃𝜃) with observed data to 𝑃𝑃(𝜃𝜃|𝑋𝑋) the posterior distribution)
18
COMP90051 Statistical Machine Learning
Tools of probabilistic inference
• Bayesian probabilistic inference
∗ Start with prior P(𝜃𝜃) and likelihood P(𝑋𝑋|𝜃𝜃)
∗ Observe data 𝑋𝑋 = 𝑥𝑥
∗ Update prior to posterior P(𝜃𝜃|𝑋𝑋 = 𝑥𝑥)
Bayes
• Primary tools to obtain the posterior
∗ Bayes Rule: reverses order of conditioning
𝑃𝑃 𝜃𝜃𝑋𝑋=𝑥𝑥 =𝑃𝑃 𝑋𝑋=𝑥𝑥𝜃𝜃 𝑃𝑃(𝜃𝜃)
These are general tools of probability and not specific to Bayesian stats/ML
∗ Marginalisation: eliminates unwanted variables
𝑃𝑃(𝑋𝑋 = 𝑥𝑥)
P𝑋𝑋=𝑥𝑥 =�𝑃𝑃(𝑋𝑋=𝑥𝑥,𝜃𝜃=𝑡𝑡) 𝑡𝑡
This quantity is called the evidence
19
COMP90051 Statistical Machine Learning
• •
Example
We model 𝑋𝑋|𝜃𝜃 as 𝑁𝑁(𝜃𝜃, 1) with prior N(0,1) Suppose we observe X=1, then update posterior
𝑃𝑃𝜃𝜃𝑋𝑋=1 =𝑃𝑃𝑋𝑋=1𝜃𝜃𝑃𝑃(𝜃𝜃) 𝑃𝑃(𝑋𝑋=1)
∝𝑃𝑃𝑋𝑋=1𝜃𝜃𝑃𝑃𝜃𝜃
= 1 𝑒𝑒𝑥𝑥𝑝𝑝 −(1−𝜃𝜃)2 1 𝑒𝑒𝑥𝑥𝑝𝑝 −𝜃𝜃2 2𝜋𝜋 2 2𝜋𝜋 2
∝ 𝑁𝑁 0.5,0.5
NB: allowed to push constants out front and “ignore” as these get taken care of by normalisation
20
𝑃𝑃𝜃𝜃𝑋𝑋=1 =𝑃𝑃𝑋𝑋=1𝜃𝜃𝑃𝑃(𝜃𝜃)
Discard constants w.r.t
Collect exp’s
term to be 𝜃𝜃2 by moving coefficient to denominator
Complete the square in numerator: move out excess constants
Factorise
Recognise as (unnormalized) Normal!
Want leading numerator
𝜃𝜃
2𝜋𝜋 2 2𝜋𝜋 2
Name of the game is to get 𝑃𝑃(𝑋𝑋=1) posterior into a recognisable form.
∝𝑃𝑃𝑋𝑋=1𝜃𝜃𝑃𝑃𝜃𝜃
= 1 𝑒𝑒𝑥𝑥𝑝𝑝 −(1−𝜃𝜃)2 1 𝑒𝑒𝑥𝑥𝑝𝑝 −𝜃𝜃2
exp of quadratic must be a Normal
Constant underlined Variance/std deviation circled
21
COMP90051 Statistical Machine Learning
How Bayesians make point estimates
• Theydon’t,unlessforcedatgunpoint!
∗ The posterior carries full information, why discard it?
• But, there are common approaches ∗Posteriormean 𝐸𝐸 𝜃𝜃 =∫𝜃𝜃𝑃𝑃𝜃𝜃𝑋𝑋𝑑𝑑𝜃𝜃
𝜃𝜃|𝑋𝑋
∗ Posterior mode argmax 𝑃𝑃(𝜃𝜃|𝑋𝑋) (max a posteriori or MAP)
𝜃𝜃
∗ There’re Bayesian decision-theoretic interpretations of these 𝑃𝑃(𝜃𝜃|𝑋𝑋)
𝜃𝜃
𝜃𝜃MAP
22
COMP90051 Statistical Machine Learning
MLE in Bayesian context
• MLEformulation:findparametersthatbestfitdata 𝜃𝜃̂∈argmax 𝑃𝑃𝑋𝑋=𝑥𝑥𝜃𝜃
• ConsidertheMAPunderaBayesianformulation 𝜃𝜃̂ ∈ a r g m a x 𝜃𝜃 𝑃𝑃 𝜃𝜃 𝑋𝑋 = 𝑥𝑥
𝜃𝜃𝑃𝑃𝑋𝑋=𝑥𝑥𝜃𝜃𝑃𝑃𝜃𝜃 =argmax𝜃𝜃 𝑃𝑃 𝑋𝑋=𝑥𝑥
= argmax𝜃𝜃 𝑃𝑃 𝑋𝑋 = 𝑥𝑥 𝜃𝜃 𝑃𝑃 𝜃𝜃
• Prior 𝑃𝑃 𝜃𝜃 weights; MLE like uniform 𝑃𝑃 𝜃𝜃 ∝ 1
23
COMP90051 Statistical Machine Learning
Frequentists vs Bayesians – Oh My!
• Two key schools of statistical thinking
∗ Decisiontheorycomplementsboth
• Past: controversy; animosity;
almost a ‘religious’ choice
• Nowadays: deeply connected
24
https://xkcd.com/1132/ CC-NC2.5
COMP90051 Statistical Machine Learning
(Some) Categories of Probabilistic Models
25
COMP90051 Statistical Machine Learning
Parametric vs non-parametric models
Parametric
Non-Parametric
Determined by fixed, finite number of parameters
Number of parameters grows with data, potentially infinite
Limited flexibility
More flexible
Efficient statistically and computationally
Less efficient
Examples to come! There are non/parametric models in both the frequentist and Bayesian schools.
26
COMP90051 Statistical Machine Learning
Generative vs. discriminative models
• X’sareinstances,Y’sarelabels(supervisedsetting!) ∗ Given: i.i.d. data (X1, Y1),… ,(Xn, Yn)
∗ Find model that can predict Y of new X
• Generative approach ∗ Model full joint P(X, Y)
• Discriminativeapproach
∗ Model conditional P(Y|X) only
• Bothhavepro’sandcon’s
Examples to come! There are generative/discriminative models in both the frequentist and Bayesian schools.
27
COMP90051 Statistical Machine Learning
Summary
• Philosophies: frequentist vs Bayesian
• Principles behind many learners: ∗ MLE
∗ Risk minimisation
∗ Probabilistic inference, MAP
• Parametric vs Non-parametric models
• Discriminative vs. Generative models
Next time: Linear regression (demo’s ideas) and Optimisation (needed for MLE, ERM, etc.)
Workshops week #2: learning Bayes one coin flip at a time!
28