Textbook Reference
Duda et al. Pattern Classification, 2nd Edition (2000)
� This week: Sections 3.1–3.5
1/25
Recap: Bayes Decision Theory
Recap:
� Knowing class priors and class-conditioned data densities, we can infer posterior probabilities using the Bayes theorem
P(ωj |x)= p(x|ωj)·P(ωj) p(x)
� From there, optimal classifiers can be built, e.g. the maximum accuracy
classifier:
Question:
� Can we assume that we know in advance the data densities p(x | ωj ), or should they be learned from the data?
argmax �P(ωj |x)� j
=argmax �logp(x|ωj)+logP(ωj)� j
2/25
Learning p(x | ωj ): Histogram Approach
Idea:
� Build a grid in input space, and for each class build a histogram that counts the number of observations that belong to each bin.
Problem:
� The number of bins is sd where s is the number of steps along each dimension and d is the number of dimensions.
� Many bins will have zero observations, not because the data probability is truly zero, but because finitely many examples have been observed.
� � � � �
� � � �
x1 x2
3/25
Learning p(x | ωj ): Model Approach
Idea:
� Assume that p(x|ωj) is a known parametric form, e.g. N (μj , Σj ) where μj,Σj aretheparametersofthe distribution.
� Estimate the parameters of the distribution that best fit the few observations we have.
Advantage:
� With the model-based approach, we need to estimate a finite and typically small number of model parameters and not the whole data distribution.
x1 x2
4/25
Maximum Likelihood Estimation
Goal: Let {p(x | θ), θ ∈ Θ} be a set of density functions (e.g. Gaussian), where θ denotes a parameter vector (e.g. mean / covariance). We would like to find the parameter θ that is the most likely with respect to the data.
Maximum Likelihood (ML) Approach:
� Assume that we have a dataset D = (x1,…,xN).
� Assume that each example xk ∈ Rd in the dataset has been is generated independently and from the same density function p(x | θ).
� In that case, the joint density function can be written as:
p(D|θ)=
�N k=1
p(xk |θ)
We also call this quantity the likelihood of θ w.r.t. the dataset D.
� The best parameter is then given by θˆ = arg max p(D | θ). θ
5/25
Maximum Likelihood, Simple Gaussian Case
Assume the data density is modeled as a univariate Gaussian with unit variance and unknown mean θ. For a given data point xk , the density function can be written as:
1 �1 2� p(xk|θ)=√2πexp −2(xk−θ)
Using the independence assumption, the joint density becomes:
�N 1 � 1 2� p(D|θ)= √2πexp −2(xk −θ)
k=1
Question: How to find θ that maximizes the function p(D | θ).
��� ��� ��� ��� ���
����� �
���
�����
�����
�����
�����
���
6/25
�� � �
����� �
Finding the Maximum of a Function
For some function f of interest (e.g. the data likelihood) we would like to compute:
θˆ=argmax f(θ) θ
When the function to optimize is continuously differentiable and concave, the maximum is found at the point where the gradient is zero.
∂f/∂θ1
∂f/∂θ2
∇ θ f ( θ ) = . . . = 0
∂f /∂θh
7/25
Maximum Likelihood, Simple Gaussian Case
�����
�����
�����
�����
���
��
��
��
���
Observation: The function p(D | θ) is not concave with θ.
Idea: Transform the function in a way that (i) doesn’t change its maximum and
(ii) makes it concave.
Applying the logarithm ensures the two properties above:
�N 1 � 1 2� logp(D|θ)=log √2πexp −2(xk −θ)
k=1
=�N �−1log(2π)−1(xk−θ)2� k=1 2 2
8/25
����� � �
�� � �
Maximum Likelihood, Simple Gaussian Case
Having found the log-likelihood w.r.t. D to be logp(D|θ)=�N �−1log(2π)−1(xk −θ)2�,
k=1 2 2
the best parameter θˆ can then be found by solving ∇θ log P(D | θ) = 0.
9/25
Maximum Likelihood, Multivariate Case
The log-likelihood of a multivariate Gaussian distribution w.r.t. D is given by �N1�d �1 �−1
logp(D|θ)= −2log (2π) det(Σ) −2(xk −μ) Σ (xk −μ) k=1
Question: Assuming Σ is fixed, what is the optimal parameter vector μ?
10/25
Building a Classifier End-to-End with ML
Consider a labeled dataset containing two examples for the first class, and four examples for the second class. Points are in N0 and we assume they are
generated from classes ω1 , ω2 following geometric unknown parameters θ1 and θ2 respectively.
probability distributions of
P(ω2) = 0.5
Known
Unknown
θ1,θ2
Known
P(ω1) = 0.5
P(x =k|θ1) = (1 − θ1)k θ1
D1 = {0,2}
P(x =k|θ2) = (1 − θ2)k θ2
D2 = {0,2,0,1}
11/25
Building a Classifier End-to-End with ML
Recall:
P(ω1) = 0.5
P(ω2) = 0.5
P(x =k|θ1)=(1−θ1)kθ1
P(x =k|θ2)=(1−θ2)kθ2
D1 ={0,2}
D2 = {0,2,0,1}
12/25
Building a Classifier End-to-End with ML
Recall:
P(ω1) = 0.5
P(ω2) = 0.5
P(x =k|θ1)=(1−θ1)kθ1
P(x =k|θ2)=(1−θ2)kθ2
D1 ={0,2}
D2 = {0,2,0,1}
13/25
Building a Classifier End-to-End with ML
Recall:
P(ω1) = 0.5
P(ω2) = 0.5
P(x =k|θ1)=(1−θ1)kθ1
P(x =k|θ2)=(1−θ2)kθ2
D1 ={0,2}
D2 = {0,2,0,1}
14/25
From ML to Bayes Classifiers
ML classifier: Class posteriors are given by:
ˆ p(x|ωi,θi)P(ωi)
p ( x , θˆ )
where θi is the maximum likelihood parameter for class ωi .
Bayes classifier: Class posteriors are computed by bypassing the intermediate computation of the parameter θˆ, and instead conditioning directly on the data:
P(ωi |x,D) = p(x|D,ωi)p(D|ωi)P(ωi) p(x | D) p(D)
p(x|D,ωi) = � p(x|θi,D,ωi)p(θi |D,ωi)dθi
p(θi |Di) = � p(Di |θi)p(θi) . p(Di |θi)p(θi)dθi
ˆ
ˆ P(ωi |x,θi)=
where and
15/25
From ML to Bayes Classifiers
Simplified Bayes classifier: Applying the Bayes rule internally, assuming that each class generates its dataset independently, assuming that priors are known, leveraging the fact that new data points are also generated i.i.d. and making thedependenceonωi implicitthroughthedatasetDi orparameterθi,wearrive after some steps at the simplified form:
where and
P(ωi |x,D) = p(x|Di)P(ωi) (1) p(x|D)
p(x|Di) = � p(x|θi)p(θi |Di)dθi (2) p(θi |Di) = � p(Di |θi)p(θi) . (3)
p(Di |θi)p(θi)dθi
We will use this simplified form in the following slides and also in the homeworks.
16/25
Building a Classifier End-to-End with Bayes
Recall: we consider the following data generating process
Known
Unknown
θ1,θ2
Known
P(ω1) = 0.5
P(ω2) = 0.5
P(x =k|θ1) = (1 − θ1)k θ1
D1 = {0,2}
P(x =k|θ2) = (1 − θ2)k θ2
D2 = {0,2,0,1}
17/25
Building a Classifier End-to-End Bayes
Recall:
P(ω1) = 0.5
P(ω2) = 0.5
P(x =k|θ1)=(1−θ1)kθ1
P(x =k|θ2)=(1−θ2)kθ2 D1 ={0,2}
D2 = {0,2,0,1} … and we further set:
p(θ1) ∼ U(0,1) p(θ2) ∼ U(0,1)
18/25
Building a Classifier End-to-End Bayes
Recall:
P(ω1) = 0.5
P(ω2) = 0.5
P(x =k|θ1)=(1−θ1)kθ1
P(x =k|θ2)=(1−θ2)kθ2 D1 ={0,2}
D2 = {0,2,0,1} … and we further set:
p(θ1) ∼ U(0,1) p(θ2) ∼ U(0,1)
19/25
Building a Classifier End-to-End Bayes
Recall:
P(ω1) = 0.5
P(ω2) = 0.5
P(x =k|θ1)=(1−θ1)kθ1
P(x =k|θ2)=(1−θ2)kθ2 D1 ={0,2}
D2 = {0,2,0,1} … and we further set:
p(θ1) ∼ U(0,1) p(θ2) ∼ U(0,1)
20/25
Building a Classifier End-to-End Bayes
Recall:
P(ω1) = 0.5
P(ω2) = 0.5
P(x =k|θ1)=(1−θ1)kθ1
P(x =k|θ2)=(1−θ2)kθ2 D1 ={0,2}
D2 = {0,2,0,1} … and we further set:
p(θ1) ∼ U(0,1) p(θ2) ∼ U(0,1)
21/25
ML vs. Bayes Classifiers
Observations:
� ML and Bayes classifiers do not always produce the same decisions
� Bayes classifiers are influenced by the prior distribution and are
consequently less sensitive to the data.
� Bayes classifiers will tend to favor the outcome that is supported by a larger amount of data.
22/25
ML vs. Bayes: Gaussian Case
Consider the simple data density p(x | μ) = N (μ, σ2) with unknown parameter μ. We would like to compare the ML and Bayes approaches to estimate the parameter μ from some dataset D = {x1, . . . , xn}.
ML approach:
� The maximum likelihood estimate is given by μˆ = 1 �n
n k=1
empirical mean (cf. previous slides).
Bayes approach:
� Assuming some prior distribution p(μ) = N(μ0,σ02), the posterior distribution can be computed as:
p(D | μ)p(μ) p(μ | D) = p(D)
where α is a normalizing factor.
= α
�n k=1
p(xk |μ)p(μ)
xk , i.e. the
23/25
ML vs. Bayes: Gaussian Case
Bayes approach (continued):
� The posterior distribution can be expanded as:
which corresponds to a new Gaussian distribution N(μn,σn2) with parameters σn2 σn2 2 �1 1�−1
μn=σ2/nμˆ+σ02μ0 and σn= σ2/n+σ02
� We observe that (1) the mean estimate μn is now pulled towards the prior if too little data is available, and (2) the mean estimate comes with a variance term σn2 that can be useful for confidence estimation.
24/25
Summary
� In practice, parameters of the class-conditioned distributions are not known and they must be inferred from some finite dataset.
� Two main approaches for learning these parameters: (1) maximum likelihood estimation and (2) Bayesian inference.
� Bayesian inference is often more difficult than maximum likelihood estimation (both analytically and computationally), because it requires integration.
� Bayesian inference readily incorporates interesting functionalities (inclusion of priors, construction of confidence estimates).
25/25