程序代写代做代考 C Probabilistic Generative Models

Probabilistic Generative Models
CS 542 Machine Learning

Today
• Probabilistic classification
• Linear Discriminant Analysis
• Generalize Linear Models

Probabilistic Classification 𝐷𝐷= 𝑥𝑥(𝑖𝑖),𝑦𝑦(𝑖𝑖) :data
𝑥𝑥 ∈ R𝑝𝑝
𝑦𝑦∈ 𝑘𝑘,𝑘𝑘=1,…,𝐾𝐾
• Can model output value directly, but having a probability is often more useful
• Bayes classifier: minimizes the probability of misclassification 𝑦𝑦 = argmax 𝑝𝑝(𝑌𝑌 = 𝑘𝑘|𝑋𝑋 = 𝑥𝑥)
𝑘𝑘
• Want to model conditional distribution, p(Y = y|𝑋𝑋 = 𝑥𝑥), then
assign label based on it

Two approaches to classification
• Discriminative: represent p(Y|𝑋𝑋) as function of parameters 𝜃𝜃,
then learn 𝜃𝜃 from training data
• Generative: use Bayes Rule to write
𝑃𝑃(𝑌𝑌=𝑘𝑘|𝑋𝑋 =𝑥𝑥)=
𝑃𝑃(𝑋𝑋 =𝑥𝑥|𝑌𝑌=𝑘𝑘)𝑃𝑃(𝑌𝑌=𝑘𝑘)
𝑃𝑃(𝑋𝑋 = 𝑥𝑥)
then learn parameters of class-conditional density p(X|Y) and class prior p(Y) — ignore 𝑝𝑝(𝑋𝑋)

Generative vs Discriminative
Intuition

• Suppose you own a cookie factory
Cookie Robots • Want to detect bad cookies and discard them

Cookie Robots
Good Bad
“The Chef”
– Can make good and bad cookies
– Compares new cookie to those
– Decides if it is good or bad
“The Critic”
– Cannot make cookies
– Has seen lots of good and bad cookies
– Decides if it is good or bad
Good Bad

Generative vs Discriminative
x2 𝑃𝑃 (𝑋𝑋| 𝑌𝑌), 𝑃𝑃(𝑌𝑌) Good Bad
x1
x2
𝑃𝑃 (𝑌𝑌| 𝑋𝑋 )
Bad
Good

Generative: model the class-conditional distribution of features, e.g. LDA, Naïve Bayes

x1
Can sample from distribution
Cannot sample from distribution
Discriminative: model the decision boundary directly, e.g. Logistic Regression, SVM

Linear Discriminant Analysis Derivation
Slide credits: Sergio Bacallado

Bayes Classifier
Find an estimate P(Y | X). Then, given an input x0, we predict the response as in a Bayes classifier:
𝑦𝑦0 = 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑥𝑥𝑦𝑦 𝑃𝑃(𝑌𝑌 = 𝑦𝑦0 | 𝑋𝑋 = 𝑥𝑥0).

Generative Classifier
Instead of estimating P (Y | X), we will estimate:
3/21

Generative Classifier
Instead of estimating P (Y | X), we will estimate:
1. P (X | Y ): Given the category, what is the distribution of the inputs.
3/21

Generative Classifier
Instead of estimating P (Y | X), we will estimate:
1. P (X | Y ): Given the category, what is the distribution of
the inputs.
2. P (Y ): How likely are each of the categories.
3/21

Generative Classifier
Instead of estimating P (Y | X), we will estimate:
1. P (X | Y ): Given the category, what is the distribution of
the inputs.
2. P (Y ): How likely are each of the categories.
Then, we use Bayes rule to obtain the estimate:
𝑃𝑃(𝑌𝑌=𝑘𝑘|𝑋𝑋 =𝑥𝑥)=
𝑃𝑃(𝑋𝑋 =𝑥𝑥|𝑌𝑌=𝑘𝑘)𝑃𝑃(𝑌𝑌=𝑘𝑘)
𝑃𝑃(𝑋𝑋 = 𝑥𝑥)
3/21

Generative Classifier
Instead of estimating P (Y | X), we will estimate:
1. P (X | Y ): Given the category, what is the distribution of
the inputs.
2. P (Y ): How likely are each of the categories.
Then, we use Bayes rule to obtain the estimate: 𝑃𝑃(𝑌𝑌=𝑘𝑘|𝑋𝑋 =𝑥𝑥)= 𝑃𝑃(𝑋𝑋=𝑥𝑥|𝑌𝑌=𝑘𝑘)𝑃𝑃(𝑌𝑌=𝑘𝑘)
�𝑗𝑗 𝑃𝑃(𝑋𝑋=𝑥𝑥|𝑌𝑌=𝑗𝑗)𝑃𝑃(𝑌𝑌=𝑗𝑗)
3/21

Linear Discriminant Analysis (LDA)
Instead of estimating P (Y | X), we will estimate: 1.Wemodel𝑃𝑃 𝑋𝑋=𝑥𝑥𝑌𝑌=𝑘𝑘 =𝑓𝑓𝑘𝑘(𝑥𝑥)asaMultivariate
Normal Distribution:
2. 𝑃𝑃(𝑌𝑌 = 𝑘𝑘) = 𝜋𝜋 𝑘𝑘 is estimated by the fraction of training samples of class k.
4/21

LDA prior and class-conditional density
Suppose that:
5/21

LDA prior and class-conditional density
Suppose that:
► WeknowP(Y=k)=πkexactly.
5/21

LDA prior and class-conditional density
Suppose that:
► WeknowP(Y=k)=πkexactly.
► P(X=x|Y=k)isMutivariateNormalwithdensity:
k (2π)p/2|Σ|1/2 − 2(x−μk ) Σ (x−μk) f (x)= 1 e 1 T −1
5/21

LDA prior and class-conditional density
Suppose that:
► WeknowP(Y=k)=πkexactly.
► P(X=x|Y=k)isMutivariateNormalwithdensity: k (2π)p/2|Σ|1/2 − 2(x−μk ) Σ (x−μk)
f (x)= 1 e 1 T −1
μk : Mean of the inputs for category k.
Σ : Covariance matrix (common to all categories).
5/21

LDA prior and class-conditional density
Suppose that:
► WeknowP(Y=k)=πkexactly.
► P(X=x|Y=k)isMutivariateNormalwithdensity: k (2π)p/2|Σ|1/2 − 2(x−μk ) Σ (x−μk)
f (x)= 1 e 1 T −1 μk : Mean of the inputs for category k.
Σ : Covariance matrix (common to all categories). Then, what is the Bayes classifier?
5/21

LDA Solution
Slide credits: Sergio Bacallado

LDA has linear decision boundaries
By Bayes rule, the probability of category k, given the input x is:
fk(x)πk P(X = x)
P (Y = k | X = x) =
6/21

LDA has linear decision boundaries
By Bayes rule, the probability of category k, given the input x is:
P (Y = k | X = x) = Thedenominatordoesnotdependontheresponsek, sowecan
write it asa constant:
fk(x)πk P(X = x)
P(Y = k | X = x) = C × fk(x)πk
6/21

LDA has linear decision boundaries
By Bayes rule, the probability of category k, given the input x is:
P (Y = k | X = x) = Thedenominatordoesnotdependontheresponsek, sowecan
write it asa constant:
P(Y = k | X = x) = C × fk(x)πk Now,expanding fk(x):
Cπk (2π)p/2|Σ|1/2
P (Y = k | X = x) =
e1 T−1
− 2(x−μk ) Σ (x−μk)
fk(x)πk P(X = x)
6/21

LDAhas linear decision boundaries
Cπk (2π)p/2|Σ|1/2
P (Y = k | X = x) =
e1 T−1
− 2(x−μk ) Σ (x−μk)
7/21

LDA has linear decision boundaries
Cπk (2π)p/2|Σ|1/2
e1 T−1
− 2(x−μk ) Σ (x−μk)
P (Y = k | X = x) = Now,letusabsorbeverythingthatdoesnotdependonk intoa
constant C/:
P(Y=k|X =x)=Cπ e 1 T −1
/ k −2(x−μk )Σ (x−μk)
7/21

LDA has linear decision boundaries
Cπk (2π)p/2|Σ|1/2
e1 T−1
− 2(x−μk ) Σ (x−μk)
P (Y = k | X = x) = Now,letusabsorbeverythingthatdoesnotdependonk intoa
constant C/:
P(Y=k|X =x)=Cπ e 1 T −1
/ k −2(x−μk )Σ (x−μk) and take the logarithm of both sides:
logP(Y=k|X=x)=logC/+logπ −1(x−μ)TΣ−1(x−μ). k2kk
7/21

LDA has linear decision boundaries
Cπk (2π)p/2|Σ|1/2
e1 T−1
− 2(x−μk ) Σ (x−μk)
P (Y = k | X = x) = Now,letusabsorbeverythingthatdoesnotdependonk intoa
constant C/:
P(Y=k|X =x)=Cπ e 1 T −1
/ k −2(x−μk )Σ (x−μk) and take the logarithm of both sides:
logP(Y=k|X=x)=logC/+logπ −1(x−μ)TΣ−1(x−μ). k2kk
Thisisthesameforeverycategory, k.
7/21

LDA has linear decision boundaries
Cπk (2π)p/2|Σ|1/2
e1 T−1
− 2(x−μk ) Σ (x−μk)
P (Y = k | X = x) = Now,letusabsorbeverythingthatdoesnotdependonk intoa
constant C/:
P(Y=k|X =x)=Cπ e 1 T −1
/ k −2(x−μk )Σ (x−μk) and take the logarithm of both sides:
logP(Y=k|X=x)=logC/+logπ −1(x−μ)TΣ−1(x−μ). k2kk
Thisisthesameforeverycategory, k. Sowewant to find the maximum of this over k.
7/21

LDA has linear decision boundaries
Goal, maximize the following over k:
logπ −1(x−μ)TΣ−1(x−μ). k2kk
8/21

LDA has linear decision boundaries
Goal, maximize the following over k:
logπ −1(x−μ)TΣ−1(x−μ).
k2kk 1
=logπ − [xTΣ−1x+μTΣ−1μ]+xTΣ−1μ k2kkk
8/21

LDA has linear decision boundaries
Goal, maximize the following over k:
logπ −1(x−μ)TΣ−1(x−μ).
k2kk 1
=logπ − [xTΣ−1x+μTΣ−1μ]+xTΣ−1μ k2kkk
= C / / + l o g π k − 1 μ Tk Σ − 1 μ k + x T Σ − 1 μ k 2
8/21

LDA has linear decision boundaries
Goal, maximize the following over k:
logπ −1(x−μ)TΣ−1(x−μ).
k2kk 1
=logπ − [xTΣ−1x+μTΣ−1μ]+xTΣ−1μ k2kkk
= C / / + l o g π k − 1 μ Tk Σ − 1 μ k + x T Σ − 1 μ k 2
We define the objective:
δk(x)=logπk − 1μTΣ−1μk +xTΣ−1μk 2k
At aninput x, wepredict the responsewith the highest δk(x).
8/21

LDAhas linear decision boundaries
What is the decision boundary? It is the set of points in which 2 classes do just as well:
δk(x) = δl (x)
9/21

LDA has linear decision boundaries
What is the decision boundary? It is the set of points in which 2 classes do just as well:
δk(x) = δl (x)
k2Tk−1k k 2T−1T−1
logπ−1μΣ μ+xTΣ−1μ=logπl−1μlΣ μl+xΣ μl
9/21

LDA has linear decision boundaries
What is the decision boundary? It is the set of points in which 2 classes do just as well:
δk(x) = δl (x)
k2Tk−1k k 2T−1T−1
logπ−1μΣ μ+xTΣ−1μ=logπl−1μlΣ μl+xΣ μl Thisisalinearequationin x.
9/21

Estimating πk
#{i ;yi =k} n
πk =
In English, the fraction of training samples of class k.

Estimating the parameters of f (x) k
Estimate the center of each class μ k :
∑x μk= 1 i
#{i;yi = k} i;yi=k
11/21

Estimating the parameters of f (x) k
Estimate the center of each class μ k :
μk= 1 i
∑x #{i;yi = k} i;yi=k
Estimate the common covariance matrix Σ :
11/21

Estimating the parameters of f (x) k
Estimate the center of each class μ k :
μk= 1 i
Estimate the common covariance matrix Σ : ► Onedimension(p=1):
∑x #{i;yi = k} i;yi=k
∑∑
σ2= 1 K (x −μ )2
n−K k=1i;yi=k
ik
11/21

Estimating the parameters of f (x) k
Estimate the center of each class μ k :
μk= 1 i
∑x #{i;yi = k} i;yi=k
Estimate the common covariance matrix Σ : ► Onedimension(p=1):
∑∑
σ2= 1 K (x −μ )2
n−K k=1i;yi=k
ik
► Many dimensions (p > 1): Compute the vectors of deviations (x1− μy1),(x2− μy2),…,(xn − μyn)anduseanestimateof its covariance matrix, Σ .
11/21

LDA prediction
Foraninputx, predicttheclasswiththelargest:
δk(x)= logπ −1μTΣ−1μ +xTΣ−1μ k2kkk
12/21

LDA prediction
Foraninputx, predicttheclasswiththelargest: δk(x)= logπ −1μTΣ−1μ +xTΣ−1μ
k2kkk The decision boundaries are defined by:
k2Tk−1k k 2T−1T−1 logπ−1μΣ μ+xTΣ−1μ=logπl−1μlΣ μl+xΣ μl
12/21

LDA prediction
Foraninputx, predicttheclasswiththelargest: δk(x)= logπ −1μTΣ−1μ +xTΣ−1μ
k2kkk The decision boundaries are defined by:
k2Tk−1k k 2T−1T−1 logπ−1μΣ μ+xTΣ−1μ=logπl−1μlΣ μl+xΣ μl
Solid lines in:
12/21

Quadratic discriminant analysis (QDA)
The assumption that the inputs of every class have the same covariance Σ can be quite restrictive:
−4 −2 0 2 4 −4 −2 0 2 4
X1 X1
X2
−4 −3 −2 −1 0 1 2
X2
−4 −3 −2 −1 0 1 2

Quadratic discriminant analysis (QDA)
Inquadraticdiscriminantanalysisweestimateameanμk anda covariancematrixΣk foreachclassseparately.
14/21

Quadratic discriminant analysis (QDA)
Inquadraticdiscriminantanalysisweestimateameanμk anda covariancematrixΣk foreachclassseparately.
Givenaninput, it iseasyto deriveanobjective function:
δ(x)=logπ −1μTΣ−1μ+xTΣ−1μ−1xTΣ−1x−1log|Σ| kk2kkkkk2k2k
14/21

Quadratic discriminant analysis (QDA)
Inquadraticdiscriminantanalysisweestimateameanμk anda covariancematrixΣk foreachclassseparately.
Givenaninput, it iseasyto deriveanobjective function:
δ(x)=logπ −1μTΣ−1μ+xTΣ−1μ−1xTΣ−1x−1log|Σ| kk2kkkkk2k2k
This objective is now quadratic in x and so are the decision boundaries.
14/21

Quadratic discriminant analysis (QDA)
► Bayesboundary(–––) ► LDA(······)
► QDA(——).
−2 0 2 4 −4 −2 0 2 4 X1 X1
X2
−4 −3 −2 −1 0 1 2

Linear Discriminant Analysis
More intuition

Illustration of Decision Boundary
constant
class prior log-ratio
diff. in class means
Can re-write as 𝜃𝜃0 + 𝑥𝑥𝑇𝑇𝜃𝜃 = 0
input covariance
x2
Good Bad
x1

x2
Effect of Covariance Matrix
x1
x1
x2
• covariance matrix determines the shape of the Gaussian density, so
• in LDA, the Gaussian densities for different classes have the same shape, but are shifted versions of each other (different mean vectors).

Effect of Class Prior
• What effect does the prior p(class), or πk, have?
• Lets look at an example for 2 classes…
class prior log-ratio

𝑝𝑝 𝐶𝐶1 𝒙𝒙 ∝ 𝑝𝑝(𝒙𝒙|𝐶𝐶1)𝑝𝑝(𝐶𝐶1)
Model class-conditional probability of a 2D feature vector for class 1 as a multivariate Gaussian density.

𝑝𝑝 𝐶𝐶2 𝒙𝒙 ∝ 𝑝𝑝(𝒙𝒙|𝐶𝐶2)𝑝𝑝(𝐶𝐶2)
Now consider class 2 with a similar Gaussian conditional density, which has the same covariance but a different mean

𝑝𝑝 𝐶𝐶1 𝒙𝒙 ∝ 𝑝𝑝(𝒙𝒙|𝐶𝐶1)𝑝𝑝(𝐶𝐶1) 𝑝𝑝 𝐶𝐶2 𝒙𝒙 ∝ 𝑝𝑝(𝒙𝒙|𝐶𝐶2)𝑝𝑝(𝐶𝐶2)
decision hypersurface
If the priors for each class are the same (i.e. 0.5), then the decision hypersurface cuts directly between the two means, with a direction parallel to the elliptical shape of the modes of the Gaussian densities shaped by their (identical) covariance matrices.

𝑝𝑝 𝐶𝐶1 𝒙𝒙 ∝ 𝑝𝑝(𝒙𝒙|𝐶𝐶1)𝑝𝑝(𝐶𝐶1) 𝑝𝑝 𝐶𝐶2 𝒙𝒙 ∝ 𝑝𝑝(𝒙𝒙|𝐶𝐶2)𝑝𝑝(𝐶𝐶2)
decision hypersurface
Now if the priors for each class are unequal, the decision hypersurface cuts between the two means with a direction as before, but now will be located further from the more likely class. This biases the predictor in favor of the more likely class.

Posterior probability 𝑝𝑝(𝐶𝐶1|𝑥𝑥) for two classes 𝐶𝐶1, 𝐶𝐶2
Bishop Figure 4.10 The left-hand plot shows the class-conditional densities for two classes, denoted red and blue. On the right is the corresponding posterior probability p(C1|x), which is given by a logistic sigmoid of a linear function of x. The surface in the right-hand plot is coloured using a proportion of red ink given by p(C1|x) and a proportion of blue ink given by p(C2|x) = 1 − p(C1|x).

More than two classes, unequal covariances
• more general case of unequal covariances (here shown for four classes)
• QDA
• the decision hypersurface is no longer a hyperplane, i.e. it is nonlinear.

Generative vs Discriminative
x2
x2
• Generative: model the class-conditional distribution of features
• Pros: Can use it to generate new features
• Cons: more parameters, e.g. LDA has O(n^2)
x1
x1
• Discriminative: model the decision boundary directly, e.g. Logistic Regression
• Pros: fewer parameters, e.g. LogReg has O(n)
• Cons: Cannot generate new features

Do they produce the same classifier?
• Generative LDA approach will estimate μ1,μ2, and Σ to maximize joint 𝑗𝑗 likelihood p(x,y) and then compute the linear decision boundary, i.e., 𝜃𝜃 and 𝜃𝜃0 are functions of those parameters. In particular, 𝜃𝜃𝑗𝑗 and 𝜃𝜃0 are not completely independent.
• Discriminative approach (logistic regression) will directly estimate 𝜃𝜃𝑗𝑗 and 𝜃𝜃0, without assuming any constraints between them, by maximizing conditional likelihood p(y|x)
• The two methods will give different decision boundaries, even both are linear.

LDA for image classification
• Discriminative Decorrelation for Clustering and Classification Hariharan, Malik and Ramanan, 2012
• Showed that LDA requires a lot less training than discriminative models for this task
Learned LDA model for class “bicycle”
http://home.bharathh.info/pubs/pdfs/BharathECCV2012.pdf

Probabilistic Generative Models
Generalized Linear Models

Model p(Y|X) as sigmoid
For two classes 𝐶𝐶1, 𝐶𝐶2:
𝑎𝑎
where we used
When𝑎𝑎≥0, 𝜎𝜎(𝑎𝑎)≥0.5
𝑎𝑎 ≥ 0 is equivalent to 𝑝𝑝(𝐶𝐶1|𝑥𝑥) ≥ 𝑝𝑝(𝐶𝐶2|𝑥𝑥)
Sigmoid/Logistic function
𝜎𝜎

Generalized Linear Models
If a(x) is a linear function of x, then we have a generalized linear
model:
• Logistic Regression, neural network Generative
• Estimate the probability densities
• Leads to Linear Discriminant Analysis
• Fit a parametrized function, e.g. 𝑎𝑎 𝑥𝑥 = 𝜃𝜃𝑇𝑇𝑥𝑥
Discriminative