程序代写代做代考 C Bayesian Announcements

Announcements
Reminder: Midterm out on Thursday will be available on Blackboard (must be done by Friday)
90 minutes, must be completed after you start. Closed book.
Have scratch paper ready, some problems ask you to write out steps/show your work that can be shown on the scratch paper. Make sure to identify the problems you are showing your work for. Upload the scratch paper on the midterm form found on piazza right after you complete your test or it won’t be counted!
• ps4 self-grading form out, due 10/30
• Lab this week – midterm review

Probabilistic Generative Models
CS 542 Machine Learning

Today
• Probabilistic classification
• Linear Discriminant Analysis

Probabilistic Classification 𝐷 = 𝑥(𝑖), 𝑦(𝑖) : data
𝑥 ∈ R𝑝
𝑦∈ 𝑘,𝑘=1,…,𝐾
• Can model output value directly, but having a probability is often more useful
• Bayes classifier: minimizes the probability of misclassification 𝑦 = argmax 𝑝(𝑌 = 𝑘|𝑋 = 𝑥)
𝑘
• Want to model conditional distribution, p(Y = y|𝑋 = 𝑥), then
assign label based on it

Two approaches to classification
• Discriminative: represent p(Y|𝑋) as function of parameters 𝜃,
then learn 𝜃 from training data
• Generative: use Bayes Rule to write
𝑃(𝑌=𝑘|𝑋 =𝑥)=
𝑃(𝑋 =𝑥|𝑌=𝑘)𝑃(𝑌=𝑘)
𝑃(𝑋 = 𝑥)
then learn parameters of class-conditional density p(X|Y) and class prior p(Y) — ignore 𝑝(𝑋)

Generative vs Discriminative
Intuition

• Suppose you own a cookie factory
Cookie Robots • Want to detect bad cookies and discard them

Cookie Robots
Good Bad
𝑃 (𝑋| 𝑌), 𝑃(𝑌)
“The Chef”
– Can make good and bad cookies
– Compares new cookie to those
– Decides if it is good or bad
𝑃 (𝑌| 𝑋 )
“The Critic”
– Cannot make cookies
– Has seen lots of good and bad cookies
– Decides if it is good or bad
Good Bad

Generative vs Discriminative
𝑃 (𝑋| 𝑌), 𝑃(𝑌)
𝑃 (𝑌| 𝑋 )
x1
Good
Bad
x2 Good Bad
Generative: model the class-conditional distribution of features, e.g. LDA, Naïve Bayes
Can sample from distribution
x2

x1

Discriminative: model the decision boundary directly, e.g. Logistic Regression, SVM
Cannot sample from distribution

Linear Discriminant Analysis Derivation
Slide credits: Sergio Bacallado

Bayes Classifier
Find an estimate P(Y | X). Then, given an input x0, we predict the output as in a Bayes classifier:
𝑦0 = 𝑎𝑟𝑔𝑚𝑎𝑥𝑦 𝑃(𝑌 = 𝑦0 | 𝑋 = 𝑥0).

Generative Classifier
Instead of estimating P (Y | X), we will estimate:
3/21

Generative Classifier
Instead of estimating P (Y | X), we will estimate:
1. P (X | Y ): Given the category, what is the distribution of the inputs.
3/21

Generative Classifier
Instead of estimating P (Y | X), we will estimate:
1. P (X | Y ): Given the category, what is the distribution of
the inputs.
2. P (Y ): How likely are each of the categories.
3/21

Generative Classifier
Instead of estimating P (Y | X), we will estimate:
1. P (X | Y ): Given the category, what is the distribution of
the inputs.
2. P (Y ): How likely are each of the categories.
Then, we use Bayes rule to obtain the estimate:
𝑃(𝑌=𝑘|𝑋 =𝑥)=
𝑃(𝑋 =𝑥|𝑌=𝑘)𝑃(𝑌=𝑘)
𝑃(𝑋 = 𝑥)
3/21

Generative Classifier
Instead of estimating P (Y | X), we will estimate:
1. P (X | Y ): Given the category, what is the distribution of
the inputs.
2. P (Y ): How likely are each of the categories.
Then, we use Bayes rule to obtain the estimate:
𝑃(𝑌=𝑘|𝑋 =𝑥)=
𝑃(𝑋 =𝑥|𝑌=𝑘)𝑃(𝑌=𝑘)
𝑃(𝑋 =𝑥|𝑌=𝑗)𝑃(𝑌=𝑗) 𝑗
3/21

Linear Discriminant Analysis (LDA)
Instead of estimating P (Y | X), we will estimate:
1. We model 𝑃 𝑋 = 𝑥 𝑌 = 𝑘 = 𝑓𝑘(𝑥) as a Multivariate
Normal Distribution:
2. 𝑃(𝑌 = 𝑘) = 𝜋 𝑘 is estimated by the fraction of training samples of class k.
4/21

LDA prior and class-conditional density
Suppose that:
5/21

LDA prior and class-conditional density
Suppose that:
► WeknowP(Y=k)=πkexactly.
5/21

LDA prior and class-conditional density
Suppose that:
► WeknowP(Y=k)=πkexactly.
► P(X=x|Y=k)isMutivariateNormalwithdensity:
k (2π)p/2|Σ|1/2 − 2(x−μk ) Σ (x−μk) f (x)= 1 e 1 T −1
5/21

LDA prior and class-conditional density
Suppose that:
► WeknowP(Y=k)=πkexactly.
► P(X=x|Y=k)isMutivariateNormalwithdensity: k (2π)p/2|Σ|1/2 − 2(x−μk ) Σ (x−μk)
f (x)= 1 e 1 T −1
μk : Mean of the inputs for category k.
Σ : Covariance matrix (common to all categories).
5/21

LDA prior and class-conditional density
Suppose that:
► WeknowP(Y=k)=πkexactly.
► P(X=x|Y=k)isMutivariateNormalwithdensity: k (2π)p/2|Σ|1/2 − 2(x−μk ) Σ (x−μk)
f (x)= 1 e 1 T −1
μk : Mean of the inputs for category k.
Σ : Covariance matrix (common to all categories).
5/21

LDA Solution
Slide credits: Sergio Bacallado

LDA has linear decision boundaries
ByBayesrule,theprobabilityofcategoryk, giventheinputxis:
fk(x)πk P(X = x)
P (Y = k | X = x) =
6/21

LDA has linear decision boundaries
ByBayesrule,theprobabilityofcategoryk, giventheinputxis:
fk(x)πk P(X = x)
Thedenominatordoesnotdependontheoutputk, sowecan write it asa constant:
P(Y = k | X = x) = C × fk(x)πk
P (Y = k | X = x) =
6/21

LDA has linear decision boundaries
ByBayesrule,theprobabilityofcategoryk, giventheinputxis:
P (Y = k | X = x) = Thedenominatordoesnotdependontheoutputk, sowecan write
it asa constant:
P(Y = k | X = x) = C × fk(x)πk Now, expanding fk(x):
Cπk (2π)p/2|Σ|1/2
P (Y = k | X = x) =
1 T−1 e−2(x−μk )Σ (x−μk)
fk(x)πk P(X = x)
6/21

LDAhas linear decision boundaries
Cπk (2π)p/2|Σ|1/2
P (Y = k | X = x) =
1 T−1 e−2(x−μk )Σ (x−μk)
7/21

LDA has linear decision boundaries
Cπk (2π)p/2|Σ|1/2
1 T−1 e−2(x−μk )Σ (x−μk)
P (Y = k | X = x) = Now,letusabsorbeverythingthatdoesnotdependonk intoa
constant C/:
P(Y=k|X =x)=Cπ e 1 T −1
/ k −2(x−μk )Σ (x−μk)
7/21

LDA has linear decision boundaries
Cπk (2π)p/2|Σ|1/2
1 T−1 e−2(x−μk )Σ (x−μk)
P (Y = k | X = x) = Now,letusabsorbeverythingthatdoesnotdependonk intoa
constant C/:
P(Y=k|X =x)=Cπ e 1 T −1
/ k −2(x−μk )Σ (x−μk) and take the logarithm of both sides:
logP(Y=k|X=x)=logC/+logπ −1(x−μ)TΣ−1(x−μ). k2kk
7/21

LDA has linear decision boundaries
Cπk (2π)p/2|Σ|1/2
1 T−1 e−2(x−μk )Σ (x−μk)
P (Y = k | X = x) = Now,letusabsorbeverythingthatdoesnotdependonk intoa
constant C/:
P(Y=k|X =x)=Cπ e 1 T −1
/ k −2(x−μk )Σ (x−μk) and take the logarithm of both sides:
logP(Y=k|X=x)=logC/+logπ −1(x−μ)TΣ−1(x−μ). k2kk
This is the same for every category, k .
7/21

LDA has linear decision boundaries
Cπk (2π)p/2|Σ|1/2
1 T−1 e−2(x−μk )Σ (x−μk)
P (Y = k | X = x) = Now,letusabsorbeverythingthatdoesnotdependonk intoa
constant C/:
P(Y=k|X =x)=Cπ e 1 T −1
/ k −2(x−μk )Σ (x−μk) and take the logarithm of both sides:
logP(Y=k|X=x)=logC/+logπ −1(x−μ)TΣ−1(x−μ). k2kk
This is the same for every category, k . Sowewant to find the maximum of this over k.
7/21

LDA has linear decision boundaries
Goal, maximize the following over k:
logπ −1(x−μ)TΣ−1(x−μ). k2kk
8/21

LDA has linear decision boundaries
Goal, maximize the following over k:
logπ −1(x−μ)TΣ−1(x−μ).
k2kk 1
=logπ − [xTΣ−1x+μTΣ−1μ]+xTΣ−1μ k2kkk
8/21

LDA has linear decision boundaries
Goal, maximize the following over k:
logπ −1(x−μ)TΣ−1(x−μ).
k2kk 1
=logπ − [xTΣ−1x+μTΣ−1μ]+xTΣ−1μ k2kkk
=C//+logπk − 1μTΣ−1μk +xT Σ−1μk 2k
8/21

LDA has linear decision boundaries
Goal, maximize the following over k:
logπ −1(x−μ)TΣ−1(x−μ).
k2kk 1
=logπ − [xTΣ−1x+μTΣ−1μ]+xTΣ−1μ k2kkk
=C//+logπk − 1μTΣ−1μk +xT Σ−1μk 2k
We define the objective:
δk(x)=logπk − 1μTΣ−1μk +xTΣ−1μk 2k
At an input x, we predict the output with the highest δk(x).
8/21

LDAhas linear decision boundaries
What is the decision boundary? It is the set of points in which 2 classes do just as well:
δk(x) = δl (x)
9/21

LDA has linear decision boundaries
What is the decision boundary? It is the set of points in which 2 classes do just as well:
δk(x) = δl (x)
1T −1 +xTΣ−1μ=logπ −1μTΣ−1μ+xTΣ−1μ
logπk−2μkΣμk k l 2l l l
9/21

LDA has linear decision boundaries
What is the decision boundary? It is the set of points in which 2 classes do just as well:
δk(x) = δl (x)
1T −1 +xTΣ−1μ=logπ −1μTΣ−1μ+xTΣ−1μ
logπk−2μkΣμk k l 2l l l Thisisalinearequationin x.
9/21

Estimating πk
#{i ;yi =k} n
πk = InEnglish,thefractionoftrainingsamplesofclass k.

Estimating the parameters of fk(x) Estimate the center of each class μ k :
μk= 1 ∑ xi #{i;yi = k} i;yi=k
11/21

Estimating the parameters of fk(x) Estimate the center of each class μ k :
μk= 1 ∑ xi #{i;yi = k} i;yi=k
Estimate the common covariance matrix Σ :
11/21

Estimating the parameters of fk(x) Estimate the center of each class μ k :
μk= 1 ∑ xi #{i;yi = k} i;yi=k
Estimate the common covariance matrix Σ : ► Onedimension(p=1):
1K
σ2= n−K ∑∑(xi−μk)2
k=1 i; yi=k
11/21

Estimating the parameters of fk(x) Estimate the center of each class μ k :
μk= 1 ∑ xi #{i;yi = k} i;yi=k
Estimate the common covariance matrix Σ : ► Onedimension(p=1):
1K
σ2= n−K ∑∑(xi−μk)2
k=1 i; yi=k
► Many dimensions (p > 1): Compute the vectors of deviations
its covariance matrix, Σ .
(x1− μy),(x2− μy),…,(xn − μy)anduseanestimateof 12n
11/21

LDA prediction
Foraninputx, predicttheclasswiththelargest:
δk(x)= logπ −1μTΣ−1μ +xTΣ−1μ k2kkk
12/21

LDA prediction
Foraninputx, predicttheclasswiththelargest: δk(x)= logπ −1μTΣ−1μ +xTΣ−1μ
k2kkk The decision boundaries are defined by:
1T −1 +xTΣ−1μ=logπ −1μTΣ−1μ+xTΣ−1μ logπk−2μkΣμk k l 2l l l
12/21

LDA prediction
Foraninputx, predicttheclasswiththelargest: δk(x)= logπ −1μTΣ−1μ +xTΣ−1μ
k2kkk The decision boundaries are defined by:
1T −1 +xTΣ−1μ=logπ −1μTΣ−1μ+xTΣ−1μ logπk−2μkΣμk k l 2l l l
Solid lines in:
12/21

Quadratic discriminant analysis (QDA)
The assumption that the inputs of every class have the same covariance Σ can be quite restrictive:
−4 −2 0 2 4 −4 −2 0 2 4
X1 X1
X2
−4 −3 −2 −1 0 1 2
X2
−4 −3 −2 −1 0 1 2

Quadratic discriminant analysis (QDA)
Inquadraticdiscriminantanalysisweestimateameanμ anda
covariancematrixΣ foreachclassseparately. k
k
14/21

Quadratic discriminant analysis (QDA)
Inquadraticdiscriminantanalysisweestimateameanμ anda
covariancematrixΣ foreachclassseparately. k
Given an input, it is easyto derive an objective function:
δ(x)=logπ −1μTΣ−1μ+xTΣ−1μ−1xTΣ−1x−1log|Σ| kk2kkkkk2k2k
k
14/21

Quadratic discriminant analysis (QDA)
Inquadraticdiscriminantanalysisweestimateameanμ anda
covariancematrixΣ foreachclassseparately. k
Given an input, it is easyto derive an objective function:
δ(x)=logπ −1μTΣ−1μ+xTΣ−1μ−1xTΣ−1x−1log|Σ| kk2kkkkk2k2k
This objective is now quadratic in x and so are the decision boundaries.
k
14/21

Quadratic discriminant analysis (QDA)
► Bayesboundary(–––) ► LDA(······)
► QDA(——).
−2 0 2 4 −4 −2 0 2 4 X1 X1
X2
−4 −3 −2 −1 0 1 2

Linear Discriminant Analysis
More intuition

Illustration of Decision Boundary
class prior log-ratio
constant input covariance
diff. in class means
x2
Good Bad
Canre-writeas 𝜃0+𝑥𝑇𝜃=0
x1

x2
Effect of Covariance Matrix
x1
x1
x2
• covariance matrix determines the shape of the Gaussian density, so
• in LDA, the Gaussian densities for different classes have the same shape, but are shifted versions of each other (different mean vectors).

Effect of Class Prior
• What effect does the prior p(class), or πk, have?
• Lets look at an example for 2 classes…
class prior log-ratio

𝑝 𝐶 𝒙 ∝ 𝑝(𝒙|𝐶 )𝑝(𝐶 ) 111
Model class-conditional probability of a 2D feature vector for class 1 as a multivariate Gaussian density.

𝑝 𝐶 𝒙 ∝ 𝑝(𝒙|𝐶 )𝑝(𝐶 ) 222
Now consider class 2 with a similar Gaussian conditional density, which has the same covariance but a different mean

𝑝𝐶 𝒙 ∝𝑝(𝒙|𝐶)𝑝(𝐶) 𝑝𝐶 𝒙 ∝𝑝(𝒙|𝐶)𝑝(𝐶) 111222
decision hypersurface
If the priors for each class are the same (i.e. 0.5), then the decision hypersurface cuts directly between the two means, with a direction parallel to the elliptical shape of the modes of the Gaussian densities shaped by their (identical) covariance matrices.

𝑝𝐶 𝒙 ∝𝑝(𝒙|𝐶)𝑝(𝐶) 𝑝𝐶 𝒙 ∝𝑝(𝒙|𝐶)𝑝(𝐶) 111222
decision hypersurface
Now if the priors for each class are unequal, the decision hypersurface cuts between the two means with a direction as before, but now will be located further from the more likely class. This biases the predictor in favor of the more likely class.

Posterior probability 𝑝(𝐶 |𝑥) for two 1
classes 𝐶 , 𝐶 12
Bishop Figure 4.10 The left-hand plot shows the class-conditional densities for two classes, denoted red and blue. On the right is the corresponding posterior probability p(C1|x), which is given by a logistic sigmoid of a linear function of x. The surface in the right-hand plot is coloured using a proportion of red ink given by p(C1|x) and a proportion of blue ink given by p(C2|x) = 1 − p(C1|x).

More than two classes, unequal covariances
• more general case of unequal covariances (here shown for four classes)
• QDA
• the decision hypersurface is no longer a hyperplane, i.e. it is nonlinear.

Generative vs Discriminative
x2
x2
• Generative: model the class-conditional distribution of features
• Pros: Can use it to generate new features
• Cons: more parameters, e.g. LDA has O(n^2)
x1
x1
• Discriminative: model the decision boundary directly, e.g. Logistic Regression
• Pros: fewer parameters, e.g. LogReg has O(n)
• Cons: Cannot generate new features

Do they produce the same classifier? • Generative LDA approach will estimate μ1,μ2, and Σ to maximize joint
likelihood p(x,y) and then compute the linear decision boundary, i.e., 𝜃 𝑗
and 𝜃 are functions of those parameters. In particular, 𝜃 and 𝜃 are not 0𝑗0
completely independent.
• Discriminative approach (logistic regression) will directly estimate 𝜃 and
𝑗 𝜃0, without assuming any constraints between them, by maximizing
conditional likelihood p(y|x)
• The two methods will give different decision boundaries, even both are linear.

LDA for image classification
• Discriminative Decorrelation for Clustering and Classification Hariharan, Malik and Ramanan, 2012
• Showed that LDA requires a lot less training than discriminative models for this task
Learned LDA model for class “bicycle”
http://home.bharathh.info/pubs/pdfs/BharathECCV2012.pdf

Next Class
Midterm! (no lecture)
Next Tuesday-
Probabilistic Models II: Bayesian Methods
priors over parameters; Bayesian linear regression
Reading: Bishop Ch 2.3