CS代考 CSC 311: Introduction to Machine Learning

CSC 311: Introduction to Machine Learning
Lectures 5 and 6 – Probabilistic Models
Based on slides by Amir-massoud Farahmand & Emad A.M. Andrews
Intro ML (UofT) CSC311-Lec5&6 1 / 55

Goal: A more focused discussion on models that explicitly represent probabilities
MLE review
Discriminative vs. Generative models Generative models
􏰀 Na ̈ıveBayes
􏰀 Gaussian Discriminant Analysis (and Linear Discriminant Analysis)
Bayesian approach to estimation and inference Maximum A-Posteriori Estimation (MAP) of parameters
Intro ML (UofT) CSC311-Lec5&6 2 / 55

Recall: Maximum Likelihood (MLE)
We have seen before that some ML algorithms can be derived using the Maximum Likelihood Estimation (MLE) principle.
􏰀 Example: Regression with squared loss could be obtained as the MLE with Gaussian noise model
Let’s try to understand it better by starting with a simple example: Estimating the parameter of a biased coin
􏰀 YouflipacoinN=100times.ItlandsheadsNH =55timesand tails NT = 45 times.
􏰀 What is the probability it will come up heads if we flip again?
Model: flips are independent Bernoulli random variables with parameter θ.
􏰀 Assume the observations are independent and identically distributed (i.i.d.).
Intro ML (UofT) CSC311-Lec5&6

Maximum Likelihood
The likelihood function is the density of the observed data, as a function of parameters θ.
In our case, it is the probability of a particular sequence of H/T’s. Under the Bernoulli model with i.i.d. observations: Let xi be the # Hs
in i-th flip (x ∈ {0, 1})
p(xi =1|θ)=θ and p(xi =0|θ)=1−θ
which can be written more compactly as
p(xi|θ) = θxi (1 − θ)1−xi .
Likelihood is given as
L(θ)=p(x1,…,xN|θ)=􏰞θxi(1−θ)1−xi =θNH(1−θ)NT
i=1 whereNH =􏰊ixi andNT =N−􏰊ixi
We usually work with log-likelihoods:
l(θ)=logL(θ)=NH logθ+NT log(1−θ).
Intro ML (UofT) CSC311-Lec5&6 4 / 55

Maximum Likelihood
Good values of θ should assign high probability to the observed data. This motivates the maximum likelihood criterion, i.e., choosing θ that maximizes the likelihood.
We can set the derivative of the likelihood function to finds its maximizer:
dl = d (NH logθ+NT log(1−θ)) dθ dθ
= NH − NT θ 1−θ
Setting this to zero gives the maximum likelihood estimate: θˆML= NH .
With this reminder, we are ready to talk about probabilistic
Intro ML (UofT) CSC311-Lec5&6 5 / 55

Generative vs Discriminative
Two approaches to classification:
Discriminative approach: estimate parameters of decision boundary/class separator directly from labeled examples.
􏰀 Tries to solve: How do I separate the classes?
􏰀 learn p(t|x) directly (logistic regression models)
􏰀 learn mappings from inputs to classes (logistic regression, decision trees, etc)
Generative approach: model the distribution of inputs generated from the class (Bayes classifier).
􏰀 Tries to solve: What does each class “look” like?
􏰀 Build a model of p(x|t)
􏰀 Apply Bayes Rule
Intro ML (UofT) CSC311-Lec5&6

A Generative Model: Bayes Classifier
Aim to classify text into spam/not-spam (yes c=1; no c=0) Example: “You are one of the very few who have been selected as
a winners for the free $1000 Gift Card.”
Use bag-of-words features, get binary vector x for each email
Vocabulary: 􏰀 “a”:1
􏰀 “car”: 0
􏰀 “card”: 1
􏰀 “win”: 0
􏰀 “winner”: 1 􏰀 “winter”: 0 􏰀 …
􏰀 “you”: 1
Intro ML (UofT) CSC311-Lec5&6 7 / 55

Bayes Classifier
Givenfeaturesx=[x1,x2,···,xD]T wewanttocomputeclass probabilities using Bayes Rule:
Pr. words given class
p(c|x) = p(x,c) = p(x|c) p(c) 􏰏 􏰎􏰍 􏰐 p(x) p(x)
Pr. class given words
Each of these terms have specific names:
posterior = Class likelihood × prior Evidence
How can we compute p(x) for the two class case? (Do we need to?) p(x) = p(x|c = 0)p(c = 0) + p(x|c = 1)p(c = 1)
To compute p(c|x) we need: p(x|c) and p(c)
Intro ML (UofT) CSC311-Lec5&6 8 / 55

Sidenote: Marginalization
The computation p(x) involves marginalizing over c
p(x) = 􏰋 p(x, c) = a sum over possible ways in which x can occur
= 􏰋 p(x|c)p(c) c
When c is binary,
p(x) = 􏰋 p(x|c)p(c)
= p(x|c = 0)p(c = 0)+p(x|c = 1)p(c = 1)
Intro ML (UofT) CSC311-Lec5&6 9 / 55

Na ̈ıve Bayes
Assume that we have two classes: spam and non-spam. We have a dictionary of D words, and binary features x = [x1,…,xD] saying whether each word appears in the e-mail.
If we define a joint distribution p(c,x1,…,xD), this gives enough information to determine p(c) and p(x|c).
Problem: specifying a joint distribution over D + 1 binary variables requires 2D+1 − 1 entries. This is computationally prohibitive and would require an absurd amount of data to fit.
We’d like to impose structure on the distribution such that:
􏰀 it can be compactly represented
􏰀 learning and inference are both tractable
Intro ML (UofT) CSC311-Lec5&6

Na ̈ıve Bayes
Na ̈ıve assumption: Na ̈ıve Bayes assumes that the word features xi are conditionally independent given the class c.
􏰀 This means xi and xj are independent conditioned on the class label c, i.e., p(xi,xj|c) = p(xi|c)p(xj|c) for i ̸= j.
􏰀 Note: This doesn’t mean they are independent.
􏰀 Therefore, we have
p(c, x1, . . . , xD) = p(c)p(x1|c) · · · p(xD|c).
Compact representation of the joint distribution
􏰀 Prior probability of class: p(c = 1) = π (e.g. spam email)
􏰀 Conditional probability of word feature given class:
p(xj = 1|c) = θjc (e.g. word ”price” appearing in spam)
􏰀 2D + 1 parameters total (before 2D+1 − 1)
Intro ML (UofT) CSC311-Lec5&6

The Log Likelihood Function l(θ)
Given training data D = (c(1), x(1)), (c(2), x(2)), . . . (c(N), x(N))
L(θ) = P (D|θ) N
= 􏰞 P (c(i), x(i)) i=1
l(θ) = log L(θ)
= log P (D|θ)
= 􏰋 log P (c(i), x(i))
Intro ML (UofT)
CSC311-Lec5&6

Na ̈ıve Bayes: Learning
The parameters can be learned efficiently because the log-likelihood decomposes into independent terms for each feature.
l(θ) = 􏰋 log p(c(i), x(i)) = 􏰋 log 􏰡p(x(i)|c(i))p(c(i))􏰢
i=1 i=1 ND
= 􏰋 log 􏰡p(c(i)) 􏰞 p(x(i) | c(i))􏰢 j
i=1 j=1 N􏰇D􏰈
= 􏰋 log p(c(i)) + 􏰋 log p(x(i) | c(i)) j
i=1 j=1 NDN
􏰋 log p(c(i)) + 􏰋 􏰋 log p(x(i) | c(i)) j
i=1 j=1 i=1
􏰏 􏰎􏰍 􏰐 􏰏 􏰎􏰍 􏰐
Bernoulli log-likelihood Bernoulli log-likelihood of labels for feature xj
Each of these log-likelihood terms depends on different sets of parameters, so they can be optimized independently.
Intro ML (UofT) CSC311-Lec5&6

Na ̈ıve Bayes: Learning
We can handle these terms separately. For the prior we maximize: 􏰊Ni=1 log p(c(i))
This is a minor variant of our coin flip example. Let p(c(i) = 1)=π. Note p(c(i)) = πc(i) (1 − π)1−c(i) .
Log-likelihood:
􏰋 log p(c(i)) = 􏰋 c(i) log π + 􏰋(1 − c(i)) log(1 − π) i=1 i=1 i=1
Obtain MLEs by setting derivatives to zero:
􏰊i I{c(i) = 1} # spams in dataset
Intro ML (UofT)
πˆ = N = total # samples CSC311-Lec5&6

Na ̈ıve Bayes: Learning
Each θjc’s can be treated separately: maximize 􏰊N log p(x(i)| c(i))
This is (again) a minor variant of our coin flip example.
x(i) (i) Letθjc =p(x(i) =1|c). Notep(x(i)|c)=θ j (1−θjc)1−xj .
Log-likelihood:
􏰋 log p(x(i) | c(i)) = 􏰋 c(i) 􏰡x(i) log θj1 + (1 − x(i)) log(1 − θj1)􏰢 jjj
+ 􏰋(1 − c(i)) 􏰡x(i) log θj0 + (1 − x(i)) log(1 − θj0)􏰢 jj
Obtain MLEs by setting derivatives to zero:
􏰊 I{x(i) =1&c(i) =c} i j
#word j appears in spams # spams in dataset
􏰊i I{c(i) = c} (UofT)
for c = 1 =
CSC311-Lec5&6

Some Mathematical tricks
Assume that xj,c ∈ 0,1
1. logp(xj|c)=c·logp(xj|c=1)+(1−c)·logp(xj|c=0) 2. Sincep(xj =1|c=1)=θj1 andp(xj =0|c=1)=1−θj1
p(xj|c = 1) = θxj (1 − θj1)(1−xj) j1
logp(xj|c=1)=xj logθj1 +(1−xj)log(1−θj1) Recall that log xa = a log x and log(xy) = log(x) + log(y)
Intro ML (UofT) CSC311-Lec5&6

Na ̈ıve Bayes: Inference
We predict the category of an input x by performing inference in the model.
Apply Bayes’ Rule:
p(c)p(x | c) p(c) 􏰜Dj=1 p(xj | c) p(c | x) = 􏰊c′ p(c′)p(x | c′) = 􏰊c′ p(c′) 􏰜Dj=1 p(xj | c′)
We need not compute the denominator if we merely want to determine the most likely c (why?).
Shorthand notation:
p(c|x) ∝ p(c)􏰞p(xj |c)
For input x, predict by computing the values of p(c) 􏰜Dj=1 p(xj | c) for different c and choose the largest.
Intro ML (UofT) CSC311-Lec5&6 17 / 55

Na ̈ıve Bayes
Na ̈ıve Bayes is an amazingly cheap learning algorithm!
Training time: estimate parameters using maximum likelihood
􏰀 Compute co-occurrence counts of each feature with the labels.
􏰀 Requires only one pass through the data.
Test time: apply Bayes’ Rule
􏰀 Cheap because of the model structure. (For more general models,
Bayesian inference can be very expensive and/or complicated.)
We covered the Bernoulli case for simplicity. But our analysis easily extends to other probability distributions.
Unfortunately, it’s usually less accurate in practice compared to discriminative models due to its “na ̈ıve” independence assumption.
Intro ML (UofT) CSC311-Lec5&6 18 / 55

MLE Issue: Data Sparsity
Maximum likelihood has a pitfall: if you have too little data, it can overfit.
Example: What if you flip the coin twice and get H both times? θML= NH = 2 =1
Because it never observed T, it assigns this outcome probability of 0. This is not an intuitive answer. It was just unlucky that we did not observe any T in two flips, but it does not mean that the coin would not be a T ever. This is an example of overfitting. And this problem is sometimes known as data sparsity.
We can mitigate this issue by using a Bayesian approach to estimation and inference.
Intro ML (UofT) CSC311-Lec5&6 19 / 55
NH + NT 2 + 0

Bayesian Parameter Estimation and Inference
In maximum likelihood, the observations are treated as random variables, but the parameters are not.
The Bayesian approach treats the parameters as random variables as well. The parameter θ has a prior probability, specified by another parameter β.
To define a Bayesian model, we need to specify two distributions:
􏰀 The prior distribution p(θ), which encodes our beliefs about the
parameters before we observe the data
􏰀 The likelihood p(D | θ), same as in maximum likelihood
Intro ML (UofT) CSC311-Lec5&6 20 / 55

Bayesian Parameter Estimation and Inference
When we update our beliefs based on the observations, we compute the posterior distribution using Bayes’ Rule:
p(θ)p(D | θ) p(θ|D) = 􏰝 p(θ′)p(D|θ′)dθ′.
We rarely ever compute the denominator explicitly. In general, it is computationally intractable.
Note: There is a subtle difference between the interpretation of probability according to a Bayesian and a frequentist (who recommends MLE). For the former, probability is a degree of belief about the truth of a statement; for the latter, a probability is the number of times a statement is true when we observe a lot of samples.
Intro ML (UofT) CSC311-Lec5&6 21 / 55

Bayesian Parameter Estimation and Inference
Let’s revisit the coin example. We already know the likelihood: L(θ) = p(D|θ) = θNH (1 − θ)NT
It remains to specify the prior p(θ).
􏰀 We can choose an uninformative prior, which assumes as little as
possible. A reasonable choice is the uniform prior.
􏰀 But our experience tells us 0.5 is more likely than 0.99. One
particularly useful prior that lets us specify this is the beta distribution:
p(θ; a, b) = Γ(a + b) θa−1(1 − θ)b−1. Γ(a)Γ(b)
􏰀 Γ is the gamma function and has the property of Γ(n) = (n − 1)! for positive integer n.
􏰀 This notation for proportionality lets us ignore the normalization constant:
p(θ; a, b) ∝ θa−1(1 − θ)b−1.
Intro ML (UofT) CSC311-Lec5&6 22 / 55

Bayesian Parameter Estimation and Inference
Beta distribution for various values of a, b:
Some observations:
􏰀 The expectation E[θ] = a/(a + b) (easy to derive).
􏰀 The distribution gets more peaked when a and b are large.
􏰀 The uniform distribution is the special case where a = b = 1.
The beta distribution is used as a prior for the Bernoulli distribution.
Intro ML (UofT) CSC311-Lec5&6

Bayesian Parameter Estimation and Inference
Computing the posterior distribution: p(θ | D) ∝ p(θ)p(D | θ)
∝ 􏰟θa−1(1 − θ)b−1􏰠 􏰙θNH (1 − θ)NT 􏰚 = θa−1+NH (1 − θ)b−1+NT .
This is just a beta distribution with parameters NH + a and NT +b.
The posterior expectation of θ is:
E[θ|D]= NH +a
NH +NT +a+b
The parameters a and b of the prior can be thought of as pseudo-counts.
􏰀 The reason this works is that the prior and likelihood have the same functional form. This phenomenon is known as conjugacy (conjugate priors), and it is very useful in computation of posteriors.
Intro ML (UofT) CSC311-Lec5&6 24 / 55

Bayesian Parameter Estimation and Inference
Bayesian inference for the coin flip example:
Small data setting Large data setting
NH =2,NT =0 NH =55,NT =45
When you have enough observations, the data overwhelm the prior.
Intro ML (UofT) CSC311-Lec5&6 25 / 55

Bayesian Parameter Estimation and Inference
What do we actually do with the posterior?
The posterior predictive distribution is the distribution over
future observables given the past observations. We compute this by marginalizing out the parameter(s):
p(D |D) = p(θ|D)p(D |θ)dθ.
For the coin flip example:
Pr(x′ = H |D) 􏰌
p(θ|D)Pr(x′ =H|θ)dθ Beta(θ;NH +a,NT +b)·θdθ
Intro ML (UofT)
EBeta(θ;NH +a,NT +b)[θ] NH+a .
NH +NT +a+b
CSC311-Lec5&6 26 / 55

Aside: Review
Marginalization:
P(D |D) = = =
Expectation
p(D ,θ|D)dθ 􏰌′
p(D |θ,D)p(θ|D)dθ 􏰌′′
p(D | θ)p(θ | D)dθ If D is independent of D given θ
Eθ∼P (a,b)[θ] = EP (θ;a,b)[θ] 􏰌
p(θ; ab) · θdθ
CSC311-Lec5&6 27 / 55
Intro ML (UofT)

Bayesian Parameter Estimation and Inference
Bayesian estimation of the mean temperature in Toronto
Assume observations are i.i.d. Gaussian with known standard deviation σ and unknown mean μ
Broad Gaussian prior over μ, centered at 0
We can compute the posterior and posterior predictive distributions analytically (derivation omitted)
Why is the posterior predictive distribution more spread out than the posterior distribution?
Intro ML (UofT) CSC311-Lec5&6

Bayesian Parameter Estimation and Inference
Comparison of maximum likelihood and Bayesian parameter estimation
The Bayesian approach deals better with data sparsity
Maximum likelihood is an optimization problem, while Bayesian parameter estimation is an integration problem (taking expectation).
􏰀 This means maximum likelihood is much easier in practice, since we can just do gradient descent.
􏰀 Automatic differentiation packages make it really easy to compute gradients.
􏰀 There aren’t any comparable black-box tools for Bayesian parameter estimation.
Intro ML (UofT) CSC311-Lec5&6 29 / 55

Maximum A-Posteriori Estimation
Maximum a-posteriori (MAP) estimation: find the most likely parameter settings under the posterior
This is an approximation of the full Bayesian estimation and inference, because it only finds one parameter instead of having a probability distribution over them.
Intro ML (UofT) CSC311-Lec5&6 30 / 55

Maximum A-Posteriori Estimation
This converts the Bayesian parameter estimation problem into a maximization problem
θˆMAP =argmax p(θ|D) θ
=argmax p(θ,D) θ
=argmax p(θ)p(D|θ) θ
= argmax logp(θ)+logp(D|θ) θ
CSC311-Lec5&6

Maximum A-Posteriori Estimation
Joint probability in the coin flip example:
logp(θ,D) = logp(θ)+logp(D|θ) =Const+(a−1)logθ+(b−1)log(1−θ)+NH logθ+NT log(1−θ) =Const+(NH +a−1)logθ+(NT +b−1)log(1−θ)
Maximize by finding a critical point
0 = d log p(θ, D) = NH + a − 1 − NT + b − 1
Solving for θ,
dθ θ 1−θ θˆMAP = NH + a − 1
NH +NT +a+b−2
Intro ML (UofT) CSC311-Lec5&6 32 / 55

Comparison: MLE, MAP, and Bayesian
Comparison of estimates in the coin flip example:
θˆML E[θ|D]
NH =55,NT =45 55 =0.55
57 ≈0.548 104
56 ≈0.549 102
NH+a NH +NT +a+b
NH =2,NT =0
θˆMAP assigns nonzero probabilities as long as a, b > 1.
NH+a−1 NH +NT +a+b−2
Intro ML (UofT) CSC311-Lec5&6

Gaussian Discriminant Analysis
Generative models – data generating distribution p(x|t = k) Instead of trying to separate classes, try to model what each class
“looks like”.
Recall that p(x|t = k) may be very complex
p(x1, · · · , xd, t) = p(x1|x2, . . . , xd, t) · · · p(xd−1|xd, t)p(xd|t)p(t)
Naive Bayes used a conditional independence assumption to get p(x1, · · · , xd, t) = p(x1|t) · · · p(xd−1|t)p(xd|t)p(t)
What else could we do?
􏰀 Choose a simple distribution.
Next, we will discuss fitting Gaussian distributions to our data.
Intro ML (UofT) CSC311-Lec5&6 34 / 55

Bayes Classifier
Let’s take a step back. Bayes Classifier
h(x) = argmax p(t = k|x) = argmax p(x|t = k)p(t = k) k k p(x)
= argmax p(x|t = k)p(t = k) k
We previously talked about discrete x. What if x is continuous?
Intro ML (UofT) CSC311-Lec5&6 35 / 55

Classification: Diabetes Example
Observation per patient: White blood cell count & glucose value.
Intro ML (UofT) CSC311-Lec5&6 36 / 55 How can we model p(x|t = k)?

Multivariate Data
Multiple measurements (sensors)
D inputs/features/attributes
N instances/observations/examples
··· x  .. . 
x(1) x(1) ··· x(1)  12D
[x(1)]⊤   (2)⊤
(2) X==12 D
.…. [x(N)]⊤ x(N) x(N) · · · x(N)
Intro ML (UofT)
CSC311-Lec5&6

Multivariate Parameters
Covariance
E[x(i)] = μ = [μ1,··· ,μd]T ∈ RD
􏰔 􏰕 σ12 σ2 ··· σ2D (i) (i) (i) ⊤  2  Σ=Cov x =E[(x −μ)(x −μ) ]= . . .. . 
…. σ D 1 σ D 2 · · · σ D2
The mean and covariance are enough to represent a Gaussian distribution. This is not true for all distributions.
σ12 σ12 ··· σ1D
Intro ML (UofT) CSC311-Lec5&6 38 / 55

Multivariate Gaussian Distribution
x ∼ N (μ, Σ), a Gaussian (or normal) distribution defined as 1 􏰅1 T−1 􏰆
p(x) = (2π)D/2|Σ|1/2 exp −2(x − μ) Σ (x − μ) where |Σ| is the determinant of the covariance matrix Σ.
The Central Limit Theorem says that sums of independent random variables are approximately Gaussian.
􏰀 The r.v. do not need to be Gaussians themselves.
In machine learning, we use Gaussians a lot because they make the
calculations easy.
Intro ML (UofT) CSC311-Lec5&6 39 / 55

Bivariate Normal
􏰃1 0􏰄 Σ= 0 1
􏰃1 0􏰄 Σ=0.5 0 1
􏰃1 0􏰄 Σ=2 0 1
Figure: Probability density function
Intro ML (UofT)
Figure: Contour plot of the pdf CSC311-Lec5&6

Bivariate Normal
􏰃1 0􏰄 􏰃1 0.5􏰄 Σ= 0 1 Σ= 0.5 1
􏰃1 0.8􏰄 Σ= 0.8 1
Figure: Probability density function
Figure: Contour plot of the pdf Intro ML (UofT) CSC311-Lec5&6

Maximum Likelihood
Suppose we want to model the distribution of highest and lowest temperatures in Toronto in March, and we’ve recorded the following observations 🙁
(-2.5,-7.5) (-9.9,-14.9) (-12.1,-17.5) (-8.9,-13.9) (-6.0,-11.1) Assume they’re drawn from a Gaussian distribution with mean μ, and covariance Σ. We want to estimate these using data. Log-likelihood function:
􏰞N􏰅1 􏰘1(i)T−1(i)􏰛􏰆
l(μ,Σ)=log
(2π)d/2|Σ|1/2 exp −2(x −μ) Σ (x −μ) 􏰋N 􏰅 1 􏰘1(i) T−1(i) 􏰛􏰆
= log (2π)d/2|Σ|1/2 exp −2(x −μ) Σ (x −μ) i=1
= − log(2π)d/2 − log |Σ|1/2 − i=1 􏰏 􏰎􏰍 􏰐 2
(UofT) CSC311-Lec5&6
(x(i) − μ)T Σ−1(x(i) − μ) 42 / 55

Maximum Likelihood
Maximize the log-likelihood by setting the derivative to zero: 0= dl =−􏰋N d 1(x(i)−μ)TΣ−1(x(i)−μ)
dμ i=1 dμ2
= − 􏰋 Σ−1(x(i) − μ) = 0
Solving we get μˆ = N1 􏰊Ni=1 x(i). In general, “hat” means estimator This is just the sample mean of the observed values, or the
empirical mean.
Intro ML (UofT) CSC311-Lec5&6 43 / 55

Maximum Likelihood
Similar calculation for the covariance matrix Σ yields: Set the partial derivatives to zero, just like before
0 = ∂Σ =⇒ Σˆ = N
This is called the empirical covariance and comes up quite often,
e.g., PCA in the next lecture.
(x(i) − μˆ)(x(i) − μˆ)⊤
Derivation in multivariate case is tedious. No need to worry about it. But it is good practice to derive this in one dimension. See appendix.
Intro ML (UofT) CSC311-Lec5&6 44 / 55