CSC 311: Introduction to Machine Learning
Lecture 7 – Probabilistic Models
. of Toronto, Fall 2021
Intro ML (UofT) CSC311-Lec7 1 / 45
Copyright By PowCoder代写 加微信 powcoder
So far in the course we have adopted a modular perspective, in which the model, loss function, optimizer, and regularizer are specified separately.
Today we begin putting together a probabilistic interpretation of our model and loss, and introduce the concept of maximum likelihood estimation.
Intro ML (UofT) CSC311-Lec7 2 / 45
Probabilistic modeling of data
Intro ML (UofT) CSC311-Lec7 3 / 45
A simple coin flip
Let’s start with a simple biased coin example.
I You flip a coin N = 100 times and get outcomes {x1,…,xN} where
xi ∈ {0, 1} and xi = 1 is interpreted as heads H.
I Suppose you had NH = 55 heads and NT = 45 tails. We would like
to think of a model of this phenomena.
I A good model should help us answer questions such as: What is the
probability it will come up heads if we flip again?
I Let’s design a model for this scenario, fit the model. We can use the
fit model to predict the next outcome.
Intro ML (UofT) CSC311-Lec7 4 / 45
The coin is possibly loaded. So, we can assume that one coin flip outcome x is a Bernoulli random variable for some currently unknown parameter θ ∈ [0, 1].
p(x=1|θ)=θ and p(x=0|θ)=1−θ ormoresuccinctly p(x|θ)=θx(1−θ)1−x
It’s sensible to assume that {x1, . . . , xN } are independent and identically distributed (i.i.d.) Bernoullis.
Thus the joint probability of the outcome {x1, . . . , xN } is
p(x1, …, xN |θ) = Y θxi (1 − θ)1−xi
Intro ML (UofT) CSC311-Lec7 5 / 45
We call the probability mass (or density for continuous) of the observed data the likelihood function (as a function of the parameters θ):
L(θ) = Y θxi (1 − θ)1−xi
We usually work with log-likelihoods:
l(θ) = X xi log θ + (1 − xi) log(1 − θ)
How can we choose θ? Good values of θ should assign high
probability to the observed data. This motivates the maximum likelihood criterion, that we should pick the parameters that maximize the likelihood:
θˆML =argmaxl(θ) θ∈[0,1]
Intro ML (UofT) CSC311-Lec7 6 / 45
Maximum Likelihood Estimation for the Coin Example
Remember how we found the optimal solution to linear regression by setting derivatives to zero? We can do that again for the coin example.
xilogθ+(1−xi)log(1−θ) = d (NH logθ+NT log(1−θ))
whereNH =Pixi andNT =N−Pixi.
Setting this to zero gives the maximum likelihood estimate: θˆML= NH .
Intro ML (UofT) CSC311-Lec7 7 / 45
Maximum Likelihood Estimation
Notice, in the coin example we are actually minimizing cross-entropies!
θˆML =argmaxl(θ) θ∈[0,1]
= arg min −l(θ) θ∈[0,1]
= arg min X −xi log θ − (1 − xi) log(1 − θ)
θ∈[0,1] i=1
This is an example of maximum likelihood estimation.
I define a model that assigns a probability (or has a probability
density at) to a dataset
I maximize the likelihood (or minimize the neg. log-likelihood).
Many examples we’ve considered fall in this framework! Let’s consider classification again.
Intro ML (UofT) CSC311-Lec7
Strategies for classification
Intro ML (UofT) CSC311-Lec7 9 / 45
Spam classification
If you are a large company that runs an email service, one of the important predictive problems you may have is the automated detection of spam email
Dear Karim,
I think we should postpone the board meeting to be held after Thanksgiving.
Regards, Toby,
I have an incredible opportunity for mining 2 Bitcoin a day. Please Contact me at the earliest at +1 123 321 1555. You won’t want to miss out on this opportunity.
Regards, Ark
Intro ML (UofT) CSC311-Lec7 10 / 45
Discriminative Classifiers
Given inputs x and classes y we can do classification in several ways. How?
Discriminative classifiers try to either:
I learn mappings directly from the space of inputs X to class labels
{0,1,2,…,K}
postpone, board, meeting, Thanksgiving
Class probability
p(y|x) Not spam
mining, Bitcoin, contact, opportunity
Intro ML (UofT) CSC311-Lec7 11 / 45
Generative Classifiers
How about this approach: build a model of “what data for a class looks like” Generative classifiers try to model p(x, y). If we know p(y) we can
easily compute p(x|y).
Classification via Bayes rule (thus also called Bayes classifiers)
Probability of feature given label
p(x|y) postpone, board, meeting,
Thanksgiving
Class label
mining, Bitcoin, contact, opportunity
Intro ML (UofT) CSC311-Lec7 12 / 45
Generative vs Discriminative
Two approaches to classification:
Discriminative approach: estimate parameters of decision boundary/class separator directly from labeled examples.
I Model p(t|x) directly (logistic regression models)
I Learn mappings from inputs to classes (linear/logistic regression,
decision trees etc)
I Tries to solve: How do I separate the classes?
Generative approach: model the distribution of inputs characteristic of the class (Bayes classifier).
I Model p(x|t)
I Apply Bayes Rule to derive p(t|x).
I Tries to solve: What does each class ”look” like?
Key difference: is there a distributional assumption over inputs?
Intro ML (UofT) CSC311-Lec7 13 / 45
Na ̈ıve Bayes
Intro ML (UofT) CSC311-Lec7 14 / 45
A Generative Model: Bayes Classifier
Aim to classify text into spam/not-spam (yes c=1; no c=0)
Example: “You are one of the very few who have been selected as a winners for the free $1000 Gift Card.”
Use bag-of-words features, get binary vector x for each email
Vocabulary: I “a”:1
I “car”: 0
I “card”: 1
I “win”: 0
I “winner”: 1 I “winter”: 0 I …
I “you”: 1
Intro ML (UofT) CSC311-Lec7 15 / 45
Bayes Classifier
Givenfeaturesx=[x1,x2,···,xD]T wewanttocomputeclass probabilities using Bayes Rule:
Pr. words given class
p(x|c) p(c) posterior = Class likelihood × prior
p(c|x) | {z }
= p(x,c) =
Pr. class given words
More formally
How can we compute p(x) for the two class case? (Do we need to?) p(x) = p(x|c = 0)p(c = 0) + p(x|c = 1)p(c = 1)
To compute p(c|x) we need: p(x|c) and p(c)
Intro ML (UofT) CSC311-Lec7 16 / 45
Na ̈ıve Bayes
Assume we have two classes: spam and non-spam. We have a dictionary of D words, and binary features x = [x1,…,xD] saying whether each word appears in the e-mail.
If we define a joint distribution p(c,x1,…,xD), this gives enough information to determine p(c) and p(x|c).
Problem: specifying a joint distribution over D + 1 binary variables requires 2D+1 − 1 entries. This is computationally prohibitive and would require an absurd amount of data to fit.
We’d like to impose structure on the distribution such that: I it can be compactly represented
I learning and inference are both tractable
Intro ML (UofT) CSC311-Lec7
Na ̈ıve Bayes
Na ̈ıve assumption: Na ̈ıve Bayes assumes that the word features xi are conditionally independent given the class c.
I This means xi and xj are independent under the conditional distribution p(x|c).
I Note: this doesn’t mean they’re independent. I Mathematically,
p(c, x1, . . . , xD) = p(c)p(x1|c) · · · p(xD|c).
Compact representation of the joint distribution
I Prior probability of class: p(c = 1) = π (e.g. spam email) I Conditional probability of word feature given class:
p(xj = 1|c) = θjc (e.g. word ”price” appearing in spam) I 2D + 1 parameters total (before 2D+1 − 1)
Intro ML (UofT) CSC311-Lec7
Bayes Nets
We can represent this model using an directed graphical model, or Bayesian network:
This graph structure means the joint distribution factorizes as a product of conditional distributions for each variable given its parent(s).
Intuitively, you can think of the edges as reflecting a causal structure. But mathematically, this doesn’t hold without additional assumptions.
Intro ML (UofT) CSC311-Lec7
Na ̈ıve Bayes: Learning
The parameters can be learned efficiently because the log-likelihood decomposes into independent terms for each feature.
l(θ) = X log p(c(i), x(i)) = X log np(x(i)|c(i))p(c(i))o
i=1 i=1 ND
= X log np(c(i)) Y p(x(i) | c(i))o j
i=1 j=1 N”D#
= X log p(c(i)) + X log p(x(i) | c(i)) j
i=1 j=1 NDN
X log p(c(i)) + X X log p(x(i) | c(i)) j
i=1 j=1 i=1
| {z } | {z }
Bernoulli log-likelihood Bernoulli log-likelihood of labels for feature xj
Each of these log-likelihood terms depends on different sets of parameters, so they can be optimized independently.
Intro ML (UofT) CSC311-Lec7
Na ̈ıve Bayes: Learning
We can handle these terms separately. For the prior we maximize: PNi=1 log p(c(i))
This is a minor variant of our coin flip example. Let p(c(i) = 1)=π. Note p(c(i)) = πc(i) (1 − π)1−c(i) .
Log-likelihood:
X log p(c(i)) = X c(i) log π + X(1 − c(i)) log(1 − π) i=1 i=1 i=1
Obtain MLEs by setting derivatives to zero:
Pi 1I[c(i) = 1] # spams in dataset
πˆ = N = CSC311-Lec7
total # samples
Intro ML (UofT)
Na ̈ıve Bayes: Learning
Each θjc’s can be treated separately: maximize PN log p(x(i)| c(i)) i=1 j
This is (again) a minor variant of our coin flip example.
x(i) (i) Letθjc =p(x(i) =1|c). Notep(x(i)|c)=θ j (1−θjc)1−xj .
Log-likelihood:
X log p(x(i) | c(i)) = X c(i) nx(i) log θj1 + (1 − x(i)) log(1 − θj1)o jjj
+ X(1 − c(i)) nx(i) log θj0 + (1 − x(i)) log(1 − θj0)o jj
Obtain MLEs by setting derivatives to zero:
P 1I[x(i) =1&c(i) =c] i j
#word j appears in spams # spams in dataset
Pi 1I[c(i) = c] (UofT)
for c = 1 =
CSC311-Lec7
Na ̈ıve Bayes: Inference
We predict the category by performing inference in the model. Apply Bayes’ Rule:
p(c)p(x | c) p(c) QDj=1 p(xj | c) p(c | x) = Pc′ p(c′)p(x | c′) = Pc′ p(c′) QDj=1 p(xj | c′)
We need not compute the denominator if we’re simply trying to determine the most likely c.
Shorthand notation:
p(c|x) ∝ p(c)Yp(xj |c)
For input x, predict by comparing the values of p(c) QDj=1 p(xj | c) for different c (e.g. choose the largest).
Intro ML (UofT) CSC311-Lec7 23 / 45
Na ̈ıve Bayes
Na ̈ıve Bayes is an amazingly cheap learning algorithm!
Training time: estimate parameters using maximum likelihood I Compute co-occurrence counts of each feature with the labels. I Requires only one pass through the data!
Test time: apply Bayes’ Rule
I Cheap because of the model structure. (For more general models,
Bayesian inference can be very expensive and/or complicated.)
We covered the Bernoulli case for simplicity. But our analysis easily extends to other probability distributions.
Unfortunately, it’s usually less accurate in practice compared to discriminative models due to its “na ̈ıve” independence assumption.
Intro ML (UofT) CSC311-Lec7 24 / 45
Bayesian Parameter Estimation
Intro ML (UofT) CSC311-Lec7 25 / 45
MLE issue: Data Sparsity
Maximum likelihood has a pitfall: if you have too little data, it can overfit.
E.g., what if you flip the coin twice and get H both times? θML= NH = 2 =1
NH + NT 2 + 0
Because it never observed T, it assigns this outcome probability 0.
This problem is known as data sparsity.
Intro ML (UofT) CSC311-Lec7 26 / 45
Bayesian Parameter Estimation
In maximum likelihood, the observations are treated as random variables, but the parameters are not.
The Bayesian approach treats the parameters as random variables as well. β is the set of parameters in the prior distribution of θ.
To define a Bayesian model, we need to specify two distributions: I The prior distribution p(θ), which encodes our beliefs about the
parameters before we observe the data
I The likelihood p(D | θ), same as in maximum likelihood
Intro ML (UofT) CSC311-Lec7 27 / 45
Bayesian Parameter Estimation
When we update our beliefs based on the observations, we compute the posterior distribution using Bayes’ Rule:
p(θ)p(D | θ) p(θ|D) = R p(θ′)p(D|θ′)dθ′.
We rarely ever compute the denominator explicitly. In general, it is computationally intractable.
Intro ML (UofT) CSC311-Lec7 28 / 45
Bayesian Parameter Estimation
Let’s revisit the coin example. We already know the likelihood: L(θ) = p(D|θ) = θNH (1 − θ)NT
It remains to specify the prior p(θ).
I We can choose an uninformative prior, which assumes as little as
possible. A reasonable choice is the uniform prior.
I But our experience tells us 0.5 is more likely than 0.99. One
particularly useful prior that lets us specify this is the beta distribution:
p(θ; a, b) = Γ(a + b) θa−1(1 − θ)b−1. Γ(a)Γ(b)
I This notation for proportionality lets us ignore the normalization constant:
p(θ; a, b) ∝ θa−1(1 − θ)b−1.
Intro ML (UofT) CSC311-Lec7 29 / 45
Bayesian Parameter Estimation
Beta distribution for various values of a, b:
Some observations:
I The expectation E[θ] = a/(a + b)
I The distribution gets more peaked when a and b are large.
I The uniform distribution is the special case where a = b = 1.
The beta distribution is used for is as a prior for the Bernoulli distribution.
Intro ML (UofT) CSC311-Lec7 30 / 45
Bayesian Parameter Estimation
Computing the posterior distribution: p(θ | D) ∝ p(θ)p(D | θ)
∝ hθa−1(1 − θ)b−1i θNH (1 − θ)NT = θa−1+NH (1 − θ)b−1+NT .
This is just a beta distribution with parameters NH + a and NT +b.
The posterior expectation of θ is:
E[θ|D]= NH +a
NH +NT +a+b
The parameters a and b of the prior can be thought of as pseudo-counts.
I The reason this works is that the prior and likelihood have the same functional form. This phenomenon is known as conjugacy (conjugate priors), and it’s very useful.
Intro ML (UofT) CSC311-Lec7 31 / 45
Bayesian Parameter Estimation
Bayesian inference for the coin flip example:
Small data setting Large data setting
NH =2,NT =0 NH =55,NT =45
When you have enough observations, the data overwhelm the prior.
Intro ML (UofT) CSC311-Lec7 32 / 45
Maximum A-Posteriori Estimation
Maximum a-posteriori (MAP) estimation: find the most likely parameter settings under the posterior
Intro ML (UofT) CSC311-Lec7 33 / 45
Maximum A-Posteriori Estimation
This converts the Bayesian parameter estimation problem into a maximization problem
θˆMAP =argmax p(θ|D) θ
=argmax p(θ,D) θ
=argmax p(θ)p(D|θ) θ
= argmax logp(θ)+logp(D|θ) θ
CSC311-Lec7
Maximum A-Posteriori Estimation
Joint probability in the coin flip example:
logp(θ,D) = logp(θ)+logp(D|θ) =Const+(a−1)logθ+(b−1)log(1−θ)+NH logθ+NT log(1−θ) =Const+(NH +a−1)logθ+(NT +b−1)log(1−θ)
Maximize by finding a critical point
0 = d log p(θ, D) = NH + a − 1 − NT + b − 1
Solving for θ,
dθ θ 1−θ θˆMAP = NH + a − 1
NH +NT +a+b−2 CSC311-Lec7
Intro ML (UofT)
Maximum A-Posteriori Estimation
Comparison of estimates in the coin flip example:
NH =2,NT =0
θˆMAP assigns nonzero probabilities as long as a, b > 1.
θˆML E[θ|D]
NH+a NH +NT +a+b
NH =55,NT =45 55 =0.55
57 ≈0.548 104
56 ≈0.549 102
NH+a−1 NH +NT +a+b−2
Intro ML (UofT) CSC311-Lec7
Multivariate Gaussian Distribution
Intro ML (UofT) CSC311-Lec7 37 / 45
Classification: Diabetes Example
Observation per patient: White blood cell count & glucose value.
p(x | t = k) for each class is shaped like an ellipse
=⇒ we model each class as a multivariate Gaussian
Intro ML (UofT) CSC311-Lec7 38 / 45
Univariate Gaussian distribution
Recall the Gaussian, or normal, distribution:
2 1 (x−μ)2 N(x;μ,σ)=√2πσexp − 2σ2
Parameterized by mean μ and variance σ2.
The Central Limit Theorem says that sums of lots of independent random variables are approximately Gaussian.
In machine learning, we use Gaussians a lot because they make the calculations easy.
Intro ML (UofT) CSC311-Lec7 39 / 45
Multivariate Mean and Covariance
Covariance
μ=E[x]= . μd
⊤ σ12 σ2 ··· σ2D Σ=Cov(x)=E[(x−μ)(x−μ) ]= . . .. . ….
σ D 1 σ D 2 · · · σ D2 The statistics (μ and Σ) uniquely define a multivariate Gaussian (or
multivariate Normal) distribution, denoted N (μ, Σ) or N (x; μ, Σ) I This is not true for distributions in general!
σ12 σ12 ··· σ1D
Intro ML (UofT) CSC311-Lec7
Multivariate Gaussian Distribution
PDF of the multivariate Gaussian distribution:
1 1 T−1
N(x;μ,Σ)= (2π)d/2|Σ|1/2 exp −2(x−μ) Σ (x−μ)
Compare to the univariate case (d = 1, Σ = σ2):
2 1 (x−μ)2
N(x;μ,σ)=√2πσexp − 2σ2
Intro ML (UofT) CSC311-Lec7 41 / 45
Gaussian Intuition: (Multivariate) Shift + Scale
Recall that in the univariate case, all normal distributions are shaped like the standard normal distribution
The densities are related to the standard normal by a shift (μ), a scale (or stretch, or dilation) σ, and a normalization factor
Intro ML (UofT) CSC311-Lec7 42 / 45
Gaussian Intuition: (Multivariate) Shift + Scale
The same intuition applies in the multivariate case.
We can think of the multivariate Gaussian as a shifted and “scaled” version of the standard multivariate normal distribution.
I The standard multivariate normal has μ = 0 and Σ = I Multivariate analog of the shift is simple: it’s a vector μ
But what about the scale?
I In the univaria√te case, the scale factor was the square root of the variance: σ = σ2
I But in the multivariate case, the covariance Σ is a matrix! 1
Does Σ2 exist, and can we scale by it?
Intro ML (UofT) CSC311-Lec7 43 / 45
Multivariate Scaling (Intuitive) (optional draw-on slide for intuition)
We call a matrix “positive definite” if it scales the space in orthogonal directions. The univariate analog is positive scalar α > 0.
Consider, e.g., how these two matrices transform the orthogonal vectors:
Consider matrix:
Consider actionon:
2 0 0 0.5
1 0 0 ⊥ 1
1 0.5 0.5 1
1 1 1 ⊥ −1
Draw action on slide:
Notice: both matrices are symmetric! Intro ML (UofT)
CSC311-Lec7
Multivariate Scaling (Formal) (details optional)
We summarize some definitions/results from linear algebra (without proof). Knowing them is optional, but they may help with intuition (esp. for PCA).
Definition. Symmetric matrix A is positive semidefinite if x⊤Ax ≥ 0 for all non-zero x. It is positive definite if x⊤Ax > 0 for all non-zero x.
I Any positive definite matrix is positive semidefinite.
I Positive definite matrices have positive eigenvalues, and positive
semidefinite matrices have non-negative eigenvalues.
I For any matrix X, X⊤X and XX⊤ are positive semidefinite.
Theorem (Unique Positive Square Root). Let A be a positive semidefinite real matrix. Then there is a unique positive semidefinite matrix B such that A=B⊤B=BB. WecallA12 ,BthepositivesquarerootofA.
Theorem (Spectral Theorem). The following are equivalent for A ∈ Rd×d:
1. A is symmetric.
2. RD has an orthonormal basis consisting of the eigenvectors of A.
3. There exists orthogonal matrix Q and diagonal matrix Λ such that
A = QΛQT . This is called the spectral decomposition of A.
I The columns of Q are (unit) eigenvectors of A.
Intro ML (UofT) CSC311-Lec7 45 / 45
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com