1 Overview
Learning goals
Know some terminology for probabilistic models: likelihood, prior distribution, poste- rior distribution, posterior predictive distribution, i.i.d. assumption, sufficient statis- tics, conjugate prior
Be able to learn the parameters of a probabilistic model using maximum likelihood, the full Bayesian method, and the maximum a-posteriori approximation.
Copyright By PowCoder代写 加微信 powcoder
Understand how these methods are related to each other. Understand why they tend to agree in the large data regime, but can often make very different predictions in the small data regime.
Maximum likelihood
Lecture 7: Probabilistic Models
In the first half of the course, we introduced backpropagation, a technique we used to train neural nets to minimize a variety of cost functions. One of the cost functions we discussed was cross-entropy, which encourages the network to learn to predict a probability distribution over the targets. This was our first glimpse into probabilistic modeling. But probabilistic modeling is so important that we’re going to spend almost the last third of the course on it. This lecture introduces some of the key principles.
This lecture and the next one aren’t about neural nets. Instead, they’ll introduce the principles of probabilistic modeling in as simple a setting as possible. Then, starting next week, we’re going to apply these principles in the context of neural nets, and this will result in some very powerful models.
The first method we’ll cover for fitting probabilistic models is maximum likelihood. In addition to being a useful method in its own right, it will also be a stepping stone towards Bayesian modeling.
Let’s begin with a simple example: we have flipped a particular coin 100 times, and it landed heads NH = 55 times and tails NT = 45 times. We want to know the probability that it will come up heads if we flip it again. We formulate the probabilistic model:
The behavior of the coin is summarized with a parameter θ, the probability that a flip lands heads (H). The flips D = x(1),…,x(100) are independent Bernoulli random variables with parameter θ.
(In general, we will use D as a shorthand for all the observed data.) We say that the indi- vidual flips are independent and identically distributed (i.i.d.); they are independent because one outcome does not influence any of the other outcomes, and they are identically distributed because they all follow the same distribution (i.e. a Bernoulli distribution with parameter θ).
We now define the likelihood function L(θ), which is the probability of the observed data, as a function of θ. In the coin example, the likelihood is the probability of the particular sequence of H’s and T’s being generated:
L(θ) = p(D) = θNH (1 − θ)NT . (1)
Note that L is a function of the model parameters (in this case, θ), not the observed data. This likelihood function will generally take on extremely small values; for instance, L(0.5) = 0.5100 ≈ 7.9 × 10−31. Therefore, in practice we almost always work with the
log-likelihood function,
l(θ)=logL(θ)=NH logθ+NT log(1−θ). (2)
For our coin example, l(0.5) = log0.5100 = 100log0.5 = −69.31. This is a much easier value to work with.
In general, we would expect good choices of θ to assign high likelihood to the observed data. This suggests the maximum likelihood criterion: choose the parameter θ which maximizes l(θ). If we’re lucky, we can do this analytically by computing the derivative and setting it to zero. (More precisely, we find critical points by setting the derivative to zero. We check which of the critical points, or boundary points, has the largest value.) Let’s try this for the coin example:
dl = d (NH logθ+NT log(1−θ)) dθ dθ
= NH − NT (3) θ 1−θ
Setting this to zero, we find the maximum likelihood estimate
θˆML= NH , (4)
i.e. the maximum likelihood estimate is simply the fraction of flips that came up heads. (We put a hat over the parameter to emphasize that it’s an estimate.) Hopefully this seems like a sensible guess for θ. Now let’s look at some more examples.
Example 1. Suppose we are interested in modeling the distribution of temper- atures in Toronto in March. Here are the high temperatures, in Celsius, from the first week of March 2014:
-2.5 -9.9 -12.1 -8.9 -6.0 -4.8 2.4
Call these observations x(1),…,x(N), where N = 7. In order to formulate a probabilistic model, we first choose a parametric form for the distribution over temperatures. Often we choose a Gaussian distribution, not because we believe it’s an especially good model, but because it makes the computations easy. So let’s assume the temperatures are drawn from a Gaussian distribution with unknown mean μ and known standard deviation σ = 5. Our likelihood function is given by:
N” !# l(μ)=logY √1 exp −(x(i)−μ)2
i=1 2π · σ 2σ2
=Xlog √1 exp −(x(i)−μ)2
2π · σ 2σ2
1 ( x ( i ) − μ ) 2
− 2 log 2π − log σ − 2σ2 (5) Since μ can take any possible real value, the maximum must occur at a critical
point, so let’s look for critical points. Setting the derivative to 0, 0 = d l = 1 XN d ( x ( i ) − μ ) 2
Therefore, PNi=1 x(i) − μ = 0, and solving for μ, we get μ = N1 PNi=1 x(i). The maximum likelihood estimate of the mean of a normal distribution is simply the mean of the observed values, or the empirical mean. Plugging in our temperature data, we get μˆML = −5.97.
Example 2. In the last example, we pulled the standard deviation σ = 5 out of a hat. Really we’d like to learn it from data as well. Let’s add it as a parameter to the model. The likelihood function is the same as before, except now it’s a function of both μ and σ, rather than just μ. To maximize a function
x(i) − μ (6)
of two variables, we find critical points by setting the partial derivatives to 0. In this case,
0 = ∂μ = −σ2
∂σ ∂σ i=1 2 2σ
x(i) − μ (7)
∂l∂”N1 1 # = X− log2π−logσ− 2(x(i) −μ)2
− 2 ∂ σ log 2π − ∂ σ log σ − ∂ σ 2σ (x XN 1 1 (i) 2
XN1∂ ∂ ∂1(i)2
0 − σ + σ3 (x − μ)
From the first equality, we find that μˆML = N1 PNi=1 x(i) is the empirical mean,
just as before. From the second inequality, we find σˆML = q 1 PN (x(i) − μ)2.
the Toronto temperatures, we get μˆML = −5.97 (as before) and σˆML = 4.55.
In other words, σˆML is simply the empirical standard deviation. In the case of
Example 3. We’ve just seen two examples where we could obtain the exact maximum likelihood solution analytically. Unfortunately, this situation is the exception rather than the rule. Let’s consider how to compute the maximum likelihood estimate of the parameters of the gamma distribution, whose PDF is defined as
ba a−1 −bx
p(x) = Γ(a)x e , (9)
where Γ(a) is the gamma function, which is a generalization of the factorial function to continuous values.1 The model parameters are a and b, both of which must take positive values. The log-likelihood, therefore, is
N l(a,b)=Xalogb−logΓ(a)+(a−1)logx(i) −bx(i)
=Nalogb−NlogΓ(a)+(a−1)Xlogx(i) −bXx(i). (10)
NN i=1 i=1
1 The definition is Γ(t) = R ∞ xt−1 e−x dx, but we’re never going to use the definition in this class. 0
Most scientific computing environments provide a function which computes log Γ(a). In SciPy, for instance, it is scipy.special.gammaln.
To maximize the log-likelihood, we’re going to use gradient ascent, which is just like gradient descent, except we move uphill rather than downhill. To derive the update rules, we need the partial derivatives:
∂a =Nlogb−NdalogΓ(a)+
logx(i) (11) (12)
Our implementation of gradient ascent, therefore, consists of computing these
∂b = N b −
derivatives, and then updating a ← a + α∂l and b ← b + α∂l, where α is the
learning rate. Most scientific computing environments provide a function to compute d log Γ(a); for instance, it is scipy.special.digamma in SciPy.
Here are some observations about these examples:
• In each of these examples, the log-likelihood function l decomposed as a sum of terms, one for each training example. This results from our independence assumption. Be- cause different observations are independent, the likelihood decomposes as a product over training examples, so the log-likelihood decomposes as a sum.
• The derivatives worked out nicely because we were dealing with log-likelihoods. Try taking derivatives of the likelihood function L(θ), and you’ll see that they’re much messier.
• All of the log-likelihood functions we looked at wound up being expressible in terms of certain sufficient statistics of the dataset, such as PNi=1 x(i), PNi=1[x(i)]2, or PNi=1 log x(i). When we’re fitting the maximum likelihood solution, we can forget the data itself and just remember the sufficient statistics. This doesn’t happen for all of our models; for instance, it didn’t happen when we fit the neural language model in Assignment 1. However, it does happen for many of the distributions commonly used in practice.2
• We made a lot of questionable assumptions in formulating these models. E.g., we assumed that temperatures on different days were independent; in practice, the tem- perature tomorrow is likely to be similar to the temperature today. This is also true of models we fit previously; e.g., the Markov assumption we used to justify our neural
2If you’re interested in learning more, the families of distributions whose likelihoods can be written in terms of sufficient statistics are known as exponential families.
language model is clearly bogus. If this were a statistics class, we’d talk about ways to test your modeling assumptions. But because this is a machine learning class, we’ll throw caution to the wind and fit models that we know are wrong. Hopefully they’ll still be good enough that they can make sensible predictions (in the supervised setting) or reveal interesting structure (in the unsupervised setting).
Beware of data sparsity
Maximum likelihood is a very powerful technique, but it has a pitfall: if you have too little training data, it can seriously overfit. The most severe pathology is when it assigns probability 0 to things that were never seen in the training set, but which still might actually happen. For instance, suppose we flip a coin twice, and it lands H both times. The maximum likelihood estimate of θ, the probability of H, would be 1. But this is pretty extreme — effectively we’re considering it impossible that the coin will ever come up T! This problem is known as data sparsity.
This problem isn’t so different in principle from examples of overfitting which we dis- cussed for other loss functions. We would like our model to generalize well to data it hasn’t seen before, i.e. assign the new data high likelihood. We can measure the general- ization performance by holding out a separate test set, and measuring the log-likelihood on this test set at the very end. (As before, if we want to choose model hyperparamters, we’d hold out a separate validation set.) In our coin example, if we choose θ = 1 and the coin comes up T even a single time in the test set, this will give us a test log-likelihood of −∞. Clearly it’s a bad idea to assign any outcome probability 0 if it might ever occur.
Last week, we talked about regularization as a way to attenuate overfitting. Would that work here? One naive approach would be to add an L2 penalty, −12θ2, to the objective function. (We subtract the penalty since we’re maximizing.) But this isn’t quite what we want: it would allow (in fact, encourage) the degenerate solution θ = 0. Instead, let’s look at Bayesian techniques for parameter estimation. These techniques will turn out to be closely related to regularization.
3 Bayesian parameter estimation
In the maximum likelihood approach, the observations (i.e. the xi’s) were treated as random variables, but the model parameters were not. In the Bayesian approach, we treat the parameters as random variables as well. We define a model for the joint distribution p(θ, D) over parameters θ and data D. (In our coin example, θ would be the probability of H, and D would be the sequence of 100 flips that we observed.) Then we can perform the usual operations on this joint distribution, such as marginalization and conditioning.
In order to define this joint distribution, we need two things:
• A distribution p(θ), known as the prior distribution. It’s called the prior because
it’s supposed to encode your “prior beliefs,” i.e. everything you believed about the parameters before looking at the data. In practice, we normally choose priors to be computationally convenient, rather than based on any sort of statistical principle. More on this later.
• The likelihood p(D|θ), the probability of the observations given the parameters, just like in maximum likelihood.
Bayesians are primarily interested in computing two things:
The posterior distribution p(θ|D). This corresponds to our beliefs about the parameters after observing the data. In general, the posterior distribution can be computed using Bayes’ Rule:
p(θ)p(D | θ)
p(θ|D) = R p(θ′)p(D|θ′)dθ′. (13)
However, we don’t normally compute the denominator directly. Instead we work with unnormalized distributions as long as possible, and normalize only when we need to. Bayes’ Rule can therefore be written in a more succinct form, using the symbol ∝ to denote “proportional to”:
p(θ | D) ∝ p(θ)p(D | θ). (14)
The posterior predictive distribution p(D′ | D), which is the distribution over future observables given past observations. For instance, given that we’ve observed 55 H’s and 45 T’s, what’s the probability that the next flip will land H? We can compute the posterior predictive distribution by computing the posterior over θ and then marginalizing out θ:
p(D |D) = p(θ|D)p(D |θ)dθ. (15)
The full Bayesian approach
Let’s figure out the posterior distribution and posterior predictive distribution for our coin example. We’ve already specified the likelihood, so it remains to specify the prior. One option is to use an uninformative prior, which assumes as little as possible about the problem. In the case of the coin, this might correspond to the uniform distribution p(θ) = 1. (There is no single recipe for choosing an uninformative prior; statisticians have a few different recipes which often, but not always, agree with each other.)
Alternatively, we can draw upon our lifetime of experience flipping coins. Most coins tend to be fair, i.e. the come up heads around 50% of the time. So perhaps our prior should make θ = 0.5 more likely. There are a lot of distributions which can do this, but a
Figure 1: The PDF of the beta distribution for various values of the parameters a and b. Observe that the distribution becomes more peaked as a and b become large, and the peak is near a/(a + b).
particularly useful one is the beta distribution, parameterized by a, b > 0, and defined
p(θ; a, b) = Γ(a + b) θa−1(1 − θ)b−1. (16) Γ(a)Γ(b)
This distribution is visualized in Figure 1. Why did we choose the beta distribution, of all things? Once we work through the example, we’ll see that it’s actually pretty convenient. Observe that the first term (with all the Γ’s) is just a normalizing constant, so it doesn’t depend on θ. In most of our computations, we’ll only need to work with unnormalized dis- tributions (i.e. ones which don’t necessarily integrate to 1), so we can drop the cumbersome normalizing constant and write
p(θ; a, b) ∝ θa−1(1 − θ)b−1. (17) A few values are plotted in Figure 1. From these plots, we observe a few things:
• The distribution appears to be centered around a/(a + b). In fact, it’s possible to show that if θ ∼ Beta(a, b), then E[θ] = a/(a + b).
• The distribution becomes more peaked for larger values of a and b.
• The values a = b = 1 correspond to the uniform distribution. Therefore, we can
simply treat the uniform prior as a special case of the beta prior.
Figure 2: Plots of the prior, likelihood, and posterior for the coin flip example, with the prior Beta(2,2). (Left) Small data setting, NH = 2, NT = 0. (Right) Large data setting, NH = 55, NT = 45. In this case, the data overwhelm the prior, so the posterior is determined by the likelihood. Note: for visualization purposes, the likelihood function is normalized to integrate to 1, since otherwise it would be too small to see.
Now let’s compute the posterior and posterior predictive distributions. When we plug in our prior and likelihood terms for the coin example, we get:
p(θ | D) ∝ p(θ)p(D | θ) (18) ∝ hθa−1(1 − θ)b−1i θNH (1 − θ)NT (19) = θa−1+NH (1 − θ)b−1+NT . (20)
But this is just a beta distribution with parameters NH + a and NT + b. Let’s stop and check if our answer makes sense. As we observe more flips, NH and NT both get larger, and the distribution becomes more peaked around a particular value. Furthermore, the peak of the distribution will be near NH /(NH + NT ), our maximum likelihood solution. This reflects the fact that the more data we observe, the less uncertainty there is about the parameter, and the more the likelihood comes to dominate. We say that the data overwhelm the prior.
We now compute the posterior predictive distribution over the next flip x′: θpred = Pr(x′ = H |D)
= p(θ|D)Pr(x =H|θ)dθ
= Beta(θ;NH +a,NT +b)·θdθ = EBeta(θ;NH +a,NT +b)[θ]
= NH + a , (21) NH +NT +a+b
where the last line follows from the formula for the expectation of a beta random variable. Again, let’s check if this is reasonable. If NH and NT are large, this is approximately NH /(NH +NT ), so our predictions are similar to the ones we get using maximum likelihood. However, if NH and NT are relatively small, then the predictive distribution is smoothed, i.e. less extreme than the maximum likelihood one. The value θpred is somewhere in between the prior and the maximum likelihood estimate.
OK, back to an earlier question. Where did our choice of prior come from? The key thing to notice is Eqn 20, where the posterior wound up belonging to the same family of distributions as the prior. Why did this happen? Let’s compare the formulas for the beta distribution and the likelihood:
p(θ) = Beta(θ; a, b) ∝ θa−1(1 − θ)b−1 (22) p(D | θ) ∝ θNH (1 − θ)NT (23)
In other words, the prior was chosen to have the same functional form as the likelihood.3 Since we multiply these expressions together to get the (unnormalized) posterior, the pos- terior will also have this functional form. A prior chosen in this way is called a conjugate prior. In this case, the parameters of the prior distribution simply got added to the observed counts, so they are sometimes referred to as pseudo-counts.
Let’s look at some more examples.
Example 4. Let’s return to our problem of estimating the mean temperature in Toronto, where our model assumes a Gaussian with unknown mean μ and known standard deviation σ = 5. The first task is to choose a conjugate prior. In order to do this, let’s look at the PMF of a single data point:
1 (x−μ)2
p(x|μ) = √2πσ exp − 2σ2 (24)
3The ∝ notation obscures the fact that the normalizing constants in these two expressions may be completely different, since p(θ) is a distribution over parameters, while p(D | θ) is a distribution over observed data. In this example, the latter normalizing constant happens to be 1, but that won’t always be the case.
If we look at this as a function of μ (rather than x), we see that it’s still a Gaussian! This should lead us to conjecture that the conjugate prior for a Gaussian is a Gaussian. Let’s try it and see if it works.
Our prior distribution will be a Gaussian distribution with mean μpri and stan- dard deviation σpri. The posterior is then given by:
p(μ | D) ∝ p(μ)p(D | μ)
1 (μ−μpri)2 = √2πσ exp − 2σ2
− 2σ2 (x i=1
pri (μ−μpri)2
√2πσexp −2σ2(x −μ) i=1
∝exp −2σ2 + σ2 −2σ2 −2σ2 [x ] +σ2 x μ−2σ2μ pri pri pri i=1 i=1
Y 1 1 (i) 2
μ2 μpriμμpri 1X(i)2 1X(i) N2
μpri 1 X (i)2 μpri Pi=1x(i)
1 1 μ−2 σ2
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com