Accelerated Natural Language Processing Week 2/Unit 1
Models and probability estimation
Sharon Goldwater
Videos 1-2: Probabilities in language (introduction and motivation)
Sharon Goldwater ANLP Week 2/Unit 1
A famous quote
It must be recognized that the notion “probability of a sentence” is an entirely useless one, under any known interpretation of this term.
Noam Chomsky, 1969
Sharon Goldwater
• •
ANLP Week 2/Unit 1 1
A famous quote
It must be recognized that the notion “probability of a sentence” is an entirely useless one, under any known interpretation of this term.
Noam Chomsky, 1969
“useless”: To everyone? To linguists?
“known interpretation”: What are possible interpretations?
Sharon Goldwater ANLP Week 2/Unit 1 2
Sharon Goldwater ANLP Week 2/Unit 1 3
speech input
↓
possible outputs
↓
best-guess output Sharon Goldwater
Sharon Goldwater ANLP Week 2/Unit 1 5
Machine translation
Sentence probabilities help decide word choice and word order. non-English input
Today’s lecture
Intuitive interpretation
• What do we mean by the “probability of a sentence” and what is it good for?
• What is probability estimation? What does it require?
• What is a generative model and what are model parameters?
• What is maximum-likelihood estimation and how do I compute likelihood?
Sharon Goldwater ANLP Week 2/Unit 1 4
Automatic speech recognition
Sentence probabilities (language model) help decide between similar-sounding options.
• “Probability of a sentence” = how likely is it to occur in natural language
– Consider only a specific language (English)
– Not including meta-language (e.g. linguistic discussion)
P(She studies morphosyntax) > P(She studies more faux syntax)
(Acoustic model)
(Language model)
She studies morphosyntax She studies more faux syntax She’s studies morph or syntax …
She studies morphosyntax
↓
possible outputs
↓
best-guess output Sharon Goldwater
(Translation model)
(Language model)
ANLP Week 2/Unit 1
She is going home
She is going house
She is traveling to home To home she is going
…
She is going home
ANLP Week 2/Unit 1 6
7
So, not “entirely useless”…
• Sentence probabilities are clearly useful for language engineering [this course].
• Given time, I could argue why they’re also useful in linguistic science (e.g., psycholinguistics). But that’s another course…
But, what about zero probability sentences?
the Archaeopteryx winged jaggedly amidst foliage
vs
jaggedly trees the on flew
• Neither has ever occurred before. ⇒ both have zero probability.
• But one is grammatical (and meaningful), the other not.
⇒ “Sentence probability” is useless as a measure of grammaticality.
Sharon Goldwater ANLP Week 2/Unit 1 9
The logical flaw
• “Probability of a sentence” = how likely is it to occur in natural language.
• Is the following statement true?
Sentence has never occurred ⇒ sentence has zero probability
• More generally, is this one?
Event has never occurred ⇒ event has zero probability
Sharon Goldwater
ANLP Week 2/Unit 1 8
Video 3: Zero probability sentences (answer)
Sharon Goldwater
ANLP Week 2/Unit 1 10
Sharon Goldwater ANLP Week 2/Unit 1 11
Events that have never occurred
Events that have never occurred
• Each of these events has never occurred:
My hair turns blue
I injure myself in a skiing accident I travel to Finland
• Yet, they clearly have different (and non-zero!) probabilities.
Sharon Goldwater ANLP Week 2/Unit 1 12
Video 4: Estimation and examples of models
• Each of these events has never occurred:
My hair turns blue
I injure myself in a skiing accident I travel to Finland
• Yet, they clearly have differing (and non-zero!) probabilities. • Most sentences (and events) have never occurred.
– This doesn’t make their probabilities zero (or meaningless), but – it does make estimating their probabilities trickier.
Sharon Goldwater ANLP Week 2/Unit 1 13
Probability theory vs estimation
• Probability theory can solve problems like:
– I have a jar with 6 blue marbles and 4 red ones.
– If I choose a marble uniformly at random, what’s the probability
it’s red?
• But what about:
– I have a jar of marbles.
– I repeatedly choose a marble uniformly at random and then
replace it before choosing again.
– In ten draws, I get 6 blue marbles and 4 red ones.
– On the next draw, what’s the probability I get a red marble?
• The latter also requires estimation theory.
Sharon Goldwater ANLP Week 2/Unit 1 15
Sharon Goldwater
ANLP Week 2/Unit 1 14
Example: weather forecasting
What is the probability that it will rain tomorrow?
• To answer this question, we need
– data: measurements of relevant info (e.g., humidity, wind speed/direction, temperature).
– model: equations/procedures to estimate the probability using the data.
• In fact, to build the model, we will need data (including outcomes) from previous situations as well.
Sharon Goldwater ANLP Week 2/Unit 1 16
Example: language model
What is the probability of sentence w⃗ = w1 . . . wn? • To answer this question, we need
– data: words w1 . . . wn, plus a large corpus of sentences (“previous situations”, or training data).
– model: equations to estimate the probability using the data.
• Different models will yield different estimates, even with the same
data.
• Deep question: what model/estimation method do humans use?
Example: weather forecasting
What is the probability that it will rain tomorrow?
• To answer this question, we need
– data: measurements of relevant info (e.g., humidity, wind speed/direction, temperature).
– model: equations/procedures to estimate the probability using the data.
• In fact, to build the model, we will need data (including outcomes) from previous situations as well.
• Note that we will never know the “true” probability of rain P (rain), only our estimated probability Pˆ(rain).
Sharon Goldwater ANLP Week 2/Unit 1
How to get better probability estimates
17
Better estimates definitely help in language technology. How to improve them?
• More training data. Limited by time, money. (Varies a lot!)
• Better model. Limited by scientific and mathematical
knowledge, computational resources
• Better estimation method. Limited by knowledge, computational resources
mathematical
We will return to the question of how to know if estimates are “better”.
Sharon Goldwater ANLP Week 2/Unit 1 18
Sharon Goldwater ANLP Week 2/Unit 1 19
Notation
• When the distinction is important, will use
– P(w⃗) for true probabilities
– Pˆ(w⃗) for estimated probabilities
– PE(w⃗) for estimated probabilities using a particular estimation
method E.
• But since we almost always mean estimated probabilities, may
get lazy later and use P(w⃗) for those too.
Sharon Goldwater ANLP Week 2/Unit 1 20
Example: estimation for coins
I flip a coin 10 times, getting 7T, 3H. What is Pˆ(T)? • Model 1: Coin is fair. Then, Pˆ(T) = 0.5
Video 5: Comments on answers to coin flip example (modelling assumptions)
Sharon Goldwater
ANLP Week 2/Unit 1 21
Example: estimation for coins
I flip a coin 10 times, getting 7T, 3H. What is Pˆ(T)? • Model 1: Coin is fair. Then, Pˆ(T) = 0.5
• Model 2: Coin is not fair.1 Then, Pˆ(T) = 0.7 (why?)
1Technically, the physical process of flipping a coin means that it’s not really possible to have a biased coin flip. To see a bias, we’d actually need to spin the coin vertically and wait for it to tip over. See https://www.stat.berkeley.edu/~nolan/Papers/dice.pdf for an interesting discussion of this and other coin flipping issues.
Sharon Goldwater
ANLP Week 2/Unit 1 22
Sharon Goldwater ANLP Week 2/Unit 1 23
Example: estimation for coins
Example: estimation for coins
I flip a coin 10 times, getting 7T, 3H. What is Pˆ(T)?
• Model 1: Coin is fair. Then, Pˆ(T) = 0.5
• Model 2: Coin is not fair. Then, Pˆ(T) = 0.7 (why?)
• Model 3: Two coins, one fair and one not; choose one at random to flip 10 times. Then, 0.5 < Pˆ(T) < 0.7.
Sharon Goldwater ANLP Week 2/Unit 1 24
Video 6: Model structure and parameters; Gaussian example
I flip a coin 10 times, getting 7T, 3H. What is Pˆ(T)?
• Model 1: Coin is fair. Then, Pˆ(T) = 0.5
• Model 2: Coin is not fair. Then, Pˆ(T) = 0.7 (why?)
• Model 3: Two coins, one fair and one not; choose one at random to flip 10 times. Then, 0.5 < Pˆ(T) < 0.7.
Each is a generative model: a probabilistic process that describes how the data were generated.
Sharon Goldwater ANLP Week 2/Unit 1 25
Defining a model
Usually, two choices in defining a model:
• Structure (or form) of the model: the form of the equations, usually determined by knowledge about the problem.
• Parameters of the model: specific values in the equations that are usually determined using the training data.
Sharon Goldwater
ANLP Week 2/Unit 1
26
Sharon Goldwater ANLP Week 2/Unit 1 27
Example: height of 30-yr-old females
Assume the form of a normal distribution (or Gaussian), with parameters (μ, σ):
Sharon Goldwater
Example: height of 30-yr-old females
Collect data to determine values of μ,σ that fit this particular dataset.
I could then make good predictions about the likely height of the next 30-yr-old female I meet.
Sharon Goldwater ANLP Week 2/Unit 1 29
Model criticism
• Sometimes using an incorrect model structure can still give useful results. (We’ll see examples.)
• But sometimes we might need to revise the model if the data don’t seem to match the model assumptions.
All models are approximations. How good the approximation needs to be depends on what we are trying to do with it.
2 1 −(x−μ)
p(x|μ, σ) = σ√2π exp 2σ2
ANLP Week 2/Unit 1
28
Example: height of 30-yr-old females
What if our data looked like this?
Sharon Goldwater
ANLP Week 2/Unit 1 30
Sharon Goldwater ANLP Week 2/Unit 1 31
The true model
The true generative model for the second dataset was actually: 96
Assume two groups, each with a Gaussian distribution.
For each data point,
1. Choose which group this point belongs to.
2. Conditioned on the group, choose height value from that
8 group’s distribution. 7
Question: how many parameters does this model have?
Sharon Goldwater ANLP Week 2/Unit 1 32
Mixture model
If I use the original model structure (single Gaussian), no estimate of model parameters will lead to accurate predictions.
Mixture model
This model is a mixture of two Gaussians, and has five parameters: • The mixing weight: probability of choosing group 1 or group 2
(in this case, 0.5).
• μ and σ for each of the two Gaussian distributions.
Sharon Goldwater ANLP Week 2/Unit 1 33
Video 7: Estimation for M&Ms; definition of likelihood and MLE
Sharon Goldwater ANLP Week 2/Unit 1 34
Sharon Goldwater ANLP Week 2/Unit 1 35
Example: M&M colors
What is the proportion of each color of M&M?
• Assume a discrete distribution with parameters θ.
– θ is a vector! That is, θ = (θR, θO, θY, θG, θBl, θBr).
– For discrete distribution, params ARE the probabilities, e.g.,
P (red) = θR.
– Note: if there are six colors, there are really only five parameters.
(why?)
Example: M&M colors
What is the proportion of each color of M&M?
• Assume a discrete distribution with parameters θ. • In 48 packages, I find2 2620 M&Ms, as follows:
Red Orange Yellow Green Blue Brown 372 544 369 483 481 371
• How to estimate θ from this data?
2Actually, data from: https://joshmadison.com/2007/12/02/mms-color-distribution-analysis/
Sharon Goldwater ANLP Week 2/Unit 1 37
Relative frequency estimation
As the number of observations approaches infinity, relative frequency estimate converges to the true probability. In practical terms,
• If our counts are large, estimates are fairly accurate. 150 red M&Ms of out 1000:
PRF(red) = .15 and P (red) not likely to be .1 or .2.
• If our counts are small, estimates are not so accurate. 3 red M&Ms of out 20:
PRF(red) = .15 but P (red) could easily be .1 or .2.
(It’s really the size of the numerator that matters, as we’ll see later.)
Sharon Goldwater ANLP Week 2/Unit 1 36
Relative frequency estimation
• Intuitive way to estimate discrete probabilities: PRF(x) = C(x)
N
where C(x) is the count of x in a large dataset, and
N = x′ C(x′) is the total number of items in the dataset. ˆ 372
• M&M example: PRF(red) = θR = 2620 = .142
• Or, could estimate probability of word w from a large corpus.
• Can we justify this mathematically?
Sharon Goldwater ANLP Week 2/Unit 1 38
Sharon Goldwater ANLP Week 2/Unit 1 39
Maximum-likelihood estimation
RF estimation is also called maximum-likelihood estimation (MLE).
• The likelihood is the probability of the observed data d under some particular model with parameters θ: that is, P(d|θ).
• For a fixed d, different choices of θ yield different P(d|θ).
• If we choose θ using relative frequencies, we get the maximum
possible value for P(d|θ): the maximum likelihood.
Sharon Goldwater ANLP Week 2/Unit 1 40
Likelihood example
• For a fixed dataset, the likelihood depends on the model we use.
• Our coin example: θ = (θH, θT). Suppose d = HTTTHTHTTT.
• Model 1: Assume coin is fair, so θˆ = (0.5, 0.5). – Likelihood of this model:
ˆ37
P (HTTTHTHTTT|θ) = (0.5) · (0.5) = 0.00097
• Model 2: Use ML estimation, so θˆ = (0.3, 0.7).
– Likelihood of this model: (0.3)3 · (0.7)7 = 0.00222
• Maximum-likelihood estimate does have higher likelihood!
Sharon Goldwater ANLP Week 2/Unit 1 42
Likelihood example
• For a fixed dataset, the likelihood depends on the model we use.
• Our coin example: θ = (θH, θT). Suppose d = HTTTHTHTTT.
• Model 1: Assume coin is fair, so θˆ = (0.5, 0.5). – Likelihood of this model:
ˆ37 P (HTTTHTHTTT|θ) = (0.5) · (0.5)
= 0.00097
Sharon Goldwater ANLP Week 2/Unit 1
41
Where to go from here?
In the next unit, we’ll start to discuss
• Different generative models for sentences (model structure), and the questions they can address
• Weaknesses of MLE and ways to address them (parameter estimation methods)
Sharon Goldwater
ANLP Week 2/Unit 1 43