程序代写代做代考 graph Bayesian C ANLP Week 2/Unit 3 N-gram models and smoothing

ANLP Week 2/Unit 3 N-gram models and smoothing
Sharon Goldwater
(some slides from Philipp Koehn)
Video 1: Add-one smoothing
Sharon Goldwater ANLP Week 2/Unit 3
Recap: N-gram models
• We can model sentence probs by conditioning each word on N − 1
previous words.
• For example, a bigram model:
Sharon Goldwater ANLP Week 2/Unit 3 1
Recap: MLE estimates for N-grams • To estimate each word prob, we could use MLE…
PML(w2|w1) = C(w1, w2) C(w1)
• But even with a large training corpus, not all N-grams will be observed (we still get zeros): we over-fit the training data.
Today: smoothing methods, which reassign probability mass from observed to unobserved events, to avoid overfitting/zero probs.
Sharon Goldwater ANLP Week 2/Unit 3 3
• Or trigram model:
P (w⃗ ) =
P (w⃗ ) =
􏰅n i=1
􏰅n i=1
P (wi|wi−1)
P (wi|wi−2, wi−1)
Sharon Goldwater ANLP Week 2/Unit 3 2

Today’s lecture:
Add-One Smoothing
• How does add-alpha smoothing work, and what are its effects?
• What are some more sophisticated smoothing methods, and what
information do they use that simpler methods don’t?
• What are training, development, and test sets used for?
• What are the trade-offs between higher order and lower order n-grams?
• What is a word embedding and how can it help in language modelling?
• For all possible bigrams, add one more count. PML(wi|wi−1) = C(wi−1, wi)
Sharon Goldwater ANLP Week 2/Unit 3
Add-One Smoothing
4
Sharon Goldwater
ANLP Week 2/Unit 3
5
• For all possible bigrams, add one more count. C(wi−1, wi)
•Wewant:
• Solve for x:
􏰄 C(wi−1,wi)+1=1 w ∈V C(wi−1) + x
i

C(wi−1) P+1(wi|wi−1) = C(wi−1, wi) + 1
?
Add-One Smoothing: normalization
C(wi−1)
PML(wi|wi−1) = C(wi−1)
􏰄
⇒ P+1(wi|wi−1) =
• NO! Sum over possible􏰄wi (in vocabulary V ) must equal 1:
C(wi−1,wi)+1 C(wi−1)
(C(wi−1, wi) + 1) = C(wi−1) + x 􏰄 C(wi−1,wi)+ 􏰄 1 = C(wi−1)+x
C(wi−1) + v = C(wi−1) + x
• So, P+1(wi|wi−1) = C(wi−1,wi)+1 where v = vocabulary size.
?
wi∈V
wi∈V wi∈V
P (wi|wi−1) = 1 wi∈V
• True for PML but we increased the numerator; must change denominator too.
Sharon Goldwater ANLP Week 2/Unit 3 6
C(wi−1)+v
Sharon Goldwater ANLP Week 2/Unit 3 7

Add-One Smoothing: effects
Add-One Smoothing: effects
P+1(wi|wi−1) = C(wi−1, wi) + 1 C(wi−1) + v
Using v = 86, 700 compute some example probabilities:
• Add-one smoothing:
P+1(wi|wi−1) = C(wi−1, wi) + 1
C(wi−1) + v
• Large vobulary size means v is often much larger than C(wi−1),
C(wi−1) = 10, 000
C(wi−1) = 100 C(wi−1, wi) PML = 100 1 10 1/10 1 1/100 00
overpowers actual counts.
• Example: in Europarl, v = 86, 700 word types
(30m tokens, max C(wi−1) = 2m).
Sharon Goldwater ANLP Week 2/Unit 3 8
Video 2: Add-alpha and train/dev/test
C(wi−1, wi) PML = 100 1/100 10 1/1k 1 1/10k 00
Sharon Goldwater
P+1 ≈ 1/970 1/10k 1/48k 1/97k
P+1 ≈ 1/870 1/9k 1/43k 1/87k
9
ANLP Week 2/Unit 3
The problem with Add-One smoothing
• All smoothing methods “steal from the rich to give to the poor”
• Add-one smoothing steals way too much
• ML estimates for frequent events are quite accurate, don’t want smoothing to change these much.
Sharon Goldwater ANLP Week 2/Unit 3 10
Sharon Goldwater ANLP Week 2/Unit 3 11

Add-α Smoothing • Addα<1toeachcount P+α(wi|wi−1) = C(wi−1, wi) + α C(wi−1) + αv • Simplifying notation: c is n-gram count, n is history count P+α= c+α n+αv • What is a good value for α? Sharon Goldwater ANLP Week 2/Unit 3 12 A general methodology • Training/dev/test split is used across machine learning • Development set used for evaluating different models, debugging, optimizing parameters (like α) • Test set simulates deployment; only used once final model and parameters are chosen. (Ideally: once per paper) • Avoids overfitting to the training set and even to the test set Optimizing α • Divide corpus into training set (80-90%), held-out (or development or validation) set (5-10%), and test set (5-10%) • Train model (estimate probabilities) on training set with different values of α • Choose the value of α that minimizes perplexity on dev set • Report final results on test set Sharon Goldwater ANLP Week 2/Unit 3 13 Video 3: Answering questions about previous two videos (no slides) Sharon Goldwater ANLP Week 2/Unit 3 14 Sharon Goldwater ANLP Week 2/Unit 3 15 Video 4: Interpolation and backoff Is add-α sufficient? • Even if we optimize α, add-α smoothing makes pretty bad predictions for word sequences. • Some cleverer methods such as Good-Turing improve on this by discounting less from very frequent items. But there’s still a problem... Sharon Goldwater ANLP Week 2/Unit 3 16 Remaining problem Sharon Goldwater ANLP Week 2/Unit 3 17 Example with add-α smoothing • We said for bigram model, P+α(wi|wi−1) = C(wi−1, wi) + α C(wi−1) + αv • And for any N-gram model where c is n-gram count and n is • In given corpus, suppose we never observe – Scottish beer drinkers – Scottish beer eaters • If we build a trigram model smoothed with Add-α or G-T, which example has higher probability? history count: P+α= c+α n+αv Sharon Goldwater ANLP Week 2/Unit 3 18 Sharon Goldwater ANLP Week 2/Unit 3 19 Example with add-α smoothing • We said for bigram model, P+α(wi|wi−1) = C(wi−1, wi) + α C(wi−1) + αv • And for any N-gram model where c is n-gram count and n is Remaining problem • Previous smoothing methods assign equal probability to all unseen events. history count: P+α= c+α n + αv • Better: use information histories). – beer drinkers – beer eaters from lower order N-grams (shorter – c is Count(Scottish beer drinkers/eaters) = 0 for both. – n is Count(Scottish beer) for both. Sharon Goldwater ANLP Week 2/Unit 3 Interpolation 20 Sharon Goldwater ANLP Week 2/Unit 3 21 • Two ways: interpolation and backoff. • Higher and lower order N-gram models have different strengths and weaknesses – high-order N-grams are sensitive to more context, but have sparse counts • Note th􏰄at λis must sum to 1: 1 = PI(w3|w1,w2) – low-order N -grams consider only very limited context, but have = robust counts w3 􏰄 [λ1 P1(w3) + λ2 P2(w3|w2) + λ3 P3(w3|w1, w2)] • So, combine them: PI(w3|w1,w2)= λ1 P1(w3) + λ2 P2(w3|w2) +λ3 P3(w3|w1,w2) w3 􏰄 􏰄 􏰄 λ1 P1(w3) + λ2 P2(w3|w2) + λ3 w3 w3 w3 P3(w3|w1, w2) P1(drinkers) P2(drinkers|beer) P3(drinkers|Scottish,beer) = = λ1+λ2+λ3 Interpolation Sharon Goldwater ANLP Week 2/Unit 3 22 Sharon Goldwater ANLP Week 2/Unit 3 23 Fitting the interpolation parameters Back-Off • In general, any weighted combination of distributions is called a mixture model. • So λis are interpolation parameters or mixture weights. • The values of the λis are chosen to optimize perplexity on a • Trust the highest order language model that contains N-gram, otherwise “back off” to a lower order model. • Basic idea: – discount the probabilities slightly in higher order model – spread the extra mass between lower order N-grams • But maths gets complicated to make probabilities sum to 1. Sharon Goldwater ANLP Week 2/Unit 3 25 Video 5: Kneser-Ney smoothing held-out data set. Sharon Goldwater ANLP Week 2/Unit 3 Back-Off Equation PBO(wi|wi−N+1, ..., wi−1) = P ∗(wi|wi−N +1, ..., wi−1) =  if count(wi−N +1, ..., wi) > 0
α(wi−N+1, …, wi−1) PBO(wi|wi−N+2, …, wi−1)
 else • Requires
– adjusted prediction model P∗(wi|wi−N+1,…,wi−1) – backoff weights α(w1,…,wN−1)
• See textbook for details/explanation. Sharon Goldwater ANLP Week 2/Unit 3
24
26
Sharon Goldwater
ANLP Week 2/Unit 3 27

Do our smoothing methods work here?
Example from MacKay and Bauman Peto (1994):
Imagine, you see, that the language, you see, has, you see, a frequently occurring couplet, ‘you see’, you see, in which the second word of the couplet, ‘see’, follows the first word, ‘you’, with very high probability, you see. Then the marginal statistics, you see, are going to become hugely dominated, you see, by the words ‘you’ and ‘see’, with equal frequency, you see.
• P(see) and P(you) both high, but see nearly always follows you.
• So P(see|novel) should be much lower than P(you|novel).
Diversity of histories matters!
Sharon Goldwater ANLP Week 2/Unit 3
Kneser-Ney Smoothing
• Kneser-Ney smoothing takes diversity of histories into account • Count of distinct histories for a word:
N1+(•wi) = |{wi−1 : c(wi−1, wi) > 0}|
• Recall: maximum likelihood est. of unigram language model:
PML(w) = 􏰃C(wi) wi C(wi)
28
• A real example: the word York
– fairly frequent word in Europarl corpus, occurs 477 times
– as frequent as foods, indicates and providers
→ in unigram language model: a respectable probability
• However, it almost always directly follows New (473 times)
• So, in unseen bigram contexts, York should have low probability
– lower than predicted by unigram model used in interpolation or backoff.
Sharon Goldwater ANLP Week 2/Unit 3 29
Kneser-Ney in practice
• Original version used backoff, later “modified Kneser-Ney” introduced using interpolation (Chen and Goodman, 1998).
• Fairly complex equations, but until recently the best smoothing method for word n-grams.
• See Chen and Goodman for extensive comparisons of KN and other smoothing methods.
• KN (and other methods) implemented in language modelling toolkits like SRILM (classic), KenLM (good for really big models), OpenGrm Ngram library (uses finite state transducers), etc.
• In KN smoothing, replace raw counts with count of histories: PKN(wi) = 􏰃N1+(•w)
wi N1+(•wi) Sharon Goldwater ANLP Week 2/Unit 3
30
Sharon Goldwater ANLP Week 2/Unit 3 31

Bayesian interpretations of smoothing
Video 6: Moving beyond N-grams
• We contrasted MLE (which has a mathematical justification, but practical problems) with smoothing (heuristic approaches with better practical performance).
• It turns out that many smoothing methods are mathematically equivalent to forms of Bayesian estimation (uses priors and uncertainty in parameters). So these have a mathematical justification too!
– Add-α smoothing: Dirichlet prior
– Kneser-Ney smoothing: Pitman-Yor prior
See MacKay and Bauman Peto (1994); (Goldwater, 2006, pp. 13-17); Goldwater et al. (2006); Teh (2006). Sharon Goldwater ANLP Week 2/Unit 3 32
Are we done with smoothing yet?
We’ve considered methods that predict rare/unseen words using • Uniform probabilities (add-α, Good-Turing)
• Probabilities from lower-order n-grams (interpolation, backoff) • Probability of appearing in new contexts (Kneser-Ney)
What’s left?
Sharon Goldwater
ANLP Week 2/Unit 3
Word similarity
33
• Two words with C(w1) ≫ C(w2) – salmon
– swordfish
• Can P(salmon|caught two) tell us
P(swordfish|caught two)? • n-gram models: no.
something about
Sharon Goldwater ANLP Week 2/Unit 3 34
Sharon Goldwater ANLP Week 2/Unit 3
35

Word similarity in language modeling
Distributed word representations
(also called word embeddings)
• Each word represented as high-dimensional vector (50-500 dims)
E.g., salmon is [0.1, 2.3, 0.6, −4.7, . . .] • Similar words represented by similar vectors
• Early version: class-based language models (J&M 4.9.2) – Define classes c of words, by hand or automatically
– PCL(wi|wi−1) = P (ci|ci−1)P (wi|ci) (an HMM) • Recent version: distributed language models
– Current models have better perplexity than MKN.
– Ongoing research to make them more efficient.
– Examples: Log Bilinear LM (Mnih and Hinton, 2007), Recursive Neural Network LM (Mikolov et al., 2010), LSTM LMs, etc.
Sharon Goldwater ANLP Week 2/Unit 3 36
Training the model
• Goal: learn word representations (embeddings) such that words that behave similarly are close together in high-dimensional space.
• 2-dimensional example:
We’ll come back to this later in the course…
Sharon Goldwater ANLP Week 2/Unit 3 38
Sharon Goldwater
E.g., swordfish is [0.3, 2.2, 1.2, −3.6, . . .]
ANLP Week 2/Unit 3 37
Training the model
• N-gram LM: collect counts, maybe optimize some parameters – (Relatively) quick, especially these days (minutes-hours)
• distributed LM: learn the representation for each word
– Solved with machine learning methods (e.g., neural networks)
– Can be extremely time-consuming (hours-days)
– Learned embeddings seem to encode both semantic and syntactic similarity (using different dimensions) (Mikolov et al., 2013).
Sharon Goldwater ANLP Week 2/Unit 3 39

Using the model
Other Topics in Language Modeling
Want to compute P (w1 . . . wn) for a new sequence.
• N -gram LM: again, relatively quick
• distributed LM: can be slow, but varies; often LMs not used in isolation anyway (instead use end-to-end neural model, which does some of the same work).
Many active research areas in language modeling:
• Factored/morpheme-based/character language models: back off to word stems, part-of-speech tags, and/or sub-word units
• Domain adaptation: when only a small domain-specific corpus is available
• Time efficiency and space efficiency are both key issues (esp on mobile devices!)
Sharon Goldwater ANLP Week 2/Unit 3 41
References
Chen, S. F. and Goodman, J. (1998). An empirical study of smoothing techniques for language modeling. Technical Report TR-10-98, Center for Research in Computing Technology, Harvard University.
Goldwater, S. (2006). Nonparametric Bayesian Models of Lexical Acquisition. PhD thesis, Brown University.
Goldwater, S., Griffiths, T. L., and Johnson, M. (2006). Interpolating between types and tokens by estimating power-law generators. In Advances in Neural Information Processing Systems 18, pages 459–466, Cambridge, MA. MIT Press.
MacKay, D. and Bauman Peto, L. (1994). A hierarchical Dirichlet language model. Natural Language Engineering, 1(1).
Mikolov, T., Karafi ́at, M., Burget, L., Cernock`y, J., and Khudanpur, S. (2010). Recurrent neural network based language model. In INTERSPEECH, pages 1045–1048.
Sharon Goldwater ANLP Week 2/Unit 3 43
• An active area of research for distributed LMs
Sharon Goldwater ANLP Week 2/Unit 3
Summary
40
• We can estimate sentence probabilities by breaking down the problem, e.g. by instead estimating N-gram probabilities.
• Longer N-grams capture more linguistic information, but are sparser.
• Different smoothing methods capture different intuitions about how to estimate probabilities for rare/unseen events.
• Still lots of work on how to improve these models.
Sharon Goldwater ANLP Week 2/Unit 3 42

Mikolov, T., Yih, W.-t., and Zweig, G. (2013). Linguistic regularities in continuous space word representations. In HLT-NAACL, pages 746–751.
Mnih, A. and Hinton, G. (2007). Three new graphical models for statistical language modelling. In Proceedings of the 24th international conference on Machine learning, pages 641–648. ACM.
Teh, Y. W. (2006). A hierarchical Bayesian language model based on Pitman-Yor processes. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, pages 985–992, Syndney, Australia.
Sharon Goldwater ANLP Week 2/Unit 3 44