程序代写代做代考 chain kernel decision tree COMP90042 Natural Language Processing Workshop Week 3

COMP90042 Natural Language Processing Workshop Week 3
Haonan Li – haonan.li@unimelb.edu.au 16, March 2020

Outline
• Text Classification
• N-gram Language Model • Smoothing
1/25

2/25

Text Classification Discussion
1. What is text classification? Give some examples.
2. Why is text classification generally a difficult problem? What are some hurdles that need to be overcome?
3. Consider some (supervised) text classification problem, and discuss whether the following (supervised) machine learning models would be suitable:
• k-Nearest Neighbour using Euclidean distance • k-Nearest Neighbour using Cosine similarity
• Decision Trees using Information Gain
• Naive Bayes
• Logistic Regression
• Support Vector Machines
3/25

What is Text Classification?
What is text classification? Give some examples.
• sentiment analysis
• author identification
• automatic fact-checking • etc.
4/25

Why Text Classification Difficult?
Why is text classification generally a difficult problem?
• document representation — how do we identify features of the document which help us to distinguish between the various classes?
What are some hurdles that need to be overcome?
• Principal source of features: presence of tokens (words), (known as a bag–of–words model).
• many words don’t tell you anything about the classes we want to predict, so feature selection is important.
• single words are often inadequate at modelling the meaningful information in the document
• Multi-word features (e.g. bi-grams, tri-grams) suffer from a sparse data problem.
5/25

Machine Learning models for Text Classification
Consider some (supervised) text classification problem, and discuss whether the these (supervised) machine learning models would be suitable:
• For a generic genre identification problem using an entire bag–of–words model (similar to the notebook) is as follows:
6/25

k-Nearest Neighbour
k-Nearest Neighbour using Euclidean distance
• Often this is a bad idea, because Euclidean distance tends to classify documents based upon their length — which is usually not a distinguishing characteristic for classification problems.
k-Nearest Neighbour using Cosine similarity
• Usually better than the previous, because we’re looking at the distribution of terms. However, k-NN suffers from high-dimensionality problems,which means that our feature set based upon the presence of (all) words usually isn’t suitable for this model.
7/25

Decision Trees using Information Gain
• Decision Trees can be useful for finding meaningful features, however, the feature set is very large, and we might find spurious correlations. More fundamentally, Information Gain is a poor choice because it tends to prefer rare features; in this case, this would correspond to features that appear only in a handful of documents.
• Random Forest?
8/25

Naive Bayes
• At first glance, a poor choice because the assumption of the conditional independence of features and classes is highly untrue, e.g.?
• Also sensitive to a large feature set, in that we are multiplying together many (small) probabilities, which leads to biased interpretations based upon otherwise uninformative features.
• Surprisingly somewhat useful anyway!
9/25

Logistic Regression
• Useful, because it relaxes the conditional independence requirement of Naive Bayes.
• Since it has an implicit feature weighting step, can handle large numbers of mostly useless features, as we have in this problem
10/25

Support Vector Machines
• Linear kernels often quite effective at modelling some combination of features that are useful (together) for characterising the classes.
• How about multi-class?
11/25

Language Model
Tasks:
• Speech Recognition • Spell Checking
• Machine Translation • etc.
12/25

Probabilistic Language Model
Goal: get a probability for an arbitrary sequence of m words:
P(w1, w2, …wn)
First step: apply the chain rule to convert joint probabilities to
conditional ones:
P(w1, w2, …wn) = P(w1)P(w2|w1)P(w3|w1, w2)…P(wn|w1, …, wn−1)
13/25

N-gram Model Discussion
For the following “corpus” of two documents:
1. how much wood would a wood chuck chuck if a wood
chuck would chuck wood
2. a wood chuck would chuck the wood he could chuck if a wood chuck would chuck wood
(a). Which of the following sentences: a wood could chuck; wood would a chuck; is more probable, according to:
I An unsmoothed uni-gram language model?
II A uni-gram language model, with Laplacian (“add-one”) smoothing?
III An unsmoothed bi-gram language model?
IV A bi-gram language model, with Laplacian smoothing?
V An unsmoothed tri-gram language model?
VI A tri-gram language model, with Laplacian smoothing?
14/25

N-gram Model Discussion
Word Uni-grams Frequencies, totally 34 appearances:
how
much
wood
would
a
chuck
if
the
he
could

1
1
8
4
4
9
2
2
1
1
2
Some Bi-grams Frequencies:
a
a wood
wood could
would a
wood would
1
4
0
1
1
wood
could chuck
a chuck
chuck

0
1
1
0
15/25

An unsmoothed uni-gram language model
• Simply based on the counts of words in the corpus. For example,
out of the 34 tokens (including
) in the corpus, there were 4
instances of a, so P(a) = 4 34
• Probability of a sentence: multiply the probabilities of the individual tokens:
P(A) = P(a)P(wood)P(could)P(chuck)P() = 4 · 8 · 1 · 9 · 2 ≈1.27×10−5
34 34 34 34 34
P(B) = P(wood)P(would)P(a)P(chuck)P()
= 8 · 4 · 4 · 9 · 2 ≈5.07×10−5 34 34 34 34 34
• Clearly sentence B has the greater likelihood, according to this model.
16/25

A uni-gram language model, with Laplacian (“add-one”) smoot- hing
• For each probability, we add 1 to the numerator and the size of
the vocabulary, which is 11, to the denominator. For example,
PL(a)=4+1 =5. 34+11 45
• Everything else proceeds the same way:
PL(A) = PL(a)PL(wood)PL(could)PL(chuck)PL()
= 5 · 9 · 2 ·10· 3 ≈1.46×10−5 45 45 45 45 45
PL(B) = PL(wood)PL(would)PL(a)PL(chuck)PL() = 9 · 5 · 5 ·10· 3 ≈3.66×10−5
• Sentence B is still more likely.
45 45 45 45 45
17/25

An unsmoothed bi-gram language model i
• This time, we’re interested in the counts of pairs of word tokens.
• We include sentence terminals, so that the first probability in
sentence A is P(a|) = 1 — because one of the two sentences 2
in the corpus starts with a. Now, we can substitute:
P(A) = P(a|)P(wood|a)P(could|wood)P(chuck|could)P(|chuck)
=1·4·0·1·0=0 24819
P(B) = P(wood|)P(would|wood)P(a|would)P(chuck|a)P(|chuck) =0·1·1·1·0=0
28449
• Because there is a zero–probability element in both of these calculations, they can’t be nicely compared, how to solve?
18/25

A bi-gram language model, with Laplacian smoothing
• We do the same idea as uni-gram add–one smoothing. The vocabulary size is 11.
PL(A) = PL(a|)PL(wood|a)PL(could|wood)PL(chuck|could)PL(|chuck) = 2 · 5 · 1 · 2 · 1 ≈2.25×10−5
13 15 19 12 20
PL(B) = PL(wood|)PL(would|wood)PL(a|would)PL(chuck|a)PL(|chuck)
= 1 · 2 · 2 · 1 · 1 ≈3.60×10−6 13 19 15 15 20
• This time, sentence A has the greater likelihood, mostly because of the common bi-gram a wood.
19/25

An unsmoothed tri-gram language model
• Same idea, longer contexts. Note that we now need two sentence terminals.
P(A) = P(a| )P(wood| a)…P(|could chuck) = 1 · 1 · 0 · 0 · 0 =?
21401
P(B) = P(wood| )P(would| wood)…P(|a chuck)
= 0 · 0 · 1 · 0 · 0 =? 20110
• Given that the unsmoothed bi-gram probabilities were zero, that also means that the unsmoothed tri-gram probabilities will be zero. Why?
• In this case, they aren’t even well–defined, because of the 0 0
terms, but we wouldn’t be able to meaningfully compare these numbers in any case.
20/25

A tri-gram language model, with Laplacian smoothing
• The vocabulary size is 11. Everything proceeds the same way: PL(A) = PL(a|
)PL(wood| a)…PL(|could chuck)
= 2 · 2 · 1 · 1 · 1 ≈1.30×10−5 13 12 15 11 12
PL(B) = PL(wood| )PL(would| wood)…PL(|a chuck) = 1 · 1 · 2 · 1 · 1 ≈8.83×10−6
• Notice that the problem of unseen contexts is now solved (they are just ]1 ).
11
• Sentence A has a slightly greater likelihood here, mostly because of the a wood at the start of one of the sentences.
• The numbers are getting very small, what to do?
13 11 12 12 11
21/25

Other Smoothing Strategies
• Add-k smoothing
• Absolute discounting
• Backoff
• Kneser-Ney smoothing
22/25

Continuation Probability
(b). Based on the “corpus”, the vocabulary = {a, chuck, could, he, how, if, much, the, wood, would,
}, and the continuation counts of the following words are given as follows:
a = 2, could = 1, he = 1, how = 0, if = 1, much = 1, the = 1, would = 2 What is the continuation probability of chuck and wood?
23/25

Continuation Probability
Continuation counts:
a = 2, could = 1, he = 1, how = 0, if = 1, much = 1, the = 1, would = 2 What is the continuation probability of chuck and wood?
• uniquecontextwordsbeforechuck:{wood,would,could, chuck }
• uniquecontextwordsbeforewood:{the,much,a,chuck}
Pcount(chuck) = = Pcount(wood) = =
#chuck
#a + #could + … + #chuck + #would
4 2+1+1+0+1+1+1+2+1+4+4
#wood
#a + #could + … + #chuck + #would
4 2+1+1+0+1+1+1+2+1+4+4
24/25

Back-off and Interpolation
What does back–off mean, in the context of smoothing a language model? What does interpolation refer to?
• Back–off is a smoothing strategy, where we incorporate lower–order n-gram models (in particular, for unseen contexts). For example, if we have never seen some tri-gram from our sentence, we can instead consider the bigram probability.
• Interpolation is a similar idea, but instead of only “falling back” to lower–order n-gram models for unseen events, we can instead consider every probability as a linear combination of all of the relevant n-gram models, where the weights are once more chosen to ensure that the probabilities of all events, given some context, sum to 1.
25/25

Questions ⌣
25/25