CS计算机代考程序代写 decision tree School of Computing and Information Systems

School of Computing and Information Systems
The University of Melbourne
COMP90042 NATURAL LANGUAGE PROCESSING (Semester 1, 2021)
Sample solutions: Week 3
Discussion
1. What is text classification? Give some examples.
• Numerous examples from the lectures: sentiment analysis, author identifica- tion, automatic fact-checking, etc.
(a) Why is text classification generally a difficult problem? What are some hur- dles that need to be overcome?
• The main issue is in terms of document representation — how do we identify features of the document which help us to distinguish between the various classes?
• The principal source of features is based upon the presence of tokens (words) in the document (known as a bag–of–words model). However, many words don’t tell you anything about the classes we want to pre- dict, hence feature selection is often important. On the other hand, single words are often inadequate at modelling the meaningful information in the document, but multi-word features (e.g. bi-grams, tri-grams) suffer from a sparse data problem.
(b) Consider some (supervised) text classification problem, and discuss whether the following (supervised) machine learning models would be suitable:
• The answers will vary depending on the nature of the problem, the feature set, the class set, and so on. One possible solution, for a generic genre identification problem using an entire bag–of–words model (similar to the notebook) is as follows:
i. k-Nearest Neighbour using Euclidean distance
• Often this is a bad idea, because Euclidean distance tends to classify
documents based upon their length — which is usually not a distin-
guishing characteristic for classification problems.
ii. k-Nearest Neighbour using Cosine similarity
• Usually better than the previous, because we’re looking at the distribu- tion of terms. However, k-NN suffers from high-dimensionality prob- lems, which means that our feature set based upon the presence of (all) words usually isn’t suitable for this model.
iii. Decision Trees using Information Gain
• Decision Trees can be useful for finding meaningful features, how-
ever, the feature set is very large, and we might find spurious correla- tions. More fundamentally, Information Gain is a poor choice because it tends to prefer rare features; in this case, this would correspond to features that appear only in a handful of documents.
iv. Naive Bayes
1

• At first glance, a poor choice because the assumption of the condi- tional independence of features and classes is highly untrue.
• Also sensitive to a large feature set, in that we are multiplying together many (small) probabilities, which leads to biased interpretations based upon otherwise uninformative features.
• Surprisingly somewhat useful anyway! v. Logistic Regression
• Useful, because it relaxes the conditional independence requirement of Naive Bayes.
• Since it has an implicit feature weighting step, can handle large num- bers of mostly useless features, as we have in this problem.
vi. Support Vector Machines
• Linear kernels often quite effective at modelling some combination of
features that are useful (together) for characterising the classes.
• Need substantial re-framing for problems with multiple classes (in- stead designed for two-class (binary) problems); most text classifica-
tion tends to be multi–class.
2. For the following “corpus” of two documents:

a
4 (a)
1. how much wood would a wood chuck chuck if a wood chuck
would chuck wood
2. a wood chuck would chuck the wood he could chuck if
a wood chuck would chuck wood
I’m going to show the frequencies of the 11 different word uni-grams, as it will make life a little easier in a moment:
chuck could he how if much the wood would Total 9 1 1 1 2 1 1 8 4 2 34
Which of the following sentences: A: a wood could chuck; B: wood would a chuck;ismoreprobable,accodingto:
i. An unsmoothed uni-gram language model?
• An unsmoothed uni-gram language model is simply based on the counts
of words in the corpus. For example, out of the 34 tokens (including )inthecorpus,therewere4instancesofa,soP(a)= 4
34
• To find the probability of a sentence using this model, we simply mul- tiply the probabilities of the individual tokens:
P (A)
P (B)
= P (a)P (wood)P (could)P (chuck)P () = 4×8×1×9×2≈1.27×10−5
34 34 34 34 34
= P (wood)P (would)P (a)P (chuck)P ()
= 8×4×4×9×2≈5.07×10−5 34 34 34 34 34
• Clearly sentence B has the greater likelihood, according to this model. 2

ii. A uni-gram language model, with Laplacian (“add-one”) smoothing?
• Recall that in add-one smoothing, for each probability, we add 1 to the numerator and the size of the vocabulary, which is 11, to the denomi-
nator. For example, PL(a) = 4+1 = 5 . 34+11 45
• Everything else proceeds the same way:
PL (A) = PL (a)PL (wood)PL (could)PL (chuck)PL ()
= 5×9×2×10×3≈1.46×10−5 45 45 45 45 45
PL(B) = PL(wood)PL(would)PL(a)PL(chuck)PL() = 9×5×5×10×3≈3.66×10−5
• Notice that the probability of sentence A is larger using this model, be- cause the probabilities of the unlikely could and have increased. (The other probabilities have decreased). Sentence B is still more likely, however.
iii. An unsmoothed bi-gram language model?
• This time, we’re interested in the counts of pairs of word tokens. For
example, the probability of the bi-gram wood would is based on the count of that sequence of tokens, divided by the count of wood: 1 (be-
8
cause only a single wood is followed by would).
• We include sentence terminals, so that the first probability in sentence
A is P (a)|) = 1 — because one of the two sentences in the corpus 2
starts with a. We also need to predict P (
|chuck) = 0 — because 9
none of the 9 chucks are followed by the end of the sentence.
• Now, we can substitute:
P (A)
P (B)
45 45 45 45 45
= P (a|)P (wood|a)P (could|wood)P (chuck|could)P (|chuck) = 1×4×0×1×0=0
24819
= P (wood|)P (would|wood)P (a|would)P (chuck|a)P (|chuck)
= 0×1×1×0×0=0 28449
• Because there is a zero–probability element in both of these calcula- tions, they can’t be nicely compared, leading us to instead consider:
iv. A bi-gram language model, with Laplacian smoothing?
• We do the same idea as uni-gram add–one smoothing. The vocabulary
size is 11. PL(A) = = PL(B) = =
PL (a|)PL (wood|a)PL (could|wood)PL (chuck|could)PL (|chuc 2×5×1×2×1≈2.25×10−5
13 15 19 12 20 PL(wood|)PL(would|wood)PL(a|would)PL(chuck|a)PL(|chuc
1×2×2×1×1≈3.60×10−6 13 19 15 15 20
• This time, sentence A has the greater likelihood, mostly because of the commonbi-grama wood.
3

v. An unsmoothed tri-gram language model?
• Same idea, longer contexts. Note that we now need two sentence ter- minals.
P (A)
P (B)
= P (a| )P (wood| a) · · · P (|could chuck) = 1×1×0×0×0=?
21401
= P (wood| )P (would| wood) · · · P (|a chuck)
= 0×0×1×0×0=? 20110
• Given that the unsmoothed bi-gram probabilities were zero, that also means that the unsmoothed tri-gram probabilities will be zero. (Exer- cise for the reader: why?)
• In this case, they aren’t even well–defined, because of the 0 terms, but 0
we wouldn’t be able to meaningfully compare these numbers in any
case.
vi. A tri-gram language model, with Laplacian smoothing?

• •
•a=2
• could = 1 • he = 1
• how = 0
• if = 1
• much=1 • the = 1
• would = 2
The vocabulary size is 11. Everything proceeds the same way:
PL(A) = PL(a| )PL(wood| a) · · · PL(|could chuck)
= 2×2×1×1×1≈1.30×10−5 13 12 15 11 12
PL(B) = PL(wood| )PL(would| wood) · · · PL(|a chuck) = 1×1×2×1×1≈8.83×10−6
Notice that the problem of unseen contexts is now solved (they are just 1 ).
11
Sentence A has a slightly greater likelihood here, mostly because of the a at the start of one of the sentences (note that this will continue to be “seen” even at higher orders of n). You can also see that the numbers are getting very small, which is a good motivation for summing log probabilities (assuming no zeroes) rather than multiplying.
on the “corpus”, the vocabulary = {a, chuck, could, he, how, if,
the, wood, would,
} and the continuation counts of the fol- lowing words are given as follows:
(b) Based much,
13 11 12 12 11
4

=1
i. What is the continuation probability of chuck and wood?
• unique context words before chuck = {wood, would, could, chuck} • unique context words before wood = {the, much, a, chuck}
Pcont(chuck) = #cont(chuck)
#cont(a) + … + #cont() + #cont(chuck) + #cont(wood)
=4 2+1+1+0+1+1+1+2+1+4+4
Pcont(wood) = #cont(wood)
#cont(a) + … + #cont() + #cont(chuck) + #cont(wood)
=4 2+1+1+0+1+1+1+2+1+4+4
3. What does back–off mean, in the context of smoothing a language model? What does interpolation refer to?
• Back–off is a different smoothing strategy, where we incorporate lower–order n-gram models (in particular, for unseen contexts). For example, if we have never seen some tri-gram from our sentence, we can instead consider the bi- gram probability (at some penalty, to maintain the probability of all of the events, given some context, summing to 1). If we haven’t seen the bi-gram, we consider the uni-gram probability. If we’ve never seen the uni-gram (this token doesn’t appear in the corpus at all), then we need a so–called “0-gram” probability, which is a default for unseen tokens.
• Interpolation is a similar idea, but instead of only “falling back” to lower– order n-gram models for unseen events, we can instead consider every prob- ability as a linear combination of all of the relevant n-gram models, where the weights are once more chosen to ensure that the probabilities of all events, given some context, sum to 1.
5