程序代写代做代考 decision tree School of Computing and Information Systems

School of Computing and Information Systems
The University of Melbourne
COMP90042 NATURAL LANGUAGE PROCESSING (Semester 1, 2020)
Workshop exercises: Week 3
Discussion
1. What is text classification? Give some examples.
(a) Why is text classification generally a difficult problem? What are some hur- dles that need to be overcome?
(b) Consider some (supervised) text classification problem, and discuss whether the following (supervised) machine learning models would be suitable:
i. k-Nearest Neighbour using Euclidean distance ii. k-Nearest Neighbour using Cosine similarity
iii. Decision Trees using Information Gain iv. Naive Bayes
v. Logistic Regression
vi. Support Vector Machines
2. For the following “corpus” of two documents:
1. how much wood would a wood chuck chuck if a wood chuck
would chuck wood
2. a wood chuck would chuck the wood he could chuck if
a wood chuck would chuck wood
(a) Which of the following sentences: a wood could chuck; wood would a chuck; is more probable, accoding to:
i. An unsmoothed uni-gram language model?
ii. A uni-gram language model, with Laplacian (“add-one”) smoothing?
iii. An unsmoothed bi-gram language model?
iv. A bi-gram language model, with Laplacian smoothing?
v. An unsmoothed tri-gram language model?
vi. A tri-gram language model, with Laplacian smoothing?
1

(b) Based on the “corpus”, the vocabulary = {a, chuck, could, he, how, if, much, the, wood, would, }, and the continuation counts of the fol- lowing words are given as follows:
•a=2
• could = 1 • he = 1
• how = 0
• if = 1
• much=1 • the = 1
• would = 2 • =1
i. What is the continuation probability of chuck and wood?
3. What does back–off mean, in the context of smoothing a language model? What
does interpolation refer to? Programming
1. In the 03-classification notebook, observe how different tokenisation regimes alter the text classification performance of the various classifiers on the given Reuters dataset problem.
(a) Alter the tokenisation strategy so that it incorporates other stages, for exam- ple, punctuation, or stemming/lemmatisation.
(b) Does performance increase or decrease? Are some classifiers affected more than others? Why do you think that is?
2. Using the iPython notebook 04-ngram, randomly generate some sentences based on the bi-gram models of the Gutenberg corpus and the Penn Treebank. What do you notice about these sentences? Are there any sentences which might get re- turned for both corpora? Why?
3. Find a sentence with a higher probability than revenue increased last quarter., ac- cording to:
(a) The Gutenberg corpus, using bi-grams smoothed with Laplacian smoothing (b) The Gutenberg corpus, using bi-grams smoothed with Interpolation
(c) The Penn Treebank corpus, using bi-grams and Laplacian smoothing (d) The Penn Treebank corpus, using bi-grams and Interpolation
4. Find the perplexity of the above (smoothed) language models for a number of sentences. Why does Interpolation generally have better perplexity?
Catch-up
• What is a language model? What is an n-gram language model? Why are lan-
guage models important?
• What do uni-gram, bi-gram, tri-gram, etc. signify?
2

• Why is smoothing important?
• Why do we usually use log probabilities when finding the probability of a sen-
tence according to an n-gram language model?
• How might one evaluate a language model?
Get ahead
• Adjust the 03-classification iPython notebook, so that the supervised ma- chine learning model attempts to solve the multi-class problem, rather than the single-class problem (for acq). Does your assessment of the relative utility of the given classifiers change?
• Using the (short) “corpus” from Discussion Q2, generate all of the sentences of length 3. Choose an n-gram language model, and find the most probable sen- tence. What about length 4? 5? 6? What do you notice about these sentences? Does smoothing (or not) change this?
• Modify the iPython notebook so that it uses back–off smoothing. How does this change the probability of the given sentence? Why? Is the perplexity of this model better than Laplacian smoothing? Interpolation? Why?
• Perform the Programming experiments above using different corpora.
3