COMP90042 Natural Language Processing
The University of Melbourne
School of Computing and Information Systems
COMP90042 Natural Language Processing Mock Exam
June 2021
Exam duration: 165 minutes (15 minutes reading time + 120 minutes writing time + 30 minutes technical buffer time)
Length: This paper has 4 pages including this cover page. Instructions to students:
• This exam is worth a total of 120 marks and counts for 40% of your final grade.
• You can read the question paper on a monitor, or print it.
• You are recommended to write your answers on blank A4 papers. Note that some answers require drawing diagrams or tables.
• You will need to scan or take a photo of your answers and upload them via Gradescope. Be sure to label the scans/photos with the question numbers (-10% penalty for each unlabelled question).
• Please answer all questions. Please write your student ID and question number on every page. Format: Open Book
• While you are undertaking this assessment you are permitted to:
– make use of the textbooks, lecture slides and workshop materials.
• While you are undertaking this assessment you must not:
– make use of any messaging or communications technology;
– make use of any world-wide web or internet-based resources such as wikipedia, stackoverflow, or google and other search services;
– act in any manner that could be regarded as providing assistance to another student who is undertaking this assessment, or will in the future be undertaking this assessment.
• The work you submit must be based on your own knowledge and skills, without assistance from any other person.
page 1 of 4 Continued overleaf . . .
COMP90042 Natural Language Processing
COMP90042 Natural Language Processing
Semester 1, 2021
Total marks: 120 (40% of subject) Students must attempt all questions
Section A: Short Answer Questions [42 marks]
Answer each of the questions in this section as briefly as possible. Expect to answer each sub-question in no more than a line or two.
Question 1: General Concepts [21 marks]
a) For higher order (n ≥ 2) N-gram language models, what is the key idea that differentiates more sophisticated “smoothing” techniques from stand-alone add-k smoothing? Mention one smoothing technique which instantiates this idea. [6 marks]
b) What is the vanishing gradient problem in recurrent neural networks? Explain one approach for tackling this. [6 marks]
c) What is discourse? Describe two common discourse applications. [9 marks]
Question 2: Machine Translation [15 marks]
a) Why is “machine translation” a difficult task? Explain with an example. [6 marks]
b) For “statistical machine translation”, what is the rationale for decomposing the model into a language
model and a translation model? [3 marks]
c) What is the “information bottleneck” issue in “neural machine translation”? Explain one approach
for tackling this. [6 marks]
Question 3: Topic Models [6 marks]
a) Compare “Latent Semantic Analysis” and “Latent Dirichlet Allocation”, identifying two important commonalities and two important differences. [6 marks]
page 2 of 4 Continued overleaf . . .
COMP90042 Natural Language Processing
Section B: Method Questions [45 marks]
In this section you are asked to demonstrate your conceptual understanding of the methods that we have studied in this subject.
Question 4: Text Classification [18 marks]
For this question, suppose you have a very large corpus of English texts written by people from 20+ different language backgrounds, and you want to build an automatic Native Language Identification system.
a) Name two types of “features” you think would be appropriate for this task and explain why. [6 marks]
b) Given the nature of the task and the features you have chosen, would you perform “lemmatisation” and/or “stop word removal” over your corpus? Explain why or why not for both preprocessing methods. [6 marks]
c) Given the task and the features you have chosen, do you think a Random Forest classifier would be appropriate? What about a Support Vector Machine? Justify your answers. [6 marks]
Question 5: Ethics [12 marks]
You’re tasked to develop an NLP tool that can identify a person’s gender (male or female) given a history of their email conversations. Discuss the ethical implications of this application.
Question 6: Lexical Semantics [15 marks]
object
animal vehicle
feline canine car lion dog
The questions below are based on the partial lexical hierarchy above.
a) Fill in this sentence with the appropriate -nym: animal is a of lion.
b) Based on simple “path-based” similarity, which is more similar to lion, dog or vehicle? What about
with the “Wu-Palmer” similarity metric? [6 marks]
c) If we are using “Lin” similarity, is it possible that lion might be more similar to car than it is to dog? If so, show give the condition on the “information content” of dog that must hold (in terms of the IC of other nodes) for this to happen, or, if not, explain why not. [6 marks]
[3 marks]
page 3 of 4
Continued overleaf . . .
COMP90042 Natural Language Processing
Section C: Algorithmic Questions [33 marks]
In this section you are asked to demonstrate your understanding of the methods that we have studied in this subject, in being able to perform algorithmic calculations.
Question 7: Part-of-Speech and Parsing [18 marks]
This question is about using analyzing syntax. Consider the following newspaper headline:
Eye drops off shelf
a) First show the key ambiguity in the sentence by giving two possible part-of-speech tag sequences. You can use any existing POS tagset, or your own, provided it satisfies the basic properties of a tag set and is easily interpretable. The tag set you use need not distinguish inflectional differences. [3 marks]
b) Write a set of CFG productions that can represent and structurally differentiate these two interpreta- tions. Your set of non-terminals should consist of S, NP, VP, and your POS tag set from above, and your rules should have no recursion. [6 marks]
c) Do a CYK parse of the sentence using your grammar. You must include the full table. Be sure to convert your grammar to Chomsky Normal Form, and show which productions must be changed. [9 marks]
Question 8: Viterbi Decoding [15 marks]
a) Why is decoding difficult for HMM at test time? Explain this in the context of part-of-speech tagging using a HMM. [6 marks]
b) Perform Viterbi decoding given the sentence they can fish and the following emission and transition
tables. You should show the full table and the computation steps involved.
they can fish
N 0.4 0.3 0.3 V 0.1 0.5 0.4
Table 1: Emission probabilities NV
0.6 0.4 N 0.3 0.7 V 0.7 0.3
Table 2: Transition probabilities
— End of Exam —
[9 marks]
page 4 of 4
End of Exam