UNIVERSITY OF EDINBURGH COLLEGE OF SCIENCE AND ENGINEERING SCHOOL OF INFORMATICS
INFR11125 ACCELERATED NATURAL LANGUAGE PROCESSING
Friday 13 th December 2019 09:30 to 11:30
INSTRUCTIONS TO CANDIDATES
1. Note that ALL QUESTIONS ARE COMPULSORY.
2. DIFFERENT QUESTIONS MAY HAVE DIFFERENT NUMBERS OF TOTAL MARKS. Take note of this in allocating time to questions.
3. CALCULATORS MAY NOT BE USED IN THIS EXAMINATION.
MSc Courses
Convener: V.Nagarajan
External Examiners: W.Knottenbelt, M.Dunlop, M.Niranjan, E.Vasilaki
THIS EXAMINATION WILL BE MARKED ANONYMOUSLY
1. Short answer questions (20 marks total). Many of these questions can be answered with a word or two, and none should require more than a few sentences.
(a) N-gram language models.
For each scenario below, I train a bigram and a trigram model, and compute perplexities. In each case, say which model you would normally expect to have lower perplexity: the bigram model, the trigram model, or either model could have lower perplexity. Justification is not needed.
i. Both models are trained using maximum-likelihood estimation (MLE) on the same corpus. I compute perplexity on the training corpus.
ii. Each model is trained using MLE on a different corpus. I compute the perplexity of each model on the corpus I used to train it.
iii. IcollectaverylargecorpusofEnglishWikipediadataandsetasidesome random sentences for testing. I use Kneser-Ney smoothing and build models using standard methods with the non-test data, then compute perplexity on the test data.
(b) Use the following sentence to answer this question.
The president arrived in Hawaii after a long trip
i. Give the lemma for any word(s) in the sentence where the lemma is different from the word itself.
ii. Draw the dependency parse above the sentence and label each arc using Universal Dependency style labels.
Then, underneath each word in the sentence, write its part of speech, using Penn Treebank style tags. The relevant tags are all included in the table below as a reminder.
[4 marks]
CC coordinating conjunction CD cardinal number
DT determiner
IN preposition
JJ adjective
JJR comparative adjective MD modal
NN singular or mass noun NNS noun, plural
NNP proper noun
NNPS proper noun, plural
PRP personal pronoun RB adverb
RBR comparative adverb TO “to”
VB verb VBD verb VBG verb VBN verb VBP verb VBZ verb
base form
past tense gerund
past participle non-3g present 3sg present
[5 marks]
QUESTION CONTINUES ON NEXT PAGE
Page 1 of 7
QUESTION CONTINUED FROM PREVIOUS PAGE
(c) Evaluating text classification.
Suppose I have a spam classifier that returns the predicted probability that each email is spam. I run it on a small set of emails where I know the gold standard label for each email (1 = spam; 0 = not spam). I then rank the emails according to the system’s predicted probability (1 = highest probability; 7 = lowest probability), as shown below.
System ranking Gold label 11
21 30 41 50 60 70
The classifier will label an email as spam if its predicted probability is above some threshold t. Suppose I want the classifier’s precision on these emails to be at least 80%, and I set t accordingly. What will the classifier’s highest possible recall be?
(d) What does the Viterbi algorithm require as input, and what does it pro- duce as output? If you include equations in your answer, you must explain what they mean.
(e) Edit distance.
I use dynamic programming to compute the weighted edit distance between the strings BALL and BILE. The weights (costs) for each edit operation are between 0 and 1, and yield the chart shown below.
BALL B
I L E
i. What is the alignment corresponding to the backtraces shown in bold? ii. What is the cost of inserting the letter E?
QUESTION CONTINUES ON NEXT PAGE
Page 2 of 7
[2 marks]
[2 marks]
[2 marks]
0 ↑0.2 ↑0.6 ↑0.8
←0.3 ↖0.3 ↑0.7 ↖↑0.9
←0.7 ←0.7 ↖0.3 ↑0.5
←1.1
←1.1 ↖↑0.7 ←0.9
Doc1:
Doc2:
Doc3:
The president arrived in Hawaii after a long trip. He
rested in a local hotel.
Hawaii is beautiful, and you should take a trip there.
I can tell you what my favorite hotel in Hawaii is.
The president announced another trip yesterday.
QUESTION CONTINUED FROM PREVIOUS PAGE
(f) Use the following three very short documents to answer this question.
[5 marks]
i. Find two (distinct) examples in these documents that illustrate different kinds of linguistic agreement. What are the two kinds of agreement your examples illustrate and which words agree in each case?
ii. Consider the two events:
X = “Hawaii occurs in a document”
Y = “president occurs in a document”
Using MLE on the documents above, find the estimated probabilities P(X) and P(Y) and determine whether X and Y are independent in this data. What does the answer tell us about the value of PMI(X,Y )? (Hint: you don’t need to compute PMI(X, Y ) to answer this question.)
Page 3 of 7
2. Text classification. (10 marks total)
You’re working for a startup that has partnered with the National Health Service (NHS). NHS want to explore whether they can partially automate diagnosis of depression by asking patients to write a short essay, and applying NLP to classify the text. If the classifier flags the patient as being likely to have depression, then they would be referred to a doctor for further diagnosis and treatment.
(a) Your boss asks you to build a simple baseline system first using Naive Bayes. She thinks the style and sentiment of the document are more discriminative for this task than the topic, so she suggests you use the following features:
• All stopwords from a standard stopword list.
• The 75 most frequent positive words and 75 most frequent negative
words from a sentiment lexicon.
i. In some of the practical work for the course, we used a sparse vector representation to store word co-occurence counts for a large data set. Would there be a benefit to using a sparse vector representation to store the word-document co-occurence counts in this situation? Explain why or why not.
ii. To train a Naive Bayes classifier, you need to estimate P(fi|cj) for each feature fi and class cj. Say what the classes are in this scenario, and give an expression for the add-alpha smoothed estimate of P(happy|c1). Make sure you define all of the terms in your equation and, where pos- sible, say what actual values they have in this case.
(b) Together with the NHS, how would you obtain data to train the classifier and what would you need to ensure before using it?
(c) Your boss thinks that if the system is successful, perhaps your company could also sell it to Facebook so they can apply it to users’ posts and suggest mental health counselling if a user appears to be depressed.
Briefly describe three potential problems with this plan. At least one must be technical, and at least one must be ethical.
[4 marks]
Page 4 of 7
[3 marks] [3 marks]
3. Co-reference and gender bias. (10 marks total) Consider the following two sentences for this question.
S1: The nurse called the plumber to ask her about the blocked drain.
S2: The nurse called the plumber to advise about her back pain.
(a) Is S1 a pro-stereotypical or anti-stereotypical sentence? By changing a single noun phrase, turn it into the opposite kind of sentence.
(b) Explain how sentences like these have been used to measure gender bias in co-reference systems. Which sentence would you expect to be more affected by gender bias in a co-reference system, and why?
(c) In class, we discussed a way to reduce gender bias in English co-reference systems by adding artificial gender-swapped examples to the training data. Using concrete examples, explain how this is done in English, and what difficulties you might run into if you tried to use the same method on a morphologically complex language.
[2 marks] [4 marks]
[4 marks]
Page 5 of 7
4. Parsing and logistic regression. (10 marks total)
It is 2021 and aliens have landed on earth. The linguist Panini is trying to under- stand their language and discovers it has some similarities to human languages. For example, there are four parts of speech (verb: V, noun: N, adjective: A, and determiner: D) and these combine to form syntactic constituents.
Panini hypothesizes that there are three types of constituents (verb phrases: VP, noun phrases: NP and adjective phrases: AP), and that the type of phrase is determined by the following rules:
R1 If there are more than two words in the phrase labelled as V, it is a VP.
R2 Otherwise, if the leftmost word in the phrase is labelled as N, it is an NP.
R3 Otherwise, if the rightmost word in the phrase is labelled as A, it is an AP.
R4 Otherwise, the sequence of parts of speech (or words) is not a valid con- stituent (None).
(a) Panini decides to formulate the above rules as a logistic regression model P(y | x) where y ∈ {VP,NP,AP,None} and x is a sequence of POS tags over {V,N,A,D}(forexamplex= DNVPorx=VV).
i. Define four features f1(x, y), f2(x, y) and f3(x, y) and f4(x, y) such that the label y will always be predicted correctly given the weights w1 = 8, w2 =4,w3 =2andw4 =1.
ii. Write down the model’s final decision rule: that is, the formula for choosing y. (Hint: your answer should have the form y = arg maxy [. . .])
(b) Panini wants to build a parser using this logistic regression model. The parser takes a sequence of POS tags as input, predicts a phrase label (or None) for each span in the sequence, and then tries to join up all phrases labelled as VP, NP, or AP into a larger parse structure.
For example, for the string V N A, Panini’s rules lead to phrases consisting of the three POS spans (0, 1, V), (1, 2, N), (2, 3, A) and the two additional phrases (0, 3, AP) and (1, 3, NP). The tree built from all these phrases would be:
AP
V NP
NA
If this approach is applied to the sequence V V V D N A, will it be possible
to create a valid hierarchical phrase structure tree using all the phrases generated by Panini’s rules? If so, give the tree. If not, explain why not.
QUESTION CONTINUES ON NEXT PAGE
Page 6 of 7
[3 marks]
[2 marks]
QUESTION CONTINUED FROM PREVIOUS PAGE
(c) Eventually Panini realizes that his original idea of the grammar was wrong. In fact, the grammar consists of two parts:
• a phrase structure grammar with a wide variety of phrase types, which can be expressed in Chomsky Normal Form (CNF), and
• rules R1, R2, and R3 from above.
This new grammar can be parsed by just modifying the CKY algorithm.
i. Write down pseudocode (or a formula) for what happens in the standard CKY algorithm when chart cell (i,j) is processed.
ii. Now show how you would modify the algorithm to constrain the parse so that a phrase structure is only included if it matches both the CNF grammar and the three additional rules (R1-R3). Will your modification affect the efficiency of the parser? Explain why or why not.
[5 marks]
Page 7 of 7