程序代写代做代考 algorithm ANLP midterm study guide and sample questions

ANLP midterm study guide and sample questions
Sharon Goldwater October 19, 2020
1 Content and format
The midterm test will cover content from Weeks 1-4. Questions will be worth 20 marks in total, and each question will indicate the number of marks it is worth. Partial credit may be awarded.
The format is open book. That means we will not be asking questions that simply require you to remember facts or definitions. Instead, the questions will test higher-level thinking such as your ability to apply concepts or evaluate different approaches.
Question types may include multiple choice, numerical answers, or free response. Free response questions will not typically require more than a few sentences to answer, but we may include one or two questions that are more open-ended, where a longer answer would be appropriate. You should expect to see several questions that are relatively straightforward and quick to answer, as well as some questions that may take more thought and/or time to write your response.
2 Sample questions
The sample questions below are questions from previous exams that would be appropriate as open-book questions covering content from Weeks 1-4. Each question is shown with the number of marks that it was worth on a two-hour final exam worth 50 marks in total. Note: These sample questions do not include much about ethics, and nothing about dialect or discrimination, because those topics were added to the course fairly recently. But the midterm test may include questions covering any topic from Weeks 1-4, including those topics.
We do not provide model answers. However, if you have a particular question about one of these questions, or your solution, feel free to post to Piazza.
1. How many types and how many tokens are in the following sentence? the cat looked at the other cats
1
[2 marks]

2. What affixes are in the word “reactors”? Identify whether each affix is derivational or inflectional.
3. Consider the sentence S = the cat chased the dog .
(a) Suppose we use a unigram language model to compute sentence probabilities. Give
a different sentence with the same unigram probability as S.
(b) Now suppose we use a bigram language model. The following table gives smoothed probability estimates for P(wi|wi−1) under this model, with wi−1 labelling the rows and wi the columns. Is this table complete, or does the model also include probabilities for bigrams that aren’t shown in this table? Justify your answer.
[2 marks ]
[2 marks ]
[2 marks ]
the 0.0067 the 0.00012 a 0.00012 cat 0.006 dog 0.0081 girl 0.0016 chased 0.024
a
0.0067 0.00009 0.00009
0.0081 0.006 0.007 0.006
cat
0.00056 0.03 0.05 0.00008 0.00008 0.00005 0.00041
dog
0.0002 0.02 0.01 0.00009 0.00009 0.00013 0.0001
girl
0.00007 0.02 0.03 0.0004 0.0004 0.00022 0.00037
chased
0.0045 0.005 0.00041 0.003 0.004 0.0061 0.00002

0 0.00008 0.00009 0.003 0.003 0.003 0.0003
4. Suppose I have a large corpus, and I make a list of all the words that occur exactly once in the first half of the corpus. I then count how many times each of those words occurs in the second half of the corpus. Which of the following should I expect to find?
A. On average, these words occur less than once in the second half. B. On average, these words occur more than once in the second half. C. On average, these words occur exactly once in the second half.
D. There is no way to predict; any of the above are equally likely.
5. N-gram language models.
For each scenario below, I train a bigram and a trigram model, and compute perplexities. In each case, say which model you would normally expect to have lower perplexity: the bigram model, the trigram model, or either model could have lower perplexity. Justification is not needed.
(a) Both models are trained using maximum-likelihood estimation (MLE) on the same corpus. I compute perplexity on the training corpus.
2
[2 marks ] [4 marks ]

(b) Each model is trained using MLE on a different corpus. I compute the perplexity of each model on the corpus I used to train it.
(c) I collect a very large corpus of English Wikipedia data and set aside some random sentences for testing. I use Kneser-Ney smoothing and build models using standard methods with the non-test data, then compute perplexity on the test data.
6. Suppose I’m building 5-gram language models over characters. I build two models using the same corpus: one uses add-α smoothing, and the other uses backoff smoothing. If represents the space character, and the sequence xing never occurs in the corpus, which model is likely to assign a higher value to P( |xing), and why?
7. Edit distance.
I use dynamic programming to compute the weighted edit distance between the strings BALL and BILE. The weights (costs) for each edit operation are between 0 and 1, and yield the chart shown below.
BALL B
I L E
(a) What is the alignment corresponding to the backtraces shown in bold? (b) What is the cost of inserting the letter E?
8. In social media such as Twitter, users sometimes emphasize words by repeating let- ters. For example, a user might write I lloooovveee it! to emphasize the word love. However, this non-standard spelling can present problems for NLP systems. This question considers some ways to deal with the problems.
(a) The diagram below shows a finite-state automaton that recognizes the word love. Convert this to a finite-state transducer that corrects non-standard to standard spelling for the word love. Assume that the non-standard spelling contains the characters l, o, v, e in that order, but might have more than one of each character. Your FST need not be deterministic.
ion
love
[3 marks] [2 marks]
0 ↑0.2 ↑0.6 ↑0.8
←0.3 ↖0.3 ↑0.7 ↖↑0.9
←0.7 ←0.7 ↖0.3 ↑0.5
←1.1
←1.1 ↖↑0.7 ←0.9
[3 marks]
START
3

(b) (Note: This part of the question focuses on a problem from the first half of the course, but would be more appropriate on a final exam than a mid-term, so that you would be able to draw on methods discussed in the later part of the course.) Suppose you designed similar transducers for all English words in order to convert emphasized (repeated letter) versions of words into standard spelling. What prob- lem(s) might you have with the results? Outline how you might try to solve these problem(s) using other methods from the course, assuming you have access to the full text where the words are used. Include specific examples in your answer where possible, and discuss whether your solution requires other resources beyond those already mentioned, and any other pros/cons of your solution.
9. Suppose we estimate the parameters of an HMM using add-α smoothing. In class we discussed some problems with add-α smoothing. For an HMM, where would these problems be worse: for estimating the transition probabilities, or for estimating the emission probabilities? Justify your answer.
10. Suppose we are tagging named entities using an HMM. We want to tag each word as either PER for persons, ORG for organizations, or OTH for all other tokens. We also have special start- and end-of-sentence markers and in both the tag sequences and word sequences.
Below are the transition matrix and part of the output matrix for the HMM:
[6 marks]
[4 marks]
ti−1 \ti PER 0.6 PER 0.5 ORG 0 OTH 0.02
ORG OTH

0 0.4 0 0.05 0.45 0 0.7 0.3 0 0.03 0.9 0.05
t\w Mr Craft David winked … PER 1×10−2 3×10−3 8×10−3 3×10−8
ORG 2×10−3 7×10−4 6×10−4 0
OTH 0 3×10−7 0 5×10−7
(a) If ⃗t = PER ORG OTH and w⃗ = Mr Craft winked , what is P(⃗t,w⃗) for this HMM?
(b) Below is a partly completed chart computed using the Viterbi algorithm:
[3 marks]
Mr 1 0
PER 0 6×10−3 ORG 0 0 OTH 0 2×10−9
0 0
Craft winked

Using the Viterbi algorithm, what is the numerical value computed to fill the cell 4

[PER, Craft]? What does this value represent, and where will the backpointer point?
11. This question is about handling unseen words in sequence models. As in the previous question,we want to tag named entities using the PER, ORG, and OTH tags.
The following sentences occur in our test set, and the words Xfinity, GBX, Altoona, Slovensky, and semiconscious have not been seen in training:
Xfinity is going further .
GBX acquired Altoona Corporation for $100 million .
Mr Slovensky looked at the semiconscious attendees .
(a) Consider the following method for handling unseen words using an HMM. In the training corpus, use the new token UNK to replace all words that occur exactly once, and then estimate probabilities as usual. To tag the test sentences, temporarily replace any unseen word with UNK, and then run the Viterbi algorithm. In the output, change each UNK back to its original word.
Using this method, which of the unseen words in the example sentences seem most likely to be tagged correctly, and why? (That is, what information might lead the HMM to predict the correct tag?)
(b) (Note: this part of the question has been removed because it requires knowledge from later in the course.)
12. HMMs are sometimes used for chunking: identifying short sequences of words (chunks) within a text that are relevant for a particular task. For example, if we want to identify all the person names in a text, we could train an HMM using annotated data similar to the following:
On/O Tuesday/O ,/O Mr/B Cameron/I met/O with/O Chancellor/B An- gela/I Merkel/I ./O
There are three possible tags for each word: B marks the beginning (first word) of a chunk, I marks words inside a chunk, and O marks words outside a chunk. We also use SS and ES tags to mark the start and end of each sentence, respectively.
Crucially, the O and SS tags may not be followed by I because we need to have a B first indicating the beginning of a chunk.
Answer the following questions about this HMM.
(a) Write down an expression for the probability of generating the sentence
Henry saw Barack Obama tagged with the sequence B O B I. 5
[4 marks]
[3 marks] [3 marks]
[4 marks]

(b) What, if any, changes would you need to make to the Viterbi algorithm in order to use it for tagging sentences with this BIO scheme? How can you incorporate the constraint on which tags can follow which others?
13. (Note: This question can be answered based just on content from the first half, but your answer might be different or more complete if you answered it at the end of the course, since you would have more to draw on by then.)
Suppose you want to build a sentiment analysis system for movie reviews. That is, given the text of a movie review, your system should classify the review as either positive or negative.
Describe how you would build such a system. Your description should include the following aspects:
• What data or other resources would you need, how you might get them, and how you would use them.
• What model or algorithm you would use, and how you would apply it in this situation. Your description should be high-level and does not need to include equations.
• How you would evaluate the results of the system.
• What, if any, ethical issues might arise in developing the system.
In some cases there might be multiple options for each choice. If so, mention briefly the strengths and weaknesses of each approach, or say what additional information you might need in order to decide which approach is best.
14. The tables below show some of the transition probabilities (top) and output probabilities (bottom) from an HMM. (As usual, row labels indicate the conditioning information.) Use them to answer the following questions.
[8 marks]
[4 marks]
DT NN 0.6 0.024
DT 0.005 0.4 NN 0.12 0.1 VBD 0.25 0.031 VBG 0.33 0.03 …
VBD VBG
0.004 0.001 0.0016 0.0002 0.16 0.21 0.009 0.005 0.00014 0.00015
0
0.01 0.2 0.07 0.05
6

the chair chaired meeting . . .
1 0 DT 0 0 NN 0 0 VBG00 VBD 0 0 0 1
0 0 0 0 0.3 0 0 0
2×10−5 3×10−6 0 5×10−7 0 0 0 6×10−5 0 0 4×10−7 0
0 0 0 0
(a) Assuming the values in the output probability table were estimated from an anno- tated corpus, can you say whether the word meeting is tagged more often as VBG or NN in the corpus? Justify your answer.
(b) Using the model defined by the tables above, compute P(w⃗,⃗t), where w⃗ is the chair chaired the meeting and
⃗t is DT NN VBD DT VBG .
15. Evaluating text classification.
Suppose I have a spam classifier that returns the predicted probability that each email is spam. I run it on a small set of emails where I know the gold standard label for each email (1 = spam; 0 = not spam). I then rank the emails according to the system’s predicted probability (1 = highest probability; 7 = lowest probability), as shown below.
System ranking Gold label 11
21 30 41 50 60 70
The classifier will label an email as spam if its predicted probability is above some threshold t. Suppose I want the classifier’s precision on these emails to be at least 80%, and I set t accordingly. What will the classifier’s highest possible recall be?
7
[2 marks]