ANLP Tutorial Exercise Set 2 (v1.0) WITH SOLUTIONS
For tutorial groups in week 4
Sharon Goldwater (School of Informatics, University of Edinburgh)
Goals
This week’s exercises and questions follow the various threads of the week’s materials: text classification, methodological and ethical issues, and in particular issues of dialect. First, it will give you some practice hand-simulating a Naive Bayes classifier to help you understand the technical side. Second, we want you to start thinking about the choices we make when running experiments (how to justify them) and the implications of our tools. So the final exercise and discussion questions focus on these aspects, requiring you to think a little more deeply about some of the readings than what I said in lecture.
Exercise 1
(Adapted from JM3 Exercise 4.1)
Assume the following likelihoods for each word being part of a positive or negative movie review, and equal prior probabilities for each class.
pos neg
I 0.09 0.16 always 0.07 0.06 like 0.29 0.06 foreign 0.04 0.15 films 0.08 0.11
For the document 𝑑 = “I always like foreign films”,
a) What is 𝑃(𝑑|pos)?
b) What is 𝑃(𝑑|neg)?
c) What class will Naive Bayes assign to 𝑑?
d) What would the prior probability of a positive review need to be for the classifier to assign equal probability to both classes for this document?
Solutions
a) 𝑃 (𝑑|pos) = (0.09)(0.07)(0.29)(0.04)(0.08) ≈ 5.85 × 10−6 b) 𝑃 (𝑑|neg) = (0.16)(0.06)(0.06)(0.15)(0.11) ≈ 9.50 × 10−6
1
c) Since the prior probabilities are equal, and the likelihood of the negative class is greater, the classifier will assign the document to the negative class.
d) We want
𝑃(𝑑|pos)𝑃(pos) = 𝑃(𝑑|neg)𝑃(neg) 𝑃 (𝑑) 𝑃 (𝑑)
Since 𝑃 (neg) = 1 − 𝑃 (pos), we have:
𝑃 (pos)𝑃 (𝑑|pos) = 𝑃 (𝑑|neg)(1 − 𝑃 (pos))
Solving for 𝑃(pos) yields:
or approximately 0.62.
Exercise 2
𝑃 (pos) =
𝑃 (𝑑|neg)
𝑃 (𝑑|neg) + 𝑃 (𝑑|pos)
(Adapted from JM3 Exercise 4.2)
The following are the features extracted from a set of short movie reviews, and the genre label of each
review (either c for comedy or a for action):
Words in review
Label
fun, couple, love, love c fast, furious, shoot, a couple, fly, fast, fun, fun c furious, shoot, shoot, fun a fly, fast, shoot, love a
And here are the features from a new document 𝑑: fast, couple
a) What is the MLE estimate of the prior probability 𝑃 (c)?
b) What is 𝑃𝑀𝐿𝐸(fast|c)?
c) With 𝛼 = 0.5, what is 𝑃𝑎𝑑𝑑−𝛼(fast|c)?
d) Again using Add-𝛼 smoothing and 𝛼 = 0.5, compute the ratio 𝑃 (c|𝑑) . What does this value tell us? 𝑃 (a|𝑑)
e) Suppose we changed the value of 𝛼. Would the classifier’s decision ever change as a result? (Hint: consider what happens at the two extremes, when 𝛼 approaches 0, or when it approaches infinity.)
Solutions
a) 2/5, or 0.4
2
b) 1/9 ≈ 0.11
c) The vocabulary size is 7, and there are 9 tokens in the c class, so we have 1+0.5 = 1.5 = 0.12
d)
𝑃 (c|𝑑) 𝑃 (a|𝑑)
𝑃 (𝑑|c)𝑃 (c) = 𝑃(𝑑)
𝑃 (𝑑|a)𝑃 (a) 𝑃 (𝑑)
= 𝑃(𝑑|c)𝑃(c) 𝑃 (𝑑|a)𝑃 (a)
1.5 2.5 2 = 12.5 12.5 5 2.5 0.5 3 14.5 14.5 5
≈ 2.69
9+(7)(0.5) 12.5
The fact that this value is greater than 1 tells us that the classifier prefers class c for this document (and by how much).
Note that in order to compute this ratio (i.e., the relative values of the two posterior probabilities) we never need to compute 𝑃(𝑑), because it cancels out.
e) Yes: As 𝛼 grows, the smoothing begins to overwhelm the actual data, and the likelihood terms for all words all approach 𝛼 . Therefore, the likelihoods for each class approach equivalent values, and
the classifier simply chooses the class with the higher prior probability. In this case, it is class a, which is not the same as the class chosen with a smaller smoothing value.
Exercise 3
In the Blodgett and O’Connor paper, the authors evaluated several language identification systems to
determine how accurately they identified tweets as English, depending on whether the tweets contained “African-American aligned” or “white aligned” language.
a) In order to perform their experiment, B&O’C used an automatic method to divide the tweets into “African-American aligned” and “white-aligned” groups. Why do you think they used this automatic method, rather than getting human annotators to label the tweets? (You should be able to come up
with more than one reason.)
b) The language ID systems B&O’C tested are text classifiers. We talked about using precision and recall to evaluate classifiers, but this paper uses accuracy instead. Why?
Solutions
a) One reason might be that doing so is faster or cheaper than getting annotators to do this task. However, a more important reason is that it might actually be difficult to get annotators to do the task accurately. Consider that the annotators would need to be familiar with the kinds of language that people use on twitter, and (more importantly) with the kinds of language that are used by each group. Non-AAVE speakers might not necessarily recognize when AAVE is being used, and conversely, AAVE speakers might not realize that certain words or phrases are specific to AAVE. This would make the annotations inaccurate.
b) To run their experiment, B&O’C started by collecting a set of tweets that are assumed to be written in English. If this assumption is valid, then when the classifiers are evaluated with respect to this set, there is no difference between accuracy and recall, and precision is always 1.
|𝑉 |𝛼
3
However, it’s important to realize that the lower accuracy of the ID systems on the set of English test tweets actually corresponds to a lower recall with respect to the set of all tweets. That is, if we re-ran the experiment on a set of arbitrary tweets (some of which will not be in English), then the ID systems will have a lower recall for identifying AA-aligned tweets as English, relative to wh-aligned tweets. In fact, this is the main point of the paper!
As a follow-up, I would I would encourage you to consider the extent to which the original assumption about the test set is valid. That is, are there possible issues with how they collected the set of tweets in the experiment, which might mean that some of them are not actually English?)
Discussion questions
You will need to post a brief response (a few sentences) to at least one of these questions by Tuesday at noon in your discussion group channel. I will release further instructions about this by the end of this week.
During the meeting, you should further discuss at least one of them, and ideally more than one. You don’t necessarily need to go over every example that your group members posted; instead consider picking one or two examples and comparing them or discussing in more detail: For example, can other group members think of some points that the original poster did not think of?
1. If you talk to a stranger from your country on the telephone (no video), what do you think they might infer about you because of the dialect you speak? (Could they tell what part of the country you are from? Might they form an opinion about your social background?) Do you think these opinions would be accurate?
2. The Blodgett and O’Connor paper studied AAVE, which is spoken in North America. Can you think of a language or dialect spoken in your own country (other than AAVE, if you’re from N America) that might be disadvantaged by NLP tools, compared to the standard language or dialect? What group(s) of people are likely to be affected, and in what ways or situations? (Some things to consider: are there situations where those people are more (or less) likely to use non-standard language? Are they likely to be fluent in the standard language or dialect as well as the non-standard one?)
3. The Swee Kiat reading discussed a distinction between “allocative” harms and “representational” (or “representative”) harms. Which, if either, of these do you think would arise from the algorithmic
bias that Blodgett and O’Connor demonstrated? Justify your answer.
4