Homework 1: Preprocessing and Text Classification¶
Student Name:
Student ID:
General Info¶
Due date: Sunday, 5 Apr 2020 2pm
Submission method: Canvas submission
Submission materials: completed copy of this iPython notebook
Late submissions: -20% per day (both week and weekend days counted)
Marks: 10% of mark for class (with 9% on correctness + 1% on quality and efficiency of your code)
Materials: See Using Jupyter Notebook and Python page on Canvas (under Modules>Resources) for information on the basic setup required for this class, including an iPython notebook viewer and the python packages NLTK, Numpy, Scipy, Matplotlib, Scikit-Learn, and Gensim. In particular, if you are not using a lab computer which already has it installed, we recommend installing all the data for NLTK, since you will need various parts of it to complete this assignment. You can also use any Python built-in packages, but do not use any other 3rd party packages (the packages listed above are all fine to use); if your iPython notebook doesn’t run on the marker’s machine, you will lose marks. You should use Python 3.
To familiarize yourself with NLTK, here is a free online book: Steven Bird, Ewan Klein, and Edward Loper (2009). Natural Language Processing with Python. O’Reilly Media Inc. You may also consult the NLTK API.
Evaluation: Your iPython notebook should run end-to-end without any errors in a reasonable amount of time, and you must follow all instructions provided below, including specific implementation requirements and instructions for what needs to be printed (please avoid printing output we don’t ask for). You should edit the sections below where requested, but leave the rest of the code as is. You should leave the output from running your code in the iPython notebook you submit, to assist with marking. The amount each section is worth is given in parenthesis after the instructions.
You will be marked not only on the correctness of your methods, but also the quality and efficency of your code: in particular, you should be careful to use Python built-in functions and operators when appropriate and pick descriptive variable names that adhere to Python style requirements. If you think it might be unclear what you are doing, you should comment your code to help the marker make sense of it.
Updates: Any major changes to the assignment will be announced via Canvas. Minor changes and clarifications will be announced on the discussion board; we recommend you check it regularly.
Academic misconduct: For most people, collaboration will form a natural part of the undertaking of this homework, and we encourge you to discuss it in general terms with other students. However, this ultimately is still an individual task, and so reuse of code or other instances of clear influence will be considered cheating. We will be checking submissions for originality and will invoke the University’s Academic Misconduct policy where inappropriate levels of collusion or plagiarism are deemed to have taken place.
Overview¶
In this homework, you’ll be working with a collection tweets. The task is to classify whether a tweet constitutes a rumour event. This homework involves writing code to preprocess data and perform text classification.
1. Preprocessing (5 marks)¶
Instructions: Run the code below to download the tweet corpus for the assignment. Note: the download may take some time. No implementation is needed.
In [ ]:
import requests
import os
from pathlib import Path
fname = ‘rumour-data.tgz’
data_dir = os.path.splitext(fname)[0] #’rumour-data’
my_file = Path(fname)
if not my_file.is_file():
url = “https://github.com/jhlau/jhlau.github.io/blob/master/files/rumour-data.tgz?raw=true”
r = requests.get(url)
#Save to the current directory
with open(fname, ‘wb’) as f:
f.write(r.content)
print(“Done. File downloaded:”, my_file)
Instructions: Run the code to extract the zip file. Note: the extraction may take a minute or two. No implementation is needed.
In [ ]:
import tarfile
#decompress rumour-data.tgz
tar = tarfile.open(fname, “r:gz”)
tar.extractall()
tar.close()
#remove superfluous files (e.g. .DS_store)
extra_files = []
for r, d, f in os.walk(data_dir):
for file in f:
if (file.startswith(“.”)):
extra_files.append(os.path.join(r, file))
for f in extra_files:
os.remove(f)
print(“Extraction done.”)
Question 1 (1.0 mark)¶
Instructions: The corpus data is in the rumour-data folder. It contains 2 sub-folders: non-rumours and rumours. As the names suggest, rumours contains all rumour-propagating tweets, while non-rumours has normal tweets. Within rumours and non-rumours, you’ll find some sub-folders, each named with an ID. Each of these IDs constitutes an ‘event’, where an event is defined as consisting a source tweet and its reactions.
An illustration of the folder structure is given below:
rumour-data
– rumours
– 498254340310966273
– reactions
– 498254340310966273.json
– 498260814487642112.json
– source-tweet
– 498254340310966273.json
– non-rumours
Now we need to gather the tweet messages for rumours and non-rumour events. As the individual tweets are stored in json format, we need to use a json parser to parse and collect the actual tweet message. The function get_tweet_text_from_json(file_path) is provided to do that.
Task: Complete the get_events(event_dir) function. The function should return a list of events for a particular class of tweets (e.g. rumours), and each event should contain the source tweet message and all reaction tweet messages.
Check: Use the assertion statements in “For your testing” below for the expected output.
In [ ]:
import json
def get_tweet_text_from_json(file_path):
with open(file_path) as json_file:
data = json.load(json_file)
return data[“text”]
def get_events(event_dir):
event_list = []
for event in sorted(os.listdir(event_dir)):
###
# Your answer BEGINS HERE
###
###
# Your answer ENDS HERE
###
return event_list
#a list of events, and each event is a list of tweets (source tweet + reactions)
rumour_events = get_events(os.path.join(data_dir, “rumours”))
nonrumour_events = get_events(os.path.join(data_dir, “non-rumours”))
print(“Number of rumour events =”, len(rumour_events))
print(“Number of non-rumour events =”, len(nonrumour_events))
For your testing:
In [ ]:
assert(len(rumour_events) == 500)
assert(len(nonrumour_events) == 1000)
Question 2 (1.0 mark)¶
Instructions: Next we need to preprocess the collected tweets to create a bag-of-words representation. The preprocessing steps required here are: (1) tokenize each tweet into individual word tokens (using NLTK TweetTokenizer); and (2) remove stopwords (based on NLTK stopwords).
Task: Complete the preprocess_events(event) function. The function takes a list of events as input, and returns a list of preprocessed events. Each preprocessed event should have a dictionary of words and frequencies.
Check: Use the assertion statements in “For your testing” below for the expected output.
In [ ]:
import nltk
from nltk.tokenize import TweetTokenizer
from nltk.corpus import stopwords
from collections import defaultdict
tt = TweetTokenizer()
stopwords = set(stopwords.words(‘english’))
def preprocess_events(events):
###
# Your answer BEGINS HERE
###
###
# Your answer ENDS HERE
###
preprocessed_rumour_events = preprocess_events(rumour_events)
preprocessed_nonrumour_events = preprocess_events(nonrumour_events)
print(“Number of preprocessed rumour events =”, len(preprocessed_rumour_events))
print(“Number of preprocessed non-rumour events =”, len(preprocessed_nonrumour_events))
For your testing:
In [ ]:
assert(len(preprocessed_rumour_events) == 500)
assert(len(preprocessed_nonrumour_events) == 1000)
Instructions: Hashtags (i.e. topic tags which start with #) pose an interesting tokenisation problem because they often include multiple words written without spaces or capitalization. Run the code below to collect all unique hashtags in the preprocessed data. No implementation is needed.
In [ ]:
def get_all_hashtags(events):
hashtags = set([])
for event in events:
for word, frequency in event.items():
if word.startswith(“#”):
hashtags.add(word)
return hashtags
hashtags = get_all_hashtags(preprocessed_rumour_events + preprocessed_nonrumour_events)
print(“Number of hashtags =”, len(hashtags))
Question 3 (2.0 mark)¶
Instructions: Our task here to tokenize the hashtags, by implementing a reversed version of the MaxMatch algorithm discussed in class, where matching begins at the end of the hashtag and progresses backwards. NLTK has a list of words that you can use for matching, see starter code below. Be careful about efficiency with respect to doing word lookups. One extra challenge you have to deal with is that the provided list of words includes only lemmas: your MaxMatch algorithm should match inflected forms by converting them into lemmas using the NLTK lemmatizer before matching. When lemmatising a word, you also need to provide the part-of-speech tag of the word. You should use nltk.tag.pos_tag for doing part-of-speech tagging.
Note that the list of words is incomplete, and, if you are unable to make any longer match, your code should default to matching a single letter. Create a new list of tokenized hashtags (this should be a list of lists of strings) and use slicing to print out the last 20 hashtags in the list.
For example, given “#speakup”, the algorithm should produce: [“#”, “speak”, “up”]. And note that you do not need to delete the hashtag symbol (“#”) from the tokenised outputs.
Task: Complete the tokenize_hashtags(hashtags) function by implementing a reversed MaxMatch algorithm. The function takes as input a set of hashtags, and returns a dictionary where key=”hashtag” and value=”a list of word tokens”.
Check: Use the assertion statements in “For your testing” below for the expected output.
In [ ]:
from nltk.corpus import wordnet
lemmatizer = nltk.stem.wordnet.WordNetLemmatizer()
words = set(nltk.corpus.words.words()) #a list of words provided by NLTK
def tokenize_hashtags(hashtags):
###
# Your answer BEGINS HERE
###
###
# Your answer ENDS HERE
###
tokenized_hashtags = tokenize_hashtags(hashtags)
print(list(tokenized_hashtags.items())[:20])
For your testing:
In [ ]:
assert(len(tokenized_hashtags) == len(hashtags))
Question 4 (1.0 mark)¶
Instructions: Now that we have the tokenized hashtags, we need to go back and update the bag-of-words representation for each event.
Task: Complete the update_event_bow(events) function. The function takes a list of preprocessed events, and for each event, it looks for every hashtag it has and updates the bag-of-words dictionary with the tokenized hashtag tokens. Note: you do not need to delete the counts of the original hashtags when updating the bag-of-words (e.g., if a document has “#speakup”:2 in its bag-of-words representation, you do not need to delete this hashtag and its counts).
In [ ]:
def update_event_bow(events):
###
# Your answer BEGINS HERE
###
###
# Your answer ENDS HERE
###
update_event_bow(preprocessed_rumour_events)
update_event_bow(preprocessed_nonrumour_events)
print(“Number of preprocessed rumour events =”, len(preprocessed_rumour_events))
print(“Number of preprocessed non-rumour events =”, len(preprocessed_nonrumour_events))
Text Classification (4 marks)¶
Question 5 (1.0 mark)¶
Instructions: Here we are interested to do text classification, to predict, given a tweet and its reactions, whether it is a rumour or not. The task here is to create training, development and test partitions from the preprocessed events and convert the bag-of-words representation into feature vectors.
Task: Create training, development and test partitions with a 60%/20%/20% ratio. Remember to preserve the ratio of rumour/non-rumour events for all your partitions. Next, turn the bag-of-words dictionary of each event into a feature vector, using scikit-learn DictVectorizer.
In [ ]:
from sklearn.feature_extraction import DictVectorizer
vectorizer = DictVectorizer()
###
# Your answer BEGINS HERE
###
###
# Your answer ENDS HERE
###
print(“Vocabulary size =”, len(vectorizer.vocabulary_))
Question 6 (2.0 mark)¶
Instructions: Now, let’s build some classifiers. Here, we’ll be comparing Naive Bayes and Logistic Regression. For each, you need to first find a good value for their main regularisation (hyper)parameters, which you should identify using the scikit-learn docs or other resources. Use the development set you created for this tuning process; do not use cross-validation in the training set, or involve the test set in any way. You don’t need to show all your work, but you do need to print out the accuracy with enough different settings to strongly suggest you have found an optimal or near-optimal choice. We should not need to look at your code to interpret the output.
Task: Implement two text classifiers: Naive Bayes and Logistic Regression. Tune the hyper-parameters of these classifiers and print the task performance for different hyper-parameter settings.
In [ ]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
###
# Your answer BEGINS HERE
###
###
# Your answer ENDS HERE
###
Question 7 (1.0 mark)¶
Instructions: Using the best settings you have found, compare the two classifiers based on performance in the test set. Print out both accuracy and macro-averaged F-score for each classifier. Be sure to label your output.
Task: Compute test performance in terms of accuracy and macro-averaged F-score for both Naive Bayes and Logistic Regression, using optimal hyper-parameter settings.
In [ ]:
###
# Your answer BEGINS HERE
###
###
# Your answer ENDS HERE
###
In [ ]: