CS918: 2017-18
Exercise one: Text preprocessing, N-grams and language models.
Submission: 12pm Thursday March 1st 2018
Notes:
a) This exercise will contribute towards 15% of your overall mark.
b) Submission should include Python code, a Jupyter notebook and a
short report.
Preparation: Getting to know Python
Practice using the Jupyter notebooks from the module website.
For this exercise you will be using the “SIGNAL NEWS1” corpus provided on the
module website, available through the following link:
https://warwick.ac.uk/fac/sci/dcs/teaching/material/cs918/signal-
news1.tar.bz2
The corpus provides news stories formatted in JSON. Each line contains a JSON
item with a news story. You should be using the “content” field of the news
stories in this exercise.
You will be delivering the Python code that you developed, a Jupyter notebook
that comments on your code and a short report (2-3 pages) that describes your
work. For the Python code, you should make sure to develop code that runs on
Python 3, and to deliver it as a single, standalone Python script that works on the
corpus files as they were provided (i.e. we should be able to run it with Python 3
having the “SIGNAL NEWS1” corpus downloaded).
The exercise consists of three parts:
Part A: Text preprocessing (25 marks)
1. After lowercasing all the text, use regular expressions to parse and clean
the texts:
a) Remove all non-alphanumeric characters except spaces, i.e. keep only
alphanumeric characters and spaces. [4 marks]
b) Remove words with 3 characters or fewer. [4 marks]
c) Remove numbers that are fully made of digits (e.g. you should remove
the number ‘5’, but in the case of ‘5pm’, made of both digits and letters,
you should keep it as is, without removing the digit that is part of the
word). [4 marks]
d) Remove URLs. Note that URLs may appear in different forms, e.g.
“http://www.*”, “http://domain”, “https://www.*”, “www.*”,
“domain.com”. [7 marks]
NOTE: The preprocessing above may need to be processed in a different
1
https://www/
http://domain/
http://www/
https://warwick.ac.uk/fac/sci/dcs/teaching/material/cs918/signal-news1.tar.bz2
https://warwick.ac.uk/fac/sci/dcs/teaching/material/cs918/signal-news1.tar.bz2
order, not necessarily as listed above.
2. Use an English lemmatiser to process all the words. [6 marks]
Part B: N-grams (20 marks)
With all the texts preprocessed as above, compute the following calculations:
1. Compute N (number of tokens) and V (vocabulary size). [5 marks]
2. List the top 25 bigrams based on the number of occurrences on the entire
corpus. [5 marks]
3. Using the lists of positive and negative words provided with the corpus,
compute the number of positive and negative word counts in the corpus.
[5 marks]
4. Compute the number of news stories with more positive than negative
words, as well as the number of news stories with more negative than
positive words. News stories with a tie (same number of positive and
negative words) should not be counted. [5 marks]
Part C: Language models (20 marks)
1. Compute language models for bigrams based on the first 16,000 rows of the
corpus. Beginning with the word “They,” produce a sentence of 10 words by
appending the most likely next word each time. [10 marks]
2. Compute the perplexity by evaluating on the remaining rows of the corpus
(rows 16,001+). [10 marks]
Total: 65 marks
2