Introduction to Natural Language Processing CSC 485E / SENG 480C / CSC 581A Assignment 1

Introduction to Natural Language Processing
CSC 485E / SENG 480C / CSC 581A
Assignment 1
Due May 25, 2017, 11:55 PM
One late day permitted for 20% reduction in grade.
No late assignments accepted more than 24 hours after the due date. Undergrad and Grad: 100 points total
All graphs and text answers should be handed in as one pdf (no word docs!!). You should submit two python code files, bayes.py and q5.py
Q1: Naïve Bayes to predict movie review sentiment. (Grad and Undergrad 50 points)
Code for Q1 should be written using the skeleton provided, and handed in with all features from sections b) through e) disabled, and smoothing α=1.
For this question we will be using the sentiment polarity dataset available from http://www.cs.cornell.edu/people/pabo/movie-review-data/. There are many versions available online. To avoid confusion, please use the version attached to the assignment on connex. There should be 5331 positive and 5331 negative reviews. Use the first 5000 from each file as your training set, and the remaining 331 as your test set (for a total of 10000 training examples and 662 test examples).
You should use the scikit-learn CountVectorizer to turn each training example into a row of your data matrix X. Except where described below, you should use this parameterization of CountVectorizer:
CountVectorizer(lowercase=True, stop_words=None, max_df=1.0, min_df=1, max_features=None, binary=True)
a) (Undergrad: 15 points; Grad: 10 points)
Implement Naïve Bayes, assuming that each feature is binary (can only take on values 0 or 1). This is called Bernoulli Naïve Bayes, because each feature is a Bernoulli random variable. Essentially we don’t care if a word appears multiple times in a sentence, we are just recording if it occurs at least once or not at all.
Use the skeleton code provided, as we will be automatically testing your code. Recall that for Naïve Bayes, the classification decision is:
!
? = ??????!(? ? !!!? ?! ? )
where y is the class, p is the total number of attributes (features) in x, and xi is the ith feature of the vector x.

Using the training data, you must calculate P(y) for each of the classes (y), as well as P(xi=0|y) and for each y and every feature xi. Note that P(xi=1|y)=1- P(xi=0|y), so you don’t need to explicitly calculate that value. Consult the class notes for information on how to calculate these values.
Implement additive smoothing with α =1, as discussed in class.
We will be testing your code by creating an instance of MyBayesClassifier and passing it a new binary dataset, so don’t assume that you know the number of features or the number of classes.
Hint: Naïve Bayes can be made much faster by using numpy and linear algebra. Without this, you will need to allow yourself more time to run the experiments required below.
b) Smoothing (Undergrad: 15 points; Grad: 10 points)
Implement additive smoothing https://en.wikipedia.org/wiki/Additive_smoothing
so that you can vary α in your code. Plot of the accuracy on the test set as a function of α, varying from 0.1 to 3 (use steps of 0.1). Based on your experiment, what is the optimal value for alpha?
c) Stemming (Undergrad: 10 points; Grad: 10 points)
What is stemming? Why might we want to use it for this task?
Use the NLTK Porter stemmer to perform stemming on this dataset, and graph your results both with and without stemming, for all values of α from part b. http://www.nltk.org/_modules/nltk/stem/porter.html
Does stemming help for this classification task (i.e. is it better than the results for part b)? Reason about your answer.
d) Stop words (Undergrad: 10 points; Grad: 10 points)
Scikit-learn comes with support for the removal of stop words. What are stop words? Explain one method for producing a list of stop words.
Change the parameters for the CountVectorizer to remove English stop words (stop_words=’english’). Plot your results with both stop word removal and stemming on, using all values of α from part b.
Does stop word removal help for this classification task? Reason about your answer.
e) Character n-grams (Graduate students only: 10 pts)
The CountVectorizer also has support for using character n-grams rather than words as features.

Explain how the character ngram feature works. Explain how word boundaries are detected and handled.
Explore using character n-grams for this classification problem. Report your results using ranges of ngrams [ (2,3), (2,4), (2,5) (2,6) ], both with and without respecting word boundaries. For this question, set smoothing α=1, stemming= false, stop words = None. Character ngrams can be quite memory intensive, so set max_features=100000 for this question only.
Do character n-grams perform better for this classification task? Why or why not? Is this technique a good replacement for stemming?
Q2: POS tagging (Undergrad: 10 points; Grad: 10 points)
Explore existing parsers and try to find some POS errors that they make (you should ignore the tree structure returned by parsers for now). You can use whichever parsers you want, but here are some options that include a web interface for your convenience.
Parsers with web interface:
• http://tomato.banatao.berkeley.edu:8080/parser/parser.html • http://nlp.stanford.edu:8080/parser/
• http://demo.ark.cs.cmu.edu/parse
• https://spacy.io/demos/displacy
• http://lil.cs.washington.edu/easysrl/demo.cgi
These resources might be useful in decoding the part-of-speech tags:
• http://cs.jhu.edu/~jason/465/hw-parse/treebank-notation.pdf
• http://www.surdeanu.info/mihai/teaching/ista555-
fall13/readings/PennTreebankConstituents.html
• http://universaldependencies.org/u/pos/index.html
You might notice that the output of different parsers is based on different conventions, label sets, etc. These details shouldn’t matter for the purpose of this assignment. We’re looking for mistagged words only for this assignment.
Explore at least two different parsers. Find four different sentences whose parses you determine to be incorrect. Specify which parser gave each incorrect parse. Explain why you think the parse is incorrect. Your four sentences should show different types of errors. You don’t have to identify all errors in those sentences. It suffices to identify some errors.

Q3: The Google Ngram Viewer (Undergrad: 10 points; Grad: 10 points)
The Google ngram viewer (https://books.google.com/ngrams) is a great resource that lets you compare the counts of words and phrases over time. Use it to create a graph that shows an interesting pattern, and explain why you think that pattern occurred.
Q5: Ngram Models (Undergrad: 20 points; Grad: 20 points)
For this question, no code skeleton is provided, but please use python. Please hand in the code you write for this question in a file named q5.py (for convenience, an empty q5.py file is provided in the assignment zip).
a) (Undergrad: 10 points; Grad: 5 points) Write a program to compute unsmoothed unigrams and bigrams on the complete works of Shakespeare (download here https://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/t8.shakespeare.txt).
Report the top 15 unigrams and bigrams. Do they look Shakespearian?
Compare the unigram counts those reported by Peter Norvig using Google Books
(http://norvig.com/mayzner.html). What differences do you see?
b) (Undergrad: 10 points; Grad: 10 points) Now use your computed unigrams and bigrams to generate random sentences using Shannon’s method. Generate 5 random sentences and include them in your write up. Are any coherent? Do any seem Shakespearian?
You will need to add sentence start and end tags to the sentences, and consider punctuation as a token. You may want to use the NLTK to detect sentence boundaries (http://textminingonline.com/dive-into-nltk-part-ii-sentence-tokenize- and-word-tokenize). Be aware that NLTK has support for training and generating from an n-gram model. I expect that you will write your own code for this question, and not use the NLTK ngram code.
c) (Grad only: 5 points) Find a new text dataset and perform generation in part b again. Include a reference to the dataset you chose. Compare your results to the sentences in part b. Can you tell the model was built with a different text dataset?
Q6: HMMs (Undergrad: 10 points; Grad: 10 points)
To answer this question, use the HMM from figure 6.3 of the textbook.
a) (Undergrad: 5 points; Grad: 5 points) Use the forward algorithm to compute the probability of seeing the ice cream eating sequence “3,2,1,1,3”. Show your work.
b) (Undergrad: 5 points; Grad: 5 points) Use the Viterbi algorithm to compute the most likely weather sequences when we see the same ice cream eating sequence “3,2,1,1,3”. Show your work.