Introduction to Natural Language Processing CSC 485E / SENG 480C / CSC 581A Assignment 2

Introduction to Natural Language Processing

CSC 485E / SENG 480C / CSC 581A

Assignment 2

Due June 15, 2017, 11:55 PM
One late day permitted for 20% reduction in grade.
No late assignments accepted more than 24 hours after the due date. Hand in a pdf for your written answers and .py files for your code Undergrad and Grad: 70 points total

Q1: Parse Trees (Undergrad & Grad: 5 points)

Draw parse tree structures for the following phrases. No particular CFG is given for this question; the non-terminals should be fairly obvious. Because they are not sentences, they will not have an S at the root.

  1. a)  Dallas
  2. b)  from Denver
  3. c)  after five p.m.
  4. d)  arriving in Washington
  5. e)  on Thursday

Q2: PCFGs (Undergrad & Grad: 5 points)

In Assignment 1 you generated sentences using an ngram model. One could also generate sentences using a PCFG. To generate a new sentence, start with an S node, and select PCFG rules (weighted by their probability) to generate all the way down to terminals. Do you think the sentences generated by the PCFG will be better or worse than those generated by a bigram model? A trigram model?

Q3: Building a CFG/PCFG from a Treebank (Undergrad & Grad: 30 points)

Consider the following newly created Treebank

Using the extensive collection of trees above:

  1. a)  Write down the CFG (start symbol, terminals, non-terminals, rules) you

    observe in the trees. [10 marks]

  2. b)  Is your grammar from part a) in CNF? If not, convert it. We covered CNF in

    only very high-level terms in class, so you may want to read up on the

    specifics https://en.wikipedia.org/wiki/Chomsky_normal_form [5 marks]

  3. c)  Convert your CFG from part a) into a PCFG [10 marks]
  4. d)  Consider the sentence “Jeff pronounced that Bill ran elegantly.” Using your

    grammar from part c) list all of the parse trees and their probabilities (not just the most probable tree). You do not need to use CKY to find the parse trees, they should be fairly simple to work out by hand. Give the probability for each tree. [5 marks]

Q4: Evaluating semantic models (Undergrad & Grad: 30 points)

For this question we will use the following semantic models: GLOVE (50 dimensional version)

http://nlp.stanford.edu/data/glove.6B.zip

Counter-fitted vectors

https://github.com/nmrksic/counter-fitting/blob/master/word_vectors/counter- fitted-vectors.txt.zip

A) Evaluate the vectors on this word analogy task (Undergrad: 15 points; Grad: 10 points)
Word vectors have been shown to be useful for word analogy tasks.
See this page for an overview https://blog.acolyer.org/2016/04/21/the-amazing-power-of-word-vectors/

We will use a word analogy task from Mikolov et al, 2013 http://download.tensorflow.org/data/questions-words.txt
Each row of this file has four words. For each set of words (a, b, c, d), find the corresponding word vectors xb, xa, xc and xd in each of the above models. Use those vectors to compute a new vector y:
y=xb –xa +xc
Now, check to see if xd is the closest vector to y (via cosine distance) in the model, and if xd is one of the five closest vectors in the model. Calculate the percentage of times these two tests pass.

Before you do this task, you should normalize all vectors (you can use sklearn.preprocessing.normalize L2 norm for this). To save time, you can reduce the search space to only those words that appear in the questions-words.txt file.

If any of the 4 words in an analogy doesn’t appear in the word model, skip that test. See section 5 of this paper for more details

http://www.aclweb.org/anthology/N13-1090

Report, for both word models:

  • The percentage of tests for which xd was the closest word
  • The percentage of tests for which xd was one of the 5 closest word vectors (this

    metric is called precision@k for k=5)
    Also, hand in the code you wrote for this question in a file named q4a.py

B) Evaluate the vectors on a word similarity task (Undergrad: 15 points; Grad: 10 points)
Here we will compare the distance between words in our vector model to people’s perceived distance between word meanings.

For this task, we will calculate the cosine similarity for the vectors of all pairs of words defined in the test set (http://www.cl.cam.ac.uk/~fh295/SimLex-999.zip). That is, for every pair of words in each row of the SimLex-999 file, calculate the cosine distance between the corresponding word vectors.

Then, calculate the spearman correlation between the human ratings (found in the fourth column of the SimLex-999 file) and your calculated cosine distances from the step above. This number tells us if the distances in the word vector space related to the distances intuitively understood by humans.

See the corresponding website

http://www.cl.cam.ac.uk/~fh295/simlex.html

And the paper

https://arxiv.org/pdf/1408.3456v1.pdf

If either word of a word pair doesn’t appear in the word model, skip that pair.

Report the final correlation score for both word vector models. Hand in the code you wrote in a file names q4b.py

C) Are word similarity tasks the right way to evaluate word vector models? (Grad only: 10 points)

Read this paper https://arxiv.org/pdf/1605.02276.pdf
Choose one of the issues outlined in section 2. Describe the problem and some associated solutions.

Bonus Question (Undergrad & Grad: 5 points):

Download the following implementation of a character-level RNN

https://gist.github.com/karpathy/d4dee566867f8291f086

I had really hoped to include this as a non-bonus question on the assignment, but the RNN didn’t work out of the box. After an hour of tinkering, I gave up. Here’s your chance to show me up!
Train the RNN on the Shakespeare data from Assignment 1. Get the model to the point that it can generate coherent sentences.

Report:

  • 5 generated sentences
  • a description of what you changed to get the RNN to work
  • your altered version of the code.

    Some things you might want to consider

  1. 1)  Maybe I didn’t train long enough?
  2. 2)  Maybe the temperature is too high during sampling? See description of

    temperature here https://github.com/karpathy/char-rnn but note that

    temperature is not implemented in the simple code linked to above.

  3. 3)  Maybe need to do something smarter with the learning rate?