0.1
0.1.1
0.2
0.2.1
Assignment2 PART2 NLP
October 5, 2019
Assignment 2: Part 1 (ML) [7 pts]
In sections 1.2-1.5 of the Machine Learning notebook, there are tasks for you to com- plete. Be sure to submit BOTH the Machine Learning demo notebook and this note- book.
Assignment 2: Part 2 (NLP) [8 pts]
2.1 Fast Text [3 pts]
FastText[1] is a neural network based text classification model designed to be computationally efficient. Your task is to implement the FastText algorithm by completeing the code in the following cells. You will need to read through the provided fastText.pdf paper, which explains the algorithm. You do not need to implement Hierarchical softmax (2.1) or N-gram features (2.2), you only need to implement the basic architecture described in (2).
The FastText model will be trained using mini-batch gradient descent. When the training data are sequences of variable lengths we can not simply stack multiple training sequences into one tensor. Instead, it is common to assume that there is a maximal sequence length, so that all sequences in a batch are fitted into tensors of the same dimensions. For sequences shorter than the maximal length, we append them with a special pad word so that all sequences in a batch are of the same length. A pad word is a special token,
whose embedding is an all-zero vector, so that the presence of pad words does not change the output of the model. In this code, the pad word has an ID of 0, when implementing your embeddings you should ensure that this ID is always embedded to a vector of all zeros. Additionally, you will need to know how many words are in each input sentence (before they got padded to be the same length), this is provided as an input parameter to your FastText model.
[1] Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Mikolov, Tomas. Bag of Tricks for Efficient Text Classification. arXiv preprint arXiv:1607.01759., 2016. [INCLUDED AS PART OF ASSIGN- MENT 2 .ZIP PACKAGE]
In [ ]: # coding: utf-8 import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import collections
import math
import os
import random
import nltk
nltk.download(’punkt’)
from nltk import word_tokenize
from collections import namedtuple
1
import sys, getopt
from random import shuffle
num_classes = 3
learning_rate = 0.005
num_epochs = 3
batch_size = 10
embedding_dim = 10
0.2.2 You need to complete the foward() and init() functions below [3 pts]
In [ ]: class FastText(nn.Module):
“””Define the computation graph for fasttext model.”””
def __init__(self, vocab_size, num_classes, embedding_dim, learning_rate): “””Init the model with default parameters/hyperparameters.””” super(FastText, self).__init__()
self.num_classes = num_classes
self.embedding_dim = embedding_dim
self.learning_rate = learning_rate
self.loss_func = F.cross_entropy
# TODO: create all the variables (weights) that the model needs here raise NotImplementedError
self.optimizer = torch.optim.SGD(self.parameters(), lr=learning_rate)
def forward(self, x, sens_lengths):
# TODO: implement the FastText computation raise NotImplementedError
return x
In [ ]: from fasttext import load_question_2_1, train_fast_text
word_to_id, train_data, valid_data, test_data = load_question_2_1(’question_2-1_data’)
model = FastText(len(word_to_id)+2, num_classes, embedding_dim=embedding_dim, learning_rate=lear
model_file_path = os.path.join(’models’, ’fasttext_model_file_q2-1’)
train_fast_text(model, train_data, valid_data, test_data, model_file_path, batch_size=batch_size
0.2.3 2.2 Question Classification [3 pts]
Understanding questions is a key problem in chatbots and question answering systems. In the open-domain setting, it is difficult to find right answers in the huge search space. To tackle the problem, one approach is to categorize questions into a finite set of semantic classes, and each semantic class corresponds to a small answer space.
0.2.4 Your task is to implement a question classification model in Pytorch, and apply it to the question 2 2 data provided in this assignment.
Notes:
• Please do NOT submit your data directories, pretrained word embeddings, and Pytorch library!
2
• You may consider reusing parts of the code above
• Code must be submitted with the assignment for purposes of plagiarism detection
0.2.5 Dataset
The dataset provided contains three files: train.json, validation.json, and test.json, which are the training
dataset, validation dataset, and the test dataset, respectively. See an example below:
{
“ID”: S1,
“Label”: 3,
“Sentence”:”What country has the best defensive position in the board game Diplomacy ?”
}
In the training set and the validation set, the response variable is called Label. Your task is to predict the Label for each sentence in the test set.
0.2.6 Evaluation
The performance of your prediction will be evaluated automatically on Kaggle using Accuracy , which is defined as the number of correct predictions divided by the total number of sentences in the test set (https://classeval.wordpress.com/introduction/basic-evaluation-measures/).
It is important to understand that the leaderboard score will be only computed based on the half of the test cases, and the remaining half will be computed after the deadline based on your selected submission. This process will ensure that your performance is not only applicable for the known test cases, but also generalised to the unknown test cases. We will combine these two performances to score the first assignment.
Your score will be computed using a lower bound and an upper bound, which will be shown on the Kaggle leader board. Achieving an accuracy equal and below the lower bound amounts to a grade of zero, while achieving the upper bound amounts to the full points (here 3 points, see score distribution here below). Consequently, your score for this competition task will be calculated based on:
Your Score = Y our Accuracy − Lower Bound ∗ 3 U pper Bound − Lower Bound
Notes about the lower bound and upper bounds predictors:
• The lower bound is the performance obtained by a classifer that always picks the majority class according to the class distribution in the training set.
• The upper bound is generated by an “in-house” classifier trained on the same dataset that you were given.
There are many possibilities to achieve better results than this. However, the only labeled training dataset to train your model should be the provided train.json. If you obtain a better performance than the upper bound, then you will have a grade higher than 3 points for this question. This can be useful to compensate for any lost points for the whole assignment. However, the total mark of this assignment is capped at 10 marks.
0.2.7 Kaggle competition
• You will be given a link to join the competition during your labs.
• Before submitting the result, first go to team menu and change your team name as your university
id.
• You need to upload the generated result file to Kaggle. The result file should be in the following format
3
id,category
S101,0
S201,1
S102,2
…
• Note that you are only allowed to upload 5 copies of your results to Kaggle per day. Make every upload count, and don’t waste your opportunities!
NB you need to fill in the cells below with your code. If you fail to provide the code, you will get zero for this question. Your code should be well documented and provide methods to generate the prediction files and compute accuracy on the validation set.
In [ ]: import json # You can use this library to read the .json files into a Python dict: https://docs. from nltk import word_tokenize # You can use this to tokenize strings, or do your own pre-proces
In [ ]: “””
Your tasks are to
1. Read in the .json files and create Dataset objects from them. The dataset constructor req
sentences (where each sentence is a list of word ids) and a list of labels (or None is t
You will need to apply appropriate preprocessing to the raw text to get in the appropria
2. Run the train_fast_text() function on these Datasets and your model.
3. Convert the output file of predictions into the correct format for Kaggle.
Kaggle expects a csv with two columns, id and category. You need to have these two colum
Your csv should not include any whitespace.
4. Change the model hyper parameters, training settings, text preprocessing, or anything els
in order to improve your models performance.
“””
num_classes = 6
from prepros import preprocessor
from fasttext import Dataset
raise NotImplementedError
model_file_path = os.path.join(’models’, ’fasttext_model_file_q2-2’)
train_fast_text(model, train_dataset, valid_dataset, test_dataset, model_file_path, batch_size=1
0.2.8 2.3 Comparison between Absolute Discounting and Kneser Ney smoothing [2pts]
Read the code below for interpolated absolute discounting and implement Kneser Ney smoothing in Python. It is sufficient to assume that the highest order of ngram is two and the discount is 0.75. Evaluate your program on the following ngram corpus and compute the distribution p(x|Granny) for all possible unigrams in the given corpus.
0.2.9 Explain what make the differences regarding the prediction results between interpolated absolute discounting and Kneser Ney smoothing.
In [ ]: ngram_corpus = [’Sam eats apple’,
“Granny plays with Sam”,
“Sam plays with Smith”,
“Sam likes Smith”,
“Sam likes apple”,
“Sam likes sport”,
“Sam plays tennis”,
4
“Sam likes games”,
“Sam plays games”,
“Sam likes apple Granny Smith”]
from nltk import word_tokenize
from nltk.util import ngrams
from collections import Counter
class NgramStats:
“”” Collect unigram and bigram statistics. “””
def __init__(self):
self.bigram_to_count = Counter([])
self.unigram_to_count = dict()
def collect_ngram_counts(self, corpus):
“””Collect unigram and bigram counts from the given corpus.””” unigram_counter = Counter([])
for sentence in corpus:
tokens = word_tokenize(sentence)
bigrams = ngrams(tokens, 2)
unigrams = ngrams(tokens, 1)
self.bigram_to_count += Counter(bigrams)
unigram_counter += Counter(unigrams)
self.unigram_to_count = {k[0]:int(v) for k,v in unigram_counter.items()}
In [ ]: stats = NgramStats()
stats.collect_ngram_counts(ngram_corpus)
print(stats.bigram_to_count)
print(stats.unigram_to_count)
In [ ]: # Interpolated Absolute Discounting import operator
class AbsDist: “””
Implementation of Interpolated Absolute Discounting
Reference: slide 25 in https://nlp.stanford.edu/~wcmac/papers/20050421-smoothing-tutorial.p
“””
def __init__(self, ngram_stats): “”” Initialization
Args:
ngram_stats (NgramStats) : ngram statistics.
“””
self.unigram_freq = float(sum(ngram_stats.unigram_to_count.values()))
self.stats= ngram_stats
def compute_prop(self, bigram, discount = 0.75): “”” Compute probability p(y | x)
Args:
bigram (string tuple) : a bigram (x, y), where x and y denotes an unigram respec
discount (float) : the discounter factor for the linear interpolation.
5
“””
preceding_word_count = 0
if bigram[0] in self.stats.unigram_to_count:
preceding_word_count = self.stats.unigram_to_count[bigram[0]]
if preceding_word_count > 0:
left_term = 0
if bigram in self.stats.bigram_to_count:
bigram_count = float(self.stats.bigram_to_count[bigram])
left_term = (bigram_count – discount)/preceding_word_count
right_term = 0
if bigram[1] in self.stats.unigram_to_count:
current_word_count = self.stats.unigram_to_count[bigram[1]]
num_bigram_preceding_word = 0
for c_bigram in self.stats.bigram_to_count.keys():
if c_bigram[0] == bigram[0] :
num_bigram_preceding_word += 1
normalization_param = (discount * num_bigram_preceding_word)/ preceding_word_cou
p_unigram = current_word_count/self.unigram_freq
right_term = normalization_param * p_unigram
return left_term + right_term
return 0
In [ ]: def compute_prop_abs_dist(ngram_stats, preceding_unigram, d = 0.75):
“”” Compute the distribution p(y | x) of all y given preceding_unigram
Args:
preceding_unigram (string) : the preceding unigram.
d (float) : the discounter factor for the linear interpolation.
“””
absDist = AbsDist(ngram_stats)
c_unigram_to_prob = dict()
for c_unigram in ngram_stats.unigram_to_count.keys():
if not c_unigram in c_unigram_to_prob:
c_unigram_to_prob[c_unigram] = absDist.compute_prop((preceding_unigram, c_unigram),
sorted_prob = sorted(c_unigram_to_prob.items(), key=operator.itemgetter(1))
return sorted_prob
print(compute_prop_abs_dist(stats, ’Granny’))
In [ ]: def compute_prop_KN(ngram_stats, preceding_word, d=0.75): # Implement Kneser Ney Smoothing here.
# Hint: try to reuse the above code as much as possible.
raise NotImplementedError
print(compute_prop_KN(stats, ’Granny’))
EXPLAIN THE DIFFERENCES REGARDING PREDICTION RESULTS HERE
6