python tensorflow 深度学习 自然语言处理 COMP4650 COMP6490 文本分析

In [ ]:
# coding: utf-8
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import tensorflow as tf
import numpy as np

import collections
import math
import os
import random

from nltk import word_tokenize
from collections import namedtuple

import sys, getopt

from random import shuffle

# Constants
label_to_id = {'World':0, 'Entertainment':1, 'Sports':2}
num_classes = 3
pad_word_id = 0
unknown_word_id = 1

learning_rate = 0.01
num_epochs = 3
embedding_dim = 10

Q1: Batch training [3 pts].

Batch training is widely used for finding optimal parameters of neural network models. In Tensorflow, a batch corresponds to an input tensor of fixed size. However, it imposes a special challenge when input data is word sequences of variable length. A widely used technique is to assume that there is a maximal sequence length so that all sequences in a batch are fitted into tensors of the same dimensions. For sequences shorter than the maximal length, we append them with pad words so that all sequences in a batch are of the same length. A pad word is a special token, whose embedding is an all-zero vector. Your task is to implement the padding technique and make sure the pad words do not change model outputs as well as model parameters during training. You should implement the padding technique by completing the fasttext model [1] below to support batch training. Note that, you need to make sure that i) all elements of the pad word embedding are zero during training; ii) your implementation should pass the unit tests in FastTextTest; iii) your implementation should be able to work on the provided news title classification dataset.

Hints:

  • You may modify the unit tests accordingly but should not change the goals of the existing tests.
  • Practice your Tensorflow skills in the provided warm-up exercise though the exercise will not be graded.

[1] Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Mikolov, Tomas. Bag of Tricks for Efficient Text Classification. arXiv preprint arXiv:1607.01759., 2016.

In [3]:
class FastText(object):
    """Define the computation graph for fasttext model."""
    
    def __init__(self, num_classes, embedding_dim, size_vocab, learning_rate, batch_size):
        """Init the model with default parameters/hyperparameters."""
        assert(size_vocab > 2)
        self.num_classes = num_classes
        self.embedding_dim = embedding_dim
        self.size_vocab = size_vocab
        self.learning_rate = learning_rate
        
    def build_graph(self):
        """Build the computation graph."""
        self.declare_placeholders()
        self.declare_variables()
        self.inference()
        self.optimize()
        self.predict()
        self.compute_accuracy()
        
    def declare_placeholders(self):
        """Declare all place holders."""
        with tf.name_scope('fast_text'):
            self.input_sens = tf.placeholder(tf.int32, shape = [None, None], name = 'input_sens')
            self.sens_length = tf.placeholder(tf.float32, shape = [None, 1], name = 'sens_length')
            self.correct_label = tf.placeholder(tf.float32, shape=[None, self.num_classes], name = 'correct_label')
        
    def declare_variables(self):
        """Declare all variables."""
        with tf.name_scope('fast_text'):
            self.W = tf.Variable(tf.zeros([self.embedding_dim, self.num_classes]), name = 'W')
            self.b = tf.Variable(tf.zeros([1, self.num_classes]), name = 'b')
            # Hint: Initialize word embeddings properly.
            embed_matrix = np.random.uniform(-1,1, [self.size_vocab, self.embedding_dim])
            self.embeddings = tf.Variable(embed_matrix, dtype=tf.float32, name = 'embed')
    
    def compute_mean_vector(self):
        # Hints:
        # Make sure that the embedding of the pad word is not updated during BackProp.
        # Make sure that the mean of each instance is correctly computed.
        embed_seq = tf.nn.embedding_lookup(self.embeddings, self.input_sens)
        # Compute the means here.
        # Hint:
        # There are more than one way of doing this.
        # https://www.tensorflow.org/performance/xla/broadcasting
        # https://www.tensorflow.org/api_docs/python/tf/tile
        
       
    
    def inference(self):
        """Compute the logits of x."""
        self.compute_mean_vector()
        self.logit = tf.matmul(self.mean_rep, self.W) + self.b
    
    def optimize(self):
        """Train a fast text model from scratch."""
        self.loss()
        optimizer = tf.train.GradientDescentOptimizer(self.learning_rate, name = 'SGD')
        self.train_step = optimizer.minimize(self.cross_entropy, name = 'train_step')
        
    def compute_accuracy(self):
        """Evaluate the model against a test/validation set"""
        correct_prediction = tf.equal(self.prediction, tf.argmax(self.correct_label, 1))
        self.accuracy = tf.cast(correct_prediction, tf.float32, name = 'accuracy')
        
    def predict(self):
        self.prediction = tf.argmax(self.logit, 1)
    
    def loss(self):
        """Compute the loss of a batch."""
        self.cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits_v2(labels = self.correct_label, logits = self.logit))
In [ ]:
def generate_inputs(size_vocab, max_length, batch_size):
    """ Generate input data to simulate sentences.
    
    Args:
        size_vocab (int) : The size of the vocabulary.
        max_length (int) : The maximal sequence length.
        batch_size (int) : Batch size.
    """
    input_sens = np.random.randint(1, size_vocab, size = [batch_size, max_length])
    sens_length = np.zeros(shape=[batch_size, 1], dtype=np.float32, order='C')
    # Hints: You may add additional inputs here.
    if max_length > 1:
        for i in range(0, batch_size):
            sens_m_length = random.randint(1, max_length)
            sens_length[i] = sens_m_length
            for j in range(sens_m_length, max_length):
                input_sens[i, j] = 0
            
    return (input_sens,sens_length)
In [ ]:
def generate_labels(batch_size, num_classes):
    """ Generate input data to simulate sentences.
    
    Args:
        num_classes (int) : number of classes.
        batch_size (int) : batch size.
    """
    true_labels = np.random.randint(0, num_classes, size = batch_size)
    label_matrix = np.zeros(shape=[batch_size, num_classes], dtype=np.intp)
    for i in range(0, batch_size):
        label_matrix[i, true_labels[i]] = 1
    return label_matrix
In [ ]:
def init_embeddings(size_vocab, embedding_dim):
    """ Initialize word embedding matrix.
    
    Args:
        size_vocab (int) : size of the vocabulary.
        batch_size (int) : batch size.
    """
    embedding_matrix = np.random.uniform(-1, 1, [size_vocab, embedding_dim])
    # How to deal with the pad word?
    return embedding_matrix
In [ ]:
class FastTextTest(tf.test.TestCase):
    """https://guillaumegenthial.github.io/testing.html"""
    
    def test_computing_mean(self):
        """
            The means computed by Numpy should match the ones computed by FastText.
        """
        vocab_size = 10
        embed_dim = 3
        num_inst_batch = 2
        sens, sens_length = generate_inputs(vocab_size, 5, num_inst_batch)
        embed_matrix = init_embeddings(vocab_size, embed_dim)
        sens_embedding = np.take(embed_matrix, sens, 0)
        true_mean = np.sum(sens_embedding,1) / sens_length
        
        model = FastText(num_classes = 3, embedding_dim = embed_dim, size_vocab = vocab_size, learning_rate = 0.1, batch_size = num_inst_batch)
        model.input_sens = tf.placeholder(tf.int32, shape = [None, None], name = 'input_sens')
        model.sens_length = tf.placeholder(tf.float32, shape = [None, None], name = 'sens_length')
        model.embeddings = tf.Variable(embed_matrix, name = 'embed', dtype=tf.float32)
        model.compute_mean_vector()
        with self.test_session() as sess:
            sess.run(tf.global_variables_initializer())
            mean_vec = sess.run(model.mean_rep, 
                                             feed_dict={model.input_sens: sens, model.sens_length : sens_length})
            self.assertAllClose(mean_vec, true_mean,rtol=1e-06, atol=1e-06, msg="Mean vectors are not equal.")
    
    def test_loss(self):
        """
            The loss computed by Numpy should be the same as the one from the model.
        """
        vocab_size = 10
        embed_dim = 3
        num_inst_batch = 2
        num_classes = 3
        sens, sens_length = generate_inputs(vocab_size, 5, num_inst_batch)
        true_labels = generate_labels(num_inst_batch, num_classes)
        
        embed_matrix = init_embeddings(vocab_size, embed_dim)
        sens_embedding = np.take(embed_matrix, sens, 0)
        true_mean = np.sum(sens_embedding,1) / sens_length
        W = np.random.uniform(-1, 1, size = [embed_dim, num_classes])
        b = np.random.uniform(-1, 1, size = [1, num_classes])
        true_logit = np.matmul(true_mean, W) + b
        exp_logit = np.exp(true_logit)
        true_denominator = np.log(np.sum(exp_logit, 1))
        true_scores = np.sum(np.multiply(true_logit, true_labels),1)
        true_loss = np.mean(true_denominator - true_scores)
        
        model = FastText(num_classes, embedding_dim = embed_dim, size_vocab = vocab_size, learning_rate = 0.1, batch_size = num_inst_batch)
        model.input_sens = tf.placeholder(tf.int32, shape = [num_inst_batch, None], name = 'input_sens')
        model.sens_length = tf.placeholder(tf.float32, shape = [num_inst_batch, None], name = 'sens_length')
        model.correct_label = tf.placeholder(tf.float32, shape = [num_inst_batch, num_classes], name = 'correct_labels')
        
        model.embeddings = tf.Variable(embed_matrix, name = 'embed', dtype=tf.float32)
        model.W = tf.Variable(W, name = 'W', dtype=tf.float32)
        model.b = tf.Variable(b, name = 'b', dtype=tf.float32)
        model.inference()
        model.loss()
        with self.test_session() as sess:
            sess.run(tf.global_variables_initializer())
            loss = sess.run(model.cross_entropy, 
                                             feed_dict={model.input_sens: sens, model.sens_length : sens_length, model.correct_label : true_labels})
            self.assertAllClose(loss, true_loss,rtol=1e-06, atol=1e-06, msg="cross entropy is not equal.")

    def test_computing_accuracy(self):
        """ The accuracy computed by Numpy should match the one computed by the model.
        """
        vocab_size = 10
        embed_dim = 4
        num_inst_batch = 2
        num_classes = 3
        sens, sens_length = generate_inputs(vocab_size, 5, num_inst_batch)
        true_labels = generate_labels(num_inst_batch, num_classes)
        
        embed_matrix = init_embeddings(vocab_size, embed_dim)
        sens_embedding = np.take(embed_matrix, sens, 0)
        true_mean = np.sum(sens_embedding,1) / sens_length
        W = np.random.uniform(-1, 1, size = [embed_dim, num_classes])
        b = np.random.uniform(-1, 1, size = [1, num_classes])
        true_logit = np.matmul(true_mean, W) + b
        true_prediction = np.argmax(true_logit, 1)
        true_label_indices = np.argmax(true_labels, 1)
        boolean_accuracy_matrix = np.equal(true_prediction, true_label_indices)
        true_accuracy_matrix = boolean_accuracy_matrix.astype(np.float32)
        
        model = FastText(num_classes, embedding_dim = embed_dim, size_vocab = vocab_size, learning_rate = 0.1, batch_size = num_inst_batch)
        model.input_sens = tf.placeholder(tf.int32, shape = [num_inst_batch, None], name = 'input_sens')
        model.sens_length = tf.placeholder(tf.float32, shape = [num_inst_batch, None], name = 'sens_length')
        model.correct_label = tf.placeholder(tf.float32, shape = [num_inst_batch, num_classes], name = 'correct_labels')
        
        model.embeddings = tf.Variable(embed_matrix, name = 'embed', dtype=tf.float32)
        model.W = tf.Variable(W, name = 'W', dtype=tf.float32)
        model.b = tf.Variable(b, name = 'b', dtype=tf.float32)
        model.inference()
        model.predict()
        model.compute_accuracy()
        with self.test_session() as sess:
            sess.run(tf.global_variables_initializer())
            accuracy = sess.run(model.accuracy, 
                                             feed_dict={model.input_sens: sens, model.sens_length : sens_length, model.correct_label : true_labels})
            self.assertAllEqual(accuracy, true_accuracy_matrix, msg="accuracy is not equal.")
            
    def test_zero_embeddings(self):
        """ The embedding of the pad word should be all zeros after a few training epochs.
        """
        vocab_size = 10
        embed_dim = 4
        num_inst_batch = 2
        num_classes = 3
        model = FastText(num_classes, embedding_dim = embed_dim, size_vocab = vocab_size, learning_rate = 0.1, batch_size = num_inst_batch)
        model.build_graph()
        pad_word_embeddings = np.zeros(embed_dim)
        with self.test_session() as sess:
            sess.run(tf.global_variables_initializer())
            for epoch in range(0, 6):
                sens, sens_length = generate_inputs(vocab_size, 5, num_inst_batch)
                true_labels = generate_labels(num_inst_batch, num_classes)    
                train_step, loss, embeddings = sess.run([model.train_step, model.cross_entropy, model.embeddings], 
                                                 feed_dict={model.input_sens: sens, model.sens_length : sens_length, model.correct_label : true_labels})
                self.assertAllEqual(embeddings[pad_word_id, :], pad_word_embeddings, msg = "Epoch {} : the embedding of the pad word contains non-zeros.".format(epoch))
    
    def runTest(self):
        pass
            
if __name__ == '__main__':
    test=FastTextTest()
    test.test_computing_mean()
    test.test_loss()
    test.test_computing_accuracy()
    test.test_zero_embeddings()
    print("All tests are passed!")
In [ ]:
# coding: utf-8

class Dataset:
    """
        The class for representing a dataset.
        
    """
    
    def __init__(self, sentences, labels, max_length = -1):
        """
            Args:
                sentences (list) : a list of sentences. Each sentence is a list of word ids.
                labels (list) : a list of label representations. Each label is represented by one-hot vector.
        """
        if len(sentences) != len(labels):
            raise ValueError("The size of sentences {} does not match the size of labels {}. ".format(len(sentences), len(labels)) )
        if len(labels) == 0:
            raise ValueError("The input is empty.")
        self.sentences = sentences
        self.labels = labels
        self.sens_lengths = [len(sens) for sens in sentences]
        if max_length == -1:
            self.max_sens_length = max(self.sens_lengths)
        else:
            self.max_sens_length = max_length
        
        
    def label_tensor(self):
        """
            Return the label matrix of a batch.
        """
        return np.array(self.labels)
    
    def sent_tensor(self):
        """
            Return the sentence matrix of a batch.
        """
        return np.array(self.sentences)
    
    def sens_length(self):
        """
            Return a vector of sentence length for a batch.
        """
        length_array = np.array(self.sens_lengths, dtype=np.float32)
        return np.reshape(length_array, [len(self.sens_lengths),1])
   
    def subset(self, index_list):
        """ Return a subset of the dataset.
        
            Args:
                index_list (list) : a list of sentence index.
        """
        sens_subset = []
        labels_subset = []
        for index in index_list:
            if index >= len(self.sentences):
                raise IndexError("index {} is larger than or equal to the size of the dataset {}.".format(index, len(self.sentences)))
            sens_subset.append(self.sentences[index]) 
            labels_subset.append(self.labels[index])
            
        dataset = Dataset(sentences=sens_subset, labels=labels_subset, max_length=self.max_sens_length)
        
        return dataset
    
    def get_batch(self, index_list):
        """ Return a batch.
            
            Args:
                index_list (list) : a list of sentence index.
        """
        data_subset = self.subset(index_list)
        for sens in data_subset.sentences:
            self.pad_sentence(sens)

        return data_subset
    
    def pad_sentence(self, sens):
        """ Implement padding here.
            
            Args:
                sens (list) : a list of word ids.
        """
        pass
    
    def size(self):
        return len(self.sentences)
In [ ]:
def create_label_vec(label):
    """Create one hot representation for the given label.
    
    Args:
        label(str): class name
    """
    label_id = label_to_id[label.strip()]
    label_vec = np.zeros(num_classes, dtype=np.int)
    label_vec[label_id] = 1
    return label_vec
In [ ]:
def tokenize(sens):
    """Tokenize a sentence
    
    Args:
        sens (str) : a sentence.
    """
    return word_tokenize(sens)
In [ ]:
def map_token_seq_to_word_id_seq(token_seq, word_to_id):
    """ Map a word sequence to a word ID sequence.
             
    Args:
        token_seq (list) : a list of words, each word is a string.
        word_to_id (dictionary) : map word to its id.
    """
    return [map_word_to_id(word_to_id,word) for word in token_seq]
In [ ]:
def map_word_to_id(word_to_id, word):
    """ Map a word to its id.
    
        Args:
            word_to_id (dictionary) : a dictionary mapping words to their ids.
            word (string) : a word.
    """
    if word in word_to_id:
        return word_to_id[word]
    else:
        return unknown_word_id
In [ ]:
def build_vocab(sens_file_name):
    """ Build a vocabulary from a train set.
        
        Args:
            sens_file_name (string) : the file path of the training sentences.
    """
    data = []
    with open(sens_file_name) as f:
        for line in f.readlines():
            tokens = tokenize(line)
            data.extend(tokens)
    print('number of sequences is %s. ' % len(data))
    count = [['$PAD$', pad_word_id], ['$UNK$', unknown_word_id]]
    sorted_counts = collections.Counter(data).most_common()
    count.extend(sorted_counts)
    word_to_id = dict()
    for word, _ in count:
        word_to_id[word] = len(word_to_id)
    
    print("PAD word id is %s ." % word_to_id['$PAD$'])
    print("Unknown word id is %s ." % word_to_id['$UNK$'])
    print('size of vocabulary is %s. ' % len(word_to_id))
    return word_to_id
In [ ]:
def read_labeled_dataset(sens_file_name, label_file_name, word_to_id):
    """ Read labeled dataset.
    
        Args:
            sens_file_name (string) : the file path of sentences.
            label_file_name (string) : the file path of sentence labels.
            word_to_id (dictionary) : a dictionary mapping words to their ids.
    """
    with open(sens_file_name) as sens_file, open(label_file_name) as label_file:
        data = []
        data_labels = []
        for label in label_file:
            sens = sens_file.readline()
            word_id_seq = map_token_seq_to_word_id_seq(tokenize(sens), word_to_id)
            if len(word_id_seq) > 0 :
                data.append(word_id_seq)
                data_labels.append(create_label_vec(label))
        print("read %d sentences from %s ." % (len(data), sens_file_name))
        labeled_set = Dataset(sentences=data, labels=data_labels)
        
        return labeled_set
In [ ]:
def write_results(test_results, result_file):
    """ Write predicted labels into file.
        
        Args:
            test_results (list) : a list of predictions.
            result_file (string) : the file path of the prediction result.
    """
    with open(result_file, mode='w') as f:
         for r in test_results:
             f.write("%d\n" % r)
In [ ]:
class DataIter:
    """ An iterator of an dataset instance.
    
    """
    
    def __init__(self, dataset, batch_size = 1):
        """ 
            Args:
                dataset (Dataset) : an instance of Dataset.
                batch_size (int) : batch size.
        """
        self.dataset = dataset
        self.dataset_size = len(dataset.sentences)
        self.shuffle_indices = np.arange(self.dataset_size)
        self.batch_index = 0
        self.batch_size = batch_size
        self.num_batches_per_epoch = int(self.dataset_size/float(self.batch_size))
    
    def __iter__(self):
        return self
        
    def next(self):
        """ return next instance. """
        
        if self.batch_index < self.dataset_size:
            i = self.shuffle_indices[self.batch_index]
            self.batch_index += 1
            return self.dataset.get_batch([i])
        else:
            raise StopIteration
            
    def next_batch(self):
        """ return indices for the next batch. Useful for minibatch learning."""
        
        if self.batch_index < self.num_batches_per_epoch:
            start_index = self.batch_index * self.batch_size
            end_index = (self.batch_index + 1) * self.batch_size
            self.batch_index += 1
            return self.dataset.get_batch(self.shuffle_indices[start_index : end_index])
        else:
            raise StopIteration
    
    def has_next(self):
        
        return self.batch_index < self.num_batches_per_epoch
        
            
    def shuffle(self):
        """ Shuffle the data indices for training"""
        
        self.shuffle_indices = np.random.permutation(self.shuffle_indices)
        self.batch_index = 0
In [ ]:
def save_for_tensorboard(logs_dir = './computation_graphs'):
    """ Save computation graph to logs_dir for tensorboard. 
    
    Args:
        logs_dir (string) : file path to serialized computation graphs.
    
    """
    fast_text = FastText(num_classes, embedding_dim, 10, learning_rate, 5)
    fast_text.build_graph()
    with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())
        log_writer = tf.summary.FileWriter(logs_dir, sess.graph)
In [2]:
def eval(word_to_id, train_dataset, dev_dataset, test_dataset, model_file_path, batch_size = 1):
    """ Train a fasttext model, evaluate it on the validation set after each epoch, 
    and choose the best one model to evaluate it on the test set. 
    
    Args:
        word_to_id (dictionary) : word to id mapping.
        train_dataset (Dataset) : labeled dataset for training.
        dev_dataset (Dataset) : labeled dataset for validation.
        test_dataset (Dataset) : labeled dataset for test.
        model_file_path (string) : model file path.
        batch_size (int) : the number of instances in a batch.
    
    """
    fast_text = FastText(num_classes, embedding_dim, len(word_to_id),learning_rate, batch_size)
    fast_text.build_graph()
    init = tf.global_variables_initializer()
    saver = tf.train.Saver()
    max_accu = 0
    max_accu_epoch = 0
    with tf.Session() as sess:
        sess.run(init)    
        for epoch in range(num_epochs):
            dataIterator = DataIter(train_dataset, batch_size)
            dataIterator.shuffle()
            total_loss = 0
            # modify here to use batch training
            while dataIterator.has_next():
                batch_data = dataIterator.next_batch()
                sens = batch_data.sent_tensor()
                label = batch_data.label_tensor()
                sens_length = batch_data.sens_length()
                (optimizer, loss) = sess.run([fast_text.train_step, fast_text.cross_entropy], 
                                             feed_dict={fast_text.input_sens: sens, fast_text.sens_length : sens_length, fast_text.correct_label: label})
                total_loss += np.sum(loss)
            model_file = '{}_{}'.format(model_file_path, epoch)
            saver.save(sess, model_file)
            validation_accuracy = compute_accuracy(fast_text, model_file, dev_dataset)
            print('Epoch %d : train loss = %s , validation accuracy = %s .' % (epoch, total_loss, validation_accuracy))
            if validation_accuracy > max_accu:
                max_accu = validation_accuracy
                max_accu_epoch = epoch

        # modify here to use batch evaluation
    final_model_file = '{}_{}'.format(model_file_path, max_accu_epoch)
    print('Accuracy on the test set : %s.' % compute_accuracy(fast_text, final_model_file, test_dataset))
    predictions = predict(fast_text, final_model_file, test_dataset)
    write_results(predictions, './predictions.csv')
In [ ]:
def predict(fast_text, fasttext_model_file, test_dataset):
    """ 
    Predict labels for each sentence in the test_dataset.
    
    Args:
        fast_text (FastText) : an instance of fasttext model.
        fasttext_model_file (string) : file path to the fasttext model.
        test_dataset (Dataset) : labeled dataset to generate predictions.
    """
    
    saver = tf.train.Saver()
   
    test_results = []
    with tf.Session() as sess: 
        saver.restore(sess, fasttext_model_file)
        dataIterator = DataIter(test_dataset)
        while dataIterator.has_next():
            data_record = dataIterator.next()
            sens = data_record.sent_tensor()
            sens_length = data_record.sens_length()
            prediction = fast_text.prediction.eval(feed_dict={fast_text.input_sens: sens, fast_text.sens_length : sens_length})
            test_results.append(prediction)
    return test_results
In [ ]:
def compute_accuracy(fast_text, fasttext_model_file, eval_dataset):
    """ 
    Compuate accuracy on the eval_dataset in the batch mode. It is useful only for the bonus assignment.
    
    Args:
        fast_text (FastText) : an instance of fasttext model.
        fasttext_model_file (string) : file path to the fasttext model.
        eval_dataset (Dataset) : labeled dataset for evaluation.
    """
    saver = tf.train.Saver()
    dataIterator = DataIter(eval_dataset)
    
    num_correct = 0
    with tf.Session() as sess: 
        saver.restore(sess, fasttext_model_file)
        
        while dataIterator.has_next():
            data_record = dataIterator.next()
            sens = data_record.sent_tensor()
            label = data_record.label_tensor()
            sens_length = data_record.sens_length()
            is_correct = sess.run(fast_text.accuracy, 
                                         feed_dict={fast_text.input_sens: sens, fast_text.sens_length : sens_length, fast_text.correct_label: label})
            num_correct += is_correct
            
    return num_correct/float(eval_dataset.size())
In [ ]:
def run_main(data_folder):
    """
    Train and evaluate the fasttext model.
    
    Args:
        data_folder (string) : the path to the data folder.
    
    """
    trainSensFile = os.path.join(data_folder, 'sentences_train.txt')
    devSensFile = os.path.join(data_folder, 'sentences_dev.txt')
    testSensFile = os.path.join(data_folder, 'sentences_test.txt')
    trainLabelFile = os.path.join(data_folder, 'labels_train.txt')
    devLabelFile = os.path.join(data_folder, 'labels_dev.txt')
    testLabelFile = os.path.join(data_folder, 'labels_test.txt')
    testResultFile = os.path.join(data_folder, 'test_results.txt')
    model_file_path = os.path.join(data_folder, 'fasttext_model_file')

    word_to_id = build_vocab(trainSensFile)
    train_dataset = read_labeled_dataset(trainSensFile, trainLabelFile, word_to_id)
    dev_dataset = read_labeled_dataset(devSensFile, devLabelFile, word_to_id)
    test_dataset = read_labeled_dataset(testSensFile, testLabelFile, word_to_id)
    eval(word_to_id, train_dataset, dev_dataset, test_dataset, model_file_path, batch_size = 10)
In [ ]:
run_main('target_file_path')

Q2 [3 pts]. Question Classification.

Understanding questions is a key problem in chatbots and question answering systems. In the open-domain setting, it is difficult to find right answers in the huge search space. To tackle the problem, one approach is to categorize questions into a finite set of semantic classes, and each semantic class corresponds to a small answer space.

Your task is to implement a question classification model in Tensorflow and apply it to the datasets provided in this assignment.

Notes:

  • The warm-up exercise will not be graded, though following the exercises and instructions may save you a great deal of time.
  • Please do not submit your data directories, pretrained word embeddings, and Tensorflow library.
  • You may consider reusing part of the code in Q1.
  • Code must be submitted with the assignment for purposes of plagiarism detection.

Dataset

The dataset provided on Wattle contains three files: train.jsonvalidation.json, and test.json, which are the training dataset, validation dataset, and the test dataset, respectively. See an example below:

{
   "ID": S1,
   "Label": 3,
   "Sentence":"What country has the best defensive position in the board game Diplomacy ?"
}

In the training set and the validation set, the response variable is called Label. Your task is to predict the Label for each sentence in the test set.

Evaluation

The performance of your prediction will be evaluated automatically on Kaggle using Accuracy , which is defined as the number of correct predictions divided by the total number of sentences in the test set (https://classeval.wordpress.com/introduction/basic-evaluation-measures/).

Your score will be computed using a lower bound and an upper bound, which will be shown on the Kaggle leader board. Achieving an accuracy equal and below the lower bound amounts to a grade of zero, while achieving the upper bound amounts to the full points (here 3 points, see score distribution here below). Consequently, your score for this competition task will be calculated based on:

Your_Score=Your_AccuracyLower_BoundUpper_BoundLower_Bound3Your_Score=Your_Accuracy−Lower_BoundUpper_Bound−Lower_Bound∗3

Notes about the lower bound and upper bounds predictors:

  • The lower bound is the performance obtained by a classifer that always picks the majority class according to the class distribution in the training set.
  • The upper bound is generated by an “in-house” classifier trained on the same dataset that you were given.

There are many possibilities to achieve better results than this. However, the only labeled training dataset to train your model should be the provided train.json. If you obtain a better performance than the upper bound, then you will have a grade higher than 3 points for this question. This can be useful to compensate for any lost points for the whole assignment. However, the total mark of this assignment is capped at 10 marks.

Kaggle competition

  • Join the competition here
  • Before submitting the result, first go to team menu and change your team name as your university id.
  • You need to upload the generated result file to Kaggle. The result file should be in the following format
    id,category
    S101,0
    S201,1
    S102,2
    ...
  • Note that you are only allowed to upload 5 copies of your results to Kaggle per day. Make every upload count, and don’t waste your opportunities!
  • For detailed submission instructions, check the end of this notebook.

After completion, please rename this notebook to your_uid.ipynb (e.g. u6000001.ipynb), submit this file to Wattle. Do not upload any other files to Wattle except this notebook file.

Note: you need to fill in the cells below with your code. If you fail to provide the code, you will get zero for this question. Your code should be well documented and provides methods to generate the prediction files and compute accuracy on the validation set.

In [ ]:
# Your code here.

Q3 : Comparison between Absolute Discounting and Kneser Ney smoothing. [2pts]

Read the code below for interpolated absolute discounting and implement Kneser Ney smoothing in Python. It is sufficient to assume that the highest order of ngram is two and the discount is 0.75. Evaluate your program on the following ngram corpus and compute the distribution p(x|Granny)p(x|Granny) for all possible unigrams in the given corpus. Explain what make the differences regarding the prediction results between interpolated absolute discounting and Kneser Ney smoothing.

In [ ]:
ngram_corpus = ['Sam eats apple',
          "Granny plays with Sam",
           "Sam plays with Smith",
           "Sam likes Smith",
          "Sam likes apple",
                "Sam likes sport",
                "Sam plays tennis",
                "Sam likes games",
                "Sam plays games",
          "Sam likes apple Granny Smith"]

from nltk import word_tokenize
from nltk.util import ngrams
from collections import Counter

class NgramStats:
    """ Collect unigram and bigram statistics. """
    
    def __init__(self):
        self.bigram_to_count = Counter([])
        self.unigram_to_count = dict()
        
    def collect_ngram_counts(self, corpus):
        """Collect unigram and bigram counts from the given corpus."""
        unigram_counter = Counter([])
        for sentence in corpus:
            tokens = word_tokenize(sentence)
            bigrams = ngrams(tokens, 2)
            unigrams = ngrams(tokens, 1)
            self.bigram_to_count += Counter(bigrams)
            unigram_counter += Counter(unigrams)
        self.unigram_to_count = {k[0]:int(v) for k,v in unigram_counter.items()}
In [ ]:
stats = NgramStats()         
stats.collect_ngram_counts(ngram_corpus)
print(stats.bigram_to_count)
print(stats.unigram_to_count)
In [ ]:
# Interpolated Absolute Discounting
import operator
    
class AbsDist:
    """
     Implementation of Interpolated Absolute Discounting
     
     Reference: slide 25 in https://nlp.stanford.edu/~wcmac/papers/20050421-smoothing-tutorial.pdf
    """
    def __init__(self, ngram_stats):
        """ Initialization
        
            Args:
                ngram_stats (NgramStats) : ngram statistics.
        """
        self.unigram_freq = float(sum(ngram_stats.unigram_to_count.values()))
        self.stats= ngram_stats
    
    def compute_prop(self, bigram, discount = 0.75):
        """ Compute probability p(y | x)
        
            Args:
                bigram (string tuple) : a bigram (x, y), where x and y denotes an unigram respectively.
                discount (float) : the discounter factor for the linear interpolation.
        """
        preceding_word_count = 0
        if bigram[0] in self.stats.unigram_to_count:
            preceding_word_count = self.stats.unigram_to_count[bigram[0]]
            
        if preceding_word_count > 0:
            left_term = 0
            if bigram in self.stats.bigram_to_count:
                bigram_count = float(self.stats.bigram_to_count[bigram])
                left_term = (bigram_count - discount)/preceding_word_count
            right_term = 0
            if bigram[1] in self.stats.unigram_to_count:
                current_word_count = self.stats.unigram_to_count[bigram[1]]
                num_bigram_preceding_word = 0
                for c_bigram in self.stats.bigram_to_count.keys():
                    if c_bigram[0] == bigram[0] :
                        num_bigram_preceding_word += 1
                normalization_param = (discount * num_bigram_preceding_word)/ preceding_word_count 
                p_unigram = current_word_count/self.unigram_freq
                right_term = normalization_param * p_unigram
            return left_term + right_term
        
        return 0
In [ ]:
def compute_prop_abs_dist(ngram_stats, preceding_unigram, d = 0.75):
    """ Compute the distribution p(y | x) of all y given preceding_unigram

        Args:
            preceding_unigram (string) : the preceding unigram.
            d (float) : the discounter factor for the linear interpolation.
    """
    absDist = AbsDist(ngram_stats)
    c_unigram_to_prob = dict()
    for c_unigram in ngram_stats.unigram_to_count.keys():
        if not c_unigram in c_unigram_to_prob:
            c_unigram_to_prob[c_unigram] = absDist.compute_prop((preceding_unigram, c_unigram), d)
  
    sorted_prob = sorted(c_unigram_to_prob.items(), key=operator.itemgetter(1))
    return sorted_prob

print(compute_prop_abs_dist(stats, 'Granny'))
In [ ]:
def compute_prop_KN(ngram_stats, preceding_word):
    # Implement Kneser Ney Smoothing here.
    # Hint: try to reuse the above code as much as possible.
    pass


print(compute_prop_KN(stats, 'Granny'))

Explain the differences regarding prediction results here.

Q4 [2 pts]. Transition-based Dependency Parsing.

We have learned in the lecture that Nivre’s parsing algorithm has four parsing actions (Left-ArcRight-ArcReduceShift). There are also several alternative transition-based algorithms for dependency parsing. Please learn the Arc-Standard algorithm from http://stp.lingfil.uu.se/~sara/kurser/5LN455-2014/lectures/5LN455-F8.pdf, and apply this algorithm to the following sentence. Use the same grammar as given below for parsing.

{Noun->Adj, ROOT->Verb, Noun->Det, Verb->Prep, Verb->Noun, figure->on, on->screen}

Check if you are able to find a priority queue that generates the same parsing tree as in the slides of dependency parsing. Write down all intermediate transitions.

Red figures on the screen indicated falling stocks

Note:

  • The intermediate transitions should be written as:

LA<root, figures=”” on=”” the=”” screen=”” indicated=”” falling=”” stocks,=”” {(figures,=”” red)}=””></root,>

S

  • In the priority queue, you may include pre-conditions before applying parsing actions.

SOLUTION:

In [ ]: