python nlp 自然语言处理代写

Assignment 4

In this assignment you will extend the Viterbi algorithm discussed during class.

Your mark will depend on

  • the correctness of your implementation
  • the correctness and clarity of your description of the algorithms

As we have to run the notebooks of all students, and because writing efficient code is important, your notebook should run in 5 minutes at most, on your machine.

Setup Instructions

It is important that this file is placed in the correct directory. It will not run otherwise. The correct directory is

DIRECTORY_OF_YOUR_BOOK/assignments/2018/assignment4/problem/

where DIRECTORY_OF_YOUR_BOOK is a placeholder for the directory you downloaded the book to. After you placed it there, rename the file to your UCPH ID (of the form xxxxxx).

General Instructions

This notebook will be used by you to provide your solution, and by us to both assess your solution and enter your marks. It contains three types of sections:

  1. Setup Sections: these sections set up code and resources for assessment. Do not edit these.
  2. Assessment Sections: these sections are used for both evaluating the output of your code, and for markers to enter their marks. Do not edit these.
  3. Task Sections: these sections require your solutions. They may contain stub code, and you are expected to edit this code. For free text answers simply edit the markdown field.

Note that you are free to create additional notebook cells within a task section.

Do not share this assignment publicly, by uploading it online, emailing it to friends etc.

Setup 1: Load Libraries and Data

This cell loads libraries and datasets important for evaluation and assessment of your model. Do not change it.

In [2]:
#! SETUP 1
import sys, os
_snlp_book_dir = "../../../../"
sys.path.append(_snlp_book_dir) 
from collections import defaultdict
import math
import statnlpbook.util as util
import statnlpbook.sequence as seq
import warnings
warnings.filterwarnings('ignore')

train = seq.load_tweebank(_snlp_book_dir + "data/oct27.splits/oct27.train")
dev = seq.load_tweebank(_snlp_book_dir + "data/oct27.splits/oct27.dev")
test = seq.load_tweebank(_snlp_book_dir + "data/oct27.splits/oct27.test")

Setup 2: Setup the MEMM

This cell sets up the Maximum Entropy Markov Model (MEMM) for part-of-speech tagging on the Tweebank dataset. It defines a simple feature function, a utility function that is needed later, and instantiates the actual MEMMSequenceLabeler in the variable memmDo not edit this setup section either. The task below does not require you to make any changes to the functions or variables defined here.

In [3]:
#! SETUP 2
def memm_feat(x,i,hist):
    return {
        'word:' + str(x[i]): 1.0,
        'first_at:' + str(x[i][0:1] == '@'): 1.0,
        'is_lower:' + str(x[i].islower()): 1.0,
        'last_3' + "".join(x[i][-3:]): 1.0,
        'last_2' + "".join(x[i][-2:]): 1.0,
        'prev_y': hist[0],
    }

def batch_predict(data, beam_predictor):
    return [beam_predictor(x)[0][0][0] for x,y in data]

memm = seq.MEMMSequenceLabeler(memm_feat, train, order=1, C=10)

Task 1: Extend the Viterbi algorithms to n-grams

This is the core part of the assignment. You are to take the existing implementation of the Viterbi algorithm below, which conditions only on the previously predicted label, and extend it to condition on the previous n labels instead.

For example, for the label sequence “O V N”, the states of the Viterbi algorithm with n=1n=1 are ["O", "V", "N"]. With n=2n=2, the states of the Viterbi algorithm would be [("PAD", "O"), ("O", "V"), ("V", "N")].

Concretely, you need to modify the memm_viterbi_search function below to utilize its newly added n argument, which indicates the label n-gram length the algorithm should condition on. The implementation you see below does not yet do anything with this argument; instead, it always conditions on a single label only (n=1). You need to understand this implementation, find the locations in the code that need to be modified, and make the necessary modifications. To do this, you will also need to change a utility function that is called by memm_viterbi_search, and therefore also given below.

At the end, the cell below evaluates the Viterbi algorithm using unigrams, bigrams, and trigrams. Currently, these all return identical accuracy scores, but if your implementation is working correctly, they should get progressively better with n=2n=2 and n=3n=3.

In [4]:
# TODO: extend this function to condition on tag n-grams, based on its "n" argument
# nb: ignore the "width" argument of this function; it is not important for the purposes of this task
def memm_viterbi_search(memm, x, width=2, n=1):
    labels = memm.labels()
    # 'alpha' and 'beta' hold the Viterbi state;
    # 'alpha' stores the best accumulated score for a label sequence,
    # 'beta'  stores the backpointer to the previous state from which the best score was achieved
    # ---For example,
    # 'alpha[2]["N"]' gives the accumulated (global) score at index 2 of the sequence
    #                 when the item at index 2 is labelled as "N";
    # 'beta[2]["N"]'  points to the label at the *previous* index that was used
    #                 to achieve this score (since we are conditioning on the previous label!)
    alpha = [{}] 
    beta = [{}] 
    # ---NB: changing these dicts to condition on n-grams, for example ('PAD', 'N'),
    #        instead of just single labels (like 'N') is the crucial part of your assignment
    # -------------------------------------------------------------------------------------------------
    # Initialisation
    # ---here, we predict the first label of our sequence, using a special "PAD" label
    #    as our history since we don't have any previous predictions yet
    for label_index, label_score in enumerate(memm.predict_scores_hist(x, 0, ["PAD"])):
        label = labels[label_index]
        alpha[0][label] = label_score
        beta[0][label] = "PAD"
    
    # Prune --- you can ignore this step for now
    seq.prune_alpha_beta(alpha[0], beta[0], width)
    
    # Recursion
    # ---here, we step through each item of the sequence to label it
    for i in range(1, len(x)):
        alpha.append(defaultdict(lambda: -math.inf))
        beta.append({})
        # For each possible label at the *previous* index, we predict
        # the scores of all possible labels at the *current* index
        for p in alpha[i-1].keys():
            for label_index, label_score in enumerate(memm.predict_scores_hist(x, i, [p])):
                label = labels[label_index]
                new_score =  alpha[i-1][p] + label_score
                # ...but we only keep predictions if they lead to a higher global score
                if new_score > alpha[i][label]:
                    alpha[i][label] = new_score
                    beta[i][label] = p

        # Prune --- you can ignore this step for now
        seq.prune_alpha_beta(alpha[i], beta[i], width)

    # Finally, we need to reconstruct the best label sequence from 'alpha' and 'beta'
    history = convert_alpha_beta_to_history(x, alpha, beta)
    return history[-1], history

# This utility function takes the 'alpha' and 'beta' from the Viterbi algorithm
# and reconstructs the best label sequence(s) from them.
# ---NB: This function can be a bit confusing, but you should only need to make minor changes.
#        Try looking at/'print'ing the output of this function
#          a) with the original Viterbi function
#          b) with your modified Viterbi function
#        and try to figure out what needs to be changed based on that.  The outputs should
#        have the same structure in both cases.
def convert_alpha_beta_to_history(x, alpha, beta):
    beams = [defaultdict(lambda: [])]
    # Here, we first reconstruct all possible label sequences our algorithm has stored
    for i in range(0, len(x)):
        beam = {}
        for l, s in alpha[i].items():
            beam[l] = beams[-1][beta[i][l]] + [l]
        beams.append(beam)
    del beams[0]
    history = [[([], 0.0)]]
    # Here, we add the scores
    for i in range(0, len(x)):
        beam = []
        for l, s in sorted(alpha[i].items(), key=lambda t: -t[1]):
            beam.append((beams[i][l], s))
        history.append(beam)
    return history

for i in range(1, 4):
    acc = seq.accuracy(dev, batch_predict(dev, lambda x: memm_viterbi_search(memm, x, width=2, n=i)))
    print(f"n={i}: {acc}")
n=1: 0.8140161725067385
n=2: 0.8140161725067385
n=3: 0.8140161725067385

Assessment 1: Accuracy scores of Viterbi search (50 pts)

We assess the correctness of your implementation by running it on the dev and test data, using different values for n, and checking whether they match the expected values.

In [5]:
for i in range(1, 4):
    acc = seq.accuracy(dev, batch_predict(dev, lambda x: memm_viterbi_search(memm, x, width=2, n=i)))
    print(f"n={i}: {acc}")

for i in range(1, 4):
    acc = seq.accuracy(test, batch_predict(test, lambda x: memm_viterbi_search(memm, x, width=2, n=i)))
    print(f"n={i}: {acc}")
n=1: 0.8140161725067385
n=2: 0.8140161725067385
n=3: 0.8140161725067385
n=1: 0.8137583892617449
n=2: 0.8137583892617449
n=3: 0.8137583892617449

The above solution is marked with 0 points.

Task 2: Describe the algorithms

We learned about greedy decoding, beam search decoding, and Viterbi decoding. In this task, you are to describe the functionality of beam search vs. Viterbi decoding with your own words.

Describe what happens for each algorithm when it processes the nn-th token of a sequence. You should address:

  1. what probabilities the algorithm calculates;
  2. what kind of information it stores;
  3. how it arrives at full predicted label sequence in the end.

Here is an example describing the greedy decoding algorithm:

When processing the nn-th token, the algorithm

  • calculates label probabilities conditioned on the label history for tokens (1,...,n1)(1,…,n−1);
  • picks the label with the highest probability as the prediction for nn.

The full predicted label sequence is then the sequence of the individual predictions made for each token.

The above solution is marked with 0 points.

Assessment 2: Assess your description (30 pts)

We will mark your description along the following dimensions:

  • Clarity (15pts: very clear, 0pts: we can’t figure out what you meant)
  • Correctness (15pts: you describe the algorithms exactly, 0pts: complete misunderstanding of the algorithms)

The above solution is marked with 0 points.

Task 3: Beam search vs. Viterbi

What are the individual advantages and disadvantages of the beam search algorithm vs. the Viterbi algorithm? When would you use beam search, when Viterbi?

The above solution is marked with 0 points.

Assessment 3: Assess your answer (20 pts)

We will mark your answer along the following dimensions:

  • Clarity (10pts: very clear, 0pts: we can’t figure out what you meant)
  • Substance (10pts: you provide good arguments, 0pts: you provide no meaningful arguments)