10/20/21, 7:47 PM Project3-instructions
localhost:8888/nbconvert/html/Project3-instructions.ipynb?download=false
10/20/21, 7:47 PM Project3-instructions
Project 3: Training a BiLSTM POS tagger Due: November 4, 2021
The goal of this project is to train a BILSTM model for sequence labeling.
Task definition:
POS tagging is the task of assigning a POS tag to each word token in the sentence, and it is a typical sequence labeling problem. The provided code has a skeletal implementation of a CRF tagger, complete with a Viterbi decoder and a function for computing the negative loglikelihood of a sentence with the forward algorithm.
You are asked to write a training routine to train a sequence labeling model with the provided training set from the Penn TreeBank. Make sure you include code in the training routine that reports the total loss after each training iteration (epoch) over the entire training set so that you can observe the change in the total loss from iteration to iteration. If the training goes well, the loss should keep going down. If you see drastic upward and downward swings in the training loss, that’s a sign that something is not working properly. Additionally you should also include code that reports the accuracy of the model on the development set every 5 or 10 iterations, so that you can observe the trend in prediction accuracy. If the improvement plateaus or starts to go down, that’s a sign you should stop training.
Additionally, you are asked to write an alternative (and simpler) per-token local softmax decoder that finds the best tag for each word token individually. You also need to write a corresponding negative loglikelihood loss function for such a greedy decoding process. Recall that in this case the negative loglikelihood loss of a sentence is the sum of the loss for individual word tokens in the sentence. This will allow you to compare the performance of this simpler alternative with the CRF model.
Data sets:
We will be using the standard train / development / test split in the Penn TreeBank for our experiments: Sections 02-21 are used for training, Section 22 is used for devevelopment, and Section 23 is used as the final test set. You can use the development set to select the best model architecture and tune the hyperparameters. When you are done training and tuning, you need to run your code on the test set and produce an automatically tagged version of it. The data format is very straightforward: each line of the data file contains one sentence. For the training and dev data, you are provided with the sentences with their gold POS tags. For the test set, you are only given the word tokens. The TAs will run your code on the test set to get the accuracy of your model
Experiments
Like linear models, having the right feature representation is crucial to the performance of a neural model. In neural models, you can no longer tweak the feature templates directly. However, you can engineer a neural architecture to capture information that is analoguous to features in linear models. For instance, the BiLSTM network captures the left and right context, similar to previous and next word features in linear models. To capture affix information that might be helpful for POS tagging, you can experiment with character-level CNNs
localhost:8888/nbconvert/html/Project3-instructions.ipynb?download=false
10/20/21, 7:47 PM Project3-instructions
with various pooling techniques (e.g., max pooling). To capture previous tag features, you can use a transition matrix between tags. You are asked to run a number of experiments and report results on the provided development set in this assignment.
1. Experiment with using pre-trained word embeddings such as GLOVE (https://nlp.stanford.edu/projects/glove/ (https://nlp.stanford.edu/projects/glove/)) or fastText (https://fasttext.cc/docs/en/english-vectors.html (https://fasttext.cc/docs/en/english-vectors.html)). These embeddings come with different dimensions, and start with a smaller dimension to make sure that your computing environment has sufficient memory for it and your model can train sufficiently fast. You are adviced to “freeze” the word embeddings and do not update them during your training process. Compare the results from using the best pre-training embeddings and using random initialized embeddings to see if there is any difference in performance.
2. Experiment with using character CNNs. Intuitively affixes or other parts of a word can be useful information. Typically this information is captured with character-level CNNs in neural models. Add a character-level CNN, and concatenate the output of the CNN with word embeddings, and see if this improves the performance of your BiLSTM model. Compare the results of using vs not using character CNNs.
3. Do the first two experiments with a local softmax decoder, i.e., making predictions individually for each word in the sentence, as this will train faster. You should be able to train your model with (the sum of) a per- token negative loglikelihood loss on the entire training set within a reasonable amoount of time. In the final experiment, you are asked to compare the results for the greedy per-token local loss with the global negative loglikelihood loss on the entire tag sequence with a BiLSTM-CRF model. Training a BiLSTM-CRF model is expensive, so use the first 10,000 sentences to train the CRF model instead of the entire training set. For apple-to-apple comparisions, also train your local greedy softmax model with the same training set so that you can observe which model yields superior performance.
Some implementation tips:
1. To take advantage of the GPU accelerator, you need to move all Pytorch tensors to GPU using to(device) or cuda().
2. A common first problem is that you’ll use up all memory quickly, leading to an “out of memory” error. It is important to realize that Pytorch keep a computation history to compute the gradient. For instance, if you add up the losses for individual sentences, you may keep accumulating history and use up the memory. So instead of doing something like ‘total_loss += sent_loss”, do “total_loss += float(sent_loss)” to strip off history. Also delete variables you no longer need to free up memory.
3. Dealing with unknown words: add UNK to the training vocabulary so that if there is an out-of-vocabulary (OOV) word in the development set, you can map it to UNK so that it still gets labeled.
4. If the loss (negative loglikelihood) swings up and down, you may need to adjust the learning rate (or choose a different optimizer). We suggest that you use the Adam optimizer as it is adaptive and you don’t need to manually set the learning rate.
5. In a typical deep learning model, there are many hyper-parameters that need to be manually set (learning rate /choice of optimizer, embedding and hidden dimensions, kernel sizes (CNN), number of training iterations, etc., ). This leads to many different combinations of hyper-parameters and it is hard to do an exhaustive search to find the best combination. One common technique for searching the best set of parameters is grid search, which allows you to specify plausible values for each hyper-paraameter and search for hyper-parameter combinations systematically.
localhost:8888/nbconvert/html/Project3-instructions.ipynb?download=false
10/20/21, 7:47 PM Project3-instructions
Write a report that a) briefly describe the structure of your code, b) present your experimental settings and results, and c) any insights that you have learned from your experiments. Your report should be no longer than 5 pages.
Project evaluation criteria
Your project will be evaluated by the correctness and thoroughness of your implementation (performing all required experiments), the performance of your best model, which usually reflects the correctness and thoroughness of your model as well as the proper selection of hyper-parameters. Your project will also be evaluated against creativity (e.g., surprising model components that lead to consistent improvement in
import random
import torch
import torch.nn as nn
import torch.optim as optim
import tqdm
# Hyperparameters
NUM_EPOCHS = 5 LEARNING_RATE = 0.002 EMBED_DIM = 50 HIDDEN_DIM = 50 NUM_LAYERS = 1 BIDIRECTIONAL = True DROPOUT = 0.2
SEED = 1334
DEVICE_ID = 0
os.environ[“CUDA_VISIBLE_DEVICES”] = f”{DEVICE_ID}”
device = torch.device(‘cuda’ if torch.cuda.is_available() else ‘cpu’) device
2. Data Preparation
random.seed(SEED) torch.manual_seed(SEED) torch.cuda.manual_seed_all(SEED) torch.backends.cudnn.deterministic = True
localhost:8888/nbconvert/html/Project3-instructions.ipynb?download=false
10/20/21, 7:47 PM Project3-instructions
localhost:8888/nbconver
def read_in_gold_data(filename):
“””Read in the labeled gold data into a list””” with open(filename) as f:
for line in f:
tuples = [tup.split(‘_’) for tup in line.split()] tokens = [tup[0] for tup in tuples]
tags = [tup[1] for tup in tuples]
yield (tokens, tags)
def batchify(dataset, batch_size=50):
“””Divide the training set into batches for mini-batch training””” dataset = list(dataset)
sent_count = len(dataset)
for batch in range(int(sent_count / batch_size)):
start = batch * batch_size
yield dataset[start : start + batch_size] if sent_count % batch_size != 0:
yield dataset[start+batch_size:]
def read_in_plain_data(filename):
“””Read in plain text data for sequence labeling, assuming a one-sen
tence-per-line format”””
with open(filename) as f:
lines = f.readlines()
lines = [line.split() for line in lines]
return lines
def argmax(vec):
# return the argmax as a python int _, idx = torch.max(vec, 1)
return idx.item()
def prepare_sequence(seq, to_ix, to_chidx): UNK = ‘
idxs = [to_ix[w] if w in to_ix else to_ix[UNK] for w in seq]
chseqs = [[c.lower() for c in token] for token in seq]
ch_padded = pad_seq(chseqs, to_chidx)
return torch.tensor(idxs, dtype=torch.long), torch.tensor(ch_padded,
dtype=torch.long)
def pad_seq(seqs, to_charidx): PAD = ‘
maxlen = max([len(seq) for seq in seqs]) if maxlen < 3: maxlen = 3
for seq in seqs:
seq += ["
padded = [[to_charidx[ch] for ch in seq] for seq in seqs] return padded
def id_to_str(tagidseq, to_str):
return [to_str[tagid] for tagid in tagidseq]
t/html/Project3-instructions.ipynb?download=false 5
10/20/21, 7:47 PM
localhost:8888/nbconver
Project3-instructions
# Compute log sum exp in a numerically stable way for the forward algori thm
def log_sum_exp(vec):
max_score = vec[0, argmax(vec)]
max_score_broadcast = max_score.view(1, -1).expand(1, vec.size()[1]) return max_score + \
torch.log(torch.sum(torch.exp(vec – max_score_broadcast)))
##Code for evaluating the performance of the tagger on a test/developmen t set
def compare_tagseq(goldseq, predicted_seq):
“””Compare two sequences and output the length of the sequence
and the number of tags they share. A helper function to the evaluati
on function”””
pairs = zip(goldseq, predicted_seq)
correct = len([1 for pair in pairs if pair[0]== pair[1]]) return len(goldseq), correct
def eval_model(devset, word_to_ix, ix_to_tag, char_to_ix, model): “””Given a development set and a model, compute the accuracy of the
model prediction
on the development set””” total_tokens = 0
correct = 0
with torch.no_grad():
for sent, tags in devset:
word_tensor, char_tensor = prepare_sequence(sent, word_to_ix
, char_to_ix)
sentence_in = (word_tensor.to(device), char_tensor.to(device
scores, tagidseq = model(sentence_in)
predicted_tags = id_to_str(tagidseq, ix_to_tag)
sent_length, length_correct = compare_tagseq(tags,predicted_
## Code the read in pre-trained word embeddings
def glove2dict(glove_path,emb_dim=50):
“””Read in glove embeddings and create a dictionary””” glove_dict = {}
if emb_dim == 50:
fname= “glove.6B.50d.txt” elif emb_dim == 100:
fname=”glove.6B.100d.txt” elif emb_dim == 200:
fname=”glove.6B.200d.txt” elif emb_dim == 300:
fname=”glove.6B.300d.txt” else:
print(“Inappropriate glove size chosen, using 50″)
fname=”glove.6B.50d.txt”
with open(f'{glove_path}/{fname}’, ‘rb’) as f:
return correct/total_tokens
correct += length_correct
total_tokens += sent_length
t/html/Project3-instructions.ipynb?download=false 6
10/20/21, 7:47 PM
localhost:8888/nbconvert/html/Project3-instructions.ipynb?download=false 7/12
Project3-instructions
# artificial sentence
test_data = “Alphabet_NN Inc._NN showed_VB new_JJ TV_NN ads_NN yesterday
In [ ]: In [ ]:
4. Vectorized Data
src_itos, tgt_itos = set(), set()
for sent in training_data:
sent = [x.split(‘_’) for x in sent.split()] sent_src, sent_tgt = zip(*sent) src_itos.update(sent_src) tgt_itos.update(sent_tgt)
src_itos, tgt_itos = sorted(src_itos), sorted(tgt_itos)
UNK = ‘
BOS = ‘
src_itos = [UNK] + src_itos
src_stoi = {word: i for i, word in enumerate(src_itos)} src_vocab = (src_itos, src_stoi)
tgt_itos = [BOS, EOS] + tgt_itos
tgt_stoi = {word: i for i, word in enumerate(tgt_itos)} tgt_vocab = (tgt_itos, tgt_stoi)
for l in f:
line = l.decode().split()
word = line[0]
vect = np.array(line[1:]).astype(np.float) glove_dict[word]=vect
return glove_dict
def create_glove_embeddings(glove_path, target_vocab_index, emb_dim=50): “””create the glove embeddings for a target dictionary””” glove_dict = glove2dict(glove_path, emb_dim)
matrix_len = len(target_vocab_index)
weight_matrix = np.zeros((matrix_len, emb_dim))
words_found = 0
10/20/21, 7:47 PM Project3-instructions
def convert_seq(seq, vocab, is_target=False): if type(seq) is str:
seq = seq.split()
out_seq = [] for tok in seq:
if tok in vocab: out_seq.append(vocab[tok])
if is_target:
raise RuntimeError(f”Unknown target token: `{repr(tok)}` from vocab: {‘, ‘.join(vocab)}”)
else: out_seq.append(vocab[UNK])
return out_seq
training_vectors = []
for sent in training_data:
sent = [x.split(‘_’) for x in sent.split()]
src, tgt = zip(*sent)
src = torch.tensor([convert_seq(src, src_stoi)], dtype=torch.long) tgt = torch.tensor(convert_seq(tgt, tgt_stoi, is_target=True), dtype
=torch.long)
training_vectors.append((src, tgt))
5. BiLSTM-CRF POS Tagger Implementation
test_vector = [x.split(‘_’) for x in test_data.split()] test_src, test_tgt = zip(*test_vector)
test_vector = [
torch.tensor([convert_seq(test_src, src_stoi)], dtype=torch.long),
torch.tensor(convert_seq(test_tgt, tgt_stoi, is_target=True), dtype= torch.long)
input_dim = len(src_itos)
output_dim = len(tgt_itos)
input_dim, output_dim
localhost:8888/nbconvert/html/Project3-instructions.ipynb?download=false
10/20/21, 7:47 PM Project3-instructions
class BiLSTM(nn.Module):
def __init__(self, input_dim, output_dim, device):
super().__init__()
self.embedding = nn.Embedding(input_dim, EMBED_DIM)
self.lstm = nn.LSTM(EMBED_DIM, HIDDEN_DIM, num_layers=NUM_LAYERS
, batch_first=True, bidirectional=BIDIRECTIONAL)
self.linear = nn.Linear(HIDDEN_DIM*2 if BIDIRECTIONAL else HIDDE
N_DIM, output_dim) # project to vocab space self.dropout = nn.Dropout(DROPOUT) self.device = device
self.hidden = self.init_hidden()
def init_hidden(self):
direction_multiplier = 2 if BIDIRECTIONAL else 1
return (torch.randn(direction_multiplier * NUM_LAYERS, 1, HIDDEN
_DIM, device=self.device), # h0
torch.randn(direction_multiplier * NUM_LAYERS, 1, HIDDEN
_DIM, device=self.device)) # c0
def forward(self, x):
embed = self.embedding(x)
embed = self.dropout(embed)
outputs, _ = self.lstm(embed, self.hidden)
outputs = self.linear(outputs)
return outputs.squeeze() # assumes batch size of 1, whose dimens
ion will be reduced here
class CRF(nn.Module):
“””TODO: Impelement CRF forward, score and viterbi functions””” def __init__(self, tgt_vocab, device):
super().__init__()
self.tgt_itos, self.tgt_stoi = tgt_vocab
self.tag_size = len(self.tgt_itos)
self.device = device
# transition matrix
self.transitions = nn.Parameter(torch.randn(self.tag_size, self.
tag_size, device=self.device))
self.transitions.data[self.tgt_stoi[BOS], :] = -1000.
self.transitions.data[:, self.tgt_stoi[EOS]] = -1000.
def forward(self, feats):
raise NotImplementedError(“Implement this”)
def score(self, feats, tags):
raise NotImplementedError(“Implement this”)
def viterbi(self, feats):
raise NotImplementedError(“Implement this”)
localhost:8888/nbconvert/html/Project3-instructions.ipynb?download=false
10/20/21, 7:47 PM Project3-instructions
class BiLSTM_CRF(nn.Module):
def __init__(self, input_dim, output_dim, tgt_vocab, device):
super().__init__()
self.lstm = BiLSTM(input_dim, output_dim, device=device)
self.crf = CRF(tgt_vocab, device=device)
def neg_log_likelihood(self, src, tgt):
“””Compute negative log likelihood given sentence and gold POS l
feats = self.lstm(src)
forward_score = self.crf(feats) gold_score = self.crf.score(feats, tgt) return forward_score – gold_score
def forward(self, src):
“””Tag a single sentence””” feats = self.lstm(src)
out = self.crf.viterbi(feats) return out
class BiLSTM_greedy(nn.Module):
def __init__(self, input_dim, output_dim, tgt_vocab, device):
super().__init__()
self.lstm = BiLSTM(input_dim, output_dim, device)
def neg_log_likelihood(self, src, tgt):
“””Compute negative log likelihood given sentence and gold POS l
raise NotImplementedError(“Implement this”)
def forward(self, src):
“””Tag a single sentence”””
raise NotImplementedError(“Implement this”)
A. Before Training
model = BiLSTM_CRF(input_dim, output_dim, tgt_vocab, device)
model = model.to(device)
test_src, test_tgt = test_vector
” “.join(src_itos[x] for x in test_src.squeeze().tolist())
” “.join(tgt_itos[x] for x in test_tgt.tolist())
localhost:8888/nbconvert/html/Project3-instructions.ipynb?download=false
10/20/21, 7:47 PM Project3-instructions
score, pred_seq = model(test_src.to(device))
score.item()
B. Training
In [ ]: In [ ]:
” “.join(tgt_itos[x] for x in test_tgt.tolist())
” “.join(tgt_itos[x] for x in torch.cat(pred_seq, 0).tolist())
optimizer = optim.Adam(model.parameters(), LEARNING_RATE)
for i in range(NUM_EPOCHS): epoch_loss = 0.
num_correct = 0
num_tokens = 0
model.train()
for src, tgt in tqdm.tqdm(training_vectors, desc=f”[Training {i+1}/ {NUM_EPOCHS}]”):
src = src.to(device)
tgt = tgt.to(device)
model.zero_grad()
loss = model.neg_log_likelihood(src, tgt)
epoch_loss += loss
loss.backward()
optimizer.step()
_, pred = model(src)
num_correct += (torch.cat(pred, 0)==tgt).sum()
num_tokens += len(tgt)
epoch_acc = num_correct.item() / num_tokens
print(“Training Epoch # {} Loss: {:.2f} Acc: {:.2f}”.format(i+1, epo ch_loss.item(), epoch_acc))
C. After Training
score, pred_seq = model(test_src.to(device))
score.item()
” “.join(src_itos[x] for x in test_src.squeeze().tolist())
localhost:8888/nbconvert/html/Project3-instructions.ipynb?download=false
10/20/21, 7:47 PM Project3-instructions
” “.join(tgt_itos[x] for x in test_tgt.tolist())
” “.join(tgt_itos[x] for x in torch.cat(pred_seq, 0).tolist())
localhost:8888/nbconvert/html/Project3-instructions.ipynb?download=false