Introduction¶
In this assignment, we ask you to build neural language models with recurrent neural networks. We provide the starter code for you. You need to implement RNN models here and write a separate report describing your experiments. Simply copying from other sources will be considered as plagiarism.¶
Tasks¶
• [Task 1.1: 10 Pts] Additional data processing (5 Pts) with comments (5 Pts).
• [Task 1.2: 50 Pts] Complete this end-to-end pipeline with RNN and LSTM architectures and save the best model object for autograding (40 Pts). Clearly comment and explain your code using + Text functionality in Colab (10 Pts).
• [Task 2: 20 Pts] Hyper-parameters tuning using the validation set. You need to write a separate report describing your experiments by tuning three hyper-parameters. See more details in Task 3.
• [Task 3: 20 Pts] Submit the best model object and class to Vocareum for grading.
• [Task 4: Extra Credits] Try adding addtional linguistic features or other DL architectures to improve your model: char-RNN, attention mechanisum, etc. You have to implement these models in the framework we provide and clearly comment your code.
Simply copying from other sources will be considered as plagiarism.
Download Data and Tokenizer¶
In [1]:
! git clone https://github.com/rujunhan/CSCI-544.git
# Install Tokenizer
! pip install mosestokenizer
Cloning into ‘CSCI-544’…
remote: Enumerating objects: 16, done.[K
remote: Counting objects: 100% (16/16), done.[K
remote: Compressing objects: 100% (14/14), done.[K
remote: Total 16 (delta 1), reused 12 (delta 0), pack-reused 0[K
Unpacking objects: 100% (16/16), done.
Collecting mosestokenizer
[?25l Downloading https://files.pythonhosted.org/packages/45/c6/913c968e5cbcaff6cdd2a54a1008330c01a573ecadcdf9f526058e3d33a0/mosestokenizer-1.0.0-py3-none-any.whl (51kB)
[K |████████████████████████████████| 51kB 2.5MB/s
[?25hRequirement already satisfied: docopt in /usr/local/lib/python3.6/dist-packages (from mosestokenizer) (0.6.2)
Collecting openfile (from mosestokenizer)
Downloading https://files.pythonhosted.org/packages/93/e6/805db6867faacb488b44ba8e0829ef4de151dd0499f3c5da5f4ad11698a7/openfile-0.0.7-py3-none-any.whl
Collecting toolwrapper (from mosestokenizer)
Downloading https://files.pythonhosted.org/packages/ad/00/dba43b705ecb0d286576e78fc5b5d75c27ba3d1c4b80cf9a047ec3c6ad3f/toolwrapper-1.0.0.tar.gz
Building wheels for collected packages: toolwrapper
Building wheel for toolwrapper (setup.py) … [?25l[?25hdone
Created wheel for toolwrapper: filename=toolwrapper-1.0.0-cp36-none-any.whl size=3225 sha256=6f74f9defcbd9a5fe3a963cab317adb6fbaa80832122fe1a862f523ccc5b6b90
Stored in directory: /root/.cache/pip/wheels/5c/b0/2e/b8c414550c8372586ebaab634cf1f93733349cfe2b1d694fe8
Successfully built toolwrapper
Installing collected packages: openfile, toolwrapper, mosestokenizer
Successfully installed mosestokenizer-1.0.0 openfile-0.0.7 toolwrapper-1.0.0
Data Processing¶
Comments:¶
1. In read_tokenized(dir) make dir to string by using str(dir)
2. Function is_number is used to check if token is a number, if not, make it to lowercase, otherwise, making it $\langle num \rangle$.
3. tok_map function is used to mapping numbers and add $\langle bos \rangle$ and $\langle eos \rangle$ to each sentence.
In [2]:
from pathlib import Path
from tqdm import tqdm
from mosestokenizer import MosesTokenizer
import logging as log
log.basicConfig(level=log.INFO)
tokr = MosesTokenizer()
def read_tokenized(dir):
“””Tokenization wrapper”””
inputfile = open(str(dir))
for sent in inputfile:
yield tokr(sent.strip())
def is_number(token):
try:
float(token)
return ‘<' + token + '>‘
except ValueError:
return token.lower()
def tok_map(toks):
l = list(map(is_number,toks))
l.insert(0,’
l.append(‘
return l
train_file = Path(‘train.txt’)
with train_file.open(‘w’) as w:
for toks in tqdm(read_tokenized(Path(‘CSCI-544/hw2/train.txt’))):
w.write(” “.join(tok_map(toks)) + ‘\n’)
dev_file = Path(‘dev.txt’)
with dev_file.open(‘w’) as w:
for toks in tqdm(read_tokenized(Path(‘CSCI-544/hw2/dev.txt’))):
w.write(” “.join(tok_map(toks)) + ‘\n’)
INFO:mosestokenizer.tokenizer.MosesTokenizer:executing argv [‘perl’, ‘/usr/local/lib/python3.6/dist-packages/mosestokenizer/tokenizer-v1.1.perl’, ‘-q’, ‘-l’, ‘en’, ‘-b’, ‘-a’]
INFO:mosestokenizer.tokenizer.MosesTokenizer:spawned process 317
144526it [00:25, 5667.54it/s]
36131it [00:06, 5164.25it/s]
In [3]:
! ls -l
total 26488
drwxr-xr-x 4 root root 4096 Oct 17 04:33 CSCI-544
-rw-r–r– 1 root root 5418160 Oct 17 04:34 dev.txt
drwxr-xr-x 1 root root 4096 Aug 27 16:17 sample_data
-rw-r–r– 1 root root 21695899 Oct 17 04:34 train.txt
In [4]:
! head train.txt
Task 1.1: additional data processing [5 pts + 5 pts comments]¶
Modify the above data processing code by
1. making all tokens lower case
2. mapping all numbers to a special symbol $\langle num\rangle$
3. adding $\langle bos\rangle$ and $\langle eos\rangle$ to the beginning and the end of a sentence
NOTE¶
MAX_TYPES, MIN_FREQ and BATCH_SIZE are fixed hyper-parameters for data. You are NOT ALLOWED to change these for fair comparison. The auto-grading script on Vocareum also uses these fixed values, so make sure you don’t change them. We will ask you to experiment with other hyper-parameters related to model and report results.
In [7]:
from typing import List, Iterator, Set, Dict, Optional, Tuple
from collections import Counter
from pathlib import Path
import torch
RESERVED = [‘
PAD_IDX = 0
UNK_IDX = 1
MAX_TYPES = 10_000
BATCH_SIZE = 256
MIN_FREQ = 5
class Vocab:
“”” Mapper of words <--> index “””
def __init__(self, types):
# types is list of strings
assert isinstance(types, list)
assert isinstance(types[0], str)
self.idx2word = types
self.word2idx = {word: idx for idx, word in enumerate(types)}
assert len(self.idx2word) == len(self.word2idx) # One-to-One
def __len__(self):
return len(self.idx2word)
def save(self, path: Path):
log.info(f’Saving vocab to {path}’)
with path.open(‘w’) as wr:
for word in self.idx2word:
wr.write(f'{word}\n’)
@staticmethod
def load(path):
log.info(f’loading vocab from {path}’)
types = [line.strip() for line in path.open()]
for idx, tok in enumerate(RESERVED): # check reserved
assert types[idx] == tok
return Vocab(types)
@staticmethod
def from_text(corpus: Iterator[str], max_types: int,
min_freq: int = 5):
“””
corpus: text corpus; iterator of strings
max_types: max size of vocabulary
min_freq: ignore word types that have fewer ferquency than this number
“””
log.info(“building vocabulary; this might take some time”)
term_freqs = Counter(tok for line in corpus for tok in line.split())
for r in RESERVED:
if r in term_freqs:
log.warning(f’Found reserved word {r} in corpus’)
del term_freqs[r]
term_freqs = list(term_freqs.items())
log.info(f”Found {len(term_freqs)} types; given max_types={max_types}”)
term_freqs = {(t, f) for t, f in term_freqs if f >= min_freq}
log.info(f”Found {len(term_freqs)} after dropping freq < {min_freq} terms")
term_freqs = sorted(term_freqs, key=lambda x: x[1], reverse=True)
term_freqs = term_freqs[:max_types]
types = [t for t, f in term_freqs]
types = RESERVED + types # prepend reserved words
return Vocab(types)
train_file = Path('train.txt')
vocab_file = Path('vocab.txt')
if not vocab_file.exists():
train_corpus = (line.strip() for line in train_file.open())
vocab = Vocab.from_text(train_corpus, max_types=MAX_TYPES, min_freq=MIN_FREQ)
vocab.save(vocab_file)
else:
vocab = Vocab.load(vocab_file)
log.info(f'Vocab has {len(vocab)} types')
INFO:root:building vocabulary; this might take some time
INFO:root:Found 46401 types; given max_types=10000
INFO:root:Found 19374 after dropping freq < 5 terms
INFO:root:Saving vocab to vocab.txt
INFO:root:Vocab has 10002 types
In [8]:
import copy
class TextDataset:
def __init__(self, vocab: Vocab, path: Path):
self.vocab = vocab
log.info(f'loading data from {path}')
# for simplicity, loading everything to memory; on large datasets this will cause OOM
text = [line.strip().split() for line in path.open()]
# words to index; out-of-vocab words are replaced with UNK
xs = [[self.vocab.word2idx.get(tok, UNK_IDX) for tok in tokss]
for tokss in text]
self.data = xs
log.info(f"Found {len(self.data)} records in {path}")
def as_batches(self, batch_size, shuffle=False): # data already shuffled
data = self.data
if shuffle:
random.shuffle(data)
for i in range(0, len(data), batch_size): # i incrememt by batch_size
batch = data[i: i + batch_size] # slice
yield self.batch_as_tensors(batch)
@staticmethod
def batch_as_tensors(batch):
n_ex = len(batch)
max_len = max(len(seq) for seq in batch)
seqs_tensor = torch.full(size=(n_ex, max_len), fill_value=PAD_IDX,
dtype=torch.long)
for i, seq in enumerate(batch):
seqs_tensor[i, 0:len(seq)] = torch.tensor(seq)
return seqs_tensor
train_data = TextDataset(vocab=vocab, path=train_file)
dev_data = TextDataset(vocab=vocab, path=Path('dev.txt'))
INFO:root:loading data from train.txt
INFO:root:Found 144526 records in train.txt
INFO:root:loading data from dev.txt
INFO:root:Found 36131 records in dev.txt
In [0]:
import torch.nn as nn
class FNN_LM(nn.Module):
def __init__(self, vocab_size, n_class, emb_dim=50, hid=100, dropout=0.2):
super(FNN_LM, self).__init__()
self.embedding = nn.Embedding(num_embeddings=vocab_size,
embedding_dim=emb_dim,
padding_idx=PAD_IDX)
self.linear1 = nn.Linear(emb_dim, hid)
self.linear2 = nn.Linear(hid, n_class)
self.dropout = nn.Dropout(p=dropout)
def forward(self, seqs, log_probs=True):
"""Return log Probabilities"""
batch_size, max_len = seqs.shape
embs = self.embedding(seqs) # embs[Batch x SeqLen x EmbDim]
embs = self.dropout(embs)
embs = embs.sum(dim=1) # sum over all all steps in seq
hid_activated = torch.relu(self.linear1(embs)) # Non linear
scores = self.linear2(hid_activated)
if log_probs:
return torch.log_softmax(scores, dim=1)
else:
return torch.softmax(scores, dim=1)
In [0]:
def save_model_object(model):
torch.save({'state_dict': model.state_dict()}, "best_model.pt")
return
In [11]:
# Trainer Optimizer
import time
from tqdm import tqdm
import torch.optim as optim
def train(model, n_epochs, batch_size, train_data, valid_data, device=torch.device('cuda')):
log.info(f"Moving model to {device}")
model = model.to(device) # move model to desired device
optimizer = optim.Adam(params=model.parameters())
log.info(f"Device for training {device}")
losses = []
for epoch in range(n_epochs):
start = time.time()
num_toks = 0
train_loss = 0.
n_train_batches = 0
model.train() # switch to train mode
with tqdm(train_data.as_batches(batch_size=BATCH_SIZE), leave=False) as data_bar:
for seqs in data_bar:
seq_loss = torch.zeros(1).to(device)
for i in range(1, seqs.size()[1]-1):
# Move input to desired device
cur_seqs = seqs[:, :i].to(device) # take w0...w_(i-1) python indexing
cur_tars = seqs[:, i].to(device) # predict w_i
log_probs = model(cur_seqs)
seq_loss += loss_func(log_probs, cur_tars).sum() / len(seqs)
seq_loss /= (seqs.shape[1] - 1) # only n-1 toks are predicted
train_loss += seq_loss.item()
n_train_batches += 1
optimizer.zero_grad() # clear grads
seq_loss.backward()
optimizer.step()
pbar_msg = f'Loss:{seq_loss.item():.4f}'
data_bar.set_postfix_str(pbar_msg)
# Run validation
with torch.no_grad():
model.eval() # switch to inference mode -- no grads, dropouts inactive
val_loss = 0
n_val_batches = 0
for seqs in valid_data.as_batches(batch_size=batch_size, shuffle=False):
# Move input to desired device
seq_loss = torch.zeros(1).to(device)
for i in range(1, seqs.size()[1]-1):
# Move input to desired device
cur_seqs = seqs[:, :i].to(device)
cur_tars = seqs[:, i].to(device)
log_probs = model(cur_seqs)
seq_loss += loss_func(log_probs, cur_tars).sum() / len(seqs)
seq_loss /= (seqs.shape[1] - 1)
val_loss += seq_loss.item()
n_val_batches += 1
save_model_object(model)
avg_train_loss = train_loss / n_train_batches
avg_val_loss = val_loss / n_val_batches
losses.append((epoch, avg_train_loss, avg_val_loss))
log.info(f"Epoch {epoch} complete; Losses: Train={avg_train_loss:G} Valid={avg_val_loss:G}")
return losses
model = FNN_LM(vocab_size=len(vocab), n_class=len(vocab))
loss_func = nn.NLLLoss(reduction='none')
losses = train(model, n_epochs=5, batch_size=BATCH_SIZE, train_data=train_data,
valid_data=dev_data)
INFO:root:Moving model to cuda
INFO:root:Device for training cuda
INFO:root:Epoch 0 complete; Losses: Train=2.4707 Valid=2.20372
INFO:root:Epoch 1 complete; Losses: Train=2.15871 Valid=2.14489
INFO:root:Epoch 2 complete; Losses: Train=2.10921 Valid=2.11441
INFO:root:Epoch 3 complete; Losses: Train=2.08006 Valid=2.09363
INFO:root:Epoch 4 complete; Losses: Train=2.05915 Valid=2.07754
In [12]:
! ls -l
total 32496
-rw-r--r-- 1 root root 6062606 Oct 16 02:12 best_model.pt
drwxr-xr-x 4 root root 4096 Oct 16 01:05 CSCI-544
-rw-r--r-- 1 root root 5418160 Oct 16 01:12 dev.txt
drwxr-xr-x 1 root root 4096 Aug 27 16:17 sample_data
-rw-r--r-- 1 root root 21695899 Oct 16 01:12 train.txt
-rw-r--r-- 1 root root 84229 Oct 16 01:55 vocab.txt
Task 1.2: RNNs [50 pts]¶
1. Under the given FNN_LM framework, modify the code to implement RNN model [15 pts + 5 pts comments]
2. Repeat this step for LSTM model [15 pts + 5 pts comments]
3. Write a report comparing your results for these three models [10 pts]
In [0]:
class RNN_LM(nn.Module):
def __init__(self, vocab_size, n_class, emb_dim=50, hid=100, num_layers=1, dropout=0.1):
super(RNN_LM, self).__init__()
self.embedding = nn.Embedding(num_embeddings=vocab_size,
embedding_dim=emb_dim, padding_idx=PAD_IDX)
def forward(self, seqs):
batch_size, max_len = seqs.shap
class FNN_LM(nn.Module):
def __init__(self, vocab_size, n_class, emb_dim=50, hid=100, dropout=0.2):
super(FNN_LM, self).__init__()
self.embedding = nn.Embedding(num_embeddings=vocab_size,
embedding_dim=emb_dim,
padding_idx=PAD_IDX)
self.linear1 = nn.Linear(emb_dim, hid)
self.linear2 = nn.Linear(hid, n_class)
self.dropout = nn.Dropout(p=dropout)
def forward(self, seqs, log_probs=True):
"""Return log Probabilities"""
batch_size, max_len = seqs.shape
embs = self.embedding(seqs) # embs[Batch x SeqLen x EmbDim]
embs = self.dropout(embs)
embs = embs.sum(dim=1) # sum over all all steps in seq
hid_activated = torch.relu(self.linear1(embs)) # Non linear
scores = self.linear2(hid_activated)
if log_probs:
return torch.log_softmax(scores, dim=1)
else:
return torch.softmax(scores, dim=1)
In [0]:
class LSTM_LM(nn.Module):
def __init__(self, vocab_size, n_class, emb_dim=50, hid=100, num_layers=1, dropout=0.1):
super(LSTM_LM, self).__init__()
self.embedding = nn.Embedding(num_embeddings=vocab_size,
embedding_dim=emb_dim, padding_idx=PAD_IDX)
def forward(self, seqs):
batch_size, max_len = seqs.shape
Task 2: Hyper-parameters Tuning [20 pts]¶
You may observe that there are multiple hyper-parameters used in the pipeline. Choose 3 out of 5 following hyper-parameters and try at least 5 different values for each of them and report their corresponding performances on the train / dev datasets. Explain why you think larger or smaller values may cause the differenes you observed.
1. emb_dim: embedding size
2. hid: hidden layer dimension
3. num_layers: number of RNN layers
4. dropout ratio
5. n_epochs
In [0]:
from google.colab import files
files.download("best_model.pt")
files.download("vocab.txt")
Task 3: Submitting your model class and object¶
1. After you find the best model architecture, rename your best model as BEST_MODEL and re-run train() to save your model.
2. Download model object and locate it (best_model.pt file) in your local direcotry and submit it to Vocareum.
3. Copy your BEST_MODEL class into a python script: best_model.py and submit it to Vocareum.
4. Download you vocab.txt file and submit it with your model files.
In summary, you will need a best_model.py file, a best_model.pt object and a vocab.txt file to successfully run the auto-grading on Vocareum.
We made the evaluation code visible (but not editable) to everyone on Vocareum. You can find it here: resource/asnlib/public/evaluation.py
See below for an example. Rename FNN() class as BEST_MODEL. Modify and save the entire script below as best_model.py
In [0]:
import torch
class BEST_MODEL(torch.nn.Module):
def __init__(self, vocab_size, n_class, emb_dim=50, hid=100, dropout=0.2):
super(BEST_MODEL, self).__init__()
self.embedding = torch.nn.Embedding(num_embeddings=vocab_size,
embedding_dim=emb_dim,
padding_idx=0)
self.linear1 = torch.nn.Linear(emb_dim, hid)
self.linear2 = torch.nn.Linear(hid, n_class)
self.dropout = torch.nn.Dropout(p=dropout)
def forward(self, seqs, log_probs=True):
"""Return log Probabilities"""
batch_size, max_len = seqs.shape
embs = self.embedding(seqs) # embs[Batch x SeqLen x EmbDim]
embs = self.dropout(embs)
embs = embs.sum(dim=1) # sum over all all steps in seq \
hid_activated = torch.relu(self.linear1(embs)) # Non linear
scores = self.linear2(hid_activated)
if log_probs:
return torch.log_softmax(scores, dim=1)
else:
return torch.softmax(scores, dim=1)
Task 4: [Extra Credits 5Pts]¶
Enhance the current model with additional linguistic features or DL models
In [0]: