程序代写代做代考 python cuda cache Introduction¶

Introduction¶
In this assignment, we ask you to build neural language models with recurrent neural networks. We provide the starter code for you. You need to implement RNN models here and write a separate report describing your experiments. Simply copying from other sources will be considered as plagiarism.¶

Tasks¶
• [Task 1.1: 10 Pts] Additional data processing (5 Pts) with comments (5 Pts). 
• [Task 1.2: 50 Pts] Complete this end-to-end pipeline with RNN and LSTM architectures and save the best model object for autograding (40 Pts). Clearly comment and explain your code using + Text functionality in Colab (10 Pts). 
• [Task 2: 20 Pts] Hyper-parameters tuning using the validation set. You need to write a separate report describing your experiments by tuning three hyper-parameters. See more details in Task 3.
• [Task 3: 20 Pts] Submit the best model object and class to Vocareum for grading.
• [Task 4: Extra Credits] Try adding addtional linguistic features or other DL architectures to improve your model: char-RNN, attention mechanisum, etc. You have to implement these models in the framework we provide and clearly comment your code.
Simply copying from other sources will be considered as plagiarism.

Download Data and Tokenizer¶
In [1]:
! git clone https://github.com/rujunhan/CSCI-544.git
# Install Tokenizer
! pip install mosestokenizer

Cloning into ‘CSCI-544’…
remote: Enumerating objects: 16, done.[K
remote: Counting objects: 100% (16/16), done.[K
remote: Compressing objects: 100% (14/14), done.[K
remote: Total 16 (delta 1), reused 12 (delta 0), pack-reused 0[K
Unpacking objects: 100% (16/16), done.
Collecting mosestokenizer
[?25l Downloading https://files.pythonhosted.org/packages/45/c6/913c968e5cbcaff6cdd2a54a1008330c01a573ecadcdf9f526058e3d33a0/mosestokenizer-1.0.0-py3-none-any.whl (51kB)
[K |████████████████████████████████| 51kB 2.5MB/s
[?25hRequirement already satisfied: docopt in /usr/local/lib/python3.6/dist-packages (from mosestokenizer) (0.6.2)
Collecting openfile (from mosestokenizer)
Downloading https://files.pythonhosted.org/packages/93/e6/805db6867faacb488b44ba8e0829ef4de151dd0499f3c5da5f4ad11698a7/openfile-0.0.7-py3-none-any.whl
Collecting toolwrapper (from mosestokenizer)
Downloading https://files.pythonhosted.org/packages/ad/00/dba43b705ecb0d286576e78fc5b5d75c27ba3d1c4b80cf9a047ec3c6ad3f/toolwrapper-1.0.0.tar.gz
Building wheels for collected packages: toolwrapper
Building wheel for toolwrapper (setup.py) … [?25l[?25hdone
Created wheel for toolwrapper: filename=toolwrapper-1.0.0-cp36-none-any.whl size=3225 sha256=6f74f9defcbd9a5fe3a963cab317adb6fbaa80832122fe1a862f523ccc5b6b90
Stored in directory: /root/.cache/pip/wheels/5c/b0/2e/b8c414550c8372586ebaab634cf1f93733349cfe2b1d694fe8
Successfully built toolwrapper
Installing collected packages: openfile, toolwrapper, mosestokenizer
Successfully installed mosestokenizer-1.0.0 openfile-0.0.7 toolwrapper-1.0.0

Data Processing¶

Comments:¶
1. In read_tokenized(dir) make dir to string by using str(dir)
2. Function is_number is used to check if token is a number, if not, make it to lowercase, otherwise, making it $\langle num \rangle$.
3. tok_map function is used to mapping numbers and add $\langle bos \rangle$ and $\langle eos \rangle$ to each sentence.
In [2]:
from pathlib import Path
from tqdm import tqdm
from mosestokenizer import MosesTokenizer
import logging as log

log.basicConfig(level=log.INFO)
tokr = MosesTokenizer()

def read_tokenized(dir):
“””Tokenization wrapper”””
inputfile = open(str(dir))
for sent in inputfile:
yield tokr(sent.strip())

def is_number(token):
try:
float(token)
return ‘<' + token + '>‘
except ValueError:
return token.lower()
def tok_map(toks):
l = list(map(is_number,toks))
l.insert(0,’‘)
l.append(‘‘)
return l

train_file = Path(‘train.txt’)
with train_file.open(‘w’) as w:
for toks in tqdm(read_tokenized(Path(‘CSCI-544/hw2/train.txt’))):
w.write(” “.join(tok_map(toks)) + ‘\n’)

dev_file = Path(‘dev.txt’)
with dev_file.open(‘w’) as w:
for toks in tqdm(read_tokenized(Path(‘CSCI-544/hw2/dev.txt’))):
w.write(” “.join(tok_map(toks)) + ‘\n’)

INFO:mosestokenizer.tokenizer.MosesTokenizer:executing argv [‘perl’, ‘/usr/local/lib/python3.6/dist-packages/mosestokenizer/tokenizer-v1.1.perl’, ‘-q’, ‘-l’, ‘en’, ‘-b’, ‘-a’]
INFO:mosestokenizer.tokenizer.MosesTokenizer:spawned process 317
144526it [00:25, 5667.54it/s]
36131it [00:06, 5164.25it/s]
In [3]:
! ls -l

total 26488
drwxr-xr-x 4 root root 4096 Oct 17 04:33 CSCI-544
-rw-r–r– 1 root root 5418160 Oct 17 04:34 dev.txt
drwxr-xr-x 1 root root 4096 Aug 27 16:17 sample_data
-rw-r–r– 1 root root 21695899 Oct 17 04:34 train.txt
In [4]:
! head train.txt

but , while deliberation is taking place on how much europe should retreat from its historical domination of multilateral bodies , there is little vision beyond this .
the kyoto protocol allows countries to meet their target reductions of co2 emissions by substituting bio @-@ fuels for fossil fuels .
this independence was at the core of hegel 's insistence that supporting oneself by earning a living is one of the key ways that we gain a sense of ourselves as individuals .
but it can no longer be taken for granted that other eu countries will automatically ratify the agreements that they reach between themselves as the lodestar for common policies .
there are serious risks .
given their large foreign @-@ exchange reserves , we believe the time to begin such an initiative is now .
iran is believed by many to be trying to develop one .
indeed , the problem today is not excessive capital inflows ; international markets have largely turned against emerging markets .
but this time they are likely to face a third alternative : mayawati , whose bahujan samaj party ( bsp ) may command a bloc of at least <50> seats .
even among the sunnis and shias there are further divisions .

Task 1.1: additional data processing [5 pts + 5 pts comments]¶
Modify the above data processing code by
1. making all tokens lower case
2. mapping all numbers to a special symbol $\langle num\rangle$
3. adding $\langle bos\rangle$ and $\langle eos\rangle$ to the beginning and the end of a sentence

NOTE¶
MAX_TYPES, MIN_FREQ and BATCH_SIZE are fixed hyper-parameters for data. You are NOT ALLOWED to change these for fair comparison. The auto-grading script on Vocareum also uses these fixed values, so make sure you don’t change them. We will ask you to experiment with other hyper-parameters related to model and report results.
In [7]:
from typing import List, Iterator, Set, Dict, Optional, Tuple
from collections import Counter
from pathlib import Path
import torch

RESERVED = [‘‘, ‘‘]

PAD_IDX = 0
UNK_IDX = 1
MAX_TYPES = 10_000
BATCH_SIZE = 256
MIN_FREQ = 5

class Vocab:
“”” Mapper of words <--> index “””

def __init__(self, types):
# types is list of strings
assert isinstance(types, list)
assert isinstance(types[0], str)

self.idx2word = types
self.word2idx = {word: idx for idx, word in enumerate(types)}
assert len(self.idx2word) == len(self.word2idx) # One-to-One

def __len__(self):
return len(self.idx2word)

def save(self, path: Path):
log.info(f’Saving vocab to {path}’)
with path.open(‘w’) as wr:
for word in self.idx2word:
wr.write(f'{word}\n’)

@staticmethod
def load(path):
log.info(f’loading vocab from {path}’)
types = [line.strip() for line in path.open()]
for idx, tok in enumerate(RESERVED): # check reserved
assert types[idx] == tok
return Vocab(types)

@staticmethod
def from_text(corpus: Iterator[str], max_types: int,
min_freq: int = 5):
“””
corpus: text corpus; iterator of strings
max_types: max size of vocabulary
min_freq: ignore word types that have fewer ferquency than this number
“””
log.info(“building vocabulary; this might take some time”)
term_freqs = Counter(tok for line in corpus for tok in line.split())
for r in RESERVED:
if r in term_freqs:
log.warning(f’Found reserved word {r} in corpus’)
del term_freqs[r]
term_freqs = list(term_freqs.items())
log.info(f”Found {len(term_freqs)} types; given max_types={max_types}”)
term_freqs = {(t, f) for t, f in term_freqs if f >= min_freq}
log.info(f”Found {len(term_freqs)} after dropping freq < {min_freq} terms") term_freqs = sorted(term_freqs, key=lambda x: x[1], reverse=True) term_freqs = term_freqs[:max_types] types = [t for t, f in term_freqs] types = RESERVED + types # prepend reserved words return Vocab(types) train_file = Path('train.txt') vocab_file = Path('vocab.txt') if not vocab_file.exists(): train_corpus = (line.strip() for line in train_file.open()) vocab = Vocab.from_text(train_corpus, max_types=MAX_TYPES, min_freq=MIN_FREQ) vocab.save(vocab_file) else: vocab = Vocab.load(vocab_file) log.info(f'Vocab has {len(vocab)} types') INFO:root:building vocabulary; this might take some time INFO:root:Found 46401 types; given max_types=10000 INFO:root:Found 19374 after dropping freq < 5 terms INFO:root:Saving vocab to vocab.txt INFO:root:Vocab has 10002 types In [8]: import copy class TextDataset: def __init__(self, vocab: Vocab, path: Path): self.vocab = vocab log.info(f'loading data from {path}') # for simplicity, loading everything to memory; on large datasets this will cause OOM text = [line.strip().split() for line in path.open()] # words to index; out-of-vocab words are replaced with UNK xs = [[self.vocab.word2idx.get(tok, UNK_IDX) for tok in tokss] for tokss in text] self.data = xs log.info(f"Found {len(self.data)} records in {path}") def as_batches(self, batch_size, shuffle=False): # data already shuffled data = self.data if shuffle: random.shuffle(data) for i in range(0, len(data), batch_size): # i incrememt by batch_size batch = data[i: i + batch_size] # slice yield self.batch_as_tensors(batch) @staticmethod def batch_as_tensors(batch): n_ex = len(batch) max_len = max(len(seq) for seq in batch) seqs_tensor = torch.full(size=(n_ex, max_len), fill_value=PAD_IDX, dtype=torch.long) for i, seq in enumerate(batch): seqs_tensor[i, 0:len(seq)] = torch.tensor(seq) return seqs_tensor train_data = TextDataset(vocab=vocab, path=train_file) dev_data = TextDataset(vocab=vocab, path=Path('dev.txt')) INFO:root:loading data from train.txt INFO:root:Found 144526 records in train.txt INFO:root:loading data from dev.txt INFO:root:Found 36131 records in dev.txt In [0]: import torch.nn as nn class FNN_LM(nn.Module): def __init__(self, vocab_size, n_class, emb_dim=50, hid=100, dropout=0.2): super(FNN_LM, self).__init__() self.embedding = nn.Embedding(num_embeddings=vocab_size, embedding_dim=emb_dim, padding_idx=PAD_IDX) self.linear1 = nn.Linear(emb_dim, hid) self.linear2 = nn.Linear(hid, n_class) self.dropout = nn.Dropout(p=dropout) def forward(self, seqs, log_probs=True): """Return log Probabilities""" batch_size, max_len = seqs.shape embs = self.embedding(seqs) # embs[Batch x SeqLen x EmbDim] embs = self.dropout(embs) embs = embs.sum(dim=1) # sum over all all steps in seq hid_activated = torch.relu(self.linear1(embs)) # Non linear scores = self.linear2(hid_activated) if log_probs: return torch.log_softmax(scores, dim=1) else: return torch.softmax(scores, dim=1) In [0]: def save_model_object(model): torch.save({'state_dict': model.state_dict()}, "best_model.pt") return In [11]: # Trainer Optimizer import time from tqdm import tqdm import torch.optim as optim def train(model, n_epochs, batch_size, train_data, valid_data, device=torch.device('cuda')): log.info(f"Moving model to {device}") model = model.to(device) # move model to desired device optimizer = optim.Adam(params=model.parameters()) log.info(f"Device for training {device}") losses = [] for epoch in range(n_epochs): start = time.time() num_toks = 0 train_loss = 0. n_train_batches = 0 model.train() # switch to train mode with tqdm(train_data.as_batches(batch_size=BATCH_SIZE), leave=False) as data_bar: for seqs in data_bar: seq_loss = torch.zeros(1).to(device) for i in range(1, seqs.size()[1]-1): # Move input to desired device cur_seqs = seqs[:, :i].to(device) # take w0...w_(i-1) python indexing cur_tars = seqs[:, i].to(device) # predict w_i log_probs = model(cur_seqs) seq_loss += loss_func(log_probs, cur_tars).sum() / len(seqs) seq_loss /= (seqs.shape[1] - 1) # only n-1 toks are predicted train_loss += seq_loss.item() n_train_batches += 1 optimizer.zero_grad() # clear grads seq_loss.backward() optimizer.step() pbar_msg = f'Loss:{seq_loss.item():.4f}' data_bar.set_postfix_str(pbar_msg) # Run validation with torch.no_grad(): model.eval() # switch to inference mode -- no grads, dropouts inactive val_loss = 0 n_val_batches = 0 for seqs in valid_data.as_batches(batch_size=batch_size, shuffle=False): # Move input to desired device seq_loss = torch.zeros(1).to(device) for i in range(1, seqs.size()[1]-1): # Move input to desired device cur_seqs = seqs[:, :i].to(device) cur_tars = seqs[:, i].to(device) log_probs = model(cur_seqs) seq_loss += loss_func(log_probs, cur_tars).sum() / len(seqs) seq_loss /= (seqs.shape[1] - 1) val_loss += seq_loss.item() n_val_batches += 1 save_model_object(model) avg_train_loss = train_loss / n_train_batches avg_val_loss = val_loss / n_val_batches losses.append((epoch, avg_train_loss, avg_val_loss)) log.info(f"Epoch {epoch} complete; Losses: Train={avg_train_loss:G} Valid={avg_val_loss:G}") return losses model = FNN_LM(vocab_size=len(vocab), n_class=len(vocab)) loss_func = nn.NLLLoss(reduction='none') losses = train(model, n_epochs=5, batch_size=BATCH_SIZE, train_data=train_data, valid_data=dev_data) INFO:root:Moving model to cuda INFO:root:Device for training cuda INFO:root:Epoch 0 complete; Losses: Train=2.4707 Valid=2.20372 INFO:root:Epoch 1 complete; Losses: Train=2.15871 Valid=2.14489 INFO:root:Epoch 2 complete; Losses: Train=2.10921 Valid=2.11441 INFO:root:Epoch 3 complete; Losses: Train=2.08006 Valid=2.09363 INFO:root:Epoch 4 complete; Losses: Train=2.05915 Valid=2.07754 In [12]: ! ls -l total 32496 -rw-r--r-- 1 root root 6062606 Oct 16 02:12 best_model.pt drwxr-xr-x 4 root root 4096 Oct 16 01:05 CSCI-544 -rw-r--r-- 1 root root 5418160 Oct 16 01:12 dev.txt drwxr-xr-x 1 root root 4096 Aug 27 16:17 sample_data -rw-r--r-- 1 root root 21695899 Oct 16 01:12 train.txt -rw-r--r-- 1 root root 84229 Oct 16 01:55 vocab.txt Task 1.2: RNNs [50 pts]¶ 1. Under the given FNN_LM framework, modify the code to implement RNN model [15 pts + 5 pts comments] 2. Repeat this step for LSTM model [15 pts + 5 pts comments] 3. Write a report comparing your results for these three models [10 pts] In [0]: class RNN_LM(nn.Module): def __init__(self, vocab_size, n_class, emb_dim=50, hid=100, num_layers=1, dropout=0.1): super(RNN_LM, self).__init__() self.embedding = nn.Embedding(num_embeddings=vocab_size, embedding_dim=emb_dim, padding_idx=PAD_IDX) def forward(self, seqs): batch_size, max_len = seqs.shap class FNN_LM(nn.Module): def __init__(self, vocab_size, n_class, emb_dim=50, hid=100, dropout=0.2): super(FNN_LM, self).__init__() self.embedding = nn.Embedding(num_embeddings=vocab_size, embedding_dim=emb_dim, padding_idx=PAD_IDX) self.linear1 = nn.Linear(emb_dim, hid) self.linear2 = nn.Linear(hid, n_class) self.dropout = nn.Dropout(p=dropout) def forward(self, seqs, log_probs=True): """Return log Probabilities""" batch_size, max_len = seqs.shape embs = self.embedding(seqs) # embs[Batch x SeqLen x EmbDim] embs = self.dropout(embs) embs = embs.sum(dim=1) # sum over all all steps in seq hid_activated = torch.relu(self.linear1(embs)) # Non linear scores = self.linear2(hid_activated) if log_probs: return torch.log_softmax(scores, dim=1) else: return torch.softmax(scores, dim=1) In [0]: class LSTM_LM(nn.Module): def __init__(self, vocab_size, n_class, emb_dim=50, hid=100, num_layers=1, dropout=0.1): super(LSTM_LM, self).__init__() self.embedding = nn.Embedding(num_embeddings=vocab_size, embedding_dim=emb_dim, padding_idx=PAD_IDX) def forward(self, seqs): batch_size, max_len = seqs.shape Task 2: Hyper-parameters Tuning [20 pts]¶ You may observe that there are multiple hyper-parameters used in the pipeline. Choose 3 out of 5 following hyper-parameters and try at least 5 different values for each of them and report their corresponding performances on the train / dev datasets. Explain why you think larger or smaller values may cause the differenes you observed. 1. emb_dim: embedding size 2. hid: hidden layer dimension 3. num_layers: number of RNN layers 4. dropout ratio 5. n_epochs In [0]: from google.colab import files files.download("best_model.pt") files.download("vocab.txt") Task 3: Submitting your model class and object¶ 1. After you find the best model architecture, rename your best model as BEST_MODEL and re-run train() to save your model. 2. Download model object and locate it (best_model.pt file) in your local direcotry and submit it to Vocareum. 3. Copy your BEST_MODEL class into a python script: best_model.py and submit it to Vocareum. 4. Download you vocab.txt file and submit it with your model files. In summary, you will need a best_model.py file, a best_model.pt object and a vocab.txt file to successfully run the auto-grading on Vocareum. We made the evaluation code visible (but not editable) to everyone on Vocareum. You can find it here: resource/asnlib/public/evaluation.py See below for an example. Rename FNN() class as BEST_MODEL. Modify and save the entire script below as best_model.py In [0]: import torch class BEST_MODEL(torch.nn.Module): def __init__(self, vocab_size, n_class, emb_dim=50, hid=100, dropout=0.2): super(BEST_MODEL, self).__init__() self.embedding = torch.nn.Embedding(num_embeddings=vocab_size, embedding_dim=emb_dim, padding_idx=0) self.linear1 = torch.nn.Linear(emb_dim, hid) self.linear2 = torch.nn.Linear(hid, n_class) self.dropout = torch.nn.Dropout(p=dropout) def forward(self, seqs, log_probs=True): """Return log Probabilities""" batch_size, max_len = seqs.shape embs = self.embedding(seqs) # embs[Batch x SeqLen x EmbDim] embs = self.dropout(embs) embs = embs.sum(dim=1) # sum over all all steps in seq \ hid_activated = torch.relu(self.linear1(embs)) # Non linear scores = self.linear2(hid_activated) if log_probs: return torch.log_softmax(scores, dim=1) else: return torch.softmax(scores, dim=1) Task 4: [Extra Credits 5Pts]¶ Enhance the current model with additional linguistic features or DL models In [0]:

Related Posts