HW2_Spam_Classification_with_LSTM
You should submit a .ipynb file with your solutions to NYU Brightspace.
Copyright By PowCoder代写 加微信 powcoder
In this homework, we will reuse the spam prediction dataset used in HW1.
We will use a word-level BiLSTM sentence encoder to encode the sentence and a neural network classifier.
For reference, you may read this paper.
Lab 3 is especially relevant to this homework.
Points distribution¶
code spam_collate_func: 25 pts
code LSTMClassifier.init: 25 pts
code LSTMClassifier.forward: 20 pts
code evaluate: 10 pts
code for training loop: 10 pts
Question on early stopping: 10 pts
How we grade the code:
full points if code works and the underlying logic is correct;
half points if code works but the underlying logic is incorrect;
zero points if code does not work.
Therefore, make sure your code works, i.e., no error is being produced when you execute the code.
Data Loading¶
First, reuse the code from HW1 to download and read the data.
!wget ‘https://docs.google.com/uc?export=download&id=1OVRo37agn02mc6yp5p6-wtJ8Hyb-YMXR’ -O spam.csv
–2022-02-16 23:52:19– https://docs.google.com/uc?export=download&id=1OVRo37agn02mc6yp5p6-wtJ8Hyb-YMXR
Resolving docs.google.com (docs.google.com)… 74.125.202.101, 74.125.202.138, 74.125.202.139, …
Connecting to docs.google.com (docs.google.com)|74.125.202.101|:443… connected.
HTTP request sent, awaiting response… 303 See Other
Location: https://doc-14-04-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/va2ei70761h7r8rlq63433gnfu6orla0/1645055475000/08752484438609855375/*/1OVRo37agn02mc6yp5p6-wtJ8Hyb-YMXR?e=download [following]
Warning: wildcards not supported in HTTP.
–2022-02-16 23:52:19– https://doc-14-04-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/va2ei70761h7r8rlq63433gnfu6orla0/1645055475000/08752484438609855375/*/1OVRo37agn02mc6yp5p6-wtJ8Hyb-YMXR?e=download
Resolving doc-14-04-docs.googleusercontent.com (doc-14-04-docs.googleusercontent.com)… 173.194.197.132, 2607:f8b0:4001:c1b::84
Connecting to doc-14-04-docs.googleusercontent.com (doc-14-04-docs.googleusercontent.com)|173.194.197.132|:443… connected.
HTTP request sent, awaiting response… 200 OK
Length: 503663 (492K) [text/csv]
Saving to: ‘spam.csv’
spam.csv 100%[===================>] 491.86K –.-KB/s in 0.009s
2022-02-16 23:52:20 (55.1 MB/s) – ‘spam.csv’ saved [503663/503663]
import pandas as pd
import numpy as np
df = pd.read_csv(“spam.csv”, usecols=[“v1”, “v2”], encoding=’latin-1′)
# 1 – spam, 0 – ham
df.v1 = (df.v1 == “spam”).astype(“int”)
0 0 Go until jurong point, crazy.. Available only …
1 0 Ok lar… Joking wif u oni…
2 1 Free entry in 2 a wkly comp to win FA Cup fina…
3 0 U dun say so early hor… U c already then say…
4 0 Nah I don’t think he goes to usf, he lives aro…
We will split the data into train, val, and test sets.
train_texts, val_texts, and test_texts should contain a list of text examples in the dataset.
# 0.15 for val, 0.15 for test, 0.7 for train
val_size = int(df.shape[0] * 0.15)
test_size = int(df.shape[0] * 0.15)
# Shuffle the data
df = df.sample(frac=1)
# Split df to test/val/train
test_df = df[:test_size]
val_df = df[test_size:test_size+val_size]
train_df = df[test_size+val_size:]
train_texts, train_labels = list(train_df.v2), list(train_df.v1)
val_texts, val_labels = list(val_df.v2), list(val_df.v1)
test_texts, test_labels = list(test_df.v2), list(test_df.v1)
# Check that idces do not overlap
assert set(train_df.index).intersection(set(val_df.index)) == set({})
assert set(test_df.index).intersection(set(train_df.index)) == set({})
assert set(val_df.index).intersection(set(test_df.index)) == set({})
# Check that all idces are present
assert df.shape[0] == len(train_labels) + len(val_labels) + len(test_labels)
f”Size of initial data: {df.shape[0]}\n”
f”Train size: {len(train_labels)}\n”
f”Val size: {len(val_labels)}\n”
f”Test size: {len(test_labels)}\n”
Size of initial data: 5572
Train size: 3902
Val size: 835
Test size: 835
train_texts[:10] # Just checking the examples in train_text
[“I’ll talk to the others and probably just come early tomorrow then”,
‘House-Maid is the murderer, coz the man was murdered on <#> th January.. As public holiday all govt.instituitions are closed,including post office..understand?’,
“Sad story of a Man – Last week was my b’day. My Wife did’nt wish me. My Parents forgot n so did my Kids . I went to work. Even my Colleagues did not wish.”,
“Nah I don’t think he goes to usf, he lives around here though”,
‘Nope… C Ì_ then…’,
‘I sent your maga that money yesterday oh.’,
‘URGENT This is our 2nd attempt to contact U. Your å£900 prize from YESTERDAY is still awaiting collection. To claim CALL NOW 09061702893. ACL03530150PM’,
‘Lol I would but my mom would have a fit and tell the whole family how crazy and terrible I am’,
‘Check mail.i have mailed varma and kept copy to you regarding membership.take care.insha allah.’,
‘88066 FROM 88066 LOST 3POUND HELP’]
Download and Load Glo ¶
We will use GloVe embedding parameters to initialize our layer of word representations / embedding layer.
Let’s download and load glove.
This is related Lab 3 Deep Learning, please watch the recording and check the notebook for details.
Download GloVe word embeddings
# === Download GloVe word embeddings
# !wget http://nlp.stanford.edu/data/glove.6B.zip
# === Unzip word embeddings and use only the top 50000 word embeddings for speed
# !unzip glove.6B.zip
# !head -n 50000 glove.6B.300d.txt > glove.6B.300d__50k.txt
# === Download Preprocessed version
!wget https://docs.google.com/uc?id=1KMJTagaVD9hFHXFTPtNk0u2JjvNlyCAu -O glove_split.aa
!wget https://docs.google.com/uc?id=1LF2yD2jToXriyD-lsYA5hj03f7J3ZKaY -O glove_split.ab
!wget https://docs.google.com/uc?id=1N1xnxkRyM5Gar7sv4d41alyTL92Iip3f -O glove_split.ac
!cat glove_split.?? > ‘glove.6B.300d__50k.txt’
–2022-02-16 23:52:32– https://docs.google.com/uc?id=1KMJTagaVD9hFHXFTPtNk0u2JjvNlyCAu
Resolving docs.google.com (docs.google.com)… 74.125.202.102, 74.125.202.139, 74.125.202.113, …
Connecting to docs.google.com (docs.google.com)|74.125.202.102|:443… connected.
HTTP request sent, awaiting response… 200 OK
Length: unspecified [text/html]
Saving to: ‘glove_split.aa’
glove_split.aa [ <=> ] 1.93K –.-KB/s in 0s
2022-02-16 23:52:33 (25.6 MB/s) – ‘glove_split.aa’ saved [1978]
–2022-02-16 23:52:33– https://docs.google.com/uc?id=1LF2yD2jToXriyD-lsYA5hj03f7J3ZKaY
Resolving docs.google.com (docs.google.com)… 74.125.202.138, 74.125.202.100, 74.125.202.102, …
Connecting to docs.google.com (docs.google.com)|74.125.202.138|:443… connected.
HTTP request sent, awaiting response… 200 OK
Length: unspecified [text/html]
Saving to: ‘glove_split.ab’
glove_split.ab [ <=> ] 1.93K –.-KB/s in 0s
2022-02-16 23:52:36 (25.5 MB/s) – ‘glove_split.ab’ saved [1978]
–2022-02-16 23:52:37– https://docs.google.com/uc?id=1N1xnxkRyM5Gar7sv4d41alyTL92Iip3f
Resolving docs.google.com (docs.google.com)… 74.125.202.138, 74.125.202.113, 74.125.202.101, …
Connecting to docs.google.com (docs.google.com)|74.125.202.138|:443… connected.
HTTP request sent, awaiting response… 303 See Other
Location: https://doc-04-0g-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/rjp7g6lolbttv0mftfnic5fr5oum8q4r/1645055550000/14514704803973256873/*/1N1xnxkRyM5Gar7sv4d41alyTL92Iip3f [following]
Warning: wildcards not supported in HTTP.
–2022-02-16 23:52:37– https://doc-04-0g-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/rjp7g6lolbttv0mftfnic5fr5oum8q4r/1645055550000/14514704803973256873/*/1N1xnxkRyM5Gar7sv4d41alyTL92Iip3f
Resolving doc-04-0g-docs.googleusercontent.com (doc-04-0g-docs.googleusercontent.com)… 173.194.197.132, 2607:f8b0:4001:c1b::84
Connecting to doc-04-0g-docs.googleusercontent.com (doc-04-0g-docs.googleusercontent.com)|173.194.197.132|:443… connected.
HTTP request sent, awaiting response… 403 Forbidden
2022-02-16 23:52:37 ERROR 403: Forbidden.
!wget https://campuspro-uploads.s3.us-west-2.amazonaws.com/f14e42f6-0f57-4d3c-bf3c-6eb8982c822b/1447cf92-9ef5-4097-939d-f69337174ded/glove.6B.300d__50k.txt.zip
!unzip glove.6B.300d__50k.txt.zip
–2022-02-16 23:57:18– https://campuspro-uploads.s3.us-west-2.amazonaws.com/f14e42f6-0f57-4d3c-bf3c-6eb8982c822b/1447cf92-9ef5-4097-939d-f69337174ded/glove.6B.300d__50k.txt.zip
Resolving campuspro-uploads.s3.us-west-2.amazonaws.com (campuspro-uploads.s3.us-west-2.amazonaws.com)… 52.218.246.249
Connecting to campuspro-uploads.s3.us-west-2.amazonaws.com (campuspro-uploads.s3.us-west-2.amazonaws.com)|52.218.246.249|:443… connected.
HTTP request sent, awaiting response… 200 OK
Length: 49335722 (47M) [application/zip]
Saving to: ‘glove.6B.300d__50k.txt.zip’
glove.6B.300d__50k. 100%[===================>] 47.05M 22.6MB/s in 2.1s
2022-02-16 23:57:21 (22.6 MB/s) – ‘glove.6B.300d__50k.txt.zip’ saved [49335722/49335722]
Archive: glove.6B.300d__50k.txt.zip
replace glove.6B.300d__50k.txt? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
inflating: glove.6B.300d__50k.txt
inflating: __MACOSX/._glove.6B.300d__50k.txt
Load Glo ¶
def load_glove(glove_path, embedding_dim):
with open(glove_path) as f:
token_ls = [PAD_TOKEN, UNK_TOKEN]
embedding_ls = [np.zeros(embedding_dim), np.random.rand(embedding_dim)]
for line in f:
token, raw_embedding = line.split(maxsplit=1)
token_ls.append(token)
embedding = np.array([float(x) for x in raw_embedding.split()])
embedding_ls.append(embedding)
embeddings = np.array(embedding_ls)
print(embedding_ls[-1].size)
return token_ls, embeddings
PAD_TOKEN = ‘
UNK_TOKEN = ‘
EMBEDDING_DIM=300 # dimension of Glove embeddings
glove_path = “glove.6B.300d__50k.txt”
vocab, embeddings = load_glove(glove_path, EMBEDDING_DIM)
Import packages¶
!pip install sacremoses
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np
import pandas as pd
import sacremoses
from torch.utils.data import dataloader, Dataset
from tqdm.auto import tqdm
Collecting sacremoses
Downloading sacremoses-0.0.47-py2.py3-none-any.whl (895 kB)
|████████████████████████████████| 895 kB 5.4 MB/s
Requirement already satisfied: joblib in /usr/local/lib/python3.7/dist-packages (from sacremoses) (1.1.0)
Requirement already satisfied: six in /usr/local/lib/python3.7/dist-packages (from sacremoses) (1.15.0)
Requirement already satisfied: regex in /usr/local/lib/python3.7/dist-packages (from sacremoses) (2019.12.20)
Requirement already satisfied: tqdm in /usr/local/lib/python3.7/dist-packages (from sacremoses) (4.62.3)
Requirement already satisfied: click in /usr/local/lib/python3.7/dist-packages (from sacremoses) (7.1.2)
Installing collected packages: sacremoses
Successfully installed sacremoses-0.0.47
Tokenize text data.¶
We will use the tokenize function to convert text data into sequence of indices.
def tokenize(data, labels, tokenizer, vocab, max_seq_length=128):
vocab_to_idx = {word: i for i, word in enumerate(vocab)}
text_data = []
label_data = []
for ex in tqdm(data):
tokenized = tokenizer.tokenize(ex.lower())
ids = [vocab_to_idx.get(token, 1) for token in tokenized]
text_data.append(ids)
return text_data, labels
tokenizer = sacremoses.MosesTokenizer()
train_data_indices, train_labels = tokenize(train_texts, train_labels, tokenizer, vocab)
val_data_indices, val_labels = tokenize(val_texts, val_labels, tokenizer, vocab)
test_data_indices, test_labels = tokenize(test_texts, test_labels, tokenizer, vocab)
print(“\nTrain text first 5 examples:\n”, train_data_indices[:5])
print(“\nTrain labels first 5 examples:\n”, train_labels[:5])
Train text first 5 examples:
[[43, 1, 1079, 6, 2, 425, 7, 967, 122, 328, 201, 4004, 129], [1, 16, 2, 12873, 3, 1, 2, 302, 17, 5928, 15, 725, 18811, 91, 2751, 725, 16537, 91, 14360, 452, 1, 21, 200, 2334, 66, 1, 34, 719, 3, 146, 660, 285, 1, 1908, 190], [5281, 525, 5, 9, 302, 13, 78, 149, 17, 194, 1558, 1, 194, 704, 121, 1, 3469, 1, 194, 1110, 15477, 3816, 102, 121, 194, 1815, 4, 43, 390, 6, 1, 153, 194, 3298, 121, 38, 3469, 4], [8978, 43, 3318, 1, 271, 20, 1434, 6, 21590, 3, 20, 975, 206, 189, 415], [43897, 436, 1866, 1, 99, 129, 436]]
Train labels first 5 examples:
[0, 0, 0, 0, 0]
Create DataLoaders (25 pts)¶
Now, let’s create pytorch DataLoaders for our train, val, and test data.
SpamDataset class is based on torch Dataset. It has an additional parameter called self.max_sent_length and a spam_collate_func.
In order to use batch processing, all the examples need to effectively be the same length. We’ll do this by adding padding tokens. spam_collate_func is supposed to dynamically pad or trim the sentences in the batch based on self.max_sent_length and the length of longest sequence in the batch.
If self.max_sent_length is less than the length of longest sequence in the batch, use self.max_sent_length. Otherwise, use the length of longest sequence in the batch.
We do this because our input sentences in the batch may be much shorter than self.max_sent_length.
Please check the comment block in the code near TODO for more details.
PAD token id = 0
max_sent_length = 5
input list of sequences:
[1,4,5,3,5,6,7,4,4],
[3,5,3,2],
[2,5,3,5,6,7,4],
then padded minibatch looks like this:
padded_input =
[[1,4,5,3,5],
[3,5,3,2,0],
[2,5,3,5,6]]
import numpy as np
import torch
from torch.utils.data import Dataset
class SpamDataset(Dataset):
Class that represents a train/validation/test dataset that’s readable for PyTorch
Note that this class inherits torch.utils.data.Dataset
def __init__(self, data_list, target_list, max_sent_length=128):
@param data_list: list of data tokens
@param target_list: list of data targets
self.data_list = data_list
self.target_list = target_list
self.max_sent_length = max_sent_length
assert (len(self.data_list) == len(self.target_list))
def __len__(self):
return len(self.data_list)
def __getitem__(self, key, max_sent_length=None):
Triggered when you call dataset[i]
if max_sent_length is None:
max_sent_length = self.max_sent_length
token_idx = self.data_list[key][:max_sent_length]
label = self.target_list[key]
return [token_idx, label]
def spam_collate_func(self, batch):
Customized function for DataLoader that dynamically pads the batch so that all
data have the same length
# What the input `batch`? That’s for you to figure out!
# You can read the Dataloader documentation, or you can use print
# function to debug.
data_list = [] # store padded sequences
label_list = []
max_batch_seq_len = None # the length of longest sequence in batch
# if it is less than self.max_sent_length
# else max_batch_seq_len = self.max_sent_length
# Pad the sequences in your data
# if their length is less than max_batch_seq_len
# or trim the sequences that are longer than self.max_sent_length
# return padded data_list and label_list
1. TODO: Your code here
return [data_list, label_list]
BATCH_SIZE = 64
max_sent_length=128
train_dataset = SpamDataset(train_data_indices, train_labels, max_sent_length)
train_loader = torch.utils.data.DataLoader(dataset=train_dataset,
batch_size=BATCH_SIZE,
collate_fn=train_dataset.spam_collate_func,
shuffle=True)
val_dataset = SpamDataset(val_data_indices, val_labels, train_dataset.max_sent_length)
val_loader = torch.utils.data.DataLoader(dataset=val_dataset,
batch_size=BATCH_SIZE,
collate_fn=train_dataset.spam_collate_func,
shuffle=False)
test_dataset = SpamDataset(test_data_indices, test_labels, train_dataset.max_sent_length)
test_loader = torch.utils.data.DataLoader(dataset=test_dataset,
batch_size=BATCH_SIZE,
collate_fn=train_dataset.spam_collate_func,
shuffle=False)
Let’s try to print out an batch from train_loader.
data_batch, labels = next(iter(train_loader))
print(“data batch dimension: “, data_batch.size())
print(“data_batch: “, data_batch)
print(“labels: “, labels)
—————————————————————————
AttributeError Traceback (most recent call last)
1 data_batch, labels = next(iter(train_loader))
—-> 2 print(“data batch dimension: “, data_batch.size())
3 print(“data_batch: “, data_batch)
4 print(“labels: “, labels)
AttributeError: ‘list’ object has no attribute ‘size’
Build a BiLSTM Classifier (20 + 25 + 10 pts)¶
Now we are going to build a BiLSTM classifier. Check this blog post and torch.nn.LSTM for reference. Recall that we’ve also seen LSTM in Lab 3.
The hyperparameters for LSTM are already given, but they are not necessarily optimal. You should get a good accuracy with these hyperparameters but you may try to tune the hyperparameters and use different hyperparameters to get better performance.
__init__: Class constructor. Here we define layers / parameters of LSTM.
forward: This function is used whenever you call your object as model(). It takes the input minibatch and returns the output representation from LSTM.
# First import torch related libraries
import torch
import torch.nn as nn
import torch.nn.functional as F
class LSTMClassifier(nn.Module):
LSTMClassifier classification model
def __init__(self, embeddings, hidden_size, num_layers, num_classes, bidirectional, dropout_prob=0.3):
super().__init__()
self.embedding_layer = self.load_pretrained_embeddings(embeddings)
self.dropout = None
self.lstm = None
self.non_linearity = None # For example, ReLU
self.clf = None # classifier layer
Define the components of your BiLSTM Classifier model
2. TODO: Your code here
raise NotImplementedError # delete this line
def load_pretrained_embeddings(self, embeddings):
The code for loading embeddings from Lab 3 Deep Learning
Unlike lab, we are not setting `embedding_layer.weight.requires_grad = False`
because we want to finetune the embeddings on our data
embedding_layer = nn.Embedding(embeddings.shape[0], embeddings.shape[1], padding_idx=0)
embedding_layer.weight.data = torch.Tensor(embeddings).float()
return embedding_layer
def forward(self, i
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com