CS代写 GA-1011, Fall 2018

Lab 4: Deep Learning with PyTorch
In this lab you’ll learn practical deep learning skills, including using the Python library Pytorch and its autodifferentiation capabilities to train basic machine learning models. We’ll also learn how to input text to a bag-of-words model using static word embeddings.

0. Lab setup¶

Copyright By PowCoder代写 加微信 powcoder

Import packages

!pip install sacremoses

import torch
import torch.nn as nn

import torch.nn.functional as F
import torch.optim as optim
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import sacremoses
from torch.utils.data import dataloader, Dataset
import tqdm

Collecting sacremoses
Downloading sacremoses-0.0.47-py2.py3-none-any.whl (895 kB)
|████████████████████████████████| 895 kB 4.9 MB/s
Requirement already satisfied: tqdm in /usr/local/lib/python3.7/dist-packages (from sacremoses) (4.62.3)
Requirement already satisfied: six in /usr/local/lib/python3.7/dist-packages (from sacremoses) (1.15.0)
Requirement already satisfied: joblib in /usr/local/lib/python3.7/dist-packages (from sacremoses) (1.1.0)
Requirement already satisfied: click in /usr/local/lib/python3.7/dist-packages (from sacremoses) (7.1.2)
Requirement already satisfied: regex in /usr/local/lib/python3.7/dist-packages (from sacremoses) (2019.12.20)
Installing collected packages: sacremoses
Successfully installed sacremoses-0.0.47

Portions of this lab taken from DS-GA-1011, Fall 2018

Set up GPU runtime¶
Go to Edit > Notebook Settings, and select “GPU” under the Hardware accelerator dropdown menu. A GPU is a special type of processor that is better suited for deep learning since it has proportionally more transistors dedicated to arithmetic logic than a CPU does. GPUs also typically contain many more cores, making them ideal for running numerous matrix operations in parallel.

1. Tensor basics¶
PyTorch is a library for training deep neural networks that is largely based on the Tensor, an array type that is similar to NumPy arrays. The main difference is that tensor operations are designed to run efficiently on both CPUs and GPUs.

# We can create tensors out of normal Python lists.
tensor = torch.tensor([0, 1, 2, 3])

# They can be easily inspected for their contents, shape, and data types.
print(tensor)
print(tensor.shape)
print(tensor.dtype)

# They can also be matrices.
M_data = [[1., 2.], [4., 5.]]
M = torch.tensor(M_data)

tensor([0, 1, 2, 3])
torch.Size([4])
torch.int64
tensor([[1., 2.],
[4., 5.]])

It’s easy to convert between NumPy arrays PyTorch tensors. Many NumPy operations are also available in PyTorch’s tensor library. (See the documentation.)

# Convert between pytorch and numpy
print(tensor.numpy())
print(torch.tensor(tensor.numpy()))

tensor([0, 1, 2, 3])

# Populating tensors/array, inspecting their shapes, and reshaping.
print(np.zeros((4, 4)))
print(np.ones((4, 4)))
X_np = np.random.random((4, 4))
print(X_np)
print(X_np.shape)
print(X_np.ndim)
print(X_np.reshape((2, 8))) # note the double parentheses! reshape() takes a
# tuple as an argument.

## pytorch
print(torch.zeros(4, 4))
print(torch.ones(4, 4))
X_th = torch.rand(4, 4)
print(X_th)
print(X_th.shape)
print(X_th.size())
print(X_th.dim())
print(X_th.view(2,8))

[[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]]
[[1. 1. 1. 1.]
[1. 1. 1. 1.]
[1. 1. 1. 1.]
[1. 1. 1. 1.]]
[[0.35180516 0.91535666 0.18839055 0.74211075]
[0.90895751 0.8142918 0.95056521 0.89966115]
[0.50783701 0.99807408 0.11904349 0.69109467]
[0.85538517 0.79962773 0.93136154 0.46450384]]
[[0.35180516 0.91535666 0.18839055 0.74211075 0.90895751 0.8142918
0.95056521 0.89966115]
[0.50783701 0.99807408 0.11904349 0.69109467 0.85538517 0.79962773
0.93136154 0.46450384]]
tensor([[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.]])
tensor([[1., 1., 1., 1.],
[1., 1., 1., 1.],
[1., 1., 1., 1.],
[1., 1., 1., 1.]])
tensor([[0.5519, 0.1044, 0.8973, 0.0933],
[0.1923, 0.4539, 0.0495, 0.7878],
[0.4294, 0.9544, 0.4191, 0.8513],
[0.2700, 0.1536, 0.8204, 0.9725]])
torch.Size([4, 4])
torch.Size([4, 4])
tensor([[0.5519, 0.1044, 0.8973, 0.0933, 0.1923, 0.4539, 0.0495, 0.7878],
[0.4294, 0.9544, 0.4191, 0.8513, 0.2700, 0.1536, 0.8204, 0.9725]])

Both are capable of numerous linear algebra operations. Here are just a few:

M_np = M.numpy()
## numpy indexing
print(M_np[0, 0]) # returns a float
print(M_np[0, :]) # returns 0th row, an array of shape (3,)
## pytorch indexing
print(M[0, 0]) # returns a tensor of shape [] (indicates a single scalar)
print(M[0, :]) # returns a tensor of shape [3]

tensor(1.)
tensor([1., 2.])

## matrix operations
## numpy matrix multiplication
print(M_np * M_np) # element-wise
print(M_np @ M_np) # matrix multiplication

## pytorch matrix multiplication
print(M * M) # element-wise
print(M @ M) # matrix multiplication

[[ 1. 4.]
[16. 25.]]
[[ 9. 12.]
[24. 33.]]
tensor([[ 1., 4.],
[16., 25.]])
tensor([[ 9., 12.],
[24., 33.]])

2. Using CUDA¶

CUDA and cuDNN are GPU-accelerated platforms that accelerate computations for mathematical operations and neural network models.

PyTorch natively supports CUDA/uDNN, but we must explicitly specify whether a tensor operation should be run on the CPU or GPU.

assert torch.cuda.is_available and torch.has_cudnn

# Tensors start out stored on the CPU by default.
x = torch.Tensor(range(5))
y = torch.Tensor(np.ones(5))

# In order to run tensor operations on the GPU, we need to move our tensors onto
# the GPU.
print(x.cuda()) # Returns a copy of x in CUDA memory.
print(x.cuda() + y.cuda()) # All tensors in a single expression must be located
# on the same device.

# The recommended way to specify tensor device:
gpu = torch.device(“cuda:0”)
cpu = torch.device(“cpu”)
device = gpu
z = x.to(device) + y.to(device) # to() converts the tensor to a format on the
# specified device

# Move z back to the CPU so that we can use it with numpy:
print(z.cpu().numpy())

tensor([0., 1., 2., 3., 4.], device=’cuda:0′)
tensor([1., 2., 3., 4., 5.], device=’cuda:0′)
tensor([1., 2., 3., 4., 5.], device=’cuda:0′)
[1. 2. 3. 4. 5.]

3. Autograd and Gradient Descent¶

Autograd is a submodule in PyTorch that handles automatic differentiations and gradient computation. This allows you to simply define a model once, in a forward fashion, and the library handles the computation of all gradients in the computational graph.

Here, we create 2 Tensors, but we want PyTorch to compute their gradients with respect to x. By default, for arbitrary computations in PyTorch, no gradients are computed (e.g for y).

x = torch.randn(5, requires_grad=True) # pytorch will now keep track of the gradient w.r.t. x
y = torch.arange(5.) # pytorch will *not* keep track of gradients w.r.t. y

print(“tensor x: “, x)
print(“gradient wrt x:”, x.grad)

tensor x: tensor([ 0.1417, -0.3987, 0.6719, 0.3453, -0.3051], requires_grad=True)
gradient wrt x: None

print(y.grad)

tensor([0., 1., 2., 3., 4.])

Now consider $z = x \cdot y$. Taking the derivative with respect to $x$, we have

$$\frac{\partial z}{\partial x} = y$$Note that z.grad_fn is not None, which shows that $z$ was computed, capturing its dependencies in the computation graph.

z = (x * y).sum()
print(z.grad)
print(z.grad_fn)

tensor(0.7603, grad_fn=)

/usr/local/lib/python3.7/dist-packages/torch/_tensor.py:1013: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won’t be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more informations. (Triggered internally at aten/src/ATen/core/TensorBody.h:417.)
return self._grad

At this point, no gradients are computed yet. It is only when we call z.backward() that PyTorch computes the gradients, and backpropagates them to any node in the graph that required gradients (e.g. $x$).

z.backward()

As we can see, $x$ now has gradients associated with it, but $y$ does not.

print(x.grad)
print(y.grad)
print(z.grad)

tensor([0., 1., 2., 3., 4.])

/usr/local/lib/python3.7/dist-packages/torch/_tensor.py:1013: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won’t be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more informations. (Triggered internally at aten/src/ATen/core/TensorBody.h:417.)
return self._grad

With just this, we can compute a very rudimentary form of gradient descent!

# A very silly case of gradient descent:
learning_rate = 0.01
x = torch.tensor([1000.], requires_grad=True)
x_values = []
for i in range(1000):
# Our loss function minimizes x^2. What should the optimal value of x be?
loss = x ** 2
loss.backward()

# Now update the value of x using the gradient of the loss w.r.t. x.
# Have to do something a little convoluted here to subtract the
# gradient — don’t worry, we’ll never do this again
x.data.sub_(x.grad.data * learning_rate)

# Remember to zero-out the gradient!
# PyTorch doesn’t do it automatically.
x.grad.data = torch.Tensor([0])
x_values.append(x.item())

plt.figure(figsize=(10,7))
plt.plot(x_values)
plt.xlabel(“Steps”)
plt.ylabel(r”Value of $x$”)

Text(0, 0.5, ‘Value of $x$’)

Lastly, sometimes you want to run things without computing gradients:

x = torch.tensor([1000.], requires_grad=True)

# With gradient computation:
loss = x ** 2
print(loss.requires_grad)

# Without gradient computation:
with torch.no_grad(): # This temporarily sets all the requires_grad flags to false
loss = x ** 2
print(loss.requires_grad)

4. Regression with PyTorch¶

Let’s solve a toy regression problem with PyTorch. Let’s define the following data-generating process:

$$ y = wx + b + \epsilon$$Where $w=2$, $b=5$, and $\epsilon \sim N(0, 1)$.

rng = np.random.RandomState(seed=1234)
torch.manual_seed(1234)
x = rng.uniform(0, 10, 100)
eps = rng.normal(size=100)
y = w * x + b + eps

Our data looks like this. Now, we want to fit a best-fit line through our data.

plt.scatter(x, y)

PyTorch lets users wrap complex function in nn.Modules. nn.Linear is a simple instance of a module, that just captures a simple linear function.

linear_model = nn.Linear(in_features=1, out_features=1)

The above initialized linear_model contains two parameters: a weight and a bias, corresponding to the $w$ and $b$ above. The values of these parameters are randomly initialized at the start.

for name, param in linear_model.named_parameters():
print(name)
print(param)
print(“======================”)

Parameter containing:
tensor([[-0.9420]], requires_grad=True)
======================
Parameter containing:
tensor([-0.1962], requires_grad=True)
======================

# We can directly access the parameter values by their names.
linear_model.weight

Parameter containing:
tensor([[-0.9420]], requires_grad=True)

This is what our untrained model looks like initially: it’s just a random line!

plt.scatter(x, y)
x_space = np.linspace(0, 10, 5)
linear_model.weight.item() * x_space + linear_model.bias.item(),
linestyle=”–“, color=”red”

[]

To start training, we’ll do two things:

First, we’ll initialize PyTorch tensors to hold our data.
Second, we’ll initialize an optimizer, that will run gradient descent for us. We’ll use a simple stochastic gradient descent (SGD) optimizer with a learning rate of 0.01.

x_tensor = torch.tensor(x).float().view(-1, 1)
y_tensor = torch.tensor(y).float().view(-1, 1)
optimizer = optim.SGD(linear_model.parameters(), lr=0.01)

Let’s perform our optimization for 1000 steps, and see how our line of best fit changes over time (we’ll check every 200 steps).

Pay close attention to the steps that happen within the loop:

Compute the forward pass (predict $\hat{y}$)
Compute the loss (how far are we from the true $y$?)
Back-propagate the gradients
Take a step with the optimizer (update the parameters)
Zero-out the gradients of the optimizer (This is an important and often forgotten step. If you don’t do this, the gradients will be continually accumulated for past forward passes.)

for t in range(1001):
# === Optimization === #

# 1. Compute the forward pass
y_hat = linear_model(x_tensor)

# 2. Compute the loss
loss = F.mse_loss(y_hat, y_tensor)

# 3. Back-propagate the gradients
loss.backward()

# 4. Take a step with the optimizer (update the parameters)
optimizer.step()

# 5.Zero-out the gradients of the optimizer
optimizer.zero_grad()

# === Plotting === #
if t % 200 == 0:
current_w = linear_model.weight.item()
current_b = linear_model.bias.item()
plt.figure(figsize=(4, 2))
plt.scatter(x, y)
x_space = np.linspace(0, 10, 5)
current_w * x_space + current_b,
linestyle=”–“, color=”red”
plt.title(f”Step={t}, w={current_w:.3f}, b={current_b:.3f}”)
plt.show()

You should see the values of $w$ and $b$ get closer to $2$ and $5$ respectively.¶

5. A simple text-classification model with PyTorch¶
Until now, we’ve been encoding all our text input in one-hot vectors, but in this lab we’ll use a more semantically meaningful representation – GloVe word embeddings.

These embeddings are trained with the intuition that words that appear nearby each other in natural language are semantically related. We can quantify this semantic similarity by constructing a co-occurrence matrix; that is, given a corpus of text, a vocabulary list and a fixed window size, we count the number of times that each vocab word occurs in the context of all other words. Then if our window size is $k$, the $ij$-th entry of our co-occurrence matrix $X$ is the number of times that vocab word $i$ occurs within $k$ words of vocab word $j$.

Example corpus:

“The cat will eat the apple but not the orange.”

Question: Given the vocab list {“cat”, “eat”, “apple”, “orange”} and a window size of 4, what does the co-occurrence matrix for this corpus look like?

GloVe embeddings are trained such that for the embeddings $w_i$ and $w_j$ of words $i$ and $j$,

$$w_i^Tw_j \propto \log(X_{ij})$$where $X_{ij}$ is the co-occurrence of words $i$ and $j$ in the training corpus. Note that $w_i^Tw_j$ is simply the dot product of $w_i$ and $w_j$, and is therefore proportional to the cosine of the angle between the two vectors. As such, this objective encodes the semantic similarity between the two words as the cosine similarity between their respective embedding vectors. Below are some examples of pre-trained GloVe embeddings projected onto a 2D plane:

(Image taken from here.)

Download data and pre-trained GloVe word embeddings

# === Download data and GloVe word embeddings
# !wget https://s3.amazonaws.com/fast-ai-nlp/yelp_review_polarity_csv.tgz
# !wget http://nlp.stanford.edu/data/glove.6B.zip

# === Unzip yelp dataset and use a truncated version of the dataset
# !tar zxvf yelp_review_polarity_csv.tgz
# !head -n 10000 yelp_review_polarity_csv/train.csv > train_10000.csv
# !head -n 1000 yelp_review_polarity_csv/test.csv > test_1000.csv

# === Unzip word embeddings and use only the top 50000 word embeddings for speed
# !unzip glove.6B.zip
# !head -n 50000 glove.6B.300d.txt > glove.6B.300d__50k.txt

# === Download Preprocessed version
!wget https://docs.google.com/uc?id=1iBc6Mc8rt1T9Uo6QH9rdVQOL6ZjwvTGf -O train_10000.csv
!wget https://docs.google.com/uc?id=13B-BqMHTyUehUZkyNBaHYByCjGUUZXwd -O test_1000.csv
!wget https://docs.google.com/uc?id=1KMJTagaVD9hFHXFTPtNk0u2JjvNlyCAu -O glove_split.aa
!wget https://docs.google.com/uc?id=1LF2yD2jToXriyD-lsYA5hj03f7J3ZKaY -O glove_split.ab
!wget https://docs.google.com/uc?id=1N1xnxkRyM5Gar7sv4d41alyTL92Iip3f -O glove_split.ac
!cat glove_split.?? > ‘glove.6B.300d__50k.txt’

–2022-02-17 20:41:40– https://docs.google.com/uc?id=1iBc6Mc8rt1T9Uo6QH9rdVQOL6ZjwvTGf
Resolving docs.google.com (docs.google.com)… 172.217.219.138, 172.217.219.113, 172.217.219.101, …
Connecting to docs.google.com (docs.google.com)|172.217.219.138|:443… connected.
HTTP request sent, awaiting response… 303 See Other
Location: https://doc-10-0g-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/61a9lam3613km0lfkjs8kvj14mlledce/1645130475000/14514704803973256873/*/1iBc6Mc8rt1T9Uo6QH9rdVQOL6ZjwvTGf [following]
Warning: wildcards not supported in HTTP.
–2022-02-17 20:41:41– https://doc-10-0g-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/61a9lam3613km0lfkjs8kvj14mlledce/1645130475000/14514704803973256873/*/1iBc6Mc8rt1T9Uo6QH9rdVQOL6ZjwvTGf
Resolving doc-10-0g-docs.googleusercontent.com (doc-10-0g-docs.googleusercontent.com)… 172.217.219.132, 2607:f8b0:4001:c13::84
Connecting to doc-10-0g-docs.googleusercontent.com (doc-10-0g-docs.googleusercontent.com)|172.217.219.132|:443… connected.
HTTP request sent, awaiting response… 200 OK
Length: 7163576 (6.8M) [text/csv]
Saving to: ‘train_10000.csv’

train_10000.csv 100%[===================>] 6.83M 30.1MB/s in 0.2s

2022-02-17 20:41:42 (30.1 MB/s) – ‘train_10000.csv’ saved [7163576/7163576]

–2022-02-17 20:41:42– https://docs.google.com/uc?id=13B-BqMHTyUehUZkyNBaHYByCjGUUZXwd
Resolving docs.google.com (docs.google.com)… 142.250.152.100, 142.250.152.102, 142.250.152.138, …
Connecting to docs.google.com (docs.google.com)|142.250.152.100|:443… connected.
HTTP request sent, awaiting response… 303 See Other
Location: https://doc-00-0g-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/cbfp6l14n59ljcoilc257g8c8iiddad9/1645130475000/14514704803973256873/*/13B-BqMHTyUehUZkyNBaHYByCjGUUZXwd [following]
Warning: wildcards not supported in HTTP.
–2022-02-17 20:41:43– https://doc-00-0g-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/cbfp6l14n59ljcoilc257g8c8iiddad9/1645130475000/14514704803973256873/*/13B-BqMHTyUehUZkyNBaHYByCjGUUZXwd
Resolving doc-00-0g-docs.googleusercontent.com (doc-00-0g-docs.googleusercontent.com)… 172.217.219.132, 2607:f8b0:4001:c13::84
Connecting to doc-00-0g-docs.googleusercontent.c

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com