CS代写 COM4509/6509 MLAI2021 @ The University of Sheffield

Solution – Lab 6 – Logistic regression & pytorch for DL

Lab 6: Logistic Regression & PyTorch for Deep Learning¶
A: Logistic Regression ; B: Linear Regression with PyTorch NN¶

Haiping Lu – COM4509/6509 MLAI2021 @ The University of Sheffield

Accompanying lectures: YouTube video lectures recorded in Year 2020/21.

Sources: Part A is based on the one neuron notebooks by . Part B is based on the PyTorch tutorial from CSE446, University of Washington and Lab 1 of my SimplyDeep notebooks.

Reproducibility: This seed module shows how we set seed to ensure reproducibility in our PyKale library.

Machine learning pipeline: Take a look at the PyKale library for pipeline-based APIs.

Note: Try to answer the five questions when you first see them rather than coming back after going through the rest.

A: Logistic Regression¶

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

A1. The sigmoid function¶
The sigmoid or logistic function is essential in binary classification problems. It is expressed as
$$\sigma(z) = \frac{1}{1+e^{-z}}$$
and here is what it looks like in 1D:

# define parameters
# the bias:
# the weight:

def sigmoid(x1):
# z is a linear function of x1
z = w*x1 + b
return 1 / (1+np.exp(-z))

# create an array of evenly spaced values
linx = np.linspace(-10,10,51)
plt.plot(linx, sigmoid(linx))
plt.plot(linx, sigmoid(linx))
plt.plot(linx, sigmoid(linx))

[]

Question 1¶
What are the parameters (w and b) for each of the three curves orange (left), green (middle) and blue (right)?

Orange (left): w=1, b=5

Green (middle): w=5, b=5

Blue (right): w=1, b=0

Let’s look at this function in more details:

when $z$ goes to infinity, $e^{-z}$ goes to zero, and $\sigma (z)$ goes to one.
when $z$ goes to minus infinity, $e^{-z}$ goes to infinity, and $\sigma (z)$ goes to zero.
$\sigma(0) = 0.5$, since $e^0=1$.

It is important to note that the sigmoid is bound between 0 and 1, like a probability. And actually, in binary classification problems, the probability for an example to belong to a given category is produced by a sigmoid function. To classify our examples, we can simply use the output of the sigmoid: A given unknown example with value $x$ will be classified to category 1 if $\sigma(z) > 0.5$, and to category 0 otherwise.

Excercise: Now you can go back to the cell above, and play a bit with the b and w parameters, redoing the plot everytime you change one of these parameters.

$b$ is the bias. Changing the bias simply moves the sigmoid along the horizontal axis. For example, if you choose $b=1$ and $w=1$, then $z = wx + b = 0$ at $x=-1$, and that’s where the sigmoid will be equal to 0.5
$w$ is the weight of variable $x$. If you increase it, the sigmoid evolves faster as a function of $x$ and gets sharper.

A2. Logistic regression as the simplest neural network¶

We will build the simplest neural network to classify our examples:

Each example has one variable, so we need 1 input node on the input layer
We’re not going to use any hidden layer, as that would complicate the network
We have two categories, so the output of the network should be a single value between 0 and 1, which is the estimated probability $p$ for an example to belong to category 1. Then, the probability to belong to category 0 is simply $1-p$. Therefore, we should have a single output neuron, the only neuron in the network.

The sigmoid function can be used in the output neuron. Indeed, it spits out a value between 0 and 1, and can be used as a classification probability as we have seen in the previous section.

We can represent our network in the following way:

In the output neuron:

the first box performs a change of variable and computes the weighted input $z$ of the neuron
the second box applies the activation function to the weighted input. Here, we choose the sigmoid $\sigma (z) = 1/(1+e^{-z})$ as an activation function

This simple network has only 2 tunable parameters, the weight $w$ and the bias $b$, both used in the first box. We see in particular that when the bias is very large, the neuron will always be activated, whatever the input. On the contrary, for very negative biases, the neuron is dead.

We can write the output simply as a function of $x$,

$$f(x) = \sigma(z) = \sigma(wx+b)$$This is exactly the logistic regression classifier.

Question 2¶
How can we rewrit the logistic regression classifier above using a single vectorial parameter (i.e. one vector containing all parameters)?

Assuming w is a (1xn) row vector, then by creating an extra column in the the w, called wbias,
then we will have, e.g. w=[w1 w2 wbias] and append an extra row into x ( our feature vector) and set it to 1, e.g. to be x= [x1; x2; 1].

A3. Classifying 2D dataset with logistic regression¶
Let’s create a sample of examples with two values x1 and x2, with two categories.
For category 0, the underlying probability distribution is a 2D Gaussian centered on (0,0), with width = 1 along both directions. For category 1, the Gaussian is centered on (2,2). We assign label 0 to category 0, and label 1 to category 1.

Dataset creation¶
Let’s create a sample of examples with two values x1 and x2, with two categories.
For category 0, the underlying probability distribution is a 2D Gaussian centered on (0,0), with width = 1 along both directions. For category 1, the Gaussian is centered on (2,2). We assign label 0 to category 0, and label 1 to category 1. Check out the documentation for Gaussian data generation

normal = np.random.multivariate_normal
# Number of samples
nSamples = 500
# (unit) variance:
# below, we provide the coordinates of the mean as
# a first argument, and then the covariance matrix
# we generate nexamples examples for each category
sgx0 = normal([0.,0.], [[s2, 0.], [0.,s2]], nSamples)
sgx1 = normal([2.,2.], [[s2, 0.], [0.,s2]], nSamples)
# setting the labels for each category
sgy0 = np.zeros((nSamples,))
sgy1 = np.ones((nSamples,))

Here is a scatter plot for the examples in the two categories

plt.scatter(sgx0[:,0], sgx0[:,1], alpha=0.5)
plt.scatter(sgx1[:,0], sgx1[:,1], alpha=0.5)
plt.xlabel(‘x1’)
plt.ylabel(‘x2’)

Text(0, 0.5, ‘x2’)

Our goal is to train a logistic regression to classify (x1,x2) points in one of the two categories depending on the values of x1 and x2. We form the dataset by concatenating the arrays of points, and also the arrays of labels for later use:

sgx = np.concatenate((sgx0, sgx1))
sgy = np.concatenate((sgy0, sgy1))

print(sgx.shape[1], sgy.shape[0])

2D sigmoid¶
In 2D, the expression of the sigmoid remains the same, but $z$ is now a function of the two variables $x_1$ and $x_2$,

$$z=w_1 x_1 + w_2 x_2 + b$$And here is the code for the 2D sigmoid and the defined function is called sigmoid_2d:

# define parameters
# x1 weight:
# x2 weight:

def sigmoid_2d(x1, x2):
# z is a linear function of x1 and x2
z = w1*x1 + w2*x2 + b
return 1 / (1+np.exp(-z))

To see what this function looks like, we can make a 2D plot, with x1 on the horizontal axis, x2 on the vertical axis, and the value of the sigmoid represented as a color for each (x1, x2) coordinate. To do that, we will create an array of evenly spaced values along x1, and another array along x2. Taken together, these arrays will allow us to map the (x1,x2) plane.

xmin, xmax, npoints = (-6,6,51)
linx1 = np.linspace(xmin,xmax,npoints)
# no need for a new array, we just reuse the one we have with another name:
linx2 = linx1

Then, we create a meshgrid from these arrays:

gridx1, gridx2 = np.meshgrid(np.linspace(xmin,xmax,npoints), np.linspace(xmin,xmax,npoints))
print(gridx1.shape, gridx2.shape)
print(‘gridx1:’)
print(gridx1)
print(‘gridx2’)
print(gridx2)

(51, 51) (51, 51)
[[-6. -5.76 -5.52 … 5.52 5.76 6. ]
[-6. -5.76 -5.52 … 5.52 5.76 6. ]
[-6. -5.76 -5.52 … 5.52 5.76 6. ]
[-6. -5.76 -5.52 … 5.52 5.76 6. ]
[-6. -5.76 -5.52 … 5.52 5.76 6. ]
[-6. -5.76 -5.52 … 5.52 5.76 6. ]]
[[-6. -6. -6. … -6. -6. -6. ]
[-5.76 -5.76 -5.76 … -5.76 -5.76 -5.76]
[-5.52 -5.52 -5.52 … -5.52 -5.52 -5.52]
[ 5.52 5.52 5.52 … 5.52 5.52 5.52]
[ 5.76 5.76 5.76 … 5.76 5.76 5.76]
[ 6. 6. 6. … 6. 6. 6. ]]

if you take the first line in both arrays, and scan the values on this line, you get: (-6,-6), (-5.76, -6), (-5.52, -6)… So we are scanning the x1 coordinates sequentially at the bottom of the plot. If you take the second line, you get: (-6, -5.76), (-5.76, -5.76), (-5.52, -5.76) … : we are scanning the second line at the bottom of the plot, after moving up in x2 from one step.

Scanning the full grid, you would scan the whole plot sequentially.

Now we need to compute the value of the sigmoid for each pair (x1,x2) in the grid using the sigmoid_2d function defined above (cell 6). That’s very easy to do with the output of meshgrid:

z = sigmoid_2d(gridx1, gridx2)

This calls the sigmoid_2d function to each pair (x1,y2) taken from the gridx1 and gridx2 arrays so that we can plot our sigmoid in 2D:

plt.pcolor(gridx1, gridx2, z)
plt.xlabel(‘x1’)
plt.ylabel(‘x2’)
plt.colorbar()

The 2D sigmoid has the same kind of rising edge as the 1D sigmoid, but in 2D.
With the parameters defined above:

The weight of $x_2$ is twice larger than the weight of $x_1$, so the sigmoid evolves twice faster as a function of $x_2$.
The separation boundary, which occurs for $z=0$, is a straight line with equation $w_1 x_1 + w_2 x_2 + b = 0$. Or equivalently:

$$x_2 = -\frac{w_1}{w_2} x_1 – \frac{b}{w_2} = -0.5 x_1$$

Question 3¶
If you set one of the weights to zero, what will happen? Also, verify on the plot above that the equation above is indeed the one describing the separation boundary.

If we set w2 (w1) to zero then this sigmoid will be a 1D sigmoid of x1 (x2).

The equation above describes a straight line $x_2=-\frac{1}{2}x_1$, which is a straight line passing the origin (0,0) with a slope of $-\frac{1}{2}$ that can be verified on the figure.

Now you can test by editing the function sigmoid_2d, before re-executing the above cells.

Note that if you prefer, you can plot the sigmoid in 3D like this:

from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure()
ax = fig.add_subplot(111, projection=’3d’)
ax.plot_wireframe(gridx1,gridx2,z)

Exercise: change the parameters to observe how the 2D sigmoid changes.

Logistic regression on the 2D data¶
Let’s now train a logistic regression to separate the two classes of examples. The goal of the training will be to use the existing examples to find the optimal values for the parameters $w_1, w_2, b$.

We take the logistic regression algorithm from scikit-learn.
Here, the logistic regression is used with the lbfgs solver. LBFGS is the minimization method used to find the best parameters. It is similar to Newton’s method. Since there is randomness, setting a seed is a good practice for reproducibility.

from sklearn.linear_model import LogisticRegression
np.random.seed(2020) #set a seed for reproducibility
clf = LogisticRegression(solver=’lbfgs’) #clf: classifier
clf.fit(sgx, sgy)

LogisticRegression()

Note from the above that the default setting for logistic regression in scikit-learn uses L2 regularisation (penalty).

Question 4¶
What is the objective of L2 regularisation (penalty)? Hint: this is not covered in lecture and you need to do some study (search).

This penalty term limits (penalises) the magnitude of the optimal weights from growing too large, typically to avoid overfitting by preferring a simpler model. See Regularization for Simplicity: L₂ Regularization.

Check out the documentation to learn other options for penalty (regularisation) and other settings. In the simplest form, logistic regression does not have any hyperparameters but in practice, regularisation is often used, e.g. to reduce overfitting.

The logistic regression has been fitted (trained) to the data. Now, we can use it to predict the probability for any given (x1,x2) point to belong to category 1.

We would like to plot this probability in 2D as a function of x1 and x2. To do that, we need to use the clf.predict_proba method which takes a 2D array of shape (n_points, 2). The first dimension indexes the points, and the second one contains the values of x1 and x2. Again, we use our grid to map the (x1,x2) plane. But the gridx1 and gridx2 arrays defined above contain disconnected values of x1 and x2:

print(gridx1.shape, gridx2.shape)

(51, 51) (51, 51)

What we want is a 2D array of shape (n_points, 2), not two 2D arrays of shape (51, 51)…
So we need to reshape these arrays. First, we will flatten the gridx1 and gridx2 arrays so that all their values appear sequentially in a 1D array. Here is a small example to show how flatten works:

a = np.array([[0, 1], [2, 3]])
print(‘flat array:’, a.flatten())

flat array: [0 1 2 3]

Then, we will stitch the two 1D arrays together in two columns with np.c_ like this:

b = np.array([[4, 5], [6, 7]])
print(a.flatten())
print(b.flatten())
c = np.c_[a.flatten(), b.flatten()]
print(c.shape)

This array has exactly the shape expected by clf.predict_proba: a list of samples with two values. So let’s do the same with our meshgrid, and let’s compute the probabilities for all (x1,x2) pairs in the grid:

grid = np.c_[gridx1.flatten(), gridx2.flatten()]
prob = clf.predict_proba(grid)
prob.shape

Now, prob does not have the right shape to be plotted. Below, we will use a gridx1 and a gridx2 array with shapes (51,51). The shape of the prob array must also be (51,51), as the plotting method will simply map each (x1,x2) pair to a probability. So we need to reshape our probability array to shape (51,51). Reshaping works like this:

d = np.array([0,1,2,3])
print(‘reshaped to (2,2):’)
print(d.reshape(2,2))

reshaped to (2,2):

Finally (!) we can do our plot:

# note that prob[:,1] returns, for all exemples, the probability p to belong to category 1.
# prob[:,0] would return the probability to belong to category 0 (which is 1-p)
plt.pcolor(gridx1,gridx2,prob[:,1].reshape(npoints,npoints))
plt.colorbar()
plt.scatter(sgx0[:,0], sgx0[:,1], alpha=0.5)
plt.scatter(sgx1[:,0], sgx1[:,1], alpha=0.5)
plt.xlabel(‘x1’)
plt.ylabel(‘x2’)

Text(0, 0.5, ‘x2’)

We see that the logistic regression is able to separate these two classes well and the decision boundary is linear.

B: Linear Regression with PyTorch NN¶

Objective¶
To perform linear regression using PyTorch for understanding the link between linear models and neural networks.

Related Posts