程序代写代做 algorithm go HW5: Neural Networks¶

HW5: Neural Networks¶
Introduction¶
In this exercise, you will implement the backpropagation algorithm for neural networks and apply it to the task of hand-written digit recognition.
In [ ]:
# used for manipulating directory paths
import os

# Scientific and vector computation for python
import numpy as np

# Plotting library
import matplotlib.pyplot as plt

# Optimization module in scipy
from scipy import optimize

# will be used to load MATLAB mat datafile format
from scipy.io import loadmat

# tells matplotlib to embed plots within the notebook
%matplotlib inline

import scipy.io as scio

We start the exercise by first loading the dataset.
In [ ]:
# training data stored in arrays X, y
data = loadmat(os.path.join(‘Data’, ‘ex5data1.mat’))
X = data[‘X’]
y = data[‘y’].flatten()

# Number of training examples
m = y.size

1.1 Visualizing the data¶
You will begin by visualizing a subset of the training set, using the function display_data.
There are 5000 training examples in ex5data1.mat, where each training example is a 20 pixel by 20 pixel grayscale image of the digit. Each pixel is represented by a floating point number indicating the grayscale intensity at that location. The 20 by 20 grid of pixels is “unrolled” into a 400-dimensional vector. Each of these training examples becomes a single row in our data matrix $X$. This gives us a 5000 by 400 matrix $X$ where every row is a training example for a handwritten digit image.
$$ X = \begin{bmatrix} – \left(x^{(1)} \right)^T – \\ – \left(x^{(2)} \right)^T – \\ \vdots \\ – \left(x^{(m)} \right)^T – \\ \end{bmatrix} $$
The second part of the training set is a 5000-dimensional vector y that contains labels for the training set. The following cell randomly selects 100 images from the dataset and plots them.
In [ ]:
def display_data(x):
(m, n) = x.shape

# Set example_width automatically if not passed in
example_width = np.round(np.sqrt(n)).astype(int)
example_height = (n / example_width).astype(int)

# Compute the number of items to display
display_rows = np.floor(np.sqrt(m)).astype(int)
display_cols = np.ceil(m / display_rows).astype(int)

# Between images padding
pad = 1

# Setup blank display
display_array = – np.ones((pad + display_rows * (example_height + pad),
pad + display_rows * (example_height + pad)))

# Copy each example into a patch on the display array
curr_ex = 0
for j in range(display_rows):
for i in range(display_cols):
if curr_ex > m:
break

# Copy the patch
# Get the max value of the patch
max_val = np.max(np.abs(x[curr_ex]))
display_array[pad + j * (example_height + pad) + np.arange(example_height),
pad + i * (example_width + pad) + np.arange(example_width)[:, np.newaxis]] = \
x[curr_ex].reshape((example_height, example_width)) / max_val
curr_ex += 1

if curr_ex > m:
break

# Display image
plt.figure()
plt.imshow(display_array, cmap=’gray’, extent=[-1, 1, -1, 1])
plt.axis(‘off’)
In [ ]:
# Randomly select 100 data points to display
rand_indices = np.random.permutation(range(m))
selected = X[rand_indices[0:100], :]

display_data(selected)

1.2 Model representation¶
Our neural network is shown in the following figure.

It has 3 layers – an input layer, a hidden layer and an output layer. Recall that our inputs are pixel values of digit images. Since the images are of size $20 \times 20$, this gives us 400 input layer units (not counting the extra bias unit which always outputs +1). The training data was loaded into the variables X and y above.
You have been provided with a set of network parameters ($\Theta^{(1)}, \Theta^{(2)}$). These are stored in ex5weights.mat and will be loaded in the next cell of this notebook into Theta1 and Theta2. The parameters have dimensions that are sized for a neural network with 25 units in the second layer and 10 output units (corresponding to the 10 digit classes).
In [ ]:
# Setup the parameters you will use for this exercise
input_layer_size = 400 # 20×20 Input Images of Digits
hidden_layer_size = 25 # 25 hidden units
num_labels = 10 # 10 labels, from 0 to 9

# Load the weights into variables Theta1 and Theta2
weights = loadmat(os.path.join(‘Data’, ‘ex5weights.mat’))

# Theta1 has size 25 x 401
# Theta2 has size 10 x 26
Theta1, Theta2 = weights[‘Theta1’], weights[‘Theta2’]

# Unroll parameters
nn_params = np.concatenate([Theta1.ravel(), Theta2.ravel()])

1.3 Feedforward and cost function (15 pts)¶
Now you will implement the cost function and gradient for the neural network. First, complete the code for the function nnCostFunction in the next cell to return the cost.
Recall that the cost function for the neural network (without regularization) is:
$$ J(\theta) = \frac{1}{m} \sum_{i=1}^{m}\sum_{k=1}^{K} \left[ – y_k^{(i)} \log \left( \left( h_\theta \left( x^{(i)} \right) \right)_k \right) – \left( 1 – y_k^{(i)} \right) \log \left( 1 – \left( h_\theta \left( x^{(i)} \right) \right)_k \right) \right]$$
where $h_\theta \left( x^{(i)} \right)$ is computed as shown in the neural network figure above, and K = 10 is the total number of possible labels. Note that $h_\theta(x^{(i)})_k = a_k^{(3)}$ is the activation (output value) of the $k^{th}$ output unit. Also, recall that whereas the original labels (in the variable y) were 0, 1, …, 9, for the purpose of training a neural network, we need to encode the labels as vectors containing only values 0 or 1, so that
$$ y = \begin{bmatrix} 1 \\ 0 \\ 0 \\\vdots \\ 0 \end{bmatrix}, \quad \begin{bmatrix} 0 \\ 1 \\ 0 \\ \vdots \\ 0 \end{bmatrix}, \quad \cdots \quad \text{or} \qquad \begin{bmatrix} 0 \\ 0 \\ 0 \\ \vdots \\ 1 \end{bmatrix}. $$
For example, if $x^{(i)}$ is an image of the digit 5, then the corresponding $y^{(i)}$ (that you should use with the cost function) should be a 10-dimensional vector with $y_5 = 1$, and the other elements equal to 0.
You should implement the feedforward computation that computes $h_\theta(x^{(i)})$ for every example $i$ and sum the cost over all examples. Your code should also work for a dataset of any size, with any number of labels (you can assume that there are always at least $K \ge 3$ labels).

**Implementation Note:** The matrix $X$ contains the examples in rows (i.e., X[i,:] is the i-th training example $x^{(i)}$, expressed as a $n \times 1$ vector.) When you complete the code in `nnCostFunction`, you will need to add the column of 1’s to the X matrix. The parameters for each unit in the neural network is represented in Theta1 and Theta2 as one row. Specifically, the first row of Theta1 corresponds to the first hidden unit in the second layer. You can use a for-loop over the examples to compute the cost.

In [ ]:
def sigmoid(z):
return 1 / (1 + np.exp(-z))
In [ ]:
def nnCostFunction(nn_params,
input_layer_size,
hidden_layer_size,
num_labels,
X, y, lmd=0.0):
“””
Implements the neural network cost function and gradient for a two layer neural
network which performs classification.

Parameters
———-
nn_params : array_like
The parameters for the neural network which are “unrolled” into
a vector. This needs to be converted back into the weight matrices Theta1
and Theta2.

input_layer_size : int
Number of features for the input layer.

hidden_layer_size : int
Number of hidden units in the second layer.

num_labels : int
Total number of labels, or equivalently number of units in output layer.

X : array_like
Input dataset. A matrix of shape (m x input_layer_size).

y : array_like
Dataset labels. A vector of shape (m,).

lambda_ : float, optional
Regularization parameter.

Returns
——-
J : float
The computed value for the cost function at the current weight values.

grad : array_like
An “unrolled” vector of the partial derivatives of the concatenatation of
neural network weights Theta1 and Theta2.

Instructions
————
You should complete the code by working through the following parts.

– Part 1: Feedforward the neural network and return the cost in the
variable J. After implementing Part 1, you can verify that your
cost function computation is correct by verifying the cost
computed in the following cell.

– Part 2: Implement the backpropagation algorithm to compute the gradients
Theta1_grad and Theta2_grad. You should return the partial derivatives of
the cost function with respect to Theta1 and Theta2 in Theta1_grad and
Theta2_grad, respectively. After implementing Part 2, you can check
that your implementation is correct by running checkNNGradients provided
in the utils.py module.

Note: The vector y passed into the function is a vector of labels
containing values from 0..K-1. You need to map this vector into a
binary vector of 1’s and 0’s to be used with the neural network
cost function.

Hint: We recommend implementing backpropagation using a for-loop
over the training examples if you are implementing it for the
first time.

– Part 3: Implement regularization with the cost function and gradients.

Hint: You can implement this around the code for
backpropagation. That is, you can compute the gradients for
the regularization separately and then add them to Theta1_grad
and Theta2_grad from Part 2.

“””
# Reshape nn_params back into the parameters Theta1 and Theta2, the weight matrices
# for our 2 layer neural network
theta1 = np.reshape(nn_params[:hidden_layer_size * (input_layer_size + 1)],
(hidden_layer_size, (input_layer_size + 1)))

theta2 = np.reshape(nn_params[(hidden_layer_size * (input_layer_size + 1)):],
(num_labels, (hidden_layer_size + 1)))

# Setup some useful variables
m = y.size

# You need to return the following variables correctly
J = 0
theta1_grad = np.zeros(theta1.shape) # 25 x 401
theta2_grad = np.zeros(theta2.shape) # 10 x 26

# ===================== Your Code Here =====================

# ================================================================
# Unroll gradients
grad = np.concatenate([theta1_grad.ravel(), theta2_grad.ravel()])

return J, grad

Use the following links to go back to the different parts of this exercise that require to modify the function `nnCostFunction`.
Back to: – [Feedforward and cost function](#section1) – [Regularized cost](#section2) – [Neural Network Gradient (Backpropagation)](#section4) – [Regularized Gradient](#section5)

Once you are done, call your nnCostFunction using the loaded set of parameters for Theta1 and Theta2.
In [ ]:
lambda_ = 0
J, _ = nnCostFunction(nn_params, input_layer_size, hidden_layer_size,
num_labels, X, y, lambda_)
print(‘Cost at parameters (loaded from ex5weights): %.6f ‘ % J)

1.4 Regularized cost function (10 pts)¶
The cost function for neural networks with regularization is given by:
$$ J(\theta) = \frac{1}{m} \sum_{i=1}^{m}\sum_{k=1}^{K} \left[ – y_k^{(i)} \log \left( \left( h_\theta \left( x^{(i)} \right) \right)_k \right) – \left( 1 – y_k^{(i)} \right) \log \left( 1 – \left( h_\theta \left( x^{(i)} \right) \right)_k \right) \right] + \frac{\lambda}{2 m} \left[ \sum_{j=1}^{25} \sum_{k=1}^{400} \left( \Theta_{j,k}^{(1)} \right)^2 + \sum_{j=1}^{10} \sum_{k=1}^{25} \left( \Theta_{j,k}^{(2)} \right)^2 \right] $$
You can assume that the neural network will only have 3 layers – an input layer, a hidden layer and an output layer. However, your code should work for any number of input units, hidden units and outputs units. While we have explicitly listed the indices above for $\Theta^{(1)}$ and $\Theta^{(2)}$ for clarity, do note that your code should in general work with $\Theta^{(1)}$ and $\Theta^{(2)}$ of any size. Note that you should not be regularizing the terms that correspond to the bias. For the matrices Theta1 and Theta2, this corresponds to the first column of each matrix. You should now add regularization to your cost function. Notice that you can first compute the unregularized cost function $J$ using your existing nnCostFunction and then later add the cost for the regularization terms.
Click here to go back to nnCostFunction for editing.

Once you are done, the next cell will call your nnCostFunction using the loaded set of parameters for Theta1 and Theta2, and $\lambda = 1$.
In [ ]:
# Weight regularization parameter (we set this to 1 here).
lambda_ = 1
J, _ = nnCostFunction(nn_params, input_layer_size, hidden_layer_size,
num_labels, X, y, lambda_)

print(‘Cost at parameters (loaded from ex5weights): %.6f’ % J)

2 Backpropagation¶
In this part of the exercise, you will implement the backpropagation algorithm to compute the gradient for the neural network cost function. You will need to update the function nnCostFunction so that it returns an appropriate value for grad. Once you have computed the gradient, you will be able to train the neural network by minimizing the cost function $J(\theta)$ using an advanced optimizer such as scipy’s optimize.minimize. You will first implement the backpropagation algorithm to compute the gradients for the parameters for the (unregularized) neural network. After you have verified that your gradient computation for the unregularized case is correct, you will implement the gradient for the regularized neural network.

2.1 Sigmoid Gradient (5 pts)¶
To help you get started with this part of the exercise, you will first implement the sigmoid gradient function. The gradient for the sigmoid function can be computed as
$$ g'(z) = \frac{d}{dz} g(z) = g(z)\left(1-g(z)\right) $$
where
$$ \text{sigmoid}(z) = g(z) = \frac{1}{1 + e^{-z}} $$
Now complete the implementation of sigmoidGradient in the next cell.
In [ ]:
def sigmoidGradient(z):
“””
Computes the gradient of the sigmoid function evaluated at z.
This should work regardless if z is a matrix or a vector.
In particular, if z is a vector or matrix, you should return
the gradient for each element.

Parameters
———-
z : array_like
A vector or matrix as input to the sigmoid function.

Returns
——–
g : array_like
Gradient of the sigmoid function. Has the same shape as z.

Instructions
————
Compute the gradient of the sigmoid function evaluated at
each value of z (z can be a matrix, vector or scalar).

“””

g = np.zeros(z.shape)

# ====================== YOUR CODE HERE ======================

g = sigmoid(z) * (1 – sigmoid(z))

# =============================================================
return g

When you are done, the following cell call sigmoidGradient on a given vector z. Try testing a few values by calling sigmoidGradient(z). For large values (both positive and negative) of z, the gradient should be close to 0. When $z = 0$, the gradient should be exactly 0.25. Your code should also work with vectors and matrices. For a matrix, your function should perform the sigmoid gradient function on every element.
In [ ]:
z = np.array([-1, -0.5, 0, 0.5, 1])
g = sigmoidGradient(z)
print(‘Sigmoid gradient evaluated at [-1 -0.5 0 0.5 1]:\n ‘)
print(g)

2.2 Random Initialization (5 pts)¶
When training neural networks, it is important to randomly initialize the parameters for symmetry breaking. One effective strategy for random initialization is to randomly select values for $\Theta^{(l)}$ uniformly in the range $[-\epsilon_{init}, \epsilon_{init}]$. You should use $\epsilon_{init} = 0.12$. This range of values ensures that the parameters are kept small and makes the learning more efficient.
One effective strategy for choosing $\epsilon_{init}$ is to base it on the number of units in the network. A good choice of $\epsilon_{init}$ is $\epsilon_{init} = \frac{\sqrt{6}}{\sqrt{L_{in} + L_{out}}}$ where $L_{in} = s_l$ and $L_{out} = s_{l+1}$ are the number of units in the layers adjacent to $\Theta^{l}$.
Your job is to complete the function randInitializeWeights to initialize the weights for $\Theta$. Modify the function by filling in the following code:
# Randomly initialize the weights to small values
W = np.random.rand(L_out, 1 + L_in) * 2 * epsilon_init – epsilon_init
Note that we give the function an argument for $\epsilon$ with default value epsilon_init = 0.12.
In [ ]:
def randInitializeWeights(L_in, L_out, epsilon_init=0.12):
“””
Randomly initialize the weights of a layer in a neural network.

Parameters
———-
L_in : int
Number of incomming connections.

L_out : int
Number of outgoing connections.

epsilon_init : float, optional
Range of values which the weight can take from a uniform
distribution.

Returns
——-
W : array_like
The weight initialiatized to random values. Note that W should
be set to a matrix of size(L_out, 1 + L_in) as
the first column of W handles the “bias” terms.

Instructions
————
Initialize W randomly so that we break the symmetry while training
the neural network. Note that the first column of W corresponds
to the parameters for the bias unit.
“””

# You need to return the following variables correctly
W = np.zeros((L_out, 1 + L_in))

# ====================== YOUR CODE HERE ======================

W = np.random.rand(L_out, 1 + L_in) * (2 * epsilon_init) – epsilon_init

# ============================================================
return W

Execute the following cell to initialize the weights for the 2 layers in the neural network using the randInitializeWeights function.
In [ ]:
print(‘Initializing Neural Network Parameters …’)

initial_Theta1 = randInitializeWeights(input_layer_size, hidden_layer_size)
initial_Theta2 = randInitializeWeights(hidden_layer_size, num_labels)

# Unroll parameters
initial_nn_params = np.concatenate([initial_Theta1.ravel(), initial_Theta2.ravel()], axis=0)

2.3 Backpropagation (20 pts)¶

Now, you will implement the backpropagation algorithm. Recall that the intuition behind the backpropagation algorithm is as follows. Given a training example $(x^{(t)}, y^{(t)})$, we will first run a “forward pass” to compute all the activations throughout the network, including the output value of the hypothesis $h_\theta(x)$. Then, for each node $j$ in layer $l$, we would like to compute an “error term” $\delta_j^{(l)}$ that measures how much that node was “responsible” for any errors in our output.
For an output node, we can directly measure the difference between the network’s activation and the true target value, and use that to define $\delta_j^{(3)}$ (since layer 3 is the output layer). For the hidden units, you will compute $\delta_j^{(l)}$ based on a weighted average of the error terms of the nodes in layer $(l+1)$. In detail, here is the backpropagation algorithm (also depicted in the figure above). You should implement steps 1 to 4 in a loop that processes one example at a time. Concretely, you should implement a for-loop for t in range(m) and place steps 1-4 below inside the for-loop, with the $t^{th}$ iteration performing the calculation on the $t^{th}$ training example $(x^{(t)}, y^{(t)})$. Step 5 will divide the accumulated gradients by $m$ to obtain the gradients for the neural network cost function.
1. Set the input layer’s values $(a^{(1)})$ to the $t^{th }$training example $x^{(t)}$. Perform a feedforward pass, computing the activations $(z^{(2)}, a^{(2)}, z^{(3)}, a^{(3)})$ for layers 2 and 3. Note that you need to add a +1 term to ensure that the vectors of activations for layers $a^{(1)}$ and $a^{(2)}$ also include the bias unit. In numpy, if a 1 is a column matrix, adding one corresponds to a_1 = np.concatenate([np.ones((m, 1)), a_1], axis=1).

2. For each output unit $k$ in layer 3 (the output layer), set $$\delta_k^{(3)} = \left(a_k^{(3)} – y_k \right)$$ where $y_k \in \{0, 1\}$ indicates whether the current training example belongs to class $k$ $(y_k = 1)$, or if it belongs to a different class $(y_k = 0)$. You may find logical arrays helpful for this task (explained in the previous programming exercise).

3. For the hidden layer $l = 2$, set $$ \delta^{(2)} = \left( \Theta^{(2)} \right)^T \delta^{(3)} * g’\left(z^{(2)} \right)$$ Note that the symbol $*$ performs element wise multiplication in numpy.

4. Accumulate the gradient from this example using the following formula. Note that you should skip or remove $\delta_0^{(2)}$. In numpy, removing $\delta_0^{(2)}$ corresponds to delta_2 = delta_2[1:]. $$ \Delta^{(l)} = \Delta^{(l)} + \delta^{(l+1)} (a^{(l)})^{(T)} $$

5. Obtain the (unregularized) gradient for the neural network cost function by dividing the accumulated gradients by $\frac{1}{m}$: $$ \frac{\partial}{\partial \Theta_{ij}^{(l)}} J(\Theta) = D_{ij}^{(l)} = \frac{1}{m} \Delta_{ij}^{(l)}$$

**Python/Numpy tip**: You should implement the backpropagation algorithm only after you have successfully completed the feedforward and cost functions. While implementing the backpropagation alogrithm, it is often useful to use the `shape` function to print out the shapes of the variables you are working with if you run into dimension mismatch errors.
Click here to go back and update the function nnCostFunction with the backpropagation algorithm.

After you have implemented the backpropagation algorithm, we will proceed to run gradient checking on your implementation. The gradient check will allow you to increase your confidence that your code is computing the gradients correctly.
2.4 Gradient checking (5 pts)¶
In your neural network, you are minimizing the cost function $J(\Theta)$. To perform gradient checking on your parameters, you can imagine “unrolling” the parameters $\Theta^{(1)}$, $\Theta^{(2)}$ into a long vector $\theta$. By doing so, you can think of the cost function being $J(\Theta)$ instead and use the following gradient checking procedure.
Suppose you have a function $f_i(\theta)$ that purportedly computes $\frac{\partial}{\partial \theta_i} J(\theta)$; you’d like to check if $f_i$ is outputting correct derivative values.
$$ \text{Let } \theta^{(i+)} = \theta + \begin{bmatrix} 0 \\ 0 \\ \vdots \\ \epsilon \\ \vdots \\ 0 \end{bmatrix} \quad \text{and} \quad \theta^{(i-)} = \theta – \begin{bmatrix} 0 \\ 0 \\ \vdots \\ \epsilon \\ \vdots \\ 0 \end{bmatrix} $$
So, $\theta^{(i+)}$ is the same as $\theta$, except its $i^{th}$ element has been incremented by $\epsilon$. Similarly, $\theta^{(i−)}$ is the corresponding vector with the $i^{th}$ element decreased by $\epsilon$. You can now numerically verify $f_i(\theta)$’s correctness by checking, for each $i$, that:
$$ f_i\left( \theta \right) \approx \frac{J\left( \theta^{(i+)}\right) – J\left( \theta^{(i-)} \right)}{2\epsilon} $$
The degree to which these two values should approximate each other will depend on the details of $J$. But assuming $\epsilon = 10^{-4}$, you’ll usually find that the left- and right-hand sides of the above will agree to at least 4 significant digits (and often many more).
Next, you need to define the function to compute the numerical gradient in compute_numerial_gradient.
In the next cell we will run the provided function check_nn_gradients which will create a small neural network and dataset that will be used for checking your gradients. If your backpropagation implementation is correct, you should see a relative difference that is less than 1e-9.
**Practical Tip**: When performing gradient checking, it is much more efficient to use a small neural network with a relatively small number of input units and hidden units, thus having a relatively small number of parameters. Each dimension of $\theta$ requires two evaluations of the cost function and this can be expensive. In the function `check_nn_gradients`, our code creates a small random model and dataset which is used with `compute_numerial_gradient` for gradient checking. Furthermore, after you are confident that your gradient computations are correct, you should turn off gradient checking before running your learning algorithm.
**Practical Tip:** Gradient checking works for any function where you are computing the cost and the gradient. Concretely, you can use the same `compute_numerial_gradient` function to check if your gradient implementations for the other exercises are correct too (e.g., logistic regression’s cost function).
In [ ]:
def compute_numerial_gradient(cost_func, theta):
numgrad = np.zeros(theta.size)
perturb = np.zeros(theta.size)

e = 1e-4

# ====================== YOUR CODE HERE ======================

# ============================================================

return numgrad
In [ ]:
def debug_initialize_weights(fan_out, fan_in):
w = np.zeros((fan_out, 1 + fan_in))

w = np.sin(np.arange(w.size)).reshape(w.shape) / 10

return w
In [ ]:
def check_nn_gradients(lmd):

input_layer_size = 3
hidden_layer_size = 5
num_labels = 3
m = 5

# We generatesome ‘random’ test data
theta1 = debug_initialize_weights(hidden_layer_size, input_layer_size)
theta2 = debug_initialize_weights(num_labels, hidden_layer_size)

# Reusing debugInitializeWeights to genete X
X = debug_initialize_weights(m, input_layer_size – 1)
y = 1 + np.mod(np.arange(1, m + 1), num_labels)

# Unroll parameters
nn_params = np.concatenate([theta1.flatten(), theta2.flatten()])

def cost_func(p):
return nnCostFunction(p, input_layer_size, hidden_layer_size, num_labels, X, y, lmd)

cost, grad = cost_func(nn_params)
numgrad = compute_numerial_gradient(cost_func, nn_params)

print(np.c_[grad, numgrad])
In [ ]:
lmd = 0
check_nn_gradients(lmd)

2.5 Regularized Neural Network (10 pts)¶
After you have successfully implemented the backpropagation algorithm, you will add regularization to the gradient. To account for regularization, it turns out that you can add this as an additional term after computing the gradients using backpropagation.
Specifically, after you have computed $\Delta_{ij}^{(l)}$ using backpropagation, you should add regularization using
$$ \begin{align} & \frac{\partial}{\partial \Theta_{ij}^{(l)}} J(\Theta) = D_{ij}^{(l)} = \frac{1}{m} \Delta_{ij}^{(l)} & \qquad \text{for } j = 0 \\ & \frac{\partial}{\partial \Theta_{ij}^{(l)}} J(\Theta) = D_{ij}^{(l)} = \frac{1}{m} \Delta_{ij}^{(l)} + \frac{\lambda}{m} \Theta_{ij}^{(l)} & \qquad \text{for } j \ge 1 \end{align} $$
Note that you should not be regularizing the first column of $\Theta^{(l)}$ which is used for the bias term. Furthermore, in the parameters $\Theta_{ij}^{(l)}$, $i$ is indexed starting from 1, and $j$ is indexed starting from 0. Thus,
$$ \Theta^{(l)} = \begin{bmatrix} \Theta_{1,0}^{(i)} & \Theta_{1,1}^{(l)} & \cdots \\ \Theta_{2,0}^{(i)} & \Theta_{2,1}^{(l)} & \cdots \\ \vdots & ~ & \ddots \end{bmatrix} $$
Now modify your code that computes grad in nnCostFunction to account for regularization.
After you are done, the following cell runs gradient checking on your implementation. If your code is correct, you should expect to see a relative difference that is less than 1e-9.
In [ ]:
# Check gradients by running check_nn_gradients
lambda_ = 3
check_nn_gradients(lambda_)

# Also output the costFunction debugging values
debug_J, _ = nnCostFunction(nn_params, input_layer_size,
hidden_layer_size, num_labels, X, y, lambda_)

print(‘\n\nCost at (fixed) debugging parameters (w/ lambda = %f): %f ‘ % (lambda_, debug_J))

2.6 Learning parameters using scipy.optimize.minimize (10 pts)¶
After you have successfully implemented the neural network cost function and gradient computation, the next step we will use scipy’s minimization to learn a good set parameters.
In [ ]:
# After you have completed the assignment, change the maxiter to a larger
# value to see how more training helps.
options= {‘maxiter’: 100}

# You should also try different values of lambda
lambda_ = 1

# Create “short hand” for the cost function to be minimized
costFunction = lambda p: nnCostFunction(p, input_layer_size,
hidden_layer_size,
num_labels, X, y, lambda_)

# Now, costFunction is a function that takes in only one argument
# (the neural network parameters)

# ====================== YOUR CODE HERE ======================

res = optimize.minimize(costFunction,
initial_nn_params,
jac=True,
method=’TNC’,
options=options)

# ============================================================

# get the solution of the optimization
nn_params = res.x

# Obtain Theta1 and Theta2 back from nn_params
Theta1 = np.reshape(nn_params[:hidden_layer_size * (input_layer_size + 1)],
(hidden_layer_size, (input_layer_size + 1)))

Theta2 = np.reshape(nn_params[(hidden_layer_size * (input_layer_size + 1)):],
(num_labels, (hidden_layer_size + 1)))

After the training completes, we will proceed to report the training accuracy of your classifier by computing the percentage of examples it got correct. If your implementation is correct, you should see a reported training accuracy of about 95.3% (this may vary by about 1% due to the random initialization). It is possible to get higher training accuracies by training the neural network for more iterations. We encourage you to try training the neural network for more iterations (e.g., set maxiter to 400) and also vary the regularization parameter $\lambda$. With the right learning settings, it is possible to get the neural network to perfectly fit the training set.
In [ ]:
def predict(theta1, theta2, x):
m = x.shape[0]

x = np.c_[np.ones(m), x]
h1 = sigmoid(np.dot(x, theta1.T))
h1 = np.c_[np.ones(h1.shape[0]), h1]
h2 = sigmoid(np.dot(h1, theta2.T))
p = np.argmax(h2, axis=1) + 1

return p
In [ ]:
pred = predict(Theta1, Theta2, X)
In [ ]:
print(‘Training set accuracy: {}’.format(np.mean(pred == y)*100))

3 Visualizing the Hidden Layer¶
One way to understand what your neural network is learning is to visualize what the representations captured by the hidden units. Informally, given a particular hidden unit, one way to visualize what it computes is to find an input $x$ that will cause it to activate (that is, to have an activation value ($a_i^{(l)}$) close to 1). For the neural network you trained, notice that the $i^{th}$ row of $\Theta^{(1)}$ is a 401-dimensional vector that represents the parameter for the $i^{th}$ hidden unit. If we discard the bias term, we get a 400 dimensional vector that represents the weights from each input pixel to the hidden unit.
Thus, one way to visualize the “representation” captured by the hidden unit is to reshape this 400 dimensional vector into a 20 × 20 image and display it (It turns out that this is equivalent to finding the input that gives the highest activation for the hidden unit, given a “norm” constraint on the input (i.e., $||x||_2 \le 1$)).
The next cell does this by using the display_data function and it will show you an image with 25 units, each corresponding to one hidden unit in the network. In your trained network, you should find that the hidden units corresponds roughly to detectors that look for strokes and other patterns in the input.
In [ ]:
display_data(Theta1[:, 1:])

4 Scikit-learn (20pt)¶
In this part, you need to use scikit-learn to train a NN model, and compare with the ones you have found before. You should also verify the accuracy scores.
In [ ]:

In [ ]:

In [ ]: