Problem Set 1¶
To run and solve this assignment, one must have a working IPython Notebook installation. The easiest way to set it up for both Windows and Linux is to install Anaconda. Then save this file to your computer, run Anaconda and choose this file in Anaconda’s file explorer. Use Python 3 version. Below statements assume that you have already followed these instructions. If you are new to Python or its scientific library, Numpy, there are some nice tutorials here and here.
Copyright By PowCoder代写 加微信 powcoder
To run code in a cell or to render Markdown+LaTeX press Ctr+Enter or [>|](like “play”) button above. To edit any code or text cell [double]click on its content. To change cell type, choose “Markdown” or “Code” in the drop-down menu above.
If a certain output is given for some cells, that means that you are expected to get similar results in order to receive full points (small deviations are fine). For some parts we have already written the code for you. You should read it closely and understand what it does.
Total: 85 points.
[10pts] Problem 0: Numpy¶
Modify the cell below to return a 5×5 matrix of ones. Put some code there and press Ctrl+Enter to execute contents of the cell. You should see something like the output below. [1] [2]
import numpy as np
import matplotlib.pyplot as plt
raise NotImplementedError(“Replace this raise statement with the code ”
“that prints 5×5 matrix of ones”)
[[1. 1. 1. 1. 1.]
[1. 1. 1. 1. 1.]
[1. 1. 1. 1. 1.]
[1. 1. 1. 1. 1.]
[1. 1. 1. 1. 1.]]
Vectorizing your code is very important to get results in a reasonable time. Let A be a 10×10 matrix and x be a 10-element column vector. Your friend writes the following code. How would you vectorize this code to run without any for loops? Compare execution speed for different values of n with %timeit.
def compute_something(A, x):
v = np.zeros((n, 1))
for i in range(n):
for j in range(n):
v[i] += A[i, j] * x[j]
A = np.random.rand(n, n)
x = np.random.rand(n, 1)
print(compute_something(A, x))
[[2.49571008]
[2.61317691]
[1.99759323]
[2.10646689]
[2.58667677]
[2.29693844]
[2.22886351]
[3.19794503]
[3.02237092]
[1.81842383]]
def vectorized(A, x):
raise NotImplementedError(‘Put your vectorized code here!’)
print(vectorized(A, x))
assert np.max(abs(vectorized(A, x) – compute_something(A, x))) < 1e-3
[[2.49571008]
[2.61317691]
[1.99759323]
[2.10646689]
[2.58667677]
[2.29693844]
[2.22886351]
[3.19794503]
[3.02237092]
[1.81842383]]
for n in [5, 10, 100, 500]:
A = np.random.rand(n, n)
x = np.random.rand(n, 1)
%timeit -n 5 compute_something(A, x)
%timeit -n 5 vectorized(A, x)
print('---')
197 µs ± 43.1 µs per loop (mean ± std. dev. of 7 runs, 5 loops each)
4 µs ± 2.33 µs per loop (mean ± std. dev. of 7 runs, 5 loops each)
715 µs ± 53.1 µs per loop (mean ± std. dev. of 7 runs, 5 loops each)
3.78 µs ± 1.43 µs per loop (mean ± std. dev. of 7 runs, 5 loops each)
66.4 ms ± 780 µs per loop (mean ± std. dev. of 7 runs, 5 loops each)
The slowest run took 6292.24 times longer than the fastest. This could mean that an intermediate result is being cached.
5.76 ms ± 14.1 ms per loop (mean ± std. dev. of 7 runs, 5 loops each)
1.66 s ± 9.49 ms per loop (mean ± std. dev. of 7 runs, 5 loops each)
The slowest run took 199.17 times longer than the fastest. This could mean that an intermediate result is being cached.
1.32 ms ± 3.11 ms per loop (mean ± std. dev. of 7 runs, 5 loops each)
[30pts] Problem 1: Gradient Descent and Cost Functions¶
With each step of gradient descent, your parameter $\theta_j$ comes closer to the local minimum of the cost $J(θ)$.
Compute the partial derivative of the regression loss function
$$ J(\theta) = \frac{1}{2m} \sum_{i = 1}^{m} \big(h(x^{(i)}; \theta) - y^{(i)}\big)^2$$
with respect to the parameter $\theta_j$, where the hypothesis $h(x;\theta)$ is given by the linear model
$$ h(x;\theta) = \theta^T x$$
Please show all your steps
[double click here to add a solution]
Write a mathematical formulation of how gradient descent will adjust $\theta_j$ values to minimize the cost $J(θ)$. Then, describe each term in the formulation.
[double click here to add a solution]
If $x^{(i)} \in R^4$, and $m=3$, draw the tensors $x$, $\theta$, $h$, in $ h(x;\theta) = \theta^T x$, together with the ground-truth tensor $y$, specifying the dimensions of each.
[double click here to add a solution]
How do we know that gradient descent has converged?
[double click here to add a solution]
Write an L2 regularized form of $$ J(\theta) = \frac{1}{2m} \sum_{i = 1}^{m} \big(h(x^{(i)}; \theta) - y^{(i)}\big)^2,$$ explain the additional term(s), and explain why the additional term(s) achieve regularization (i.e. help avoid overfitting).
[double click here to add a solution]
1.6 [2pt] Briefly, explain how the computational speed of backpropagation would be affected if it did not include a forward pass
[double click here to add a solution]
1.7 [6pt] Suppose we have a classification problem with inputs $x$ and corresponding labels $y$, where $y^{(i)} \in \{-1,+1\}$. We would like to learn a classifier that computes a linear function of the input $h=\theta^T x$, and predicts $+1$ if $h(x^{(i)}) \ge 0$, or $-1$ otherwise.
A) What can be said about the correctness of the classifier’s prediction if $y^{(i)}h(x^{(i)})>0$?
B) Is $yh(x)$ a good loss function to minimize? Why?
[double click here to add a solution]
[35pts] Problem 2: Logistic Regression¶
In this part of the exercise, you will build a logistic regression model to predict whether a student
gets admitted into a university.
Suppose that you are the administrator of a university department and you want to determine
each applicant’s chance of admission based on their results on two exams. You have historical
data from previous applicants in ex2data1.txt that you can use as a training set for logistic regression. For each
training example, you have the applicant’s scores on two exams and the admissions decision.
Your task is to build a classification model that estimates an applicant’s probability of admission based on the scores from those two exams. This outline and code framework will guide you through the exercise.
2.1 Implementation
import sys
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
print(‘Tested with:’)
print(‘Python’, sys.version)
print({x.__name__: x.__version__ for x in [np, matplotlib]})
Tested with:
Python 3.7.4 (default, Aug 9 2019, 18:34:13) [MSC v.1915 64 bit (AMD64)]
{‘numpy’: ‘1.16.5’, ‘matplotlib’: ‘3.1.1’}
2.1.1 Visualizing the data
Before starting to implement any learning algorithm, it is always good to visualize the data if possible. This first part of the code will load the data and display it on a 2-dimensional plot by calling the function plotData. The axes are the two exam scores, and the positive and negative examples are shown with different markers.
def read_classification_csv_data(fn, add_ones=False):
# read comma separated data
data = np.loadtxt(fn, delimiter=’,’)
X_, y_ = data[:, :-1], data[:, -1, None] # a fast way to keep last dim
print(X_.shape, X_.min(), X_.max(), X_.dtype)
print(y_.shape, y_.min(), y_.max(), y_.dtype)
# note that y is float!
# insert the column of 1’s into the “X” matrix (for bias)
X = np.insert(X_, X_.shape[1], 1, axis=1) if add_ones else X_
y = y_.astype(np.int32)
return X, y
X_data, y_data = read_classification_csv_data(‘ex2data1.txt’, add_ones=True)
print(X_data.shape, X_data.min(), X_data.max(), X_data.dtype)
print(y_data.shape, y_data.min(), y_data.max(), y_data.dtype)
(100, 2) 30.05882244669796 99.82785779692128 float64
(100, 1) 0.0 1.0 float64
(100, 3) 1.0 99.82785779692128 float64
(100, 1) 0 1 int32
# how does the *X[y.ravel()==1, :2].T trick work?
# https://docs.python.org/3/tutorial/controlflow.html#unpacking-argument-lists
def plot_data(X, y, labels, markers, xlabel, ylabel, figsize=(10, 6), ax=None):
if figsize is not None:
plt.figure(figsize=figsize)
ax = ax or plt.gca()
for label_id, (label, marker) in enumerate(zip(labels, markers)):
ax.plot(*X[y.ravel()==label_id, :2].T, marker, label=label)
ax.set_xlabel(xlabel)
ax.set_ylabel(ylabel)
plt.legend()
ax.grid(True)
student_plotting_spec = {
‘X’: X_data,
‘y’: y_data,
‘xlabel’: ‘Exam 1 score’,
‘ylabel’: ‘Exam 2 score’,
‘labels’: [‘Not admitted’, ‘Admitted’],
‘markers’: [‘yo’, ‘k+’],
‘figsize’: (10, 6)
plot_data(**student_plotting_spec)
plt.show()
2.1.2 [5pts] Sigmoid function
Before you start with the actual cost function, recall that the logistic regression hypothesis is defined as:
$h_\theta(x) = g(\theta^Tx)$
where function g is the sigmoid function. The sigmoid function is defined as:
$g(z) = \dfrac{1}{1+e^{-z}}$
Your first step is to implement/find a sigmoid function so it can be called by the rest of your program. Your code should also work with vectors and matrices. For a matrix, your function should perform the sigmoid function on every element.
When you are finished, (a) plot the sigmoid function, and (b) test the function with a scalar, a vector, and a matrix. For scalar large positive values of x, the sigmoid should be close to 1, while for scalar large negative values, the sigmoid should be close to 0. Evaluating sigmoid(0) should give you exactly 0.5.
# check out scipy.special for great variaty of vectorized functions
# remember that sigmoid is the inverse of logit function
# maybe worth checking out scipy.special.logit first
sigmoid = raise NotImplementedError(“Replace this raise statement with the sigmoid function from scipy.special”)
def check_that_sigmoid_f(f):
x_test = np.linspace(-10, 10, 50)
sigm_test = f(x_test)
plt.plot(x_test, sigm_test)
plt.title(“Sigmoid function”)
plt.grid(True)
plt.show()
check_that_sigmoid_f(sigmoid)
2.1.3 [15pts] Cost function and gradient
Now you will implement the cost function and gradient for logistic regression. Complete the code
in the functions hyposesis_function and binary_logistic_loss below to return the value of the hypothesis function and the cost, respectively. Recall that the cost function in logistic regression is
$J(\theta) \ = \ \frac{1}{m} \ \sum_{i=1}^{m} \ [ \ -y^{(i)} log(h_\theta(x^{(i)})) \ – \ (1 – y^{(i)})log(1-h_\theta(x^{(i)})) \ ]$
and the gradient of the cost is a vector of the same length as $\theta$ where the $j^{th}$ element (for $j = 0, 1,…,n$) is defined as follows:
$\frac{\partial J(\theta)}{\partial \theta_{j}} \ = \ \frac{1}{m} \ \sum_{i=1}^{m} \ (h_\theta(x^{(i)})-y^{(i)}) x_j^{(i)}$
where $m$ is the number of points and $n$ is the number of features. Note that while this gradient looks identical to the linear regression gradient, the formula is
actually different because linear and logistic regression have different definitions of $h_\theta(x)$.
What should be the value of the loss for $\theta = \bar 0$ regardless of input? Why? Make sure your code also outputs this value.
# we are trying to fit a function that would return a
# “probability of ”
# hyposesis_function describes parametric family of functions that we are
# going to pick our “best fitting function” from. It is parameterized by
# real-valued vector theta, i.e. we are going to pick
# h_best = argmin_{h \in H} logistic_loss_h(x, y, h)
# but because there exist a bijection between theta’s and h’s it is
# eqvivalent to choosing
# theta_best = argmin_{theta \in H} logistic_loss_theta(x, y, theta)
def hyposesis_function(x, theta):
raise NotImplementedError(‘Replace this with your implementation’)
# negative log likelihood of observing sequence of integer
# y’s given probabilities y_pred’s of each Bernoulli trial
# recommentation: convert both variables to floats
# to avoid scenarios like -1*y != -y for uint8
# use np.mean and broadcasting
def binary_logistic_loss(y, y_pred):
assert y_pred.shape == y.shape
y, y_pred = y.astype(np.float64), y_pred.astype(np.float64)
# When y_pred = 0, log(0) = -inf,
# we could add a small constant to avoid this case
CONSTANT = 0.000001
y_pred = np.clip(y_pred, 0+CONSTANT, 1-CONSTANT)
raise NotImplementedError(‘Calculate the log likelihoods’)
def logistic_loss_theta_grad(x, y, h, theta):
Arguments (np arrays of shape):
x : [m, n] ground truth data
y : [m, 1] ground truth prediction
h : [m, n] -> [m, 1] our guess for a prediction function
# reshape theta: n by 1
theta = theta.reshape((-1,1))
y_pred = h(x, theta)
point_wise_grads = (y_pred – y)*x
grad = np.mean(point_wise_grads, axis=0)[:, None]
assert grad.shape == theta.shape
return grad.ravel()
def logistic_loss_theta(x, y, h, theta):
# reshape theta: n by 1
theta = theta.reshape((-1,1))
return binary_logistic_loss(y, h(x, theta))
# Check that with theta as zeros, cost is about 0.693:
theta_init = np.zeros((X_data.shape[1], 1))
print(logistic_loss_theta(X_data, y_data, hyposesis_function, theta_init))
print(logistic_loss_theta_grad(X_data, y_data, hyposesis_function, theta_init))
0.6931471805599453
[-12.00921659 -11.26284221 -0.1 ]
2.1.4 Learning parameters using scipy.optimize
Instead of taking gradient descent steps, use a scipy.optimize built-in function called scipy.optimize.minimize. In this case, we will use the conjugate gradient algorithm.
The final $\theta$ value will then be used to plot the
decision boundary on the training data, as seen in the figure below.
import scipy.optimize
from functools import partial
def optimize(theta_init, loss, loss_grad, max_iter=10000, print_every=1000, optimizer_fn=None, show=False):
theta = theta_init.copy()
opt_args = {‘x0’: theta_init, ‘fun’: loss, ‘jac’: loss_grad, ‘options’: {‘maxiter’: max_iter}}
loss_curve = []
def scipy_callback(theta):
f_value = loss(theta)
loss_curve.append(f_value)
if optimizer_fn is None:
optimizer_fn = partial(scipy.optimize.minimize, method=’CG’, callback=scipy_callback)
opt_result = optimizer_fn(**opt_args)
plt.plot(loss_curve)
plt.show()
return opt_result[‘x’].reshape((-1, 1)), opt_result[‘fun’]
theta_init = np.zeros((3, 1))
loss = partial(logistic_loss_theta, X_data, y_data, hyposesis_function)
loss_grad = partial(logistic_loss_theta_grad, X_data, y_data, hyposesis_function)
theta, best_cost = optimize(theta_init, loss, loss_grad, show=True)
print(best_cost)
0.2034977174052572
# Plotting the decision boundary: two points, draw a line between
# Decision boundary occurs when h = 0, or when
# theta_0*x1 + theta_1*x2 + theta_2 = 0
# y=mx+b is replaced by x2 = (-1/theta1)(theta2 + theta0*x1)
line_xs = np.array([np.min(X_data[:,0]), np.max(X_data[:,0])])
line_ys = (-1./theta[1])*(theta[2] + theta[0]*line_xs)
plot_data(**student_plotting_spec)
plt.plot(line_xs, line_ys, ‘b-‘, lw=10, alpha=0.2, label=’Decision Boundary’)
plt.legend()
plt.show()
2.1.5 [15pts] Evaluating logistic regression
After learning the parameters, you can use the model to predict whether a particular student will
be admitted.
(a) [5 pts] Show that for a student with an Exam 1 score of 45 and an Exam 2 score of 85, you should
expect to see an admission probability of 0.776.
Another way to evaluate the quality of the parameters we have found is to see how well the
learned model predicts on our training set.
(b) [10 pts] In this part, your task is to complete the code in
makePrediction. The predict function will produce “1” or “0” predictions given a dataset and a learned
parameter vector $\theta$. After you have completed the code, the script below will proceed to report the
training accuracy of your classifier by computing the percentage of examples it got correct. You
should also see a Training Accuracy of 89.0.
# For a student with an Exam 1 score of 45 and an Exam 2 score of 85,
# you should expect to see an admission probability of 0.776.
check_data = np.array([[45., 85., 1]])
print(check_data.shape)
print(hyposesis_function(check_data, theta))
[[0.77637608]]
# use hyposesis function and broadcast compare operator
def predict(x, theta):
raise NotImplementedError(‘Replace this with your implementation’)
def accuracy(x, y, theta):
raise NotImplementedError(‘Replace this with your implementation’)
print(accuracy(X_data, y_data, theta))
[10pts] Problem 3: Simple Regularization Methods¶
In learning neural networks, aside from minimizing a loss function $\mathcal{L}(\theta)$ with respect to the network parameters $\theta$, we usually explicitly or implicitly add some regularization term to reduce overfitting. A simple and popular regularization strategy is to penalize some norm of $\theta$.
[5pts] Q3.1: L2 regularization¶
We can penalize the L2 norm of $\theta$: we modify our objective function to be $\mathcal{L}(\theta) + \lambda \|\theta\|^2$ where $\lambda$ is the weight of regularization.
We will minimize this objective using gradient descent with step size $\eta$.
Derive the update rule: at time $t+1$, express the new parameters $\theta_{t+1}$ in terms of the old parameters $\theta_t$, the gradient $g_t=\frac{\partial \mathcal{L}}{\partial \theta_t}$, $\eta$, and $\lambda$.
[Double click here to add your answer]
[5pts] Q3.2: L1 regularization¶
Now let’s consider L1 regularization: our objective in this case is $\mathcal{L}(\theta) + \lambda \|\theta\|_1$. Derive the update rule.
(Technically this becomes Sub-Gradient Descent since the L1 norm is not differentiable at 0. But practically it is usually not an issue.)
[Double click here to add your answer]
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com