University of Toronto Mississauga Department of Mathematical and Computational Sciences CSC311 — Introduction to Machine Learning, Fall 2020
Midterm Test: Questions and Solutions
Open book. 60 minutes. 5 pages. 4 questions. 70 points. Write all answers on blank paper. When you are done, take scans/photos of your answers, combine them into a single pdf file, and upload them to Markus. Clear, concise answers will receive more marks than long, rambling ones. Good luck!
I don’t know policy: If you do not know the answer to a question (or part), and you write “I don’t know”, you will receive 20% of the marks of that question (or part). If you just leave a question blank with no such statement, you will get 0 marks for that question.
Copyright By PowCoder代写 加微信 powcoder
1. (20 points total) True or False.
For each of the following statements, state whether it is true or false without giving an
explanation (2 points each):
(a) When used for classification, linear regression is more robust and less sensitive to outliers than logistic regression. FALSE
(b) In classification, we try to learn a mapping from input points to continuous target values. FALSE
(c) If the learning rate is too large, gradient descent can diverge. TRUE
(d) In K nearest-neighbours, a good value of K can be determined by using validation data. TRUE
(e) If a function underfits the data, then both the training and testing error are too large. TRUE
(f) A multi-class classifier for K classes partitions the input space into K regions. TRUE
(g) When training a neural network, gradients are computed using forward propagation, which starts at the input and works forwards towards the output. FALSE
(h) If two classes are linearly separable, then there may be many linear decision boundaries that separate the positive inputs from the negative inputs. TRUE
(i) Linear least-squares regression is equivalent to modelling the data as a linear function plus Gaussian noise and using maximum likelihood estimation to fit this model to the training data. TRUE
(j) Unlike stochastic gradient descent, batch gradient descent does not follow the direction of steepest decent and converges sooner. FALSE
2. (10 points) Hyper parameters.
We plan to use validation data to determine the optimal number of hidden units to use for a neural network with one hidden layer. On a single pair of axes, draw two curves: (1) a plot of training error v.s. number of hidden units, and a plot of test error v.s. number of hidden units. On the graph, identify areas of overfitting, areas of underfitting, and the optimal number of hidden units.
3. (20 points total) Proofs.
(a) (5 points) Let A be a matrix, and let δij be the Kronecker delta, where i and j are integers. Thatis,δii =1andδij =0ifi̸=j. Provethat nmAnmδmi =nAni forall i. Here, mn is a double sum over all combinations of values of m and n. Justify each step of your proof. Hint: In general, m ym = yi + m̸=i ym for any indexed value, ym.
Anmδmi = Anmδmi by iterating the sums
= [Aniδii + Anmδmi]
= [Ani × 1 + Anm × 0]
= Ani trivially n
by the hint, using Anm in place of ym
by the definition of the Kronecker delta
(b) (15 points) Let x be a column vector and A be a square matrix of compatible dimensions. Prove that
∂xTAx = Ax+ATx ∂x
Justify each step of your proof. You may use part (a) in your answer. Hint: ∂xi/∂xj = δij. ANSWER: First note that xT Ax is a real number. In particular,
xTAx=x[Ax] =xA x =xA x (1) mm mmnn mmnn
where the first two equalities follow from the definition of matrix-vector multiplication. Second, note that ∂xT Ax/∂x is a vector, so we can prove the result component-wise. In particular, for component i,
∂xTAx ∂xTAx
= ∂x by the definition of gradient ii
= ∂ mn xmAmnxn ∂xi
= Amn[xnδmi + xmδni] mn
= A ∂xmxn mn ∂x
by Equation (1)
by the linearity of differentiation
∂xn + xm ∂x
= Amn xn ∂x
by the product rule for derivatives by the hint
= Amn xn δmi + Amn xm δni by basic algebra mn mn
= Ainxn + Amixm by part (a) nm
= Ainxn + [AT ]imxm by the definition of matrix transpose nm
= [Ax]i + [AT x]i by the definition of matrix-vector multiplication
= [Ax + AT x]i by the definition of vector addition
Itfollowsthat ∂xTAx/∂x=Ax+ATx sincealltheircomponentsareequal.
4. (20 points total) Numpy Programming (a) (5 points) Batch gradient descent.
Write a Numpy function gd(X,T,K,lrate) that computes the optimal weight vector, w, for a linear classifier by doing K iterations of batch gradient descent. Here, lrate is the learning rate, and X and T form the training data, where X is the input data matrix (including a column of 1’s), and T is the vector of target values. You should initialize the weight vector by using the function randn in numpy.random. You may assume that you have a library called mylib that contains a function grad(X,T,w) that computes the gradient of the loss function for the linear classifier.
Use vectorized code wherever possible. Do not forget to include all required import statements in your code. Your function may contain 1 loop, and no nested loops. You can easily write the function in 10 lines of highly-readible code, including import state- ments.
import numpy as np
import numpy.random as rnd
import mylib
def gd(X,T,K,lrate):
N,M = np.shape(X)
w = rnd.randn(M)
for k in range(K):
g = mylib.grad(X,T,w)
w = w – lrate*g
(b) (15 points) Stochastic gradient descent.
Write a Numpy function sgd1(X,T,w,lrate) that updates the weight vector for a linear classifier by doing one epoch of stochastic gradient descent using mini-batches of size 100. Here, w is the currect value of the weight vector. As in part (a), lrate is the learning rate, and X and T form the training data, where X is the input data matrix (including a column of 1s) , and T is the vector of target values. You may again assume that you have a library called mylib that contains a function grad(X,T,w) that computes the gradient of the loss function for the linear classifier.
Using vectorized code wherever possible, your function should perform the following steps:
• First reorder the training data randomly. You may assume that you can do this with the statement X,T = mylib.shuffle(X,T).
• Next, divide the training data into mini batches of size 100. The first mini batch consists of the first 100 training points, the second mini batch consists of the second 100 training points, the third mini batch consists of the third 100 training points, etc. Note that both X and T must be divided into mini batches. You may assume that the number of training points is a multiple of 100.
• Update the weight vector one mini-batch at a time. That is, first update the weight vector using mini-batch 1, then update the weight vector again using mini-batch 2, then again using mini-batch 3, etc. Each mini batch should be used exactly once.
• Finally, return the updated weight vector.
Do not forget to include all required import statements in your code. Your function may contain 1 loop, and no nested loops. You can easily write the function in 13 lines of highly-readible code, including import statements. Optional: use numpy.reshape to form the mini batches.
import numpy as np
import mylib
def sgd1(X,T,w,lrate):
X,T = mylib.shuffle(X,T)
N,M = np.shape(X)
J = 100 # minibatch size
I = N/J # number of mini batches
X = np.reshape(X,(I,J,M))
T = np.reshape(T,(I,J))
for i in range(I):
g = mylib.grad(X[i],T[i],w)
w = w – lrate*g
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com