CS代考 DECEMBER 2021 FINAL EXAMINATION CSC311H5F

NAME (PRINT): STUDENT #:
Last/Surname First /Given Name
SIGNATURE:
I don’t know policy: If you do not know the answer to a question (or part), and you write “I don’t know”, you will receive 20% of the marks of that question (or part). If you just leave a question blank with no such statement, you get 0 marks for that question.

Copyright By PowCoder代写 加微信 powcoder

UNIVERSITY OF TORONTO MISSISSAUGA DECEMBER 2021 FINAL EXAMINATION CSC311H5F
Introduction to Machine Learning ,
Duration – 2 hours
Aids: Open book (Textbook, notes, computer files)
The University of Toronto Mississauga and you, as a student, share a commitment to academic integrity. You are reminded that you may be charged with an academic offence for possessing any unauthorized aids during the writing of an exam. Clear, sealable, plastic bags have been provided for all electronic devices with storage, including but not limited to: cell phones, smart watches, SMART devices, tablets, laptops, and calculators. Please turn off all devices, seal them in the bag provided, and place the bag under your desk for the duration of the examination. You will not be able to touch the bag or its contents until the exam is over.
If, during an exam, any of these items are found on your person or in the area of your desk other than in the clear, sealable, plastic bag, you may be charged with an academic offence. A typical penalty for an academic offence may cause you to fail the course.
Please note, once this exam has begun, you CANNOT re-write it.

Name: Student No.: CSC311H5F page 2 of 9
1. (26 points total) True or False.
For each of the following statements, state whether it is true or false, without giving a expla- nation (2 points each):
(a) Training a nearest-neighbour classifier is faster than training a decision tree.
(b) Nearest neighbours is better than decision trees at handling missing attribute values.
(c) The number of neighbours used in K nearest neighbours is a form of regularization.
(d) Deep decision trees tend to have greater test accuracy than shallow decision trees.
(e) Compared to non-linear auto encoders, principle components analysis can achieve greater dimensionality reduction with less information loss.
(f) Reinforcement learning is ideally suited to classifying images of hand-written digits.
(g) Bagging increases the test accuracy of a classifier by increasing the variance of predic- tions.
(h) Starting at the input, a convolutional neural network often consists of a series of convo- lutional and pooling layers followed by a number of fully-connected layers.
(i) In a convolutional neural network, each convolutional layer consists of a bank of global feature detectors.
(j) Convolutional neural networks were originally developed for speech recognition.
(k) In Bayes rule, the prior probability is computed from the likelihood and the posterior probability.
(l) In a binary decision tree for continuous vectors, each internal node is associated with an attribute and a threshold.
(m) Bayesian and frequentist approaches tend to agree with each other when there is lots of data.
continued on page 3

Name: Student No.: CSC311H5F page 3 of 9
2. Reinforcement Learning. (10 points)
Consider the familiar grid-world navigation task shown above. The agent can move in any of four directions (left/right/up/down) one cell at a time. All the state transitions are deter- ministic. If the agent hits the blob at B3, it remains in the same state. A1 and C1 are both goal states. The reward is 40 for entering state A1, and 80 for entering state C1. All other rewards are 0.
When the learning rate is 1, the rule for updating the estimated Q values simplifies to the following:
Q(St,At) = Rt+maxQ(St+1,a0) a0
Consider applying this update rule when all the Q values are initialized to zero, and = 0.5. Suppose the agent executes each of the following three episodes once, one after the other:
(a) A4,A3,A2,B2,C2,C1. (b) C4,B4,A4,A3,A2,A1.
(c) A4,B4,C4,C3,C2,B2,A2,A1. (d) B4,A4,A3,A2,B2,C2,C1.
Redraw the grid-world figure on blank paper, and indicate on the figure the Q values after all three episodes have been executed. Within a grid cell, use arrows ( ! ” #) to indicate which action a Q value refers to. Q values of 0 can be left unspecified.
continued on page 4

Name: Student No.: CSC311H5F page 4 of 9
3. Neural Networks (40 points total)
We want to train a neural network with one hidden layer to do binary classification. Here are
the equations for the network:
y = (vTh+c) (1) h = (Wx+b) (2) L = tlogy(1t)log(1y) (3)
Here, x is the input vector, h is a vector of hidden values, y is the output of the neural net, t is the target value (0 or 1), and L is the loss function. In addition, W is a matrix of weights, v is a vector of weights, b is a vector of bias terms, and c is a single bias term. Finally, (z) = 1/(1 + exp(z)) is the logistic function. The above equations are for a single data point, (x, t). Given a set of data points, (x(1), t(1)), …, (x(N), t(N)), we use the following equation for total cost: X (n)
J= L (4) n
where L(n) is the loss on input (x(n),t(n)).
(a) (1 point) What is the name for the loss function, L?
(b) (4 points) Suppose the neural network has I hidden units and K input units. Write down the dimensions of W , b, v and c.
(c) (3 points) Prove the following equation:
= (z)[1(z)]
(d) (5 points) Use part (c) to prove the following equation: = (yt)v
(e) (5 points) Use parts (c) and (d) to prove the following equation:
= (y t)vihi(1 hi)xk
(f) (6 points) Let X be a data matrix whose nth row is x(n), and let H be a matrix whose nth row is the vector of hidden values on input x(n). Likewise, let Y and T be vectors whose nth elements are the values of y and t, respectively, on input x(n). Write down (but do not prove) equations for the quantities below. The equations should be in terms of X, H, Y and T and should be simple to translate into Python code in your functions in parts (g) to (j), below. (2 points each)
i. A matrix equation for H. ii. A matrix equation for Y .
iii. An equation for using summation notation, not matrix multiplication.
continued on page 5

Student No.: CSC311H5F page 5 of 9
In the rest of this question, you will use the equations in part (f) to define Python functions for the neural net above. The code should be vectorized and ecient, should not use any loops, and should avoid uneccessary operations. You should import numpy, but do not import any other packages. The functions in any part may use the functions specified in the parts above, even if you did not define them.
(g) (5 points) Define a Python function ff(X,W,v,b,c) that implements the feed-forward phase of the neural net. You should use matrix multiplication when computing Y and H, and the function should return both Y and H. Note that numpy does not have a sigmoid function.
(h) (2 points) Define a Python function pred(X,W,v,b,c) that computes and returns pre- dictions for each input vector in X. It should return a binary vector whose nth element is the neural net’s prediction on input vector x(n).
(i) (2 points) Define a Python function acc(X,T,W,v,b,c) that computes and returns the accuracy of the neural net’s predictions on the data in X and T. Recall that accuracy is the proportion of predictions that are correct.
(j) (7 points) Define a Python function gradJ(X,T,W,v,b,c) that computes and returns the gradient @J . Use broadcasting, not matrix multiplication or loops.
continued on page 6

Name: Student No.: CSC311H5F page 6 of 9
4. Clustering. (30 points total)
(a) Gaussian Mixture Models: theory. (10 points)
The expected complete log likelihood of a Gaussian mixture model is given by
ECL = X X r(n) log[⇡kN (x(n)|μk, ⌃k)] (5)
where r(n) is the responsibility that cluster k takes for data point x(n), and k
exp[(x μ)T ⌃1(x μ)/2]
N(x|μ,⌃) = p2⇡|⌃| (6)
is the probability density of a multivariate Gaussian distribution with mean vector μ and covariance matrix ⌃. Recall that |⌃| is the determinant of ⌃.
Prove that the M step of the EM algorithm maximizes the expected complete log likeli- hood. That is, prove that it is maximized by the following value:
P r(n)x(n) μˆ = n k
k P r(n) nk
Hint: you may use the fact that @yT = 2Ay, where y is a column vector, and A is a symmetric matrix. Recall that covariance matrices are symmetric.
(b) Gaussian Mixture Models: programming. (10 points)
In fitting a Gaussian mixture model to data, it is common to assume that each cluster in
the model has the same covariance matrix. In this case, the M step of the EM algorithm updates the covariance matrix as follows:
Rnk(Xni μki)(Xnj μkj)/N
(where the small ⌃ is the covariance matrix, and the large are summation signs). Here, R is a matrix of responsibilites, X is a data matrix of input vectors, and μ is a matrix of cluster means. Each row of μ is a mean vector, each row of X is an input vector, and N is the number of input vectors.
Define a Python function cov(X,R,Mu) that computes and returns the updated covari- ance matrix, ⌃, where Mu = μ is the matrix of mean vectors. Use broadcasting in your code, and do not use matrix multiplication. The code should be vectorized and ecient, should not use any loops, and should avoid uneccessary operations. You should import numpy, but do not import any other packages.
continued on page 7

Student No.: CSC311H5F page 7 of 9
(c) K means. (10 points)
Recall that in soft K-means, responsibilities are computed by a softmax function, as
(n) exp z(n)
rk =P k (7)
where z(n) = kμk x(n)k2. Here, x(n) is a data point and μk is the current centre of k
cluster k. The cluster centres are then updated as follows: P r(n)x(n)
μ=nk (8) k P r(n)
Prove that as ! 1, soft K means closely approximates hard K means. To keep things
simple, you may assume that the values z(n) are all di↵erent. Hint: consider the solution
to question 2(x) in assignment 3.
exp z(n) jj
continued on page 8

Name: Student No.: CSC311H5F page 8 of 9
5. Generative Probabilistic Models and Ensembles. (30 points total)
This question is concerned with making predictions using a naive Bayes classifier based on Quadratic Discriminant Analysis (QDA) for multi-class classification. For this purpose, recall that the likelihood of a multivariate Gaussian variable, x, is given by the following equation,
exp[(x μ)T ⌃1(x μ)/2]
P(x) = N(x|μ,⌃) = p(2⇡)M|⌃| (9)
where x is a column vector of dimension M, μ is the mean of the distribution, ⌃ is the covariance matrix, and |⌃| is the determinant of ⌃. Also recall that in Assignment 3, you proved that if ⌃ is diagonal, then
In naive Bayes, we assume that the features of x are conditionally independent given the class. In the Gaussian case, this means that the covariance matrix of each class is diagonal. Thus, we need to evaluate expression (10) for every vector, x(n), in a data set, and every class, k. For this purpose, we can rewrite the equation as folllows:
Znk = X(Xnmμkm)2/k2m m
Here, X is a data matrix in which row n stores the vector x(n); μ is a matrix in which row k stores the mean vector for class k; and is a matrix in which row k stores the diagonal elements of ⌃k, the covariance matrix for class k.
(a) (10 points) Define a Python function prob(X,Mu,Sigma) that computes and returns the class likelihoods for every input vector in a data set, X. Here, X is the data matrix; mu = μ, the matrix of mean vectors; and Sigma = , the matrix of diagonal covariance elements. Your function should return a matrix, P , where Pnk is the likelihood that x(n) is in class k.
Use broadcasting in your code, and do not use matrix multiplication. The code should be vectorized and ecient, should not use any loops, and should avoid uneccessary oper- ations. You should import numpy, but do not import any other packages. In particular, you should not use any built-in functions for computing probabilities, matrix inverses or determinants. Instead, you should use the fact that the determinant of a diagonal matrix is the product of its diagonal elements.
(b) (5 points) Define a Python function predict(X,Mu,Sigma,Prior) that computes and returns the class prediction for each input vector in a data set, X. Here, X, Mu and Sigma are as in part (a). In addition, Prior is a vector whose kth element is the prior probability of class k. Your function should return a vector, Y , of integers, where Yn is the predicted class, k, of input vector x(n) according to Bayes rule. Moreover, you should compute Y without computing the posterior probabilities of the input vectors. You may use the function prob specified in part (a) even if you did not define it.
Use broadcasting in your code, and do not use matrix multiplication. The code should be vectorized and ecient, should not use any loops, and should avoid uneccessary operations. You should import numpy, but do not import any other packages.
(xμ)T⌃1(xμ) = where m2 is the mth diagonal element of ⌃.
(xmμm)2/m2 (10)
continued on page 9

Student No.: CSC311H5F page 9 of 9
(c) (4 points) Consider two QDA classifiers, the naive Bayes classifier above, and one based on unconstrained covariance matrices. If the input vectors, x, have M features, and if there are K classes, how many parameters does the naive Bayes classifier have? How many parameters does the unconstrained classifier have?
The remaining parts of this question refer to the two classifers in part (c).
(d) (2 points) Which classifier is most likely to overfit a training set? Why?
(e) (5 points) Suppose we generate 1000 bootstrap samples of a training set, and use them to train 1000 versions of each classifier. Which classifier is likely to have the largest variance in its predictions on a test set? Under what condition is the di↵erence in variance likely to be large? Justify your answers.
(f) (4 points) Which classifier is likely to see the largest increase in test accuracy due to bagging? Under what conditions is the di↵erence likely to be large? Justify your answers.
Total marks = 136 END OF EXAM

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com