King’s College London
University Of London
This paper is part of an examination of the College counting towards the award of a degree. Examinations are governed by the College Regulations under the authority of the Academic Board.
MSc Examination
7CCMFM18 Machine Learning. MOCK Exam
Summer 2020
Time Allowed: Two Hours
Full marks will be awarded for complete answers to four questions. If more than four questions are attempted, then only the best FOUR questions will count.
Within a given question, the relative weights of the different parts are indicated by a percentage figure.
You may consult lecture notes.
FOLLOW the instructions you have been given on how to upload your solutions
2020 ⃝c King’s College London
7CCMFM18
1. DISCLAIMER: In this MOCK Exam Questions 1, 2, 3 have only parts a+b+c Question 4 has parts a+b+c+d and Question 5 has parts a+b only. The ques- tions Q1-Q5 of the MOCK Exam serve as indications for (i) the topic and (ii) the style of the respective Questions Q1-Q5 of the actual exam. In the actual exam all Questions Q1-Q5 have parts a+b+c+d, so be prepared that you may need longer to solve the actual exam than this one. Keep in mind that the exam will be open book, but better to prepare as it was not open book, as time is scarce for looking up questions+solutions. The solution of Question Q5 b (i+ii) of the MOCK Exam coincides with Problem 2 of Problem Set 3.
Question 1: Consider the feedforward neural network
f ∈ Nr(I,d1,…,dr−1,O;σ1,…σr) represented in the graph above.
a. Name all the hyperparameters of the network f, explain what they govern, and specify their values. What is Nr(I,d1,…,dr−1,O;σ1,…σr) for the network in the graph above?
b. Specify the function as precisely as you can from the graph given above: (i) How are the components Li i = 1,…,r and the components σi i = 1, . . . , r called? (ii) How are the functions Li and σi defined? Spec- ify in particular the respective domains and image spaces of each of the components for i = 1,…,r.
c. In the graph above there are three edges A,B,C highlighted. Which network parameters do the edges A, B, B correspond to?
-2- See Next Page
2. Recall that for D ⊂ Rn the class of functions h : D −→ Rm whose partial deriva- tives exist and are continuous up to order k on D ⊂ Rn for some k ∈ N∪{∞} is denoted by Ck(D,Rm). We say that the smoothness of a function h is Ck for some k ∈ N ∪ {∞}, if k is the highest number such that h ∈ Ck(D, Rm).
Recall also that we say that a function h : Rn −→ Rm is of at most linear growth if there exists a constant C > 0 such that ||h(x)|| ≤ (1+C)||x||, for any x ∈ Rn.
a. Define the ReLU activation function. What is the regularity of this acti- vation function? Justify your answer. Name at least one of the strengths of this activation function.
b. Let the neural network f be of the form f ∈ N2(1, d1, 1; σ1, Id). Show that if σ1 = ReLU, then f is and has at most linear growth.
c. Consider a neural network of the form N (3, d1, d2, 1; σ1, σ2, Id), with d1 = d2 =6andσ1 =σ2 =ReLU. Showthatanetworkf ofthisformcan represent any linear function g : R3 −→ R, (x1, x2, x3) → a1x1+a2x2+a3x3
for a1 , a2 , a3 ∈ R3 and determine the weight matrices and bias vectors representing the linear function g.
-3- See Next Page
7CCMFM18
3. Let K ∈ RI be a compact set. Recall for any measurable function f : K → R the norms || · ||sup,K and || · ||Lp(K) with p > 0 defined as
and
1 |f(x)|pdx p .
||f||sup,K := supx∈K|f(x)|,
K
• g is not a polynomial
• g is bounded on any finite interval,
• the closure of the set of discontinuities of g has zero Lebesgue measure.
a. Formulate the universal approximation property.
b. Consider a neural network described by f ∈ N2(1, d1, 1; σ1, Id).
(i) Give a formula expressing the network f as a function of d1 and σ1. (ii) Specify the form (as a function of d1) of all weight matrices and bias vectors involved in f.
c. Let us now consider the network f ∈ N2(1, d1, 1; σ1, Id) with the specifi- cation σ1 = ReLU. Determine appropriate weights and biasses and the appropriate value for d1 such that f(x) = h(x), for all x ∈ R, where h(x) := ReLU(3x) − 2ReLU(2x) for x ∈ R. Summarize the parameters of the network by specifying their values in the form θ = (W 1, W 2; b1, b2).
Hint for exam: How would your answer change for more general func- tions h if we assume that h is of a form that can be expressed as a linear combination of ReLU functions?
7CCMFM18
||f||Lp(K) :=
Further, let g : R → R denote an activation function such that
-4- See Next Page
4. Training of neural networks. Let fθ ∈ Nr(I,d1,…,dr−1,O) denote a feed- forward neural network parametrised by θ ∈ Rn for some n ∈ N and let (x1,y1),…,(xN,yN) be a dataset consisting of input-output pairs for some N ∈ N, where for i = 1…N the xi denote the input vectors and yi denote the output vectors.
a. Explain what is a loss function and what it is used for in training of neural networks. Specify the general form of a loss function corresponding to an input-output sample above.
b. (i) Recall the shortcoming of the squared loss function and its potential impact of that shortcoming on the training of the neural network.
(ii) What properties of the empirical distribution of the data do the ab- solute loss and the squared loss target?
(iii) Write down the definition of the Huber loss function given in class that can be applied to some yˆ, y ∈ R and describe the advantages of this loss function over the squared loss and the absolute loss.
c. Recall the categorical cross-entropy loss function and explain what type of problems it is usually used for.
d. Let F : Rd → R be differentiable. Recall gradient descent for some step size η > 0 and for an initial condition x0 ∈ Rd is defined as
xn := xn−1 − η∇F(xn−1), for n ∈ N. (1)
Letd=2andF(x):=||x||2 =x21 +x2 andletx0 ̸=(0,0).
(i) Explain how the equation (1) relates to the training of a feedforward neural network. In particular, describe the expressions F, x and η in the context of training a feedforward neural network.
(ii) Derive an expression for xn ∈ R2 for arbitrary n ∈ N \ {0} in terms of the initial value x0 ∈ R2.
(iii) Identify four different regimes of values for η > 0 that lead to different asymptotic behaviours of xn as n → ∞. For which of the regimes of η > 0 does xn converge as n → ∞?
-5- See Next Page
7CCMFM18
5. Stochastic gradient descent and backpropagation. Let (x1, y1), . . . , (xN , yN ) be a training dataset consisting of input-output pairs for some N ∈ N, where for i = 1 . . . N the xi denote the input vectors and yi denote the output vectors. Furthermore, let l denote a corresponding loss function.
a. Describe the following concepts: (i) Construction of minibatches from the full training set. (ii) Define minibatch risk. (iii) Explain the rationale behind using minibatches for training.
b. Consider a neural network f = fθ ∈ N2(1, 2, 1; ReLU, Id), where θ denotes the corresponding network parameters consisting of
θ = W1,W2;b1,b2 = [4,−4]T,[−2,3];(0,2),2,
and let l denote the absolute loss function. Furthermore, let x = 1 and
y = 0 be one observation.
Hint for the exam: Be prepared to perform similar calculations for
different parameters, different starting values and different loss functions.
(i) Calculate the value fθ(x) in a forward pass as seen in the lecture and calculate the corresponding sample loss.
Hint: Start by calculating the values a0 and z1 := W 1a0 + b1.
(ii) Recall that the parameter updates are determined by gradient descent as
θnew =θ−η∇θl(fθ(x),y) (2)
for some η > 0. Calculate ∇θl(fθ(x),y) in the above case by backpropaga- tion and determine the updated value θnew of the network parameters in (2) for η = 1.
Hint: You may use: z1 = [4, −2]T , z2 = −6 for your calculations.
-6- Final Page
7CCMFM18