Q1.
2 Points
BME/ECE 695 Deep Learning Midterm II Solution April 23, Spring 2020
Rules: I understand that this is an open book exam that shall be done within the allotted time of 120 minutes. I can use my notes, and web resources. However, I will not communicate with any other person other than the official exam proctors during the exam, and I will not seek or accept help from any other persons other than the official proctors.
Upload a scan of your signature here:
Name: (4 pt)
1
Q2 Back Propagation
32 Points
Consider the inference function
where
and the associated loss function is given by
1 K−1
∥xk − fθ(yk)∥2 k=0
where [xk,yk] for k = 0,···K − 1 are training pairs, and θ = [θ0,θ1] is the associated full parameter vector.
The three figures below illustrate: a) a 2 Layer Network; b) the Forward Gradient Propaga- tion network; and c) the Back Propagation network.
fθ(y)
fθ(y) = f1,θ1 (f0,θ0 (y))
L(θ) =
K
2
Q2.1
8 Points For the forward gradient propagation with the training pair [xk,yk], give expressions for A0, B0, A1, and B1 in terms of the functions f0,θ0(z0) and f1,θ1(z1).
—files—
Q2.2
8 Points
For the forward gradient propagation with the training pair [xk,yk], give an expression for δxˆ when δθ1 = 0, δz0 = 0, and δθ0 is small. Express your result in terms of the matrices A0, B0, A1, and B1.
—files—
Q2.3
8 Points
For the back propagation, give an expression for εk so that
—files—
Q2.4
g0,k =∇θ∥xk−fθ(yk)∥2.
8 Points
For the back propagation, give an expression for the gradient of the total loss function
in terms of the vectors g0,k. —files—
g0 = ∇θL(θ) ,
3
Solution: Q2.1
Q2.2
Q2.3
Q2.4
A0 = ∇z0 f0,θ0 (z0) B0 = ∇θ0 f0,θ0 (z0) A1 = ∇z1 f1,θ1 (z1) B1 = ∇θ1 f1,θ1 (z1)
δxˆ = A1δz1 = A1B0δθ0
εk = −2 (xk − fθ (yk ))
Due to typo in exam, the answer was
g0 =∇θ0L(θ)=g0,k .
However, based on classed notes, the answer would be
So either get full credit.
g0 =∇θ0L(θ)=
1 K−1
g0,k . k=0
K−1 k=0
4
K
Q3 Estimation
34 Points
Consider the forward model with the form
Xk =fθ(Yk)+Wk ,
where Yk ∼ p(y) are i.i.d. vectors for k = 0, · · · , K − 1, fθ(·) is the inference function
parameterized by the vector θ, and Wk ∼ N(0,σ2I) are i.i.d. noise vectors.
Q3.1
5 Points
Give an expression for pθ(x|y), the conditional density of Xk given Yk for known θ. —files—
Q3.2
5 Points
Calculate an expression for the maximum likelihood (ML) estimate of θ given (Xk,Yk) for k = 0, · · · , K − 1.
—files—
Q3.3
8 Points
In this subquestion, take a Bayesian approach and assume that θ is a random vector com- posed of i.i.d. components, each having an exponential density given by
1 |t| θi∼g(t)=2αexp −α .
Making this new assumption, give an expression for the MAP estimate of θ given (Xk,Yk) for k = 0, · · · , K − 1.
—files—
Q3.4
8 Points
Explain in words, a) the advantages of the MAP estimate over the ML estimate, b) the advantages of the ML estimate over the MAP estimate.
[____]
Solution:
Q3.1
Conditioned on knowledge of Yk, we know that
11 2
pθ(xk|yk) = p exp (2πσ2)2
− 2 ∥xk − fθ(yk)∥ . 2σ
5
Q3.2
From the previous problem and using the fact that both Yk and Wk are i.i.d., we know that
pθ(x0,··· ,xK−1|y0,··· ,yK−1)
=
So then the maximum likelihood (ML) estimate is given by
K−1
= pθ(xk|yk)
k=0
K−111
= exp − ∥x−f(y)∥2 p 2kθk
∥xk −fθ(yk)∥2 . k=0
log 2πσ2
k=0 (2πσ2)2 2σ 11K−1
Kp exp − (2πσ2) 2 2σ2
θˆML = =
= =
argmin{−logpθ(x|y)} θ
1 K−1 Kp
arg min ∥xk − fθ(yk)∥2 +
θ 2σ2
1 K−1
2
k=0
arg min ∥xk − fθ(yk)∥2
Q3.3
In this case, the MAP estimate is given by
θK
k=0
argmin{LMSE(θ)} . θ
θˆMAP =
= arg min ∥xk − fθ(yk)∥2 +
θ 2σ2
arg min {− log pθ(x|y) − log g( θ )} θ
1K−1 1
k=0
2α
∥θ∥1
K−1
1 2σ2
∥θ∥1
= argmin ∥xk −fθ(yk)∥ + θK αK
k=0
σ2
= argmin LMSE(θ)+ ∥θ∥1 θ αK
.
Q3.4
a) The advantages of MAP:
• It may result in a lower variance estimates if the prior model is accurate.
• It can generate a more accurate result when the amount of data is small and the number of unknown parameters is large.
b) the advantages of ML:
6
• It does not require the selection of a prior model. • It is (mostly) unbiased.
• It is asymptotically efficient.
7
Q4 Interpretation of Loss Functions
32 Points
While training a deep neural network (DNN), you decide to partition your training data into three subsets: the training data – ST ; the validation data – SV ; and the testing data – SE . These three data subsets are associated with three separate loss functions given by
LT(θ)= 1 ∥yk−fθ(xk)∥2 |ST | k∈ST
LV(θ)= 1 ∥yk−fθ(xk)∥2 |SV | k∈SV
LE(θ)= 1 ∥yk−fθ(xk)∥2. |SE| k∈SE
where |ST |, |SV |, and |SE | denote the number training pairs in each subset.
Q4.1
8 Points
For this sub problem, assume that after training, LV >> LT . a) What does this tell you about the capacity of the model? b) What does this tell you about the amount of training data?
—files—
Q4.2
8 Points
For this sub problem, assume that after training, LV this case to improve the accuracy of the DNN? —files—
Q4.3
8 Points
For this sub problem, assume that after training, LV
the capacity of the model? b) What does this tell you about the amount of training data? —files—
Q4.4
8 Points
For this sub problem, assume that after training, LV ≈ LT . What options do you have in this case to improve the accuracy of the DNN?
—files—
8
>> LT . What options do you have in
≈ LT . a) What does this tell you about
Solution:
Q4.1
a) The capacity of the model is high.
b) The amount of training data is insufficient.
Q4.2
The approaches to improve the accuracy of DNN: • Early termination of training.
• Regularization.
• Drop-out method.
• Reduce model order.
• Increase the amount of data used to train the model. Q4.3
a) It is likely that the capacity is low.
b) The amount of data is sufficient for the capacity of the model.
Q4.4
The approaches to improve the accuracy of DNN:
• Increase the model order.
• Increase the training time to see if the training loss function, LT (θ), can be further minimized.
9