SOLUTIONS FOR EXAM for ARTIFICIAL NEURAL NETWORKS
Maximum score on this exam: 12 points.
Maximum score for homework problems: 12 points.
To pass the course it is necessary to score at least 5 points on this written exam.
1. Feature map.
Local fields of feature map of pattern x(1): 6 5
2 1. (1) 1 2
56 Local fields of feature map of pattern x(2):
3 3
2 2. (2)
2 2 22
The ReLU-activation function does not exert any effect, since all local fields are possible. The feature maps are therefore equal to the local fields above. Max-pooling layer of pattern x(1):
6
2 . (3)
6 Max-pooling layer of pattern x(2):
3
2 . (4)
2 1
WithWk =−δk1 andΘ=−4wehave
and
3 6
Wk2 −Θ=−2 (5) k=1 6k
3 3
Wk2 −Θ=1. (6) k=1 2k
Applying the Heaviside activation function results in the requested outputs.
2. Hopfield network with hidden units
Denote the the value of hidden neuron i after the update by h′i. Suppose that the kth hidden neuron change sign. We then have:
h′i = hi − 2hiδik, (7) The energy after the update is
MN
H′ =−wijh′ivj (8)
i=1 j=1 NM
=−v w (h −2hδ ) (9) j iji iik
j=1 i=1 NMM
j=1
=−v w h −2w hδ (10)
ij i j j=1 i=1
=H + 2hkb(h). k
k kj j j=1
If the kth hidden neuron change sign, then hkb(h) < 0. k
i=1
ij i ij i ik i=1
j NM
=−v wh−2wh (11) j iji kjk
j=1 i=1 NMN
=−w hv +2h w v (12)
2
(13)
3. Backpropagation
Solution: The weight update rule for L = 3 reads,
δw(3) =−η ∂H
(14) (15)
pr
∂w(3) pr
pr
i,μ
(y(μ) − O(μ))
∂V (3,μ) i
∂w(3) pr
=η =η
=η =η =η
where, ∆(3,μ) = (y(μ) − O(μ))g′(b(3,μ)). Similarly, pppp
δw(2) = −η ∂H (20)
∂w(2) pr
= η (y(μ) − O(μ))g′(b(3,μ))w(3)g′(b(2,μ))V (1,μ) (21) i i i ip p r
i,μ
= η ∆(3,μ)w(3)g′(b(2,μ))V (1,μ) (22) i ip p r
i,μ
= η ∆(2,μ)V (1,μ) (23) pr
μ
i
i
∂w(3,μ) (y(μ) − O(μ))g′(b(3,μ)) ij
V (2,μ) (16) i i i ∂w(3)j
i,μ j pr
(y(μ) − O(μ))g′(b(3,μ)) δipδjrV (2,μ) (17) iii j
i,μ j
(y(μ) − O(μ))g′(b(3,μ))V (2,μ) (18) pppr
μ
∆(3,μ)V (2,μ) (19) pr
μ
where, ∆(2,μ) = ∆(3,μ)w(3)g′(b(2,μ)). Similarly, p i,μiipp
δw(1) =η∆(1,μ)V(0,μ), (24) pr pr
μ
where ∆(1,μ) = ∆(2,μ)w(2)g′(b(1,μ)). p i,μiipp
3
4. XNOR function.
1 1 1 1 1 −1
3w=1 1 1+1 1 −1 (25)
1 1 1 −1 −1 1
1 −1−1 1 −1 1
+−1 1 1+−1 1 −1 (26)
−1 1 1 1 −1 1
4 0 0 =0 4 0.
004
(27)
The weight matrix is proportional to the identity matrix, and the network does not recognise the XNOR function.
H = −1 wijxixj, (28) 2
= −14 δijxixj (29) 23
= − 32 x 2i (30) i
= − 23 3 = −2.
(31) (32)
Thus the energy function is always constant and the network cannot learn. Further, even using the modified Hopfield rule in this case would not work, because the weight matrix would just be 0 and the energy would always be 0 as well.
Storing only 3 out of the 4 patterns makes the network linearly separable instead of linearly separable. Thus, now the network can recognise patterns. 5. Gradient descent and momentum
Consider the given energy function H as a function of weight w as shown in Fig. 2. Use the following gradient descent update rule,
δwn+1 =−η∂H+αδwn. (33) ∂w
Assume that the system is initially at point A, and that ηs = 1/2. The slope of the segment AB in Fig. 2 is −s and the slope of the segment BC is 0. The system starts at time step 1, and assume that δw0 = 0.
1. Find the number of time steps required to travel from point A to point B for α = 0.
4
2. Repeat the previous calculation for the case α = 1/2, and graphically find the solution of the final equation you obtain.
3. Indicate the results of the previous two parts on the same graph. Which of the two cases: α = 0 and α = 1/2 converges faster?
4. Whatisthefateofthetwosystemsα=0andα=1/2oncetheycross point B?
A
H
slope=−s
BC
LM
w
Figure 1: Energy as a function of weight for problem: Gradient descent and momentum.
Solution: 1 and 2: We calculate the total change in weight at time step n, ∆wn = ni=1 δwi, equate ∆wn to L and solve for n. Proceed by solving for δwn. Iterating the equation for δw we find,
Next compute ∆wn,
i
δwi+1 = ηs αj + αi+1δw0, (34)
j=0 1−αi+1
=ηs 1−α . (35) 5
n
∆wn =δwi, (36)
i=1
n 1−αi+1
=ηs 1−α , (37) i=1
ηs 1−αn
=1−α n−α1−α . (38)
Thus using ηs = 1/2 we obtain, for α = 0, ∆wn(α = 0) = n/2, and for α=1/2,∆wn(α=1/2)=n−1+2−m. Equating∆w=Lweobtain,
nα=0 = 2L, (39) nα=1/2 − 1 + 2−nα=1/2 = L. (40)
graphing the above equations, we see that nα=1/2 < nα=0, thus, α = 1/2 converges faster.
f (n)
f(n) = L + 1 − n
f(n) = 2−n
nα=1/2
Figure 2: Graphical solution of problem : gradient descent and momentum.
After crossing point B, δw(α = 0) = 0 so that this system stays stationary, however δwα=1/2 > 0 so that this system keeps on moving.
6
n nα=0
6. Linear activation function
a)
p
∂H=∂ 1O(μ)−t(μ)2
∂wi
(41) (42) (43)
Nppp
=w x(μ)xμ −θxμ −t(μ)xμ (46) jjiii
∂wi 2 μ=1
p
O(μ) − t(μ)
= O(μ) − t(μ) x(μ)
i
pN
=
∂O(μ) ∂wi
μ=1 p
μ=1
= wjx(μ) −θ−t(μ) ji
x(μ) (44) Nppp
μ=1 j=1
=wjx(μ)xμ −θxμ −t(μ)xμ (45)
jiii j=1 μ=1 μ=1 μ=1
j=1 μ=1 μ=1 N
=wjpGji −θpβi −pαi j=1
N =p Gijwj−θβi−αi
j=1
∂H =0⇒Gw=α+θβ ∂wi
μ=1
7
(47)
(48) (49)
(50)
∂H ∂θ
p
= ∂ 1 O(μ) − t(μ)2
∂θ 2 μ=1
p
O(μ) − t(μ)
= O(μ) − t(μ) (−1)
μ=1
pN
= wjx(μ) −θ−t(μ) j
(51) (52) (53)
=
∂O(μ) ∂θ
μ=1 p
(−1) (54) pNpp
μ=1 j=1
=−wjx(μ) +θ+t(μ) (55)
μ=1 j=1 N
μ=1
μ=1
=−pwjβj +pθ+pc j=1
∂H =0⇒wTβ=θ+γ. ∂θ
(56) (57)
(58)
(59)
b) The first equation gives:
w = G−1α + θG−1β. Insert into the second, and use that wTβ = βTw:
βT G−1α + θG−1β = θ + γ (60) ⇒ βTG−1α + θβTG−1β = θ + γ (61) ⇒ θ βTG−1β − 1 = γ − βTG−1α (62)
⇒ θ = γ − βTG−1α. (63) βTG−1β − 1
c) The equation can be written as
V (μ,l) = WV (μ,l−2) − Θ, (64)
j
where
and
W = w(l)w(l−1), (65) Θ = w(l)θ(l−1) + θ(l). (66)
The two layers can therefore be collapsed into one single layer, and with a
linear activation function in all layers the whole perceptron collapses into a simple perceptron with linear activation function. Such a perceptron can only solve linearly separable problems.
8