1
Solutions for Artifical Neural Networks September 21, 2020
2
Chapter 2
2.3 Cross-talk term
(a) Apply x (ν) to the network: S = sgn N w x (ν) . For x (ν) to remain unchanged i j=1 ij j
underthisupdatewemustrequirethatS =x(ν).Inotherwords ii
This condition evaluates to:
N p1p x(ν)=sgn w x(ν) =sgn x(μ)x(μ) x(ν)
i ijj Nijj
j=1
1 N
j=1 μ=1 j ̸=i
1 N p x(μ)x(μ)x(ν)
N
x(ν)=sgn w x(ν) . (1)
i ijj j=1
x(ν) x(ν)x(ν)+ NijjNijj
=sgn
N−1 1Np
j=1 μ=1 j̸=i μ̸=ν
j=1 j̸=i =1
=sgn x(ν) − x(ν) + x(μ)x(μ)x(ν) iNiNijj
j=1 μ=1 j̸=i μ̸=ν
= sgnx (ν) + corrections . i
x(μ)x(μ)x(ν) NiNijj
=sgn x(ν) + 11Np
The corrections are given by: corrections=− x(ν)+
So condition (1) is satisfied if corrections < 1. Now define C (ν) as follows: i
j=1 μ=1 j̸=i μ̸=ν
1 1Np
x(μ)x(μ)x(ν). NiNijj
1 1 N p
C(ν) ≡ − x(μ)x(μ)x(ν)x(ν) =corrections×(−x(ν)).
iNNijji i j=1 μ=1
j̸=i μ̸=ν
Multiply both sides of Equation (1) by −x (ν) and rewrite this condition as
−1=sgn−1+C(ν). i
i
j=1 μ=1 j̸=i μ̸=ν
3 This condition is satisfied provided that C (ν) < 1. No limits have been taken so far.
(b) Assume random patterns x (ν) with bits
+1 with probability 1 ,
x(μ) = 2 i −1 with probability 21 .
The bit x (ν) is unchanged after a single asynchronous update if C (ν) < 1, as derived ii
in task (a). Thus, the probability that the bit x (ν) is changed (although it shouldn’t i
be) is given by
ToevaluatePt=1,considerC(ν) inthelimitofN ≫1:
error i
P t=1 =prob(C(ν) >1). error i
1 N p 1(N−1)(p−1) C(ν) ≈− x(μ)x(μ)x(ν)x(ν) =− z ,
iNijjiNk j=1 μ=1 k=1
j̸=i μ̸=ν
wherezk areindependentrandomvariableswith
+1 with probability 12 , zk = −1 with probability 12 .
Sincethevariableszk areindenpendentlyidenticallydistributed(withzeromean
and unit variance), we can use the central limit theorem. It follows that the distribu-
tion of C (ν) has the following properties: i
• C (ν) is approximately Gaussian distributed in the limit N ≫ 1. i
• The mean of C (ν) is equal to zero (since the mean of the x -variables vanishes). ik
•Thevarianceσ2ofC(ν)isσ2=1(N−1)(p−1)≈p inthelimitN≫1and iN2N
p ≫ 1.
Thus it follows that
t=1 (ν)∞1−x211
Perror=prob(Ci >1)=
where the error function erf(z ) is defined as
We conclude that
Here α = p /N is the storage capacity.
1
i
dx2πσ2e 2σ2 =2 1−erf 2σ2 , 2z 2
2 erf(z)= dy e−y .
π0 Pt=1=11−erf1 .
error 2 2α
4
2.11 Hopfield net with four neurons
1
One stored pattern in the network: x (1) = −1 . Weight matrix:
1 W= 1 x(1)x(1)T = 1−11
N 4 −1 −1
Feeding the 24 patterns gives: 1
1. Take S(t =0)=x(1) =−1. −1
−1
−1
1 −1 −1 −1 −1= 1−1 1 1 1 4 −1 1 1 1
−1 1 1 1
−1 −1
−1
To compute S (t = 1) must first evaluate WS (t = 0) = 14 x (1)x (1)Tx (1). Then take
sgn function componentwise. Obtain S (t = 1) = x (1).
−1 1 −1 −1 −1−1 1
2. S(t =0)=−1,WS(t =0)= 41 −1 1 1 1 −1=−1. −1 −1 1 1 1 −1 −1
−1 −1 1 1 1 −1 −1
Take the sgn function componentwise to obtain S(t = 1) = x(1). The other cases are analogous. The results are given below.
1 1
3. S(t =0)= 1 , S(t =1)=−1=x(1)
−1 −1 −1 −1
1 1
4. S(t =0)=−1, S(t =1)=−1=x(1)
−1 −1 1 −1 1 1
5. S(t =0)=−1, S(t =1)=−1=x(1) 1 −1
−1 −1
−1 0 6. S(t =0)= 1 , S(t =1)=0.
−1 0 −1 0
−1 0 7. S(t =0)=−1, S(t =1)=0.
−1 0 10
−1 0 8. S(t =0)=−1, S(t =1)=0.
1 0 −1 0
1 0 9. S(t =0)= 1 , S(t =1)=0.
−1 0 1 0
10. S(t =0)= 1 , S(t =1)=0. −1 0
10 1 0
11. S(t =0)=−1, S(t =1)=0. 1 0
10 −1 −1
12. S(t =0)= 1 , S(t =1)= 1 =−x(1)
−1 1 −1 −1
13. S(t =0)= 1 , S(t =1)= 1 =−x(1) −1 1
11
5
6
−1 −1
14. S(t =0)=−1, S(t =1)= 1 =−x(1)
1 1 11
1 −1
15. S(t =0)=1, S(t =1)= 1 =−x(1)
11 −1 1
16. S(t =0)= 1 , S(t =1)=1.
11
In summary:
1. When we feed one of the first five patterns, the stored pattern is retrieved. Pattern no. 1 is the stored patterns, and patterns 2 to 5 differ in only one bit compared with the stored pattern.
2. When we feed a pattern that has more than two bit differences, then the network retrieves the inverted version of the stored pattern (patterns 12-16).
3. Whenexactlytwobitsaredistorted,thenetworkfails(patterns6to11).
(a) Define
N
Q(μ,ν) =x(μ)x(ν). (2)
jj j=1
The bit j contributes with +1 to Q (μ,ν) if x (μ) = x (ν) , and with −1 if x (μ) ̸= x (ν) . Since the jj jj
number of bits are N = 32, we have Q (μ,ν) = 32 − 2H (μ,ν), where H (μ,ν) is the number of bits that are different in patterns μ and ν (the Hamming distance). By looking at the figures, we find the following:
• H(1,1) =0⇒Q(1,1) =32 • H(1,2) =13⇒Q(1,2) =6 • H(1,3) =2⇒Q(1,3) =28
• H(1,4) =32⇒Q(1,4) =−32
• H(1,5) =16⇒Q(1,5) =0
• H(2,1) =H(1,2) =13⇒Q(2,1) =6
• H(2,2) =0⇒Q(2,2) =32 • H(2,3) =15⇒Q(2,3) =2
• H(2,4) =32−H(2,1) =19⇒Q(2,4) =−6 • H(2,5) =19⇒Q(2,5) =−6
(b) We have that
b(ν) =w x(ν) = 1 x(1)x(1) +x(2)x(2)x(ν) i ijj 32ijijj
jj
=1x(1)x(1)x(ν)+ 1x(2)x(2)x(ν)= 1x(1)Q(1,ν)+ 1x(2)Q(2,ν).
32i j j 32i j j 32i 32i jj
7
2.12 Recognising letters with a Hopfield net
8
From (a) we have that:
(1) Q (1,1) (1) bi = 32 xi +
(2) Q (1,2) (1) bi = 32 xi +
(3) Q (1,3) (1) bi = 32 xi +
(4) Q (1,4) (1) bi = 32 xi +
(5) Q(1,5) (1) bi = 32 xi +
Q (2,1) (2) (1) 6 (2) 32 ti =xi +32xi ,
Q (2,2) (2) 6 (1) (2) 32 xi =32xi +xi ,
Q (2,3) (2) 28 (1) 2 (2) 32 xi =32xi +32xi ,
Q (2,4) (2) (1)6(2) 32 xi =−xi −32xi ,
Q(2,5) (2) 6 (2) 32 xi =−32xi .
(c) From (b), we find that:
Thus, patterns x (1), x (2) and x (4) are stable.
x(1) →sgn(b(1))=x(1), iii
x(2) →sgn(b(2))=x(2), iii
x(3) →sgn(b(3))=x(1), iii
x(4) →sgn(b(4))=−x(1) =x(4), iiii
x(5) →sgn(b(5))=−x(2). iii
i j=1ijjijKμ=1ij
1 with probability NK ,
Kij = 0withprobability1−NK .
Limit considered: N ≫ 1, p ≫ 1, K ≫ 1, N ≫ p , N ≫ K . a) S = x (ν) ,
j=1K =
which gives
ijj iii ijj Kμ=1 j=1Kμ=1
N N K p
w x(ν)= ij x(μ)x(μ)x(ν)=
NK Kp NKp
= ij x(ν)x(ν)x(ν)+ ii x(μ)x(μ)x(ν)+ ij x(μ)x(μ)x(ν)=
N iijjijjKijj
b =
w S =
j=1 j=1 j=1 μ=1
=1 μ̸=ν =1 j̸=i μ̸=ν NK Kp−1 NKp
i j x(ν) + ii x(ν) + i j x(μ)x(μ)x(ν),
KiKNi Kijj j=1 j=1 μ=1
j ≠=i μ̸=ν
N K Kx(ν) K Kp−1x(ν)1N K Kp
〈b 〉 =1 +01− i +1 +0(1− ) i + 1 +01− x(μ)x(μ)x(ν) =
9
2.13 Diluted Hopfield net
In this exercise only a small fraction of connections is active, the others are not
active. Even so, the network is able to learn, as will be shown here. Si = sgn(bi ), b=N wS,w=Kijp x(μ)x(μ),
icNNKNNNKKNNijj
j=1
j=1 μ=1
j=1 μ=1
j ̸=i μ̸=ν
j ̸=i μ̸=ν
=〈cross-talk term〉
1NKp
x(μ)x(μ)x(ν)
≈ x(ν) + iKNijj
=〈cross-talk term〉
In the last line, the limit N ≫ 1, N ≫ p was used: in this limit, it follows that the
contributionofthesecondterminthepenultimatelineisnegligible,i.e. 1 p−1x(ν)≈0 NNi
(b) 〈C (ν)〉 = 〈cross-talk term〉 = does not depend on N because it is the sum of ic
(N−1)(p−1)NK ≈pK randomnumbersthatare±1withequalprobability,dividedby K (the remaining terms are zero). Thus 〈C (ν)〉 is the sum of ≈ K p random numbers
ic
that are ±1 with equal probability, divided by K . Note that here we used that in the
limitN ≫1,p ≫1,itfollowsthatN −1≈N,andp−1≈p. InthelimitofK ≫1and
10
p≫1,weusetheCLT(CentralLimitTheorem)toconcludethat〈C(ν)〉 isGaussian
ic
distributed with mean 0. The variance is computed as follows. Since each term is ±1
divided by K , then the variance of each term is 1/K 2. There are approximately K p terms in the sum, and so the variance of 〈C (ν)〉 is approximately equal to = K p = p .
ic K2K Note that the variance depends on K because, as explained above, the number of
terms that enter the sum depends on K .
(c)m = 1 N x(ν)〈〈S〉〉 . InthelimitofN ≫1,weusetheapproximation
ν N i=1i ictime
〈〈Si 〉c〉time ≈ 〈Si 〉c ≈ sgn(〈bi 〉c) to obtain that
1N 1NN m= x(ν)sgn(〈b〉)= sgnx(ν)〈 wS〉 =
νNiicNiijjc i=1 i=1 j=1
1N NKp = sgn x(ν)〈 ij x(μ)x(μ)S〉
=⇒ 1NNK NKp
NiKijjc i=1 j=1 μ=1
sgn 〈 ij x(ν)x(ν)x(ν)S 〉+x(ν)〈 ij x(μ)x(μ)S 〉 = νNKijijiKijj
m =
i=1 j=1 j=1 μ=1 μ̸=ν
=〈C (ν) 〉 ic
NN = 1 sgn Kij x(ν)S +x(ν)C(ν) .
WehavethatN
j=1Kjj ν
toKp ic
−C2
Ni=1 j=1Kjjiic
Kij x(ν)S ≈m inthelimitofK ≫1.Itfollowsthat
1 N ( ν ) ( ν ) mν≈N sgnmν+xi Ci c,
i=1
whereC(ν) isGaussiandistributedwithmean0andvarianceapproximatelyequal
dC(ν) p(C(ν) )sgn(m +x(ν)C(ν) ), νicicνiic
=⇒ m ≈
wherep(C)= 1 e 2Kp . Now,weneedtosplittheintegraltoaccountforthefact
2 π Kp
that x (ν) = ±1 with equal probability: i
1 (ν) (ν) (ν) mν≈2 d Ci cp(Ci c)sgn(mν+ Ci c)
+2
d Ci cp( Ci
c)sgn(mν − Ci
1 (ν) (ν)
(ν)
1 −mν
− dC(ν) p(C(ν) )+
c)= Assumingmν ≥0 =
−mν
mν dC(ν) p(C(ν) ) =
∞ 2icicicic
−∞ 1 −∞
dC(ν) p(C(ν) )+ 2icicicic
dC(ν) p(C(ν) ) +
11
=
mν − 1−
−
1 ∞
−∞
∞ dC(ν) p(C(ν) ) + dC(ν) p(C(ν) ) +
2 icic icic −mν −mν
1 ∞
− dC(ν) p(C(ν) )+1−
∞ dC(ν) p(C(ν) ) =
2icicicic mν mν
1
=2 1−erf 2Kp
−mν
1 mν −2 1−erf 2Kp
1 mν =2 erf 2Kp
−mν −erf 2Kp .
mν =−erf
It follows that mν ≈ erfmν/2Kp . The same conclusion holds if we instead as- sumed that mν < 0 where we assumed that mν ≥ 0.
2 Kp
12
Table 1: Signs of sμ = x (μ) x (mix) . See also Table 2.1 in Lecture notes.
jj
x(1) x(2) x(3) x(mix) s s s
iiii123 -1 -1 -1 -1 1 1 1
-1 -1 1 -1 1 1 -1 -1 1 -1 -1 1 -1 1 -1 1 1 1 -1 1 1
1 -1 -1 -1 -1 1 1 1 -1 1 1 1 -1 1 1 1 -1 1 1 1 -1 1111111
This result is essentially the same as for the fully connected Hopfield network obtained under the mean-field approximation, but here, in the argument of the functionwehavep/K insteadofthestoragecapacityp/N.
2.14 Mixed states
(a)
(b) We find that 〈s1〉, 〈s2〉 and 〈s3〉 are the average of independent random variables
X suchthatp(X =−1)= 14 andp(X =1)= 34. Thus,theexpectedvalueofX is
E[X]=−1+3 = 1 andthevarianceofX isVar[X]=E[(X−1)2]= 1(−3)2+3(1)2 = 12 = 3. 4 4 2 2 4 2 4 2 16 4
Thus,E[〈s 〉]= 1 forμ=1,2,3andVar[〈s 〉]= 1 N3 = 3 →0asN →∞. Thus, μ 2 μ N2 4 4N
〈sμ〉=12 forμ=1,2,3.
For μ > 3, we find that 〈sμ〉 is N1 times the sum of N independent random vari-
ables X with p (X = −1) = 12 and p (X = 1) = 12 . Thus, E[X ] = 0 and Var[X ] = 1 and
weconcludethat,forμ>3,E[〈s 〉]=0andVar[〈s 〉]= 1 N = 1 →0asN →∞and μμN2N
thus〈sμ〉=0forμ>3. (c)
NN1pp1Np
w x(mix) = x(μ)x(μ)x(mix) =x(μ) x(μ)x(mix) =x(μ)〈s 〉.
ijj Nijj iNjj iμ j=1 j=1 μ=1 μ=1 j=1 μ=1
N 3 311
w x(mix) = x(μ) 〈s 〉 + x(μ) 〈s 〉 = x(μ) = (x(1) +x(2) +x(3)).
j=1
ijjiμiμiiii μ=1 μ>3 μ=1 2 2
= 21 = 0
The cross-talk term can be neglected since 〈sμ〉 = 0 for μ > 3 (from the previous part of this exercise).
Chapter 3
Chapter 4
sgnw x(mix)=sgn (x(1)+x(2)+x(3))=x(mix).
ijj2iiii j=1
4.10 McCulloch-Pitts dynamics for restricted Boltzmann machine
The deterministic analogue is
v′ =sgnb(v) with b(v) =h w , (3)
M jjjiij
i=1
for the visible neurons, and
h′ =sgnb(v) with b(h) =w v . (4)
N ijiijj
j=1
for the hidden neurons.
Say that the hidden neurons hm 1 , hm 2 , . . . change sign when they update. We have
hi′ =hi −2hiδi,m1 −2hiδi,m2 −… (5) H′ =−wijhi′vj =H +2wijhiδi,m1vj +2wijhiδi,m2vj +… (6)
ijijij
=H +2wm1,j hm1vj +2wm2,j hm2vj +… (7) jj
=H +2hm1b(h) +2hm2b(h) +… (8) m1 m2
=H −2sgnb(h)b(h) −2sgnb(h)b(h) −… (9) m1 m1 m2 m2
Say that the visible neurons vm 1 , vm 2 , . . . change sign when they update. We have vj′ = vj −2vj δj,m2 −2vj δj,m2 −… (10)
and
13
(d) We apply the update rule and use (c) to conclude that N1
14 and
H′ =−wijhivj′ =H +2wijhiδj,m1vj +2wijhiδj,m2vj +… (11) ijijij
=H +2wi,m1hi vm1 +2wi,m2hi vm2 +… (12) ii
=H +2b(v)vm1 +2b(v)vm2 +… (13) m1 m2
=H −2b(v)sgnb(v)−2b(v)sgnb(v)−… (14) m1 m1 m2 m1
Chapter 5
5.2 Boolean functions
(a)
In order for the problem to be solved by a simple perceptron with two input termi- nals and one output unit (no hidden layer), it must be linearly separable, i.e. there must exist a plane such that all inputs mapping to the target output +1 have to be on one side of the plane and all inputs mapping to the target output −1 have to be on the other side of the plane. By looking at the figure above, this is impossible for the Boolean XOR problem.
(b) 3D Boolean functions. Consider a function that has k input patterns that map to the target output +1 and 8 − k input patterns that map to −1.
15
16
5.5 Boolean functions
(a)Here,N =3.Usingthatθi =2andwjk =x(j),weconcludethat k
Thus
1ifi =μ, v(i,μ) = 0ifi ̸=μ.
−θi +
wik x(μ) =−2+ k
kk
x(i)x(μ) k k
> 0 if i = μ, ≤ 0 if i ̸= μ.
From figure 5.16 in the lecture notes, we can see that there are exactly 4 of the 8 possibleinputsx(μ) thataretobemappedtoO(μ) =1. Thesearex(μ) forμ=2,4,5 and 7. These inputs will assign, respectively:
0 1
0 0 0 1
0000 The weights W must satisfy the following
O(μ) =WTv(μ) = This is achieved by letting
1 for μ ∈ {2,4,5,7}, −1forμ∈{1,3,6,8}.
−1 1 −1
0
(2) 0 (4)
0 0 0 0
0 0
0 0
0
1 (5) 0
v =,v =,v =,andv =.
(7) 0 0 0 10 0
1 W= .
1 −1
1 −1
Note that this solution only works for N = 3. If N = 2, there is no solution since
−2+ x(j)x(μ) ≤0foralli andμ,i.e. v(i,μ) =0foralli andμ. kkk
(b) The solution in a) implies separating each corner x (μ) of the cube of input patterns by letting
1ifi =μ, v(i,μ) = 0ifi ̸=μ.
Thus, the solution requires 23 = 8 hidden neurons. The analogous solution in 2D is to separate each corner of a square, and it requires 2N = 22 = 4 neurons. The decision boundaries of the hidden neurons are shown here:
17
18
5.6 Linearly inseparable problem
(a)
Inthefigureabove,x(A) andx(B) aretohaveoutput1andx(C) andx(D) aretohave output 0. There is no straight line that can separate patterns x (A) and x (B ) from patternsx(C) andx(D).
(b) The triangle corners are:
(1) −4 (2) 4 (3) 0
x = 0 ,x = −1 ,x = 3 .
Let b1 = 0 at x (1) and x (2). Here, b1 is the argument of the activation function for the
hidden neuron one. This implies
0=w x(1) +w x(1) −θ =−4w −θ θ =−4w
1111221 111 =⇒1 11
0=w x(2)+w x(2)−θ =4w −w −θ w =4w −θ =8w
111 122 1 11 12 1 12 11 1 11
We choose w11 = 1, w12 = 8 and θ1 = −4. Now, let b2 = 0 at x (2) and x (3). This implies
0=w x(2)+w x(2)−θ =−4w −w −θ w =4w −θ =4w −3w =⇒ w =w 211 222 2 21 22 2 =⇒ 22 21 2 21 22 22 21
0=w x(3)+w x(3)−θ =3w −θ θ =3w 211 222 1 22 2 2 22
We choose w21 = w22 = 1 and θ2 = 3. Now, let b3 = 0 at x (3) and x (1). This implies 0=w x(3) +w x(3) −θ =−3w −θ 3w =−4w
311 322 2 32 3 =⇒ 32 31 0=w x(1)+w x(1)−θ =−4w −θ θ =3w
3113223313 332 Wechoosew32 =4,w31 =−3andθ3 =12.
We know that the origin maps to V = [1, 0, 0]T and that the hidden neurons change values at the dashed lines in the following figure:
Thus, we can conclude that the regions in input space maps to the following regions in the hidden space:
WewantV =[1,0,0]T tomapto1andallotherpossiblevaluesofV tomapto0. The hidden space can be illustrated like this:
19
In summary:
The origin maps to
1 8 −4 w=1 1andθ=3.
−3 4 12
0 4 1
V =H(w · 0 −θ)=H −3 =0. −12 0
20
W must be normal to the plane passing through the crosses in this picture. Also, W points to V = [1,0,0]T from V = [0,1,1]T . We may choose
1 0 1 W =0−1=−1.
0 1 −1
We know that the point V = [ 12 , 0, 0]T lies on the decision boundary we are looking
for. So
1 T21
W ·0−Θ=0=⇒Θ=2. 0
1 1
0 0
WT ·0−Θ=−12 =⇒0mapstoO=0,
00
0
WT ·0−Θ=1−21=12 =⇒0mapstoO=1,
11 0 0
WT ·01−Θ=2−21=32 =⇒01mapstoO=1,
11 0 0
WT ·10−Θ=−1−21=−32 =⇒10mapstoO=0,
00 0 0
WT ·10−Θ=0−21=12 =⇒10mapstoO=0,
11 0 0
WT ·1−Θ=2−21=32 =⇒1mapstoO=1,
11 1 1
WT ·10−Θ=0−21=−12 =⇒10mapstoO=0,
00
0
21
5.7 Perceptron with one hidden layer
1
a)W =−1,Θ=21,
22
Illustration:
1 1
WT ·10−Θ=1−12=12 =⇒10mapstoO=1,
11
1 1
WT ·1−Θ=2−21=32 =⇒1mapstoO=1.
11
1.5 1.5 1.5 (b)Considerthesequenceofpatternsx(1) = −0.5 ,x(2) = 0.5 ,andx(3) = 1.5 . The
corresponding sequence of v (μ) is v (1) = 0, v (2) = 1 and v (3) = 0. For a monotonously 3333
increasing sequence of patterns where x (μ) stays constant and x (μ) is monotonously 12
increasing, the function −Θ + w x (μ) must increase (decrease) monotonously 3 j3jj
if w32 is positive (negative) or stay constant if w32 = 0. This isn’t the case for the sequence of patterns x (1), x (2) and x (3).
(c) The solution is shown in the following figure:
23
5.8 Multilayer perceptron
The solid line passes through the points (0, 1) and (1, 0):
(1,0) =⇒ w11 ·1+w12 ·0−θ1 =0 =⇒ w11 =θ1,
(0,1) =⇒ w11 ·0+w12 ·1−θ1 =0 =⇒ w12 =θ1.
So w11 = w12 = θ1. The dash-dotted line passes through the points (0.5,1) and (0.5,0):
(0.5,0) =⇒ 0.5w21 +0·w22 −θ2 =0 =⇒ θ2 =0.5w21 (0.5,1) =⇒ 0.5w21 +w22 −θ2 =0 =⇒ w22 =0.
So w22 = 0 and θ2 = 0.5w21. The dashed line passes through the points (0,0.8) and (1, 0.8):
(0,0.8) =⇒ 0·w31 +0.8w32 −θ3 =0 =⇒ θ3 =0.8w32
24
(1,0.8) =⇒ 1·w31 +0.8w32 −θ3 =0 =⇒ w31 =0. So w31 = 0 and θ3 = 0.8w32.
11 1 These conditions are realized by, for instance, w = 1 0 , θ = 0.5 .
01 0.8
Evaluatetheoutputat(x1,x2)=(0,0):
V1 =θ 1·0+1·0−1=0
V2 =θ 1·0+0·0−0.5=0
V3 =θ 0·0+1·0−0.8=0.
So, the origin in the input space maps to V = 0 0 0. Note that V1 flips at the solid line, V2 flips at the dash-dotted line and V3 flips at the dashed line. Given that the origin in the input space is (0, 0, 0), we find:
The input patterns are located according to this figure:
So the coordinates in the transformed space are:
μV 1,2,3 (1,0,1) 4,5,6 (1,0,0)
7,8 (1,1,0) 9,10 (0,0,0) 11 (0,1,0)
Graphical illustration of hidden space:
Yes, the problem is linearly separable since we can separate the output with the plane passing through the crosses in the figure.
e) The crosses drawn at the last figure in assignment (d) are P1 = (1, 12 , 0), P2 = ( 12 , 1, 0) and P3 = (0, 0, 21 ). Decision boundary at plane containing these points imply W1V1 + W2V2 + W3V3 − Θ = 0:
P1 =⇒W1+21W2=Θ (15) P2 =⇒12W1+W2=Θ (16)
25
26
P3 =⇒ 12W3 =Θ (17)
(15)+(16) =⇒ W1 + 21W2 = 12W1 +W2 =⇒ W1 =W2 (18)
(15)+(18)=⇒ 23W1=23W2=Θ (19)
(17)+(19) =⇒ 12W3 = 32W1 = 32W2 =Θ (20)
Note that we want the origin of the transformed space to be mapped to +1 (from patterns 9 and 10). Thus, 0 · W1 + 0 · W2 + 0 · W3 − Θ > 1 =⇒ Θ < −1. We may choose
−4 3
3 Θ=−2,W= −4 .
−4
Checkthat〈xj〉=0:
We have
μ=1
xx =2016, We compute the elements of C:
Wefind
5C11 =36+4+4+1+25=70 5C12 =5C21 =30+8+4+3+20=65
5C22 =25+16+4+9+16=70
−13 13 C= 13 −13 .
5
x(μ) =−6−2+1+1+5=0, 1
μ=1 5
x(μ) =−5−4+2+3+4=0. 2
μ=1
So the data has zero mean in all components. This means that the matrix C′ equals
the data-covariance matrix C. Therefore drop the prime in the following:
1 5
C = 5
x (μ) x (μ)T (because data has zero mean). (1) (1)T 36 30
xx =3025, (2) (2)T 4 8
xx =816, (3) (3)T 4 4
xx=44, (4) (4)T 1 3
xx=39, (5) (5)T 25 20
27
Chapter 6
6.2 Principal-component analysis
28 Eigenvalues:
14−λ13222
0= 13 14−λ=(14−λ) −13 =λ −28λ+27 =⇒ λ=14± 142 −27=14± 169=14±13.
This gives the maximal eigenvalue λmax = 27. Eigenvector u :
14 13u1
13 14 u =0 =⇒u1=u2.
2
So the normalised eigenvector corresponding to the largest eigenvalue of C is given by
±1 1 u=2 1 .
Discussion. Refer to Section 6 in Lecture notes.
6.8 Stochastic gradient descent
Solution starts on the next page.
This is the principal component.
6.7 Backpropagation
6.9 Multi-layer perceptron
Let N denote the number of weights w(m) in the mth layer. Let n denote the mijm
numberofhiddenunitsv(m,μ) fori =1,...,L−1,letn denotethenumberofinput i0
units and let nL denote the number of output units. The number of weights are L Nm = L nm−1nm and the number of thresholds are L nm .
m=1
gb(m,μ)=g′b(m,μ) ∂ b(m,μ)
m=1 m=1
∂ v(m,μ) = ∂w(p) i
i∂w(p)i ijj qrj
=g′b(m,μ) ∂ w(m)v(m−1,μ).
∂ ∂w(p)
i i qrqrqr
∂w(p) i =g′b(m,μ) ∂ −θ(m) +w(m)v(m−1,μ)
33
Using that p < m , we obtain
∂
Wehave
∂ v(m,μ) =g′b(m,μ) ∂ w(m)v(m−1,μ). Butsincep =m,wefind: ∂w(p) i i j ∂w(p) ij j
i ∂ w (p ) jqr
i j j
∂ v (m −1,μ)
w(m) j . (21)
v(m,μ)=g′ b(m,μ) ∂w(p)i i
ij ∂w(p) qrjqr
qr qr
∂ v(m,μ) =g′b(m,μ)δ δ v(m−1,μ) =g′b(m,μ)δ
∂w(p) i i qi rj j i qrj
v(m−1,μ). (22) qi r
We have w(L−2) ← w(L−2) +δw(L−2), where qrqrqr
δw(L−2)=−η ∂H .
qr ∂ w(L−2) qr
34
From the definition of the energy function, we have
We have:
Definev(L,μ) =O(μ). Wehave: ii
∂ O(μ) ∂ v(L,μ)
i=i
∂ w(L−2) ∂ w(L−2) qr qr
∂ H = ∂ 1 O (μ) − t (μ) 2 (23) ∂w(L−2) ∂w(L−2) 2 i i
qrqrμi
∂O(μ)
= O(μ) −t(μ) i (24) i i ∂ w(L−2)
μiqr
(m,μ) ′ (m,μ) (m)∂v(m−1,μ)
∂v gb wj ifp
(c) See Oja’s M-rule in the lecture notes. Deduce normalisation from the steady-state condition 〈δwi j 〉 = 0.
46
9.5 Kohonen net
The parameter σ regulates the width of the neighbouring function Λ(i,i0). This function is centred at the winning neuron i0 (Λ(i0,i0) = 1), and it decreases as the distance from this neuron increases. The update rule assures that the weight as- signed to the winning neuron i0 is moved by a fraction η towards the fed pattern x . Moreover, the neighbourhood function assures that the weights assigned to the remaining neurons are also moved towards x with a fraction determined by the widthofΛ(i,i0): thefractionislargertheclosertheneuroni istothewinningneuron i0. In the limit of σ → 0, the update rule reduces to the simple-competitive learning, because only the winning neuron for pattern x is updated in this case. In other words, the rule becomes:
η(xj −wij)fori=i0, δwi j = 0,otherwise.
M output neurons arranged in a d -dimensional lattice. The position of the i th outputneuronintheoutputspaceisad-dimensionalvectorri (eachri isfixed). Weight matrix W is an M × N matrix with elements wi j , i = 1, . . . , M , j = 1, . . . , N . The weight vector w i assigned to the output neuron i is N -dimensional (as are the inputpatternsx):wi =(wi1,…,wi,N)T.Learningconsistsoftwophases:ordering and convergence phase. Usually, ordering phase lasts for Torder ≈ 1000 updates. The convergence phase starts after the ordering phase is finished. Initialisation: weights are initialised to random numbers from a relevant interval for a given problem, e.g. from [−1, 1]. Set the current update number t to t = 0. Learning algorithm:
1. Update t ← t + 1.
2. Draw a random pattern x from the set of input patterns.
3. Competition: find a neuron i0 such that w i0 is closest to x in the euclidean sense,i.e. |x −wi0|≤|x −wi|foralli.
4. Updateweightvectors: where
wi ←wi +δwi,
δwij =η(t)Λ(i,i0,t)(xj −wij), andΛ(i,i0,t)=exp−|ri −ri0|2. 2σ(t )2
The dependence on the update number t is as follows: if t < Torder then σ(t ) =
σ0exp− t andη(t)=η0exp− t ,whereτ≈ Torder ,σ0 issettothelargest τ τ log(σ0)
47
distance in the output lattice, and η0 is set to a small (but not too small) value, usually η0 = 0.1. When η0 is not too small, the network is likely to get rid of kinks during the ordering phase. If, however, t > Torder (convergence phase, fine tunning), then both η and σ are kept constant and small (usually, η = 0.01 andσ≈1).
5. Repeatsteps1-4untilconvergence.
9.10
See next page.
q
ratiow= v= h u 2
Chapter 10
10.3 Radial basis functions for XOR
(a) A problem is solvable by a simple perceptron if it is linearly separable. For the two-dimensional XOR problem, this means one must find a plane separating the two types of target outputs. Look at this figure:
From this figure, we see that no such plane can be found: there will always be at least one point that will be misclassified (the point on the “wrong side” of the plane). An example is shown by a dashed line in the figure: on the right side from the plane, there is only one pattern with the target output 1. On the opposite side of the plane there are two patterns with target output 0, but also one pattern with the target output 1. The latter pattern would be misclassified as the pattern with the output 0.
(b) Applying the radial-basis functions suggested in the task, we find that the input patterns have the following coordinates in the transformed space (g1,g2)T:
g1(x(1))=exp(−|(1,1)T −(1,1)T|2)=exp(0)=1
g2(x(1))=exp(−|(1,1)T −(0,0)T|2)=exp(−2)≈0.14 g1(x(2))=exp(−|(1,0)T −(1,1)T|2)=exp(−1)≈0.37 g2(x(2))=exp(−|(1,0)T −(0,0)T|2)=exp(−1)≈0.37
51
52
g1(x(3))=exp(−|(0,1)T −(1,1)T|2)=exp(−1)≈0.37 g2(x(3))=exp(−|(0,1)T −(0,0)T|2)=exp(−1)≈0.37 g1(x(4))=exp(−|(0,0)T −(1,1)T|2)=exp(−2)≈0.14 g2(x(4))=exp(−|(0,0)T −(0,0)T|2)=exp(0)=1
Note that both input patterns x (2) and x (3) are transformed to the same vector in the transformed input space (g1,g2)T. Applying a simple perceptron to the transformed input patterns (in space (g1,g2)T), the problem is linearly separable. Indeed, there is a decision boundary separating the two classes of data: in this case, the network output at the decision boundary is equal to 0.5 (because the activation function is linear):
wTg −θ =0.5for g atthedecisionboundary,
where w = (w1, w2)T. [Note: if the target outputs are +1 or -1, then we would require the sigmoid activation function, and that the network output at the boundary is equal to zero.] The weights and the threshold for the simple perception correspond- ing to the decision boundary shown in the following figure are w1 = w2 = −1 and θ = −1.5:
Note that there are other possible solutions. Particularly, the weight vector −w and the corresponding threshold is also a possible solution, but one must check the network output for the problem given to decide which solution to take. In this task, one requires that the network outputs for the two blue points are smaller than 0.5,
53 whereas for the red point, the network output is required to be larger than 0.5.
54
10.3 Radial basis functions
(a) Is the problem linearly separable? Graphical representation:
Clearly, no plane separates the white balls from the black balls, i.e. the problem isn’t linearly separable.
(b)w =(−1,1,1)T,w =(1,1,−1)T,g(μ) =exp(−1|x(μ) −w |2)fori =1,2. 12i4i
μ |x(μ) −w |2 |x(μ) −w |2 g(μ) g(μ) t(μ) 1212
1 88
2 412
3 4 4
4 0 8
5 12 4
6 8 8
7 8 0
8 4 4
exp(−2) exp(−2) 1 exp(−1) exp(−3) 1 exp(−1) exp(−1) 1
exp(0) exp(−2) -1 exp(−3) exp(−1) 1 exp(−2) exp(−2) 1 exp(−2) exp(0) -1 exp(−1) exp(−1) 1
exp(0) = 1, exp(−1) ≈ 0.37, exp(−2) ≈ 0.14, exp(−3) ≈ 0.05. Graphical presentation in the (g1,g2)T space:
(c) Weight vector W and threshold Θ: W1g1 + W2g2 − Θ = 0 on boundary because
55
sgn() activation function. W = (−1,−1)T, g1 = 1, g2 = 0 =⇒ Θ = −1. Answer: W = (−1,−1)T, Θ = −1.
Chapter 11