King’s College London
Department of Mathematics Academic year 2019–2020, Spring term Dr Blanka Horvath
7CCMFM18 Machine Learning Problem sheet 1 — Solutions
Problem 1
(a) Briefinspectionsuggeststhatforx1,x2∈{0,1},
OR(x1, x2) = 0, 1,
NAND(x1,x2)=0, 1,
=0,
1,
x1+x2 <1, x1+x2≥1,
x1+x2≥2, x1+x2<2,
−x1−x2≤−2, −x1−x2>−2.
x1+x2 <2, 1, x1+x2≥2,
Therefore, say,
OR(x1,x2) = f (x1,x2;1,1,−1),
NAND(x1,x2)= f (x1,x2;−1,−1,1.5),
OR(x1,x2) = f (x1,x2;1,1,−2), x1,x2 ∈{0,1}.
(b) Supposethatthereexistw1,w2∈Randb∈Rsuchthat XOR(x1,x2)=f(x1,x2;w1,w2,b), x1,x2∈{0,1}.
By the definition of XOR,
w1·0+w2·0+b<0 b<0
w1·1+w2·0+b≥0 w1 + b ≥ 0 ⇐⇒ .
w1·0+w2·1+b≥0 w2 + b ≥ 0
w 1 · 1 + w 2 · 1 + b < 0 w 1 + w 2 + b < 0
The first and second inequalities imply that w1 ≥ −b > 0. Thus, by the third inequality, w1 +(w2 +b) ≥ 0, which contradicts with the fourth inequality. This shows that we cannot represent XOR with the perceptron.
1
AND(x1,x2)=0,
Problem 2
(a) Afunction f ∈Nr(O,d1,…,dr−1,O;σ1,…,σd)isexpressedas f =σr ◦Lr ◦···◦σ1◦L1,
where the affine functions L1,…,Lr characterise f . For each i = 1,…,r, the function Li is parameterised by a di × di −1 weight matrix and di -dimensional bias vector, that is, by di−1di +di = (di−1 +1)di parameters. Summing up, we get
r
(di−1 +1)di i=1
parameters in total, as claimed.
(b) Following the hint, consider f ∈ Nr (I,d1,…,dr−1,O,g1,…,gr−1,σr ) and let
di′=di+1andd′j =dj forj̸=i.Letuslookinto
Li(x)= Wi x+ bi , Li+1(x)= Wi+1 x+ bi+1 .
∈Rdi ×di−1 ∈Rdi ∈Rdi+1×di ∈Rdi+1 Definefirst,using0d =(0,…,0)∈Rd,
Wi′ ibi′ ′
Wi:= ∈Rd×d ,
and then Li : Rdi−1 → Rdi′ and Li+1 : Rdi′ → Rdi+1 by
×d ii−1 i d i+1i
Wi+1:= Wi+1 0 ∈Rd 0 ′d i − 1 0 i + 1
b := ∈Rd,
Li(x):=Wix+bi, Li+1(x):=Wi+1x+bi+1. By construction,
di′ di
Li+1◦(gi,…,gi)◦Li =Li+1◦(gi,…,gi)◦Li,
: = g i
:=gi
whereby
f =σr ◦Lr ◦gr−1◦Lr−1◦···◦Li+1◦gi ◦Li ◦···◦g1◦L1
= σ r ◦ L r ◦ g r − 1 ◦ L r − 1 ◦ · · · ◦ L i + 1 ◦ g i ◦ L i ◦ · · · ◦ g 1 ◦ L 1 = : f ,
while f ∈ Nr (I,d′ ,…,d′ ,O,g1,…,gr−1,σr ). In the general case we simply
1 r−1
apply the result we just proved several times to increase the number of units
until we reach the specified amount in each hidden layer.
2
Problem 3
As suggested in the hint, the crucial observation is that Id(x)=x=ReLU(x)−ReLU(−x), x∈R.
To make use of the identity (1), define 1
−1 0
1 −1
(1)
while
1 Md := ∈R
and,foranyW =[Wi,j]1≤i≤d′,1≤j≤d ∈Rd′×d,
W −W W −W ··· W
−W 1,d
−1 …
..
. 0 1
−1
1,1 1,1 1,2 1,2 1,d W −W W −W ··· W
−W 2,d
2,1 2,1
W:=. . . . . . .∈R .
2,2 2,2 2,d
. . . . .. . .
d′×2d
2d ×d
Wd′,1 −Wd′,1 Wd′,2 −Wd′,2 ··· Wd′,d −Wd′,d Moreover,writeReLUd(x):=ReLU(x1),…,ReLU(xd)∈Rd forx=(x1,…,xd)∈Rd.
Now note that ReLU2d(Mdx)=ReLU(x1),ReLU(−x1),…,ReLU(xd),ReLU(−xd),
W ReLU2d(Mdx)=W ReLUd(x)−W ReLUd(−x) =WReLUd(x)−ReLUd(−x) (2)
=Wx,
by (1). Consider now
f =σ◦Lr◦ReLUdr−1◦Lr−1◦···◦ReLUd1◦L1 ∈Nr(I,d1,…,dr−1,O;ReLU,…,ReLU,σ),
whereLr(x)=Wrx+br,x∈Rdr−1.DefineLr :Rdr−1 →R2dr−1 by Lr(x):=Mdr−1x
3
andLr+1 :R2dr−1 →Rdr by Thanks to (2),
Lr+1(x):=Wrx+br.
L ReLU L (x)=Wrx+bi =L (x), x∈Rdr−1.
r+1 2dr−1 r r Consequently,
f =σ◦Lr+1◦ReLU2dr−1 ◦Lr ◦ReLUdr−1 ◦Lr−1◦···◦ReLUd1 ◦L1,
=Lr
whereby
f ∈Nr+1(I,d1,…,dr−1,2dr−1,O;ReLU,…,ReLU,σ). Problem 4
(a) Notefirstthat
eλxi eλxi 1
dj=1 eλxj = eλxi dj=1 eλ(xj −xi ) = 1+j̸=i eλ(xj −xi ) ,
where x j − xi < 0 for any j ̸= i , whereby
lim eλ(xj−xi)=0,
so that
λ→∞ j ̸=i
eλxi limd λx =1.
λ→∞ j=1e j
Since the components of softmax sum up to one, it follows that
eλxk
lim d λx =0 fork̸=i.
λ→∞ j=1e j
Thus,limλ→∞Softmaxd(λx1,...,λxd)equalstheargmaxfunctionof(x1,...,xd), which returns one for the largest coordinate and zero for others — softmax can thus be seen as a smoothed version of argmax, as the latter is the limit of the former when the scaling of the inputs is blown up.
(b) ThelayerCk withk=(k−l′,...,kl)canberepresentedwith k−l′ k−l′+1 ··· kl−1 kl 0
k′k′···kk
−l −l +1 l−1
W:= . . . . ∈R .
l (d−l−l′)×d .. .. ··· .. ..
0 k−l′ k−l′+1 ··· kl−1 kl 4