King’s College London
Department of Mathematics
Dr Blanka Horvath
Academic year 2019–2020, Spring term
MATH97231 Machine Learning Problem sheet 3
Problem 1
Show that Nr (I,d1,…,dr−1,O;Id,…,Id) is an affine function.
Problem 2
Let f = fθ ∈ N2(1,2,1;ReLU,Id) with parameters
4
θ= −4,−23;(0,2),2 .
1 =b2 =W2 =b
=W1
Let also l(yˆ, y) = |yˆ − y|, yˆ, y ∈ R. Compute first fθ(1) in a forward pass. Com-
pute then ∇θl(fθ(1),0) by backpropagation. If necessary, set ReLU′(x) := 1 and d |x|:=0forx=0.
Problem 3
Definition. Let P = (P ( j ))dj =1 and Q = (Q ( j ))dj =1 be probability distributions on {1, . . . , d } for some d ≥ 2. The Kullback–Leibler divergence of Q from P is defined
as
dx
d P ( j )
DKL(P∥Q):=
(Any summand with P ( j ) = 0 is interpreted as zero.) It is an information theoretic
Letyi =(y1i,…,ydi)∈{0,1}d,i=1,…,N,beone-hotencodedcategoricallabels ind≥2categories—thatis,wehaveyij =1andyki =0fork̸=jifsamplei
P(j)log Q(j) .
measure of how much information is lost when Q is used to approximate P .
j=1
1
is in category j = 1,…,d for any i = 1,…,N. Define the empirical distribution P = P( j )dj =1 of labels by
# i : y ij = 1
P(j):= N , j =1,…,d.
Show that minimising
D KL (P ∥Q )
with respect to probability distribution Q = (Q(j))dj=1 ∈ (0,1)d is equivalent to
minimising empirical risk
1N i
L(Q):=N
where l is the d-dimensional categorical cross-entropy (Equation (3.7) on p. 34
of lecture notes).
Problem 4
This is a coding exercise.
Generate first an artificial data set as follows:
⋆ Sample independent and identically distributed realisations X1,…,X1N ∼
Uniform(−1,1)andX21,…,X2N ∼Uniform(−1,1)forN=1000000.
⋆ SamplethenYi ∼Bernoulli(pi),foranyi =1,…,N,where
pi :=p(X1i,X2i), p(x1,x2):=σ3sin10x12+x2, (x1,x2)∈R2,
and the sigmoid function σ(x) := 1−x , x ∈ R. 1+e
The aim is to predict the label Y i using the features X1i and X2i . Specify a network pˆ with two inputs and one output, so that the output is a probability in (0,1). A suggested architecture is
p∈N4(2,200,200,200,1;ReLU,ReLU,ReLU,σ),
but feel free to try other architectures as well. Train p using Y i and (X1i , X2i ) for all i = 1, . . . , N with binary cross-entropy as loss function, Adam as optimisation algorithm and minibatch size 500, say. Plot both p and p on [−1, 1]2 and assess alsotheerror|p−p|.
l(Q,y ),
is defined using
i=1
2