程序代写代做 algorithm King’s College London

King’s College London
Department of Mathematics
Dr Blanka Horvath
Academic year 2019–2020, Spring term
MATH97231 Machine Learning Problem sheet 3
Problem 1
Show that Nr (I,d1,…,dr−1,O;Id,…,Id) is an affine function.
Problem 2
Let f = fθ ∈ N2(1,2,1;ReLU,Id) with parameters
􏰕􏰈 4 􏰉 􏰙 􏰚 􏰖
θ= −4,−23;(0,2),2 . 􏰎􏰍􏰌􏰏 􏰎􏰍􏰌􏰏
􏰎 􏰍􏰌 􏰏 1 =b2 􏰎􏰍􏰌􏰏 =W2 =b
=W1
Let also l(yˆ, y) = |yˆ − y|, yˆ, y ∈ R. Compute first fθ(1) in a forward pass. Com-
pute then ∇θl(fθ(1),0) by backpropagation. If necessary, set ReLU′(x) := 1 and d |x|:=0forx=0.
Problem 3
Definition. Let P = (P ( j ))dj =1 and Q = (Q ( j ))dj =1 be probability distributions on {1, . . . , d } for some d ≥ 2. The Kullback–Leibler divergence of Q from P is defined
as
dx
􏰂d 􏰕 P ( j ) 􏰖
DKL(P∥Q):=
(Any summand with P ( j ) = 0 is interpreted as zero.) It is an information theoretic
Letyi =(y1i,…,ydi)∈{0,1}d,i=1,…,N,beone-hotencodedcategoricallabels ind≥2categories—thatis,wehaveyij =1andyki =0fork̸=jifsamplei
P(j)log Q(j) .
measure of how much information is lost when Q is used to approximate P .
j=1
1

is in category j = 1,…,d for any i = 1,…,N. Define the empirical distribution P􏰊 = 􏰄P􏰊( j )􏰅dj =1 of labels by
# 􏰟 i : y ij = 1 􏰠
P􏰊(j):= N , j =1,…,d.
Show that minimising
D KL (P􏰊 ∥Q )
with respect to probability distribution Q = (Q(j))dj=1 ∈ (0,1)d is equivalent to
minimising empirical risk
1􏰂N i
L(Q):=N
where l is the d-dimensional categorical cross-entropy (Equation (3.7) on p. 34
of lecture notes).
Problem 4
This is a coding exercise.
Generate first an artificial data set as follows:
⋆ Sample independent and identically distributed realisations X1,…,X1N ∼
Uniform(−1,1)andX21,…,X2N ∼Uniform(−1,1)forN=1000000.
⋆ SamplethenYi ∼Bernoulli(pi),foranyi =1,…,N,where
pi :=p(X1i,X2i), p(x1,x2):=σ􏰓3sin􏰓10􏰇x12+x2􏰔􏰔, (x1,x2)∈R2,
and the sigmoid function σ(x) := 1−x , x ∈ R. 1+e
The aim is to predict the label Y i using the features X1i and X2i . Specify a network pˆ with two inputs and one output, so that the output is a probability in (0,1). A suggested architecture is
p􏰊∈N4(2,200,200,200,1;ReLU,ReLU,ReLU,σ),
but feel free to try other architectures as well. Train p􏰊 using Y i and (X1i , X2i ) for all i = 1, . . . , N with binary cross-entropy as loss function, Adam as optimisation algorithm and minibatch size 500, say. Plot both p and p􏰊 on [−1, 1]2 and assess alsotheerror|p􏰊−p|.
l(Q,y ),
is defined using
i=1
2