MAST90083 Computational Statistics & Data Mining SVM and ANNs
Tutorial & Practical 11: Solutions
Question 1
1. The kernel density estimator has the form
p (x) =
1
N
N∑
i=1
1
hp
k (x, xi)
where hp is the normalization constant for the kernel.
Using the above we have
p (x|t) =
1
N+
N+∑
i=1
1
hp
k (x, xi) t = +1
p (x|t) =
1
N−
N−∑
i=1
1
hp
k (x, xi) t = −1
where N = N+ +N− is the number of input vectors with labels +1 and −1.
2. The minimum misclassification-rate is achieved if, for a new input feature vector x̃, the
selected t̃ maximizes p
(
t̃|x̃
)
.
From Bayes theorem we have
p (t|x) ∝ p (x|t) p(t).
Assuming the two classes have equal probability
p(t) =
{
0.5 if t = +1
0.5 if t = −1
leads
p (t|x) =
{
1
hp
1
N+
∑N+
i=1 k (x, xi) if t = +1
1
hp
1
N−
∑N−
i=1 k (x, xi) if t = −1
This therefore corresponds to maximizing p
(
x̃|t̃
)
or
t̃ =
{
+1 if 1
N+
∑N+
i=1 k (x̃, xi) ≥
1
N−
∑N−
i=1 k (x̃, xi)
−1 if 1
N+
∑N+
i=1 k (x̃, xi) ≤
1
N−
∑N−
i=1 k (x̃, xi)
1
MAST90083 Computational Statistics & Data Mining SVM and ANNs
The factor 1
hp
has been removed since it is common factor and the classification rule is
therefore
t̃ = sign
(
N∑
i=1
ti
Nti
k (x̃, xi)
)
3. If we choose the kernel function as k (x, xi) = x
>xi this results in the kernel density
p (x|t = +1) =
1
N+
N+∑
i=1
k (x, xi) =
1
N+
N+∑
i=1
x>xi = x
>x̃+
where
x̃+ =
1
N+
N+∑
i=1
xi
Similarly
p (x|t = −1) =
1
N−
N−∑
i=1
x>xi = x
>x̃−.
The resulting classification rule is
t̃ =
{
+1 if x>x̃+ ≥ x>x̃−
−1 otherwise.
4. If we choose the kernel function as k (x, x′) = φ (x)
>
φ (x′), similarly we obtain the
classification rule in the feature space φ (x)
t̃ =
{
+1 if φ (x)
>
φ̃ (x+) ≥ φ (x)
>
φ̃ (x−)
−1 otherwise
here
φ̃ (x+) =
1
N+
N+∑
i=1
φ (xi)
φ̃ (x−) =
1
N−
N−∑
i=1
φ (xi)
2
MAST90083 Computational Statistics & Data Mining SVM and ANNs
Question 2
We have (see slide 29)
d =
2
‖β‖
therefore M =
1
‖β‖
and M2 =
1
‖β‖2
Therefore we only need to show that
‖β‖2 =
n∑
i=1
µi.
For the maximum margin solution the second term of
Lp =
1
2
‖β‖2 −
n∑
i=1
µi
{
yi
(
β0 + x
>
i β
)
− 1
}
vanishes so that
Lp =
1
2
‖β‖2
For the maximum we also have
LD =
n∑
i=1
µi −
1
2
n∑
i=1
n∑
j=1
µiµjyiyj(x
>
i xj)
using the constraint
β =
n∑
i=1
µiyixi
this leads to
LD =
n∑
i=1
µi −
1
2
‖β‖2
Using the equality of Lp and LD gives
1
2
‖β‖2 =
n∑
i=1
µi −
1
2
‖β‖2
and therefore
‖β‖2 =
n∑
i=1
µi.
3
MAST90083 Computational Statistics & Data Mining SVM and ANNs
Question 3
1. From the definition we have
f(u) =
eu − e−u
eu + e−u
= −1 +
2eu
eu + e−u
= −1 + 2
1
1 + e−2u
= 2g(2u)− 1.
2. The output of node k for the network with hidden layer activation function f(.) has the
form
y
f
k =
M∑
j=1
α
f
jkf
(
h
f
j
)
+ α
f
k0 =
M∑
j=1
α
f
jk
[
2g
(
2h
f
j
)
− 1
]
+ α
f
k0
=
M∑
j=1
2α
f
jkg
(
2h
f
j
)
+
[
α
f
k0 −
M∑
j=1
α
f
jk
]
.
On the other hand the output of node k for the network with hidden layer activation
function g(.) has the form
y
g
k =
M∑
j=1
α
g
jkg
(
h
g
j
)
+ α
g
k0.
For the output nodes of both networks to be similar y
f
k = y
g
k we should have
h
g
j = 2h
f
j
α
g
kj = 2α
f
jk
α
g
k0 = α
f
k0 −
∑M
j=1 α
f
jk
where the first condition can be achieved by enforcing
β
g
ji = 2β
f
ji and β
g
j0 = 2β
f
j0.
Question 4
The hidden units k compute
g(β0k + βkx) =
1
1 + exp
(
−β0k − β>k x
) = exp (β0k)
exp (β0k) + exp
(
−β>k x
)
=
(
exp (β0k)
1 + exp (β0k)
) 1
exp(−β>k x)−1
1+exp(β0k)
+ 1
4
MAST90083 Computational Statistics & Data Mining SVM and ANNs
if −β>k x→ 0 then
(
exp
(
−β>k x
)
− 1
)
→ −β>k x
so that
g(β0k + β
>
k x)→
(
exp (β0k)
1 + exp (β0k)
)(
1 +
β>k x
1 + exp (β0k)
)
The output nodes is therefore a linear function of x.
5