程序代写代做代考 data mining MAST90083 Computational Statistics & Data Mining SVM and ANNs

MAST90083 Computational Statistics & Data Mining SVM and ANNs

Tutorial & Practical 11: Solutions

Question 1

1. The kernel density estimator has the form

p (x) =
1

N

N∑
i=1

1

hp
k (x, xi)

where hp is the normalization constant for the kernel.

Using the above we have

p (x|t) =
1

N+

N+∑
i=1

1

hp
k (x, xi) t = +1

p (x|t) =
1

N−

N−∑
i=1

1

hp
k (x, xi) t = −1

where N = N+ +N− is the number of input vectors with labels +1 and −1.

2. The minimum misclassification-rate is achieved if, for a new input feature vector x̃, the
selected t̃ maximizes p

(
t̃|x̃
)
.

From Bayes theorem we have

p (t|x) ∝ p (x|t) p(t).

Assuming the two classes have equal probability

p(t) =

{
0.5 if t = +1

0.5 if t = −1

leads

p (t|x) =

{
1
hp

1
N+

∑N+
i=1 k (x, xi) if t = +1

1
hp

1
N−

∑N−
i=1 k (x, xi) if t = −1

This therefore corresponds to maximizing p
(
x̃|t̃
)

or

t̃ =

{
+1 if 1

N+

∑N+
i=1 k (x̃, xi) ≥

1
N−

∑N−
i=1 k (x̃, xi)

−1 if 1
N+

∑N+
i=1 k (x̃, xi) ≤

1
N−

∑N−
i=1 k (x̃, xi)

1

MAST90083 Computational Statistics & Data Mining SVM and ANNs

The factor 1
hp

has been removed since it is common factor and the classification rule is
therefore

t̃ = sign

(
N∑
i=1

ti
Nti

k (x̃, xi)

)

3. If we choose the kernel function as k (x, xi) = x
>xi this results in the kernel density

p (x|t = +1) =
1

N+

N+∑
i=1

k (x, xi) =
1

N+

N+∑
i=1

x>xi = x
>x̃+

where

x̃+ =
1

N+

N+∑
i=1

xi

Similarly

p (x|t = −1) =
1

N−

N−∑
i=1

x>xi = x
>x̃−.

The resulting classification rule is

t̃ =

{
+1 if x>x̃+ ≥ x>x̃−
−1 otherwise.

4. If we choose the kernel function as k (x, x′) = φ (x)
>
φ (x′), similarly we obtain the

classification rule in the feature space φ (x)

t̃ =

{
+1 if φ (x)

>
φ̃ (x+) ≥ φ (x)

>
φ̃ (x−)

−1 otherwise

here

φ̃ (x+) =
1

N+

N+∑
i=1

φ (xi)

φ̃ (x−) =
1

N−

N−∑
i=1

φ (xi)

2

MAST90083 Computational Statistics & Data Mining SVM and ANNs

Question 2

We have (see slide 29)

d =
2

‖β‖
therefore M =

1

‖β‖
and M2 =

1

‖β‖2

Therefore we only need to show that

‖β‖2 =
n∑
i=1

µi.

For the maximum margin solution the second term of

Lp =
1

2
‖β‖2 −

n∑
i=1

µi
{
yi
(
β0 + x

>
i β
)
− 1
}

vanishes so that

Lp =
1

2
‖β‖2

For the maximum we also have

LD =
n∑
i=1

µi −
1

2

n∑
i=1

n∑
j=1

µiµjyiyj(x
>
i xj)

using the constraint

β =
n∑
i=1

µiyixi

this leads to

LD =
n∑
i=1

µi −
1

2
‖β‖2

Using the equality of Lp and LD gives

1

2
‖β‖2 =

n∑
i=1

µi −
1

2
‖β‖2

and therefore

‖β‖2 =
n∑
i=1

µi.

3

MAST90083 Computational Statistics & Data Mining SVM and ANNs

Question 3

1. From the definition we have

f(u) =
eu − e−u

eu + e−u
= −1 +

2eu

eu + e−u
= −1 + 2

1

1 + e−2u
= 2g(2u)− 1.

2. The output of node k for the network with hidden layer activation function f(.) has the
form

y
f
k =

M∑
j=1

α
f
jkf
(
h
f
j

)
+ α

f
k0 =

M∑
j=1

α
f
jk

[
2g
(

2h
f
j

)
− 1
]

+ α
f
k0

=
M∑
j=1


f
jkg
(

2h
f
j

)
+

[
α
f
k0 −

M∑
j=1

α
f
jk

]
.

On the other hand the output of node k for the network with hidden layer activation
function g(.) has the form

y
g
k =

M∑
j=1

α
g
jkg
(
h
g
j

)
+ α

g
k0.

For the output nodes of both networks to be similar y
f
k = y

g
k we should have


h
g
j = 2h

f
j

α
g
kj = 2α

f
jk

α
g
k0 = α

f
k0 −

∑M
j=1 α

f
jk

where the first condition can be achieved by enforcing

β
g
ji = 2β

f
ji and β

g
j0 = 2β

f
j0.

Question 4

The hidden units k compute

g(β0k + βkx) =
1

1 + exp
(
−β0k − β>k x

) = exp (β0k)
exp (β0k) + exp

(
−β>k x

)
=

(
exp (β0k)

1 + exp (β0k)

) 1
exp(−β>k x)−1
1+exp(β0k)

+ 1


4

MAST90083 Computational Statistics & Data Mining SVM and ANNs

if −β>k x→ 0 then
(
exp

(
−β>k x

)
− 1
)
→ −β>k x

so that

g(β0k + β
>
k x)→

(
exp (β0k)

1 + exp (β0k)

)(
1 +

β>k x

1 + exp (β0k)

)
The output nodes is therefore a linear function of x.

5