5 Feature Extraction
1. Briefly define The following terms: a. Feature Engineering
modifying measured values to make them suitable for classification.
b. Feature Selection
Copyright By PowCoder代写 加微信 powcoder
choosing a subset of measured/possible features to use for classification.
c. Feature Extraction
projecting the chosen feature vectors into a new feature space.
d. Dimensionality Reduction
selecting/extracting feature vectors of lower dimensionality (length) than the original feature vectors.
e. Deep Learning
the process of training a neural network that has many layers (> 3, typically ≫ 3), or performing multiple stages of (nonlinear) feature extraction prior to classification.
2. List 5 methods that can be used to perform feature extraction.
Any five from:
• Principal Component Analysis (PCA)
• Whitening
• Linear Discriminant Analysis (LDA)
• Independent Component Analysis (ICA) • Random Projections
• Sparse Coding
3. Write pseudo-code for the Karhunen-Loève Transform method for performing Principal Component Analysis (PCA).
1. Subtract the mean from all data vectors.
2. Calculate the covariance matrix of the zero-mean data.
3. Find the eigenvalues and eigenvectors of the covariance matrix.
4. Order eigenvalues from large to small, and discard small eigenvalues and their respective vectors. Form a matrix (Vˆ ) of the remaining eigenvectors.
5. Project the zero-mean data onto the PCA subspace by multiplying by the transpose of Vˆ .
4. Use the Karhunen-Loève Transform to project the following 3-dimensional data onto the first two principal components (the MATLAB command eig can be used to find eigenvectors and eignevalues).
1 2 3 2 x1 = 2 , x2 = 3 , x3 = 5 , x4 = 2 .
The mean of the data is μ = 3 , hence, the zero-mean data is: 1
−1 0
x ′1 = − 1 , x ′2 = 0 , x ′3 = 2 , x ′4 = − 1 .
1 0 0000
The covariance matrix is
0 −1 (0
Calculating eigenvectors and eignevalues of C (using MATLAB command eig):
0−0.88170.4719 00 0 V=0 0.4719 0.8817 E=0 0.0986 0
1 0 0 0 0 1.9014 We need to choose the two eigenvectors corresponding to the two largest eigenvectors:
0.4719 −0.8817 Vˆ = 0.8817 0.4719
(xi −μ)(xi −μ)T
(x′i)(x′i)T
−1 (−1 0000
4
110 000 120 000
0 (0 0 0)+ 2
4 C=1 110+000+240+010
000 000 000 000 230
4 C=1360
Projection of the data onto the subspace spanned by the 1st two principal components is given by: yi =
Vˆ T ( x i − μ )
Hence, yi = −0.8817 0.4719 0 xi
Therefore,
0.4719 0.8817 0 ′
0.4719 0.8817 0 −1 −1.3536
y1 = −0.8817 0.4719 0 −1 = 0.4098 0
0.4719 y2= −0.8817
0.8817 00 0 0.4719 0 0= 0
0.8817 01 2.2353 0.4719 0 2= 0.0621
0.8817 0 0 −0.8817 0.4719 0 −1 = −0.4719
0.4719 −0.8817
0.4719 −0.8817
Note: the new data, y, has zero mean and the covariance matrix is:
the original covariance matrix measure variance along each principal component).
(the eigenvalues of
5. What is the proportion of the variance explained by the 1st two principal components in the preceding ques- tion?
Proportion of the variance is given by sum of eigenvalues for selected components divided by the sum of all eigenvalues.
For the preceding question this is
1.9015 + 0.0986 = 1 1.9015 + 0.0986 + 0
Note: the 1st principal component alone would explain 1.9015 = 0.95 of the variance, so we could 1.9015+0.0986+0
project data onto only the 1st PC without losing too much information.
6. Use the Karhunen-Loève Transform to project the following 2-dimensional dataset onto the first principal component (the MATLAB command eig can be used to find eigenvectors and eignevalues).
035589 1,5,4,6,7,7.
The mean of the data is μ = 5 , hence, the zero-mean data is:
−5−2 0 034 −4 , 0 , −1 , 1 , 2 , 2 .
The covariance matrix is
1.9015 0 0 0.0986
(xi −μ)(xi −μ)T
1 5 2 0 0 3 4
C=6 4 (5 4)+ 0 (2 0)+ 1 (0 1)+ 1 (0 1)+ 2 (3 2)+ 2 (4 2)
12520 40 00 00 96 168 C=6 20 16 + 0 0 + 0 1 + 0 1 + 6 4 + 8 4
154 34 9 5.67 C=6 34 26 = 5.67 4.33
Calculating eigenvectors and eignevalues of C (using MATLAB command eig):
0.5564 −0.8309 0.5384 0
V = −0.8309 −0.5564 E = 0 12.7949 We need to choose the eigenvector corresponding to the largest eigenvector:
ˆ −0.8309 V = −0.5564
Projection of the data onto the subspace spanned by the 1st principal component is given by: yi = Vˆ T (xi −μ)
Hence, yi =
5 −0.8309 −0.5564 xi − 5
Therefore, the new dataset is: 6.3801, 1.6618, 0.5564, −0.5564, −3.6055, −4.4364.
The projection looks like this:
original data
0 5 10 −5 0 5 −5 0 5
zero−mean data
projection
7. Apply two epochs of a batch version of Oja’s learning rule to the same data used in the previous question. Use a learning rate of 0.01 and an initial weight vector of [-1,0].
For the Batch Oja’s learning rule, weights are updated such that: w←w+ηiy(xti −yw).Here,η=0.01.
Epoch 1, initial w = [−1, 0] xt
(−5, −4) (−2, 0) (0, −1)
y=wx xt−yw ηy(xt−yw)
(−1, −0.34)
(−0.854, −0.496)
(0,1) 0 (3, 2) −3 (4, 2) −4
(0,1) (0,0)
(0, −4) (0, 0) (0, −1)
(0, −0.2) (0, 0) (0, 0)
(0, 2) total weight change
(0, −0.06) (0, −0.08) (0, −0.34)
ηy(xt − yw) (0.087, −0.117) (0, 0.0136) (0.001, −0.003) (0.001, −0.003) (0.025, −0.028) (0.318, −0.019) (0.146, −0.156)
Epoch 2, initial w = [−1, −0.34]
xt (−5, −4)
(−2, 0) (0, −1) (0, 1) (3, 2) (4, 2)
y = wx 6.36 2 0.34 −0.34 −3.68 −4.68
xt − yw (1.36, −1.84) (0, 0.68) (0.34, −0.88) (−0.34, 0.88) (−0.68, 0.75) (−0.68, 0.41)
total weight change
after 3 epochs: w = (−0.8468, −0.5453)
after 4 epochs: w = (−0.8302, −0.5505)
after 5 epochs: w = (−0.8333, −0.5565)
after 6 epochs: w = (−0.8302, −0.5556)
cf., result for previous question when calculating first PC via the Karhunen-Loève Transform.
8. The graph below shows a two-dimensional dataset in which examplars come from two classes. Exemplars from one class are plotted using triangular markers, and exemplars from the other class are plotted using square markers.
• Draw the approximate direction of the first principal component of this data.
• Draw the approximate direction of the axis onto which the data would be projected using LDA.
9. Briefly describe the optimisation performed by Fisher’s Linear Discriminant Analysis to find a projection of the original data onto a subspace.
LDA searches for a discriminative subspace in which exemplars belonging to the same class are as close together as possible while patterns belonging to different classes are as far apart as possible.
10. For the data in the Table below use Fisher’s method to determine which of the following projection weights is more effective at performing Linear Discriminant Analysis (LDA).
• wT =[−1,5] • wT =[2,−3]
Class Feature vector xT 1 [1,2]
2 [6,5] 2 [7,8]
Fisher’s Linear Discriminant Analysis maximise the cost function J(w) = sb , where: sw
sb = |wT (m1 − m2)|2
sw = (wT(x−m1))2 + (wT(x−m2))2
m i = n1 x i x∈ωi
Samplemeanforclass1ismT1 =31([1,2]+[2,1]+[3,3])=[2,2]. Sample mean for class 2 is mT2 = 12 ([6, 5] + [7, 8]) = [6.5, 6.5].
wT =[−1,5]:
Between class scatter (sb) = |wT (m1 − m2)|2
=|[−1,5]×([2,2]−[6.5,6.5])T|2 = |[−1,5]×[−4.5,−4.5]T|2 = |−18|2 = 324
Within class scatter (sw) = (wT (x − m1))2 + (wT (x − m2))2 x∈ω1 x∈ω2
= ([−1, 5] × ([1, 2] − [2, 2])T )2 + ([−1, 5] × ([2, 1] − [2, 2])T )2 + ([−1, 5] × ([3, 3] − [2, 2])T )2 + ([−1, 5] × ([6, 5] − [6.5, 6.5])T )2 + ([−1, 5] × ([7, 8] − [6.5, 6.5])T )2 = 140
Cost J(w) = sb = 324 = 2.3143 sw 140
ForwT =[2,−3]:
Between class scatter (sb) = |wT (m1 − m2)|2
= |[2, −3] × ([2, 2] − [6.5, 6.5])T |2 = |[2, −3] × [−4.5, −4.5]T |2
= |4.5|2 =
Within class scatter (sw) = (wT (x − m1))2 + (wT (x − m2))2 x∈ω1 x∈ω2
= ([2, −3] × ([1, 2] − [2, 2])T )2 + ([2, −3] × ([2, 1] − [2, 2])T )2 + ([2, −3] × ([3, 3] − [2, 2])T )2 + ([2, −3] × ([6, 5] − [6.5, 6.5])T )2 + ([2, −3] × ([7, 8] − [6.5, 6.5])T )2 = 38.5
CostJ(w) = sb = 20.5 = 0.526 sw 38.5
As J (w) given by wT = [−1, 5] is higher, it is a more effective projection weight.
Note, projection of the data into the new feature space defined by the two projection weights is:
Feature vector xT [1, 2]
[2, 1] [3, 3] [6, 5] [7, 8]
wT = [−1,5] [−1,5]×[1,2]T =9 [−1,5]×[2,1]T =3 [−1,5]×[3,3]T =12 [−1,5]×[6,5]T =19
wT = [2,−3] [2,−3]×[1,2]T =−4 [2,−3]×[2,1]T =1 [2,−3]×[3,3]T =−3 [2,−3]×[6,5]T =−3
It can be seen that wT =[2,−3]itisnot.
after projection by wT
[−1,5]×[7,8]T =33 [2,−3]×[7,8]T =−10
= [−1, 5] the data is linearly separable, while after projection by
11. An Extreme Learning Machine consists of a hidden layer with six neurons, and an output layer with one neuron. The weights to the hidden neurons have been assigned the following random values:
−0.62 0.44 −0.91 −0.81 −0.09 0.02 0.74 −0.91 −0.60
V = −0.82 −0.92 0.71 −0.26 0.68 0.15
0.80 −0.94 −0.83
The weights to the output neuron are: w = (0, 0, 0, −1, 0, 0, 2). All weights are defined using augmented vector
notation. Hidden neurons are Linear Threshold units, while the output neuron is linear. Calculate the response of
0011 the output neuron to each of the following input vectors: 0 , 1 , 0 , 1 .
If we place all the augmented input patterns into a matrix we have the following dataset:
1111 X=0011
The response of each hidden neuron to a single exemplar is defined as y = H(vx), where H is the heaviside
function. The response of all six hidden neurons to all four input patterns, is given by:
−0.62 0.44 −0.91
−0.81 −0.09 0.02 1 1 1 1
0.74 −0.91 −0.60 Y=H[VX]=H −0.82 −0.92 0.71 0 0 1 1
−0.26 0.68 0.15 0 1 0 1
0.80 −0.94 −0.83
−0.62 −1.53 −0.18 −1.09 0 0 0 0 −0.81 −0.79 −0.90 −0.88 0 0 0 0
0.74 0.14 −0.17 −0.77 1 1 0 0
Y=H −0.82 −0.11 −1.74 −1.03 = 0 0 0 0 −0.26 −0.11 0.42 0.57 0 0 1 1
0.80 −0.03 −0.14 −0.97 1 0 0 0
The response of the output neuron to a single exemplar is defined as z = wy. The response of the output
neuron to all four input patterns, is given by:
1111 0000 0 0 0 0
Z=wY=0 0 0 −1 0 0 21 1 0 0=(1 −1 0 0) 0 0 0 0
0 0 1 1 1000
12. Given a dictionary, Vt, what is the best sparse code for the signal x out of the following two alternatives: i) y1t = (1,0,0,0,1,0,0,0)
ii) y2t = (0,0,1,0,0,0,−1,0)
t 0.4 0.55 0.5 −0.1 −0.5 0.9 0.5 0.45 −0.05
Where V = −0.6 −0.45 −0.5 0.9 −0.5 0.1 0.5 0.55 , and x = −0.95 . Assume that sparsity is measured as the count of elements that are non-zero.
Both alternatives are equally sparse (2 non-zero elements each), so the best will be the one with the lowest reconstruction error: ∥x − Vty∥2.
=
For (i) error = ∥x − Vty1∥2
−0.05 0.4 0.55 0.5 −0.1 −0.5 0.9 0.5 0.45
1 0 0 0 1 0 0
0.05
= −0.95 − −0.6−0.5 = −0.95 − −1.1 = 0.15 = 0.052 + 0.152 = 0.158
−0.05 0.4−0.5 −0.05 −0.1
2 22
For (ii) error = ∥x − Vty2∥2
−0.05 0.4 0.55 0.5 −0.1 −0.5 0.9 0.5 0.45 = −0.95 − −0.6 −0.45 −0.5 0.9 −0.5 0.1 0.5 0.55
−0.05 0.5−0.5 −0.05 0 −0.05 2 22
= −0.95 − −0.5−0.5 = −0.95 − −1 = 0.05 = 0.052 + 0.052 = 0.071
Therefore, solution (ii) is the better sparse code.
13. Repeat the previous questions when the two alternatives are: i) y1t = (1,0,0,0,1,0,0,0)
ii) y2t = (0,0,0,−1,0,0,0,0)
(i) is the same as in the previous question, therefore for (i) the error is 0.158.
0 0
For (ii) error = ∥x − Vty2∥2
0 −0.05 0.4 0.55 0.5 −0.1 −0.5 0.9 0.5 0.45 −1
−0.95 − −0.6 −0.45 −0.5 0.9 −0.5 0.1 0.5 0.55
0 0 0
= −0.95 − −0.9 = −0.05 = 0.152 +0.052 =0.158
−0.05 0.1 −0.15 22
Hence, error is the same in both cases. We should therefore prefer the sparser solution, which is solution (ii).
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com