程序代写代做代考 data mining matlab Data Mining and Machine Learning

Data Mining and Machine Learning
Introduction to Data Mining, Vector Data Analysis and Principal Components Analysis (PCA)
Slide 1
Data Mining and Machine Learning

Objectives
 To introduce Data Mining
 To outline the techniques that we will study in this part of the course – a Data Mining ‘Toolkit’
 To review basic data analysis and to review the notions of mean, variance and covariance
 To explain Principal Components Analysis (PCA)
 To present an example of PCA
Slide 2
Data Mining and Machine Learning

What is Data Mining?  Mining
– Digging deep into the earth, to find hidden, valuable materials
 Data Mining
– Analysis of large data corpora: biomedical, acoustic, video, text,… to discover structure, patterns and relationships
– Corpora which are too large for human inspection – Patterns and structure may be hidden
Slide 3
Data Mining and Machine Learning

Data Mining
 Structure and patterns in large, abstract data sets:
– Is the data homogeneous or does it consist of several separately identifiable subsets?
– Are there patterns in the data?
– If so, do these patterns have an intuitive interpretation?
– Are there correlations in the data?
– Is there redundancy in the data?
Slide 4
Data Mining and Machine Learning

Data Mining
 In this part of the course we will develop a basic ‘data mining toolkit’
– Subspace projection methods (PCA) – Clustering
– Statistical modelling
– Sequence analysis
– Dynamic Programming (DP)
Slide 5
Data Mining and Machine Learning

Some example data
Fig 1: Single, spherical cluster centred at origin
Fig 2: Single, arbitrary elliptical cluster
Fig 3: Multiple, arbitrary elliptical clusters
Slide 6
Data Mining and Machine Learning

Objectives
 Fig 3 shows “multiple source” data. The data is arranged in a set of “clusters”.
 How do we discover the number and locations of the clusters?
 Remember, in real applications there will be many points in a high- dimensional vector space which is difficult to visualise
Slide 7
Data Mining and Machine Learning

Objectives
 Fig 1 shows simplest type of data – single source data centred at origin. Equal variance in both dimensions and no covariance.
 Fig 2 is again single source, but the data is correlated and skewed and not centred at the origin.
 How do we convert Fig 2 into Fig 1?
 We will start with this problem
 Solution is a technique called Principal Components Analysis (PCA)
Fig 1
Fig 2
Slide 8
Data Mining and Machine Learning

Example from speech processing
14 13 12 11 10
9 8 7 6
Plot of high-frequency energy vs low- frequency energy, for 25 ms speech segments, sampled every 10ms
6 7 8 9 10 11 12 13 14
Slide 9
Data Mining and Machine Learning

Basic statistics
Sample variance ‘x’
‘y’ max
Sample mean
14 13 12 11 10
9 8 7 6
‘x’ min
Sample variance ‘y’
‘y’ min
6 7 8 9 10 11 12 13 14
Slide 10
‘x’ max Data Mining and Machine Learning

Basic statistics
 Denote samples by X = x1, x2, … ,xT
where xt = (xt1, xt2, … , xtN)
 The sample mean  (or more correctly (X)) vector
is given by:
n 1T n T xt
t1
   1 ,  2 , . . . ,  n , . . . ,  N 
Slide 11
Data Mining and Machine Learning

More basic statistics
 The sample variance  (more correctly (X)) vector is given by:
Slide 12
Data Mining and Machine Learning

Covariance
 In this data, as the x value increases, the y value also increases
14 13
12 11 10
9 8 7 6
 This is (positive) co-variance
 If y decreases as x increases, the result is negative covariance
6 7 8 9 10 11 12 13 14
Slide 13
Data Mining and Machine Learning

Definition of covariance
 The covariance between the mth and nth components of the sample data is defined by:
 In practice it is useful to subtract the mean  from each of the data points xt. The sample mean is then 0 and
Slide 14
Data Mining and Machine Learning

The covariance matrix
1,1 1,2 2,1 2,2
… …
m,1 …
 … …  N,1
… 1,n … … 2,n … … … … … m,n … … … …
1,N  2,N 
… m,N
 ………… 
…  N,N
Slide 15
Data Mining and Machine Learning

Data with mean subtracted
3
2
1
0
-5 -4 -3 -2 -1 0 1 2 3 4 
2.96
Implies positive covariance
1.9  
-1 -2 -3 -4
1.97
Slide 16
Data Mining and Machine Learning
 1 . 9

Sample data rotated
3 2 1 0
-4 -3 -2 -1 0 1 2 3 4 5 -1
-2 -3 -4
  2.96 
1.9 1.97
Implies negative covariance
Slide 17
Data Mining and Machine Learning
1.9

Data with covariance removed
6 5 4 3 2 1 0
-5 -4 -3 -2 -1 0 1 2 3 4 5 6 -1
-2 -3 -4 -5
4.51 0  
0 0.48

Slide 18
Data Mining and Machine Learning

Principal Components Analysis
 PCA is the technique which I used to diagonalise the sample covariance matrix
 The first step is to write the covariance matrix in the
form:
 UDUT
where D is diagonal and U is a matrix corresponding
to a rotation
 You can do this using SVD (see lecture on LSI) or Eigenvalue Decomposition
Slide 19
Data Mining and Machine Learning

PCA continued
e1 e
1
0
-5 -4 -3 -2 -1 0 1 2 3 4 directione1
3
2
e1 is the first column of U u 
U implements rotation through angle 
2
e 11 1 u 
 21
d11 is the variance in the
-1
-2 e2 is the 2nd column of U
-3 d22 is the variance in the direction e2
-4
u ud 0u u  UDUT  11 12 11  11 21
uu0duu
 21 22 22 12 22
Slide 20
Data Mining and Machine Learning

PCA Example  Abstract data set
Slide 21
Data Mining and Machine Learning

PCA Example (continued)  Step 1: load the data into MATLAB:
– A=load(‘data4’);
 Step 2: Calculate the mean and subtract this from each sample
– M=ones(size(A));
– N=mean(A);
– M(:,1)=M(:,1)*N(1); – M(:,2)=M(:,2)*N(2); – B=A-M;
Plot B Slide 22
Data Mining and Machine Learning

PCA Example (continued)
Slide 23
Data Mining and Machine Learning

PCA Example (continued)
 Calculate the covariance matrix of B (or A)
– S=(B’*B)/size(B,1); – or
– S=cov(B); S  6.78 3.27

3.27 2.76

 Difficult to deduce much about the data from this covariance matrix
Slide 24
Data Mining and Machine Learning

PCA Example (continued)
 Calculate the eigenvalue decomposition of S
– [U,E]=eig(S);
U   0.4884 0.8726, E  0.9307 0  
 0.8726  0.4884 0 
 After transforming the data using U its covariance matrix becomes E. You can confirm this by plotting the transformed data:
8.6079
Slide 25
Data Mining and Machine Learning

PCA Example (continued)
Slide 26
Data Mining and Machine Learning

PCA Example (continued)
 After transformation by the matrix U, the covariance matrix
has been diagonalized and is now equal to E – variance in the x direction is 0.93
– variance in the y direction is 8.61
 This tells us that most of the variation in the data is contained in the (new) y direction
 There is much less variation in the new x direction, and we could get a 1 dimensional approximation to the data by discarding this dimension
 None of this is obvious from the original covariance matrix
Slide 27
Data Mining and Machine Learning

Final notes
 Each column of U is a principal vector
 The corresponding eigenvalue indicates the variance of the data along that dimension
– Large eigenvalues indicate significant components of the data
– Small eigenvalues indicate that the variation along the corresponding eigenvectors may be noise
 It may be advantageous to ignore dimensions which correspond to small eigenvalues and only consider the projection of the data onto the most significant eigenvectors – this way the dimension of the data can be reduced
Slide 28
Data Mining and Machine Learning

Eigenvalues
Data PCs
1st 1st 2nd
3rd
Eigenvalues
More Significant Components
Insignificant Components
90th 90th
Slide 29
Data Mining and Machine Learning

Visualising PCA
Original pattern (blue)
Reduced pattern (red)
U
Slide 30
Data Mining and Machine Learning
U-1
Eigenspace
Set coordinates n → 90 to zero
Eigenspace

Summary
 Review of basic data analysis (mean, variance and covariance)
 Introduction to Principal Components Analysis (PCA)
 Example of PCA
Slide 31
Data Mining and Machine Learning