ECE 657A: Data and Knowledge Modelling and Analysis – Lecture 3: Parameter Estimation, Dimensionality Reduction, Feature Extraction/Selection
ECE 657A: Data and Knowledge Modelling and Analysis
Lecture 3: Parameter Estimation, Dimensionality Reduction,
Feature Extraction/Selection
Mark Crowley
January 18, 2016
Mark Crowley ECE657A : Lecture 3 January 18, 2016 1 / 76
Opening Data Example : Guess the Dataset
??
Mark Crowley ECE657A : Lecture 3 January 18, 2016 2 / 76
Today’s Class
Announcements
Parameter Estimation
Dimensionality Reduction
Feature Extraction
Mark Crowley ECE657A : Lecture 3 January 18, 2016 3 / 76
Today’s Class
Announcements
Project Description
Lecture 2 slides
Part IV : Parameter Estimation
Lecture 3 slides : Feature Extraction
Linear Methods: PCA, LDA
Nonlinear Methods
Mark Crowley ECE657A : Lecture 3 January 18, 2016 4 / 76
Announcements
Volunteer note taker needed
Project Description is out. What do you think?
Mark Crowley ECE657A : Lecture 3 January 18, 2016 5 / 76
Back to Lecture 2 Parameter Estimation!
.
Mark Crowley ECE657A : Lecture 3 January 18, 2016 6 / 76
Data Reduction Overview
Outline of Dimensionality Reduction Methods
Data Reduction Overview
Feature Extraction
Feature Selection
Motivation: Data Reduction
Principle Component Analysis
Overview of PCA
Implementing PCA
Linear Discriminant Analysis
Separation Measures
Fisher Linear Descriminant
Independent Component Analysis
Nonlinear Methods For Dimensionality Reduction
Global Methods – Multidimensional Scaling
isomap
Locally Linear Embedding (LLE)
Mark Crowley ECE657A : Lecture 3 January 18, 2016 7 / 76
Data Reduction Overview
Feature Extraction vs. Feature Selection
Given a dataset X : N × D matrix.
We can extract or transform new features F to describe the
variation, distances, proximity, etc, in the data such that |F | < D.
We can select from the existing features the ones which are the most
representative F to describe X for |F | < D.
Mark Crowley ECE657A : Lecture 3 January 18, 2016 8 / 76
Data Reduction Overview Feature Extraction
Feature Extraction Methods
Feature Extraction combines original features into a new set of
features by transformation or projection.
Projection is according to some objective function or concept.
objective should find the most significant set of new features
used as a lower dimensional subspace than the original set of features
Interpretation:
Transformation can produce new features have interpretable meaning
to the data
But many projections produce features that are difficult to relate to the
original features.
Mark Crowley ECE657A : Lecture 3 January 18, 2016 9 / 76
Data Reduction Overview Feature Extraction
Methods of Feature Extraction
Linear methods:
Unsupervised (no class label information, e.g .PCA, ICA)
Supervised (class labels know, e.g LDA)
Nonlinear methods
Global (preserves global properties, eg. MDS, Isomap, Kernel PCA)
Local (preserves properties within local neighborhood, e.g. LLE)
Mark Crowley ECE657A : Lecture 3 January 18, 2016 10 / 76
Data Reduction Overview Feature Selection
Feature Selection
Feature Selection is attempt to find a subset of the original features
which satisfy some criteria.
Ideally the selected subset includes the significant features and
eliminates irrelevant and redundant features.
The selected features being a subset of the original are interpretable
and keep their original meaning.
Approaches include:
Feature Ranking
Subset Selection which includes:
Wrapper approach and Filter approach
Both use search strategies such as:
Exhaustive
Sequential (Forward or Backward)
Mark Crowley ECE657A : Lecture 3 January 18, 2016 11 / 76
Data Reduction Overview Motivation: Data Reduction
Dimensionality Reduction
Can be seen as a preprocessing step that may be required for:
Visualization of the data.
Finding the intrinsic dimensionality (features)
Many applications have a large number of features that may be
redundant, irrelevant or sparse (e.g text documents where features are
words)
If the number of samples are much smaller than number of
dimensions then this makes learning at a desired resolution difficult
(curse of dimensionality).
Reducing costs of training and learning algorithms
Reducing storage and future measurement costs
Mark Crowley ECE657A : Lecture 3 January 18, 2016 12 / 76
Data Reduction Overview Motivation: Data Reduction
Motivation: The Curse of Dimensionality
A term introduced by Richard Bellman to illustrate the explosion of
the samples required when we have higher dimensions.
The number of samples required to fully represent a 10 valued
function with one dimension=10
For 20 dimensions with same number of values (10 datapoints in each
dimension) we will need 1020
The number of samples grows exponentially in terms of the
dimensions.
This poses a problem for non-parametric algorithms such as most
classification algorithms that require many training samples.
Mark Crowley ECE657A : Lecture 3 January 18, 2016 13 / 76
Data Reduction Overview Motivation: Data Reduction
Motivation: The Curse of Dimensionality
Learn a 1-D function from data:
How many data
points are needed?
N points
Mark Crowley ECE657A : Lecture 3 January 18, 2016 14 / 76
Data Reduction Overview Motivation: Data Reduction
Motivation: The Curse of Dimensionality
Learn a 1-D function from data:
How many data
points are needed? N points
Mark Crowley ECE657A : Lecture 3 January 18, 2016 14 / 76
Data Reduction Overview Motivation: Data Reduction
Motivation: The Curse of Dimensionality
What if the data is better explained by a 2-D function?
N2 points
Mark Crowley ECE657A : Lecture 3 January 18, 2016 15 / 76
Data Reduction Overview Motivation: Data Reduction
Motivation: The Curse of Dimensionality
What if the data is better explained by a 2-D function? N2 points
Mark Crowley ECE657A : Lecture 3 January 18, 2016 15 / 76
Data Reduction Overview Motivation: Data Reduction
Motivation: Peaking Phenomenon
For a fixed number of samples increasing the number of features may
degrade the performance. (eg. genetics data: very high dimensional,
small data sample)
Classic solution is to assume fixed order of features, only choose most
important.
Newer approach: use classification algorithms to reduce feature space
(see http://ift.tt/2iN7cA8[From [7] ])
ECE 750 T17
| Peaking Phenomenon
For a fixed number of samples increasing
the number of features may degrade the
performance
42
# of features
Accuracy
Desired # of features
Error
Mark Crowley ECE657A : Lecture 3 January 18, 2016 16 / 76
http://ift.tt/2iN7cA8
Principle Component Analysis
Outline of Dimensionality Reduction Methods
Data Reduction Overview
Feature Extraction
Feature Selection
Motivation: Data Reduction
Principle Component Analysis
Overview of PCA
Implementing PCA
Linear Discriminant Analysis
Separation Measures
Fisher Linear Descriminant
Independent Component Analysis
Nonlinear Methods For Dimensionality Reduction
Global Methods - Multidimensional Scaling
isomap
Locally Linear Embedding (LLE)
Mark Crowley ECE657A : Lecture 3 January 18, 2016 17 / 76
Principle Component Analysis Overview of PCA
Principal Component Analysis
A way to linearly transform a set of d-dimensional vectors
x̄1, x̄2, . . . , x̄n into another set of m-dimensional vectors
Has the property that most of the information content is stored in the
first few dimensions, so we can have m < d
The main idea is that high information corresponds to high variance
(more discriminating).
The direction of max variance is parallel to the eigenvector
corresponding to the largest eigenvalue of the covariance matrix of
the sample matrix A.
[From [4] ]
Mark Crowley ECE657A : Lecture 3 January 18, 2016 18 / 76
Principle Component Analysis Overview of PCA
Intuitive View
Find new axes which will be better in terms of variance and errors
than original ones.
PCA is equivalent to minimizing the mean-square error. It also
maximizes the scatter.
see Jain and Dubes Book [From [4] ] for derivation.
Visual Explanation:
http://setosa.io/ev/principal-component-analysis/
Mark Crowley ECE657A : Lecture 3 January 18, 2016 19 / 76
http://setosa.io/ev/principal-component-analysis/
Principle Component Analysis Implementing PCA
Implementing PCA
Let R be the d × d covariance matrix of A
A is normalized by subtracting mean
x ′ij = (xij − x̄j), i = 1, . . . , n; j = 1, . . . , d
R is symmetric positive definite, its eigenvalues are real and positive
Now we apply an orthogonal transformation to R to diagonalize it.
CRCT = Λd (1)
Where Λd is a diagonal matrix of the d eigenvalues of R and C is the
matrix with its columns corresponds to the eigenvectors of R
Mark Crowley ECE657A : Lecture 3 January 18, 2016 20 / 76
Principle Component Analysis Implementing PCA
Implementing PCA
We can sort the eigenvalues of R such that
λ1 ≥ λ2 ≥ 3 . . . ≥ λd ≥ 0 (2)
and ĉ1, ĉ2, ĉ3, . . . , ĉd are the corresponding eigenvectors, called the
Principal Components (axes)
To reduce the dimensions and keep large percentage of the variance we
select the 1st m eigenvalues and eigenvectors
Let Hm =
ĉT1
ĉT2
...
ĉTm
be an m × d matrix.
Mark Crowley ECE657A : Lecture 3 January 18, 2016 21 / 76
Principle Component Analysis Implementing PCA
Implementing PCA
Then
ȳi = Hmx̄i , i = 1, 2, . . . , n
m × 1 = m × d · d × 1
The projected matrix Bm
Bm =
ȳ1
T
ȳ2
T
...ȳn
T
=
x̄1
T
x̄2
T
...x̄n
T
HTm = AHTm
where
x̄Tm = [xk1, xk2, . . . , xkd ]
ȳTm = [xk1, xk2, . . . , xkd ]
Mark Crowley ECE657A : Lecture 3 January 18, 2016 22 / 76
Principle Component Analysis Implementing PCA
Implementing PCA
The covariance matrix in the new space can be defined as
1
n
BTmBm =
1
n
n∑
i=1
ȳi ȳ
T
i = HmRH
T
m
= Hm(C
TΛC )HTm = HmC
TΛ(HmC
T )T
= Λm = diag(λ1, λ2, . . . , λm)
HmC
T =
c̄T1
c̄T2
...
c̄Tm
[c̄1c̄2 . . . c̄m] =
100 · · · · · 0
010 · · · · · 0
· · 1 · · · · · 0
...
10 · · · 1 · ·0
Which means the m new features are uncorrelated.
Mark Crowley ECE657A : Lecture 3 January 18, 2016 23 / 76
Principle Component Analysis Implementing PCA
Sum of Eigenvalues
The sum of the Eigenvalues of R are the sample variance in the new space
One would choose m such that
rm =
(
m∑
i=1
λi
)
/
(
d∑
i=1
λi
)
≥ τ < 1
e.g. Choosing τ = 0.95 will ensures that 95% of the variance is retained in
the new space.
One way to know the right value is to use a scree plot.
The scree plot can be done in different ways:
simply plotting the eigenvalues of the components (in descending
order) and look for a gap or a knee in the plot.
or plot rm as a function of m and look for a knee in curve.
could also plot the cumulative variance.
[From [8] ]
Mark Crowley ECE657A : Lecture 3 January 18, 2016 24 / 76
Principle Component Analysis Implementing PCA
Scree Plot (Descending Eigenvectors)
ECE 750 T17
0
20
40
60
80
100
120
PC1 PC2 PC3 PC4 PC5 PC6 PC7
eigenvalues scree plot
8
Mark Crowley ECE657A : Lecture 3 January 18, 2016 25 / 76
Principle Component Analysis Implementing PCA
Scree Plot (Normalized Eigenvectors)
ECE 750 T17
0.00
0.20
0.40
0.60
0.80
1.00
1.20
PC1 PC2 PC3 PC4 PC5 PC6 PC7
eigenvalue/sum
rm
9
Mark Crowley ECE657A : Lecture 3 January 18, 2016 26 / 76
Principle Component Analysis Implementing PCA
Cost of Computation
The covariance matrix R = ATA is a d × d matrix and d (features)
may be much larger than n (samples).
Using this approach could be too complex.
Mark Crowley ECE657A : Lecture 3 January 18, 2016 27 / 76
Principle Component Analysis Implementing PCA
Singular Value Decomposition
The matrix A can be decomposed using singular value decomposition into
An×d = USV
T , where:
U is a n × d matrix of orthonormal columns UTU = I
the left singular vectors
V is d × d matrix of orthonormal columns V TV = I
the rght singular vectors
S is d × d diagonal matrix of singular values.
Mark Crowley ECE657A : Lecture 3 January 18, 2016 28 / 76
Principle Component Analysis Implementing PCA
PCA Using Singular Value Decomposition
SVD can be used to obtain PCA.
Now AAT = USV T (VSUT ) = US2UT
and ATA = VSUT (USV T ) = VS2V T
Which leads to the following facts:
The singular values are the square root of the eigenvalues of the
covariance matrix
The right singular vectors are the eigenvectors of the covariance
matrix.
So, the SVD gives us the d eigenvalues (ordered) values as well as the
PC components.
Now we can reduce the dimensions by selecting the largest m.
Mark Crowley ECE657A : Lecture 3 January 18, 2016 29 / 76
Principle Component Analysis Implementing PCA
Interpretation of PCA
PCA is also called Karhunen-Loeve transform (KLT) or the Hotelling
transform.
The transformed points ȳ1, ȳ2, . . . , ȳn are sometimes called scores
The eigenvectors or principal components c̄1, c̄2, . . . , c̄d are called
loadings represented by coefficients
The eigenvalues of the covariance matrix are sometimes called the
latent representation
The Hotelling T 2 measures the distance from the mean.
Mark Crowley ECE657A : Lecture 3 January 18, 2016 30 / 76
Principle Component Analysis Implementing PCA
Interpretation of PCA
PCA is Optimal in the sense of min. sum of square of errors.
It obtains max variance projection by finding orthogonal linear
combinations of the original variables.
It mainly rotates the coordinates ( for zero mean data) to find new
axes that have the max variance.
It de-correlates the axes. For uni-modal Gaussian data this will
amount to finding independent axes.
For data that doesn’t follow this distribution, the axes may not
necessarily be independent.
The principle components may not be the best for discriminating
between classes.
Mark Crowley ECE657A : Lecture 3 January 18, 2016 31 / 76
Principle Component Analysis Implementing PCA
Whitening (Sphering)
Goal is to have features that are
less correlated together, each represents something independently
important
all the features have the same variance (a kind of a normalization)
A Whitening Transformation
given a vector of random variables with known covariance matrix
want a linear transformation into a set of new variables such that
covariance is the identity matrix (ie. they are uncorrelated)
all have variance 1
The transformation changes the input vector into a ”white noise”
vector.
Mark Crowley ECE657A : Lecture 3 January 18, 2016 32 / 76
Principle Component Analysis Implementing PCA
Performing Whitening
Shift data to zero mean (subtract mean from each feature)
Compute eigenvectors U and eigenvalues S using SVD
Project data using U and S to obtain whitened data points
See http://ufldl.stanford.edu/tutorial/unsupervised/PCAWhitening/ for an
example using matlab.
Mark Crowley ECE657A : Lecture 3 January 18, 2016 33 / 76
http://ufldl.stanford.edu/tutorial/unsupervised/PCAWhitening/
Linear Discriminant Analysis
Outline of Dimensionality Reduction Methods
Data Reduction Overview
Feature Extraction
Feature Selection
Motivation: Data Reduction
Principle Component Analysis
Overview of PCA
Implementing PCA
Linear Discriminant Analysis
Separation Measures
Fisher Linear Descriminant
Independent Component Analysis
Nonlinear Methods For Dimensionality Reduction
Global Methods - Multidimensional Scaling
isomap
Locally Linear Embedding (LLE)
Mark Crowley ECE657A : Lecture 3 January 18, 2016 34 / 76
Linear Discriminant Analysis
Linear Discriminant Analysis (LDA)
If we know the labels of the data (classes) we can use a supervised
approach such as discriminant analysis.
Emphasis on finding projections that best discriminate the data in
lower dimensions.
Intuition: Maximize the between-class scatter while holding fixed the
within-class scatter.
[From [3] ]
Mark Crowley ECE657A : Lecture 3 January 18, 2016 35 / 76
Linear Discriminant Analysis
PCA vs. LDAECE 750 T17
14
� Linear Discriminant Analysis (LDA) (From
Duda, Hart and Stork’s book ref 3)
If we know the labels of the data (classes) we
can use a supervised approach such as
discriminant analysis. Here the emphasis is on
finding projections that best discriminate the
data in lower dimensions.
Better in terms of
Variance and errors
x1
x2
Better for discrimination
Mark Crowley ECE657A : Lecture 3 January 18, 2016 36 / 76
Linear Discriminant Analysis
Fisher Linear Discriminant
FLD is a special case of LDA
Consider the 2 class problem. We have n samples x each of d
dimensions divided into two classes
C1 has n1 samples
C2 has n2 samples
We want to project these onto a line w such that the projected n
points y (each is a scalar of one dimension) are divided into two
classes, D1 and D2 such that
y = wT y
Our goal is to have a projection that well separates the projected
points into two classes
Mark Crowley ECE657A : Lecture 3 January 18, 2016 37 / 76
Linear Discriminant Analysis Separation Measures
Separation Measures
Use the difference between the projected points means of each group. The
mean of original sample points in each class are:
mi =
1
ni
∑
c∈Ci
x, i = 1, 2
And mean of projected points are
m′i =
1
ni
∑
y∈Di
y
=
1
ni
∑
x∈Ci
wT x
= wTmi
Mark Crowley ECE657A : Lecture 3 January 18, 2016 38 / 76
Linear Discriminant Analysis Separation Measures
Simple Mean Difference
mi =
1
ni
∑
c∈Ci
x, i = 1, 2
m′i = w
Tmi
Note that the original sample means are vectors of means of the features
for samples in each class.
|m′1 −m
′
2| = |w
T (m1 −m2)|
We can find w that maximizes this difference.
However, this difference can be made large simply by scaling w so we need
to normalize it.
Mark Crowley ECE657A : Lecture 3 January 18, 2016 39 / 76
Linear Discriminant Analysis Fisher Linear Descriminant
Within-Class Scatter
Fisher suggested normalizing the difference by the within-class scatter.
The scatter is defined as
s2i =
∑
y∈Di
(y −m′i )
2
s21 + s
2
2 is called the within-class scatter of the projected points (n times
the variance).
Define the sample within-class scatter of the original samples:
Sw = S1 + S2
Si =
∑
x∈Ci
(x−mi)(x−mi)T
Mark Crowley ECE657A : Lecture 3 January 18, 2016 40 / 76
Linear Discriminant Analysis Fisher Linear Descriminant
Between-Class Scatter
Substitute for y and m′
s21 + s
2
2 = w
TSww
similarly, (m′1 −m
′
2)
2 = wTSBw
Where SB = (m1 −m2)(m1 −m2)T is the Between-Class Scatter
So the problem can be now stated as finding w that maximizes
wTSBw
wTSwW
Mark Crowley ECE657A : Lecture 3 January 18, 2016 41 / 76
Linear Discriminant Analysis Fisher Linear Descriminant
Between-Class Scatter
Maximize,
wTSBw
wTSwW
The solution w needs to satisfy S−1W SBw = λw
In general this is an eigenvalue problem, but in the case of 2 classes
SBw is always in the direction of (m1 −m2)
Mark Crowley ECE657A : Lecture 3 January 18, 2016 42 / 76
Linear Discriminant Analysis Fisher Linear Descriminant
Generalization of FLD to K classes
Given K classes, we find a projection into (K − 1) - dimensional
subspace.
So the problem reduces to finding a (K − 1)× d projection matrix W
such that the projected sample points are well separated.
y = W Tx
Find the projection matrix W by maximizing the between-group
scatter while holding the within-group scatter constant.
[From [4] ]
Mark Crowley ECE657A : Lecture 3 January 18, 2016 43 / 76
Linear Discriminant Analysis Fisher Linear Descriminant
Generalization of FLD to K classes
Let there be K classes each of size ni , for i ∈ [1,K ]
Let the points in the `th group be the vectors
[x̄1
`, . . . , x̄1
`]T
where
x̄j
` = [x`j1 , . . . , x
`
jd
]T
The mean of the i th feature for the `th group is
m
(`)
i =
1
ni
nj∑
j=1
x
(`)
ji
Mark Crowley ECE657A : Lecture 3 January 18, 2016 44 / 76
Linear Discriminant Analysis Fisher Linear Descriminant
Generalization to K classes
The vector of feature means is then
m̄(`) = [m
(`)
1 ,m
(`)
2 , . . . ,m
(`)
d ]
T
and the pooled mean m is the mean vector over all samples
m =
1
n
K∑
`=1
n`m
(`) n =
K∑
`=1
n`
Mark Crowley ECE657A : Lecture 3 January 18, 2016 45 / 76
Linear Discriminant Analysis Fisher Linear Descriminant
Scatter Matrix
The scatter matrix is
S =
K∑
`=1
n∑̀
j=1
(x̄
(`)
j −m)(x̄
(`)
j −m)
T
The scatter matrix of the `th group is
S (`) =
n∑̀
j=1
(x̄
(`)
j − m̄
(`))(x̄
(`)
j − m̄
(`))T
Mark Crowley ECE657A : Lecture 3 January 18, 2016 46 / 76
Linear Discriminant Analysis Fisher Linear Descriminant
Within-Group and Between-Group Scatter
The within-group scatter matrix is then SW
SW =
K∑
`=1
S (`)
And the between-group scatter matrix SB is
SB =
K∑
`=1
n∑̀
j=1
(m̄(`) −m)(m̄(`) −m)T =
K∑
`=1
n`m̄
(`)(m̄(`))T − nm̄m̄T
Mark Crowley ECE657A : Lecture 3 January 18, 2016 47 / 76
Linear Discriminant Analysis Fisher Linear Descriminant
Combining Scatter Matricies
S = SB + SW
The projection is to maximize SB and keep SW /S constant.
The solution gives the rows of W projection matrix which as the
K − 1 eigenvectors of S−1W SB whose eigenvalues are non-zero.
Reducing dimensions: to project to t ≤ (k − 1) then use the
eigenvectors of the largest t eigenvalues.
LDA vs PCA: LDA is a supervised method. It’s fast. Eigenvector
based like PCA but generally better than PCA for classification.
Limited to k − 1 components.
Mark Crowley ECE657A : Lecture 3 January 18, 2016 48 / 76
Linear Discriminant Analysis Independent Component Analysis
Independent Component Analysis (ICA)
PCA uses 2nd order stats (ie. mean and variance)
Linear projection methods can use higher order stats so they can be
used for non-Gaussian distributed data : eg. ICA, Projection Pursuit
ICA was proposed for blind source separation of signals (”the cocktail
party problem”).
It finds projections that are independent but may not be orthogonal.
Each source can be any non-Gaussian distribution.
Mark Crowley ECE657A : Lecture 3 January 18, 2016 49 / 76
Linear Discriminant Analysis Independent Component Analysis
PCA vs. ICA
0 100 200 300 400 500
−2
0
2
0 100 200 300 400 500
−5
0
5
0 100 200 300 400 500
−2
0
2
truth
0 100 200 300 400 500
−10
0
10
0 100 200 300 400 500
−10
0
10
0 100 200 300 400 500
−5
0
5
0 100 200 300 400 500
−10
0
10
observed signals
0 100 200 300 400 500
−5
0
5
Figure: Truth vs. Observed
Mark Crowley ECE657A : Lecture 3 January 18, 2016 50 / 76
Linear Discriminant Analysis Independent Component Analysis
PCA vs. ICA
0 100 200 300 400 500
−2
0
2
0 100 200 300 400 500
−5
0
5
0 100 200 300 400 500
−2
0
2
truth
0 100 200 300 400 500
−10
0
10
0 100 200 300 400 500
−10
0
10
0 100 200 300 400 500
−5
0
5
0 100 200 300 400 500
−2
0
2
PCA estimate
0 100 200 300 400 500
−1
0
1
Figure: Truth vs. PCA
Mark Crowley ECE657A : Lecture 3 January 18, 2016 51 / 76
Linear Discriminant Analysis Independent Component Analysis
PCA vs. ICA
0 100 200 300 400 500
−2
0
2
0 100 200 300 400 500
−5
0
5
0 100 200 300 400 500
−2
0
2
truth
0 100 200 300 400 500
−10
0
10
0 100 200 300 400 500
−5
0
5
0 100 200 300 400 500
−10
0
10
0 100 200 300 400 500
−2
0
2
ICA estimate
0 100 200 300 400 500
−2
0
2
Figure: Truth vs. ICA
Mark Crowley ECE657A : Lecture 3 January 18, 2016 52 / 76
Linear Discriminant Analysis Independent Component Analysis
Projection Pursuit
How do we choose each projection dimension?
Projection pursuit uses a measure of interestingness of a projection.
Interestingness is a measure of some aspects of not being Gaussian
(such as entropy).
It tries to find a projection that maximize the measure
Data is reduced by removing components along this projection.
Projection performs projections one at a time such that the extracted
signal is as non-Gaussian as possible
The Gaussian distribution is the maximum entropy distribution.
So, we can maximize the negative entropy (negentropy)
This is equivalent to MLE up to a sign change and addition of a
constant
Mark Crowley ECE657A : Lecture 3 January 18, 2016 53 / 76
Linear Discriminant Analysis Independent Component Analysis
Implementing ICA
Maximum Likelihood Estimation formulation :
xt = Wzt + �t
where xt =