ECE 657A: Data and Knowledge Modelling and Analysis – Lecture 4: Dimensionality Reduction, Probability, Feature Selection
ECE 657A: Data and Knowledge Modelling and Analysis
Lecture 4: Dimensionality Reduction, Probability,
Feature Selection
Mark Crowley
January 24, 2016
Mark Crowley ECE 657A : Lecture 4 January 24, 2016 1 / 61
Opening Data Example : Guess the Dataset
??
Mark Crowley ECE 657A : Lecture 4 January 24, 2016 2 / 61
Class Admin Announcements
Today’s Class
Announcements
Independent Component Analysis
Non-linear Dimensionality Reduction / Manifold Learning
Probability and Statistics Review
Feature Selection
Mark Crowley ECE 657A : Lecture 4 January 24, 2016 3 / 61
Class Admin Announcements
Announcements
Project Description is still out. Do you have a group yet?
Today: Groups members emailed to prof and TA.
Next week (Feb 1): Pitch Session.
Mark Crowley ECE 657A : Lecture 4 January 24, 2016 4 / 61
Class Admin Pitch Session
Pitch Session : Structure
Your group will be matched with two other groups (yes 9 people per
peer-group).
Your group will have 5 minutes to pitch your idea to two other
groups.
You can use any materials you want, I’ll have flip chart paper, some
can use chalboard or whiteboard. You could use laptop. You could just
talk.
Then 5-10 minutes of discussion and feedback.
Next group
Mark Crowley ECE 657A : Lecture 4 January 24, 2016 5 / 61
Class Admin Pitch Session
Pitch Session : Grading
You will not be graded on your pitch quality or your topic.
You will be graded on your feedback for others.
Peer review forms:
Explain the project idea: you need to provide a concise, fairly accurate
explanation of their idea (shows you listened)
Do you think the project is too easy, too hard or just right?
Give some substantial, constructive suggestion for their project.
Explain how your suggestion would improve the project’s ability to
teach the class about something new.
Hand in form by end of class.
Mark Crowley ECE 657A : Lecture 4 January 24, 2016 6 / 61
Class Admin Updates from last time
Fixing our bias problem
Last week my explanation for bias was a bit confused:
Mark Crowley ECE 657A : Lecture 4 January 24, 2016 7 / 61
Class Admin Updates from last time
Back to Lecture 3 ICA!
.
Mark Crowley ECE 657A : Lecture 4 January 24, 2016 8 / 61
Outline of Probability and Statistics Review
Probability Definitions
Bayes Theorem
Entropy
Probabilistic Distance Metrics
Hypothesis Testing
Mark Crowley ECE 657A : Lecture 4 January 24, 2016 9 / 61
Probability Review
Factoring Probability Distributions
The joint prob decomposes into mulitplication of probs if vars are
indep
Entropy
KL-divergence, Mahalanobis distance
Mark Crowley ECE 657A : Lecture 4 January 24, 2016 10 / 61
Probability Definitions
Joint and Conditional Probability
Given event X (binary or multivalued). p(X = x) = p(x) is the probability
of the event that X takes on the value x .
Probability of A or B occurring:
p(A ∨ B) = p(A) + p(B)− p(A ∧ B)
= p(A) + p(B) if A and B are mutually exclusive
Joint Probailities: Product Rule and Chain Rule
p(A,B) = p(A ∧ B) = p(A|B)p(B)
p(X1,X2, . . . ,XD) = p(X1)p(X2|X1)p(X3|X2,X1) . . . p(XD |X1:D−1)
Mark Crowley ECE 657A : Lecture 4 January 24, 2016 11 / 61
Probability Definitions
Marginal and Conditional Probability
Marginal Distrubtion:
p(A) =
∑
b
p(A,B) =
∑
b
p(A|B = b)p(B = b)
Conditional Probability:
p(A|B) =
p(A,B)
p(B)
if p(B) > 0
“Probabilty of A given B”
Mark Crowley ECE 657A : Lecture 4 January 24, 2016 12 / 61
Probability Definitions Bayes Theorem
Bayes Theorem
Given a hypothesis h and observed evidence e:
posterior =
likelihood× prior
evidence
p(h|e) =
p(e|h)p(h)
p(e)
p(cancer |testresult) =
p(cancer)p(testresult|cancer)
p(testresult)
Bayesian Statistics vs Frequentist Statistics
very important, for knowing how to update a model based on new
evidence, also tells you how to turn around a p(X |Y ) into a p(Y |X )
Mark Crowley ECE 657A : Lecture 4 January 24, 2016 13 / 61
Probability Definitions Bayes Theorem
Unconditional Independence
If two random variables variable X and Y are independent we denote it as
X⊥Y
X⊥Y iff p(X ,Y ) = p(X )p(Y )
Mark Crowley ECE 657A : Lecture 4 January 24, 2016 14 / 61
Entropy
Outline of Probability and Statistics Review
Probability Definitions
Bayes Theorem
Entropy
Probabilistic Distance Metrics
Hypothesis Testing
Mark Crowley ECE 657A : Lecture 4 January 24, 2016 15 / 61
Entropy
Entropy
H(X ) = −
∑
x∈X
p(x) log2 p(x)
The higher the entropy the higher the uncertainty for that value.
Also measures surprise of seeing the observation.
How much information is represented by this observation.
Mark Crowley ECE 657A : Lecture 4 January 24, 2016 16 / 61
Entropy
KL-Divergence or Relative Entropy
Kullback-Leibler Divergence (KL-Divergence) is a common method for
measuring hte dissimilarity between two probability distributions P and Q.
KL(P||Q) =
N∑
i=1
P(i) log
P(i)
Q(i)
KL(P||Q) ≥ 0 and equals zero iff P = Q
How much information you’d lose approximating Q with P
In general KL(P||Q) 6= KL(Q||P)
Mark Crowley ECE 657A : Lecture 4 January 24, 2016 17 / 61
Probabilistic Distance Metrics
Mahalanobis Distance
Another way to measure difference between vectors that accounts for their
distribution
d(x , y) =
√
(x − y)TS−1(x − y)T
Where x and y share the same distribution and covariance matrix S
Mark Crowley ECE 657A : Lecture 4 January 24, 2016 18 / 61
Probabilistic Distance Metrics
Mutual Information (MI)
The mutual information (MI) between two vectors X ,Y measures how
similar the joint distribution p(X ,Y ) is to the factored distribution
p(X )p(Y ):
MI (X ,Y ) =
∑
x∈X
∑
y∈Y
p(x , y) log
p(x , y)
p(x)p(y)
MI(X,Y) is always nonnegative
Equals 0 iff X ,Y are independent
Notice this is just the KL-Divergence between the distributions
p(X ,Y ) and p(X )p(Y )
[From [?] ]
Mark Crowley ECE 657A : Lecture 4 January 24, 2016 19 / 61
Probabilistic Distance Metrics
Relation of MI to Entropy
The entropy H(X ) and mutual inforamtion are related:
H(X ) = −
∑
x∈X
p(x) log p(x) (1)
MI (X ,Y ) =H(X ) + H(Y )− H(X ,Y ) (2)
MI can seen as the reduction in entropy on the labels that results
from observing feature value xj
Some measures use MI normalized by the entropy H(X )
[From [?] ]
Mark Crowley ECE 657A : Lecture 4 January 24, 2016 20 / 61
Probabilistic Distance Metrics
Mutual Information For Feature Ranking
In our case, mutual information is:
MI (Xf ,Y ) =
∑
xfi∈Xf
∑
yk∈Y
p(xfi , yk ) log
p(xfi , yk )
p(xfi )p(yk )
where xfi ∈ Xf is the value of feature (column) f for datapoint i
and yk ∈ Y are the class labels
a feature with high MI may not have high probability but it’s better
at identifying the classes.
[From [?] ]
Mark Crowley ECE 657A : Lecture 4 January 24, 2016 21 / 61
Probabilistic Distance Metrics
Information Gain Measure
Another measure you could use is information gain.
IG (Y ,X ) = H(Y )− H(Y |X )
Mark Crowley ECE 657A : Lecture 4 January 24, 2016 22 / 61
Hypothesis Testing
Outline of Probability and Statistics Review
Probability Definitions
Bayes Theorem
Entropy
Probabilistic Distance Metrics
Hypothesis Testing
Mark Crowley ECE 657A : Lecture 4 January 24, 2016 23 / 61
Hypothesis Testing
Hypothesis Testing
Given a known distribution D0 we think produced the data, call this
our null hypothesis (often denoted H0)
Want to ask whether we can reject the null hypothesis given some
observed data.
Say D0 is N(0, 1) a standardized Gaussian and the sample is
x = 2.576.
p(|u| ≤ 2.576) = .99 : The probability of a sample u taken from
N(0, 1) being less then 2.576 is 99%.
So we say the difference of the sample x from the assumed
distribution is statistically significant
Also, say that the sample x lets us “reject the null hypothesis at the
0.1 confidence level”.
Many methods for doing this, for discrete data one is the Chi-sqaured
(χ2) Test.
Mark Crowley ECE 657A : Lecture 4 January 24, 2016 24 / 61
Hypothesis Testing
Chi-squared (χ2) Test
χ2 statistics can be used to test whether a feature is statistically
significant in predicting a class. For a feature xf and class yk we can
formulate the Chi-square test
χ2(xfi , yk ) =
∑
xfi∈Xf
∑
yk∈Y
(Oik − Eik )2
Eik
= N
∑
xfi∈Xf
∑
yk∈Y
pipk
(
(Oik/N)− pipk
pipk
)2
O are the observed counts of joint events and E are their expected counts.
χ2 tests the hypothesis that the features and the classes are assigned
randomly and independent
The higher the value of χ2, the more likley we reject the null hypothesis of
independent, random assignment of classes.
Thus, the higher the value of χ2 the more likely this feature f gives a
statistically significant discrimination between the classes.
Mark Crowley ECE 657A : Lecture 4 January 24, 2016 25 / 61
Hypothesis Testing
Contingency Table
The counts for χ2 can be obtained using a contingency table
o11 o12
o21 o22
xfi
1
0
1 0
yk
o11 is number of samples in the class that has the feature
o21 is number of samples in the class that doesnt have the feature
o12 is number of samples in other classes that has the feature
o22 is number of samples in other classes that doesnt have the feature
Mark Crowley ECE 657A : Lecture 4 January 24, 2016 26 / 61
Hypothesis Testing
Contingency Table
100 70
60 30
xfi
1
0
1 0
yk
= 160 = 100
= 170
= 90
N = 260
104.6 65.4
55.4 34.6
1
0
1 0
Eik
E11 = (o11 + o21)(o11 + o12)/N E12 = (o12 + o22)(o11 + o12)/N (3)
E21 = (o11 + o21)(o21 + o22)/N E22 = (o12 + o22)(o21 + o22)/N (4)
Mark Crowley ECE 657A : Lecture 4 January 24, 2016 27 / 61
Hypothesis Testing
Contingency Table
100 70
60 30
xfi
1
0
1 0
yk
= 160 = 100
= 170
= 90
N = 260
104.6 65.4
55.4 34.6
1
0
1 0
Eik
χ2 =
(100− 104.6)2
104.6
+
(70− 65.4)2
65.4
+
(60− 55.4)2
55.4
+
(30− 34.6)2
34.6
= 1.51
(5)
The number of degrees of freedom here is 1. Now we can use a
Chi-squared lookup table to find the critical value for this number at a
desired significance level. We see that for p=0.05 we need χ2 > 3.8 to
reject the null hypothesis and claim that our feature is significant. In this
case we can only claim p=0.30 significance.Mark Crowley ECE 657A : Lecture 4 January 24, 2016 28 / 61
Hypothesis Testing
χ2 Lookup Table
Figure: From wikipedia:Chi-squared Distribution
Mark Crowley ECE 657A : Lecture 4 January 24, 2016 29 / 61
Feature Selection – Definition
Outline of Feature Selection
Feature Selection – Definition
Definition
General Approach
Feature Ranking
Quality Measures for Feature Ranking
Choosing Feature Subsets
Objective Functions and Distance Measures
Information-Theoretic Measures
Search Strategies
Sequential Forward Selection
Sequential Backward Selection
Sequential Floating Selection
Oscillating Search
Mark Crowley ECE 657A : Lecture 4 January 24, 2016 30 / 61
Feature Selection – Definition Definition
Dimensionality Reduction and Feature Extraction Methods
Feature Extraction and Feature Selection
Feature Extraction combines original features into a new set of
features by transformation or projection.
Ideally the transformation or the projection is according to an
objective that helps finding the significant set of new features of lower
dimensions than the original set of features
Sometimes the transformation produce new features that may have
some relevant meaning to the data but many projection cases produce
features that are difficult to interpret or relate to the meanings of the
original features.
Mark Crowley ECE 657A : Lecture 4 January 24, 2016 31 / 61
Feature Selection – Definition Definition
Feature Selection
Material in this section is based on the following references:
1 Petr Somol, Jana Novovicova and Pavel Pudil (2010). Efficient Feature
Subset Selection and Subset Size Optimization, Pattern Recognition Recent
Advances, Adam Herout (Ed.), InTech, Available from: http://www.
intechopen.com/books/pattern-recognition-recent-advances/
efficient-feature-subset-selection-and-subset-size-optimization
2 Guyon, I. and Elisseeff, A. (2003). An introduction to variable and feature
selection. J. Mach. Learn. Res., 3, 11571182.
3 Luis Carlos Molina, Lluis Belanche, Angela Nebot: Feature Selection
Algorithms: A Survey and Experimental Evaluation. ICDM 2002: 306-313
4 Jain, A. K.; Duin, R. P. W., and Mao, J. (2000). Statistical pattern
recognition: A review. IEEE Trans. Pattern Anal. Mach. Intell., 22(1), 437.
5 Pudil, P.; Novovicova, J., and Kittler, J. (1994). Floating search methods in
feature selection.Pattern Recogn. Lett., 15(11), 46 11191125.
Mark Crowley ECE 657A : Lecture 4 January 24, 2016 32 / 61
http://www.intechopen.com/books/pattern-recognition-recent-advances/efficient-feature-subset-selection-and-subset-size-optimization
http://www.intechopen.com/books/pattern-recognition-recent-advances/efficient-feature-subset-selection-and-subset-size-optimization
http://www.intechopen.com/books/pattern-recognition-recent-advances/efficient-feature-subset-selection-and-subset-size-optimization
Feature Selection – Definition Definition
Feature Extraction vs. Feature Selection
Feature Extraction: we talked about ways to transform the original
features in the data into new features which have some advantage
such as
lower dimensionality
better description of the variance in the data
better ability to discriminate (distinguish) data points or clusters of
data points
Now we focus on finding a smaller set of the original features for use.
still want to reduce dimensionality
but also some interpretable features
Mark Crowley ECE 657A : Lecture 4 January 24, 2016 33 / 61
Feature Selection – Definition General Approach
General Approach
Search the feature space for a subset of features that optimizes a
selection criterion, an objective function (ie. a quality index or a
classifier output).
Challenge: If we have n samples and d features then there are 2d
possible subsets of d .
Components:
1 Selection Criteria
2 Search Strategy
Mark Crowley ECE 657A : Lecture 4 January 24, 2016 34 / 61
Feature Selection – Definition Feature Ranking
Feature Ranking
Need to use more feasible methods. Idea: Treat the problem as a search
problem and find ways to reduce the search space.
Feature Ranking
(also called Best Individual Features (BIF) and Naive Selection)
1 Evaluate the individual features according to a quality measure (how
good the feature is for discriminating the class)
2 Sort the features according to their value of the quality measure
3 Select the best m features
Advantage: search complexity is O(m) after sorting. It can be used for
very large number of features
Problem: features are considered in isolation
Mark Crowley ECE 657A : Lecture 4 January 24, 2016 35 / 61
Feature Selection – Definition Feature Ranking
Limitations of Correlation
Figure: From (murphy2012) Fig 2.12
Mark Crowley ECE 657A : Lecture 4 January 24, 2016 36 / 61
Feature Selection – Definition Quality Measures for Feature Ranking
Quality Measures for Feature Ranking
Some quality measures include:
Mutual Information: between a feature and class
Information Gain
Chi-squared test (χ2)
Mark Crowley ECE 657A : Lecture 4 January 24, 2016 37 / 61
Choosing Feature Subsets
Outline of Feature Selection
Feature Selection – Definition
Definition
General Approach
Feature Ranking
Quality Measures for Feature Ranking
Choosing Feature Subsets
Objective Functions and Distance Measures
Information-Theoretic Measures
Search Strategies
Sequential Forward Selection
Sequential Backward Selection
Sequential Floating Selection
Oscillating Search
Mark Crowley ECE 657A : Lecture 4 January 24, 2016 38 / 61
Choosing Feature Subsets
Objective Functions for Feature Subset Selection
Filter Approach: Evaluate the subsets in terms of their information
content, relevance, statistical dependence, their interclass
distance or their estimate of classification error
Wrapper Approach: Choose a learning algorithm to fit a classifier, test on
different subsets of features. Select the subset one that
maximizes the accuracy
Mark Crowley ECE 657A : Lecture 4 January 24, 2016 39 / 61
Choosing Feature Subsets
Objective Functions for Feature Subset Selection
Embedded Approach: The FS process is integrated with the classification
process. Example includes Decision Tree construction
(evaluates each feature and selects the high score feature to
start the tree). Recently Random Forest also has an effective
measure of relevance of a feature.
Hybrid Approach: A recent approach is to combine some of the above
approaches. For example combining filter and wrapper
approaches to handle high dimensional data by first using the
filter approach to select different subsets then use a wrapper
approach to select the best subset
Mark Crowley ECE 657A : Lecture 4 January 24, 2016 40 / 61
Choosing Feature Subsets
Pros and Cons of Each Approach
Filters:
+ independent of the learning algorithm
+ low complexity and therefore good for large number of features
– selects larger subsets
Wrappers:
+ better accuracy as they are tied to the classifier
+ subset selection is tune to the learning algorithm (limit the
generalization unless it uses good training practice)
– more complex and can be slow
– dependent on the learning algorithm
Mark Crowley ECE 657A : Lecture 4 January 24, 2016 41 / 61
Choosing Feature Subsets
Pros and Cons of Each Approach
Embedded:
+ performance is competitive with wrapper
+ fast learning
– tied to the learning process
Mark Crowley ECE 657A : Lecture 4 January 24, 2016 42 / 61
Choosing Feature Subsets Objective Functions and Distance Measures
Objective Functions for Filter Approach
Distance Based Functions: A good set of features is the one that increases
the inter-class distance and reduces the intra- class distance
(Euclidian, Mahalanobis etc.)
Correlation Based Functions: Good set of features should be correlated
with relevant class but should be uncorrelated with each
other
Information-Theoretic Functions: Good features should share maximum
information with the relevant classes
Mark Crowley ECE 657A : Lecture 4 January 24, 2016 43 / 61
Choosing Feature Subsets Objective Functions and Distance Measures
Distance Based Objective Functions
Goal: Maximize distance between classes.
Suppose we have C classes c1, c2, . . . , cK
In general, the distance between classes ca and cb is
class-distab(ca, cb) =
1
nanb
na∑
i=1
nb∑
j=1
dist(XTi ,a,X
T
j ,b)
Mark Crowley ECE 657A : Lecture 4 January 24, 2016 44 / 61
Choosing Feature Subsets Objective Functions and Distance Measures
Distance Based Objective Functions
Distance could be any distance
between 2 vectors Euclidian,
absolute…etc.
Vector i from one class to all
vectors j of the other
Sum over all classes
Normalize by size
Mark Crowley ECE 657A : Lecture 4 January 24, 2016 45 / 61
Choosing Feature Subsets Information-Theoretic Measures
Information-Theoretic Measures
Mutual Information measure:
Measures if a set of features Fs share information with the class.
MIc (Fs , c) =
1
|Fs |
∑
fi∈Fs
MI (fi , c)
For multi-class we can sum over all classes.
A good measure is also to use mutual information to measure
relevance to the class as above but also to reduce relevance of the
features to each other. This can be measured by:
MIFs =
1
|Fs |2
∑
fi ,fj∈Fs
MI (fi , fj )
Then the quality measure becomes:
maxFs (MIc (Fs , c)−MIFs )
Other measures include information gain based and entropy based
measures
Mark Crowley ECE 657A : Lecture 4 January 24, 2016 46 / 61
Search Strategies
Outline of Feature Selection
Feature Selection – Definition
Definition
General Approach
Feature Ranking
Quality Measures for Feature Ranking
Choosing Feature Subsets
Objective Functions and Distance Measures
Information-Theoretic Measures
Search Strategies
Sequential Forward Selection
Sequential Backward Selection
Sequential Floating Selection
Oscillating Search
Mark Crowley ECE 657A : Lecture 4 January 24, 2016 47 / 61
Search Strategies
Search Strategies
Once you have some metric for ranking or comparing features you still
need a search strategy for finding the best subset of features to use.
Exhaustive Search:
Optimal but prohibitively expensive as number of features grows.
Given 4 Features f1, f2, f3, f4
Select 2 Without repetitions,
Complexity is exponential in terms of branching factor
Mark Crowley ECE 657A : Lecture 4 January 24, 2016 48 / 61
Search Strategies
Search Strategies Sequential Simple Strategy
Strategy: Add or remove features sequentially, no backtracking
Sequential forward selection (SFS)
Sequential backward selection (SBS)
Mark Crowley ECE 657A : Lecture 4 January 24, 2016 49 / 61
Search Strategies Sequential Forward Selection
Sequential Forward Selection (SFS)
Can use with wrapper (classifier) or filter (objective distance)
1 Start with empty feature set
2 Select the next best feature that maximizes the objective function
3 Update your feature set
4 Terminate if you reached the required number of features, otherwise
repeat the search.
Mark Crowley ECE 657A : Lecture 4 January 24, 2016 50 / 61
Search Strategies Sequential Forward Selection
Sequential Forward Selection Example
Search Tree: Values on edges are cumulative total of objective function
for using that feature, larger is better.
Search Algorithm: Use Depth first
or hill climbing or greedy search to
select top three features
1 Select f3 → S = {f3}
2 best remaining feature is
f4 → S = {f3, f4}
3 S = {f3, f4, f1}
Mark Crowley ECE 657A : Lecture 4 January 24, 2016 51 / 61
Search Strategies Sequential Backward Selection
Sequential Backward Selection (SBS)
Can use with wrapper (classifier) or filter (objective distance)
Start with full set of features
Remove the worst feature (one that when removed will have least
effect)
Update the set of features.
Terminate if the required number of features is reached, otherwise
repeat
Mark Crowley ECE 657A : Lecture 4 January 24, 2016 52 / 61
Search Strategies Sequential Backward Selection
Sequential Backward Selection Example
Search Tree: Values on edges for feature fi are the total of the objective
function of the subset |S | − fi , larger is better.
Search Algorithm: Use Depth first
or hill climbing or greedy search to
select top three features
1 Start with S = {f1, f2, f3, f4}
2 Removing f2 has least impact so
S = {f1, f3, f4}
3 S = {f3, f4, f1}
Mark Crowley ECE 657A : Lecture 4 January 24, 2016 53 / 61
Search Strategies Sequential Backward Selection
Pros and Cons of SBS and SFS
Both: once you add or delete a feature you can’t go back, can get
stuck in suboptimal answers. This is called nesting
SBS is better if the sought subset is large relative to the original set
(not removing too many features)
SFS is better if subset is much smaller than original set of features
Mark Crowley ECE 657A : Lecture 4 January 24, 2016 54 / 61
Search Strategies Sequential Floating Selection
Sequential Forward Floating Selection
Avoids nesting problem, does backtracking.
Combines SFS and SBS, it’s a generalization of the (`, r) method
Algorithm: Input: Assume that we have already selected k features and
our current set is FS with obj(FS )
1 Inclusion step: using SFS, select next feature fnew so the new set as
FS = FS ∪ fnew , size of |FS | = k + 1
2 Conditional removal step:
find the least significant feature flow ∈ FS
If flow is fnew then keep it, and go to step 1.
Otherwise:
remove flow , FS = FS − flow , now |FS | = k.
Go to step 2.
Mark Crowley ECE 657A : Lecture 4 January 24, 2016 55 / 61
Search Strategies Sequential Floating Selection
SBFS
Start the algorithm by empty set of features and perform 2 steps of
SFS.
Needs termination criteria and way to avoid loops
SBFS algorithm is the dual of SFFS algorithm where step 1 will be
removal, step 2 will be conditional addition and step 3 continuation of
conditional addition.
SFFS and SBFS give good performance compared to SFS and SBS
Mark Crowley ECE 657A : Lecture 4 January 24, 2016 56 / 61
Search Strategies Oscillating Search
Oscillating Search (OS)
Oscillating Search (OS) starts from a set of features with desired
number and then it oscillates between cycles of down-swing and up-
swing
Down-swing removes and then adds back features to improve the
objective
Up-swing adds then removes features.
The algorithm uses an oscillating depth parameter ( number of
features to be replaced in the cycles)
The depth is increased after an unsuccessful cycle and reset to 1 after
a successful cycle and terminates when the depth exceeds some
threshold
The search oscillates around the desired number of features untill it
finds a better set but doesnt waste time evaluating subsets with less
features.
Mark Crowley ECE 657A : Lecture 4 January 24, 2016 57 / 61
Search Strategies Oscillating Search
Other Search Methods
These are often ’meta-heuristic’ methods
Branch and Bound (optimal if objective is monotonic, but in worst
case cost is exponential)
A* with admissible heuristic (optimal, worst case still exponential)
Genetic algorithms (very flexible but slow to converge, get’s stuck in
local minima)
Simulated annealing
Ant Colony Optimization
Mark Crowley ECE 657A : Lecture 4 January 24, 2016 58 / 61
Search Strategies Oscillating Search
Review and Guidelines
Feature Extraction:
Feature Selection:
Selection criteria: independent scoring and ranking, Mutual
Information, Chi-squared, filter/wrapper based evaluation
Search strategy: SFS, SBS, SFFS, Oscilating Search
How many features to keep? Good rule of thumb is to have
samples>=10F where F is the number of features.
Mark Crowley ECE 657A : Lecture 4 January 24, 2016 59 / 61
Search Strategies Oscillating Search
[Dunham, Data Mining Intro and Advanced Topics, 2003]
Margaret Dunham, Data Mining Introductory and Advanced Topics,
ISBN:0130888923, Prentice Hall, 2003.
[Han,Kamber and Pei. Data Mining, 2011]
Jiawei Han, Micheline Kamber and Jian Pei, Data Mining: Concepts
and Techniques, 3rd ed, Morgan Kaufmann Publishers, May 2011.
[Duda, Pattern Classification, 2001]
R. O. Duda, P. E. Hart and D. G. Stork, Pattern Classification (2nd
ed.), John Wiley and Sons, 2001.
[Jain and Dubes. Algs for Clustering Data, 1988]
A. K. Jain and R.C. Dubes, Algorithms for Clustering Data, ISBN:
0-13-022278-x, Prentice Hall, 1988.
[Cohen,Empirical Methods for Artificial Intelligence, 1995]
P. Cohen, Empirical Methods for Artificial Intelligence,
ISBN:0-262-03225-2, MIT Press, 1995.
[Ackoff, From Data to Wisdom, 1989]
Mark Crowley ECE 657A : Lecture 4 January 24, 2016 59 / 61
Search Strategies Oscillating Search
Ackoff, From Data to Wisdom, Journal of Applied Systems Analysis,
1989.
[Sima and Dougherty, 2008]
Sima, C. and Dougherty, E. R. The Peaking Phenomenon in the
Presence of Feature Selection. Pattern Recognition Letters, 29,
16671674, 2008.
[Zhu and Ghodsi, 2006]
Mu Zhu, Ali Ghodsi, Automatic dimensionality selection from the
scree plot via the use of profile likelihood, Computational Statistics &
Data Analysis 51 918 930, 2006.
[Cox, 2000]
Trevor Cox and M.A.A Cox, Multidimensional Scaling, Chapman and
Hall/CRC, Second Edition, 2000.
[Murphy, 2012]
Kevin Murphy, Machine Learning: A Probabilistic Perspective, MIT
Press, 2012.
Mark Crowley ECE 657A : Lecture 4 January 24, 2016 59 / 61
Search Strategies Oscillating Search
Summary
Independent Component Analysis
Non-linear Dimensionality Reduction / Manifold Learning
Probability and Statistics Review
Feature Selection
Naive Bayes Classification
Mark Crowley ECE 657A : Lecture 4 January 24, 2016 60 / 61
Search Strategies Oscillating Search
Closing Data Example
?
Mark Crowley ECE 657A : Lecture 4 January 24, 2016 61 / 61
Introduction
Class Admin
Announcements
Pitch Session
Updates from last time
Probability and Statistics Review
Probability Definitions
Bayes Theorem
Entropy
Probabilistic Distance Metrics
Hypothesis Testing
Feature Selection
Feature Selection – Definition
Definition
General Approach
Feature Ranking
Quality Measures for Feature Ranking
Choosing Feature Subsets
Objective Functions and Distance Measures
Information-Theoretic Measures
Search Strategies
Sequential Forward Selection
Sequential Backward Selection
Sequential Floating Selection
Oscillating Search