程序代写代做代考 algorithm Bayesian flex chain ant decision tree data mining ECE 657A: Data and Knowledge Modelling and Analysis – Lecture 4: Dimensionality Reduction, Probability, Feature Selection

ECE 657A: Data and Knowledge Modelling and Analysis – Lecture 4: Dimensionality Reduction, Probability, Feature Selection

ECE 657A: Data and Knowledge Modelling and Analysis
Lecture 4: Dimensionality Reduction, Probability,

Feature Selection

Mark Crowley

January 24, 2016

Mark Crowley ECE 657A : Lecture 4 January 24, 2016 1 / 61

Opening Data Example : Guess the Dataset

??

Mark Crowley ECE 657A : Lecture 4 January 24, 2016 2 / 61

Class Admin Announcements

Today’s Class

Announcements

Independent Component Analysis

Non-linear Dimensionality Reduction / Manifold Learning

Probability and Statistics Review

Feature Selection

Mark Crowley ECE 657A : Lecture 4 January 24, 2016 3 / 61

Class Admin Announcements

Announcements

Project Description is still out. Do you have a group yet?

Today: Groups members emailed to prof and TA.

Next week (Feb 1): Pitch Session.

Mark Crowley ECE 657A : Lecture 4 January 24, 2016 4 / 61

Class Admin Pitch Session

Pitch Session : Structure

Your group will be matched with two other groups (yes 9 people per
peer-group).

Your group will have 5 minutes to pitch your idea to two other
groups.

You can use any materials you want, I’ll have flip chart paper, some
can use chalboard or whiteboard. You could use laptop. You could just
talk.

Then 5-10 minutes of discussion and feedback.

Next group

Mark Crowley ECE 657A : Lecture 4 January 24, 2016 5 / 61

Class Admin Pitch Session

Pitch Session : Grading

You will not be graded on your pitch quality or your topic.

You will be graded on your feedback for others.

Peer review forms:

Explain the project idea: you need to provide a concise, fairly accurate
explanation of their idea (shows you listened)
Do you think the project is too easy, too hard or just right?
Give some substantial, constructive suggestion for their project.
Explain how your suggestion would improve the project’s ability to
teach the class about something new.
Hand in form by end of class.

Mark Crowley ECE 657A : Lecture 4 January 24, 2016 6 / 61

Class Admin Updates from last time

Fixing our bias problem

Last week my explanation for bias was a bit confused:

Mark Crowley ECE 657A : Lecture 4 January 24, 2016 7 / 61

Class Admin Updates from last time

Back to Lecture 3 ICA!

.

Mark Crowley ECE 657A : Lecture 4 January 24, 2016 8 / 61

Outline of Probability and Statistics Review

Probability Definitions
Bayes Theorem

Entropy

Probabilistic Distance Metrics

Hypothesis Testing

Mark Crowley ECE 657A : Lecture 4 January 24, 2016 9 / 61

Probability Review

Factoring Probability Distributions

The joint prob decomposes into mulitplication of probs if vars are
indep

Entropy

KL-divergence, Mahalanobis distance

Mark Crowley ECE 657A : Lecture 4 January 24, 2016 10 / 61

Probability Definitions

Joint and Conditional Probability

Given event X (binary or multivalued). p(X = x) = p(x) is the probability
of the event that X takes on the value x .
Probability of A or B occurring:

p(A ∨ B) = p(A) + p(B)− p(A ∧ B)
= p(A) + p(B) if A and B are mutually exclusive

Joint Probailities: Product Rule and Chain Rule

p(A,B) = p(A ∧ B) = p(A|B)p(B)
p(X1,X2, . . . ,XD) = p(X1)p(X2|X1)p(X3|X2,X1) . . . p(XD |X1:D−1)

Mark Crowley ECE 657A : Lecture 4 January 24, 2016 11 / 61

Probability Definitions

Marginal and Conditional Probability

Marginal Distrubtion:

p(A) =

b

p(A,B) =

b

p(A|B = b)p(B = b)

Conditional Probability:

p(A|B) =
p(A,B)

p(B)
if p(B) > 0

“Probabilty of A given B”

Mark Crowley ECE 657A : Lecture 4 January 24, 2016 12 / 61

Probability Definitions Bayes Theorem

Bayes Theorem

Given a hypothesis h and observed evidence e:

posterior =
likelihood× prior

evidence

p(h|e) =
p(e|h)p(h)

p(e)

p(cancer |testresult) =
p(cancer)p(testresult|cancer)

p(testresult)

Bayesian Statistics vs Frequentist Statistics

very important, for knowing how to update a model based on new
evidence, also tells you how to turn around a p(X |Y ) into a p(Y |X )

Mark Crowley ECE 657A : Lecture 4 January 24, 2016 13 / 61

Probability Definitions Bayes Theorem

Unconditional Independence

If two random variables variable X and Y are independent we denote it as
X⊥Y

X⊥Y iff p(X ,Y ) = p(X )p(Y )

Mark Crowley ECE 657A : Lecture 4 January 24, 2016 14 / 61

Entropy

Outline of Probability and Statistics Review

Probability Definitions
Bayes Theorem

Entropy

Probabilistic Distance Metrics

Hypothesis Testing

Mark Crowley ECE 657A : Lecture 4 January 24, 2016 15 / 61

Entropy

Entropy

H(X ) = −

x∈X

p(x) log2 p(x)

The higher the entropy the higher the uncertainty for that value.

Also measures surprise of seeing the observation.

How much information is represented by this observation.

Mark Crowley ECE 657A : Lecture 4 January 24, 2016 16 / 61

Entropy

KL-Divergence or Relative Entropy

Kullback-Leibler Divergence (KL-Divergence) is a common method for
measuring hte dissimilarity between two probability distributions P and Q.

KL(P||Q) =
N∑

i=1

P(i) log
P(i)

Q(i)

KL(P||Q) ≥ 0 and equals zero iff P = Q
How much information you’d lose approximating Q with P

In general KL(P||Q) 6= KL(Q||P)

Mark Crowley ECE 657A : Lecture 4 January 24, 2016 17 / 61

Probabilistic Distance Metrics

Mahalanobis Distance

Another way to measure difference between vectors that accounts for their
distribution

d(x , y) =

(x − y)TS−1(x − y)T

Where x and y share the same distribution and covariance matrix S

Mark Crowley ECE 657A : Lecture 4 January 24, 2016 18 / 61

Probabilistic Distance Metrics

Mutual Information (MI)

The mutual information (MI) between two vectors X ,Y measures how
similar the joint distribution p(X ,Y ) is to the factored distribution
p(X )p(Y ):

MI (X ,Y ) =

x∈X


y∈Y

p(x , y) log
p(x , y)

p(x)p(y)

MI(X,Y) is always nonnegative

Equals 0 iff X ,Y are independent

Notice this is just the KL-Divergence between the distributions
p(X ,Y ) and p(X )p(Y )

[From [?] ]

Mark Crowley ECE 657A : Lecture 4 January 24, 2016 19 / 61

Probabilistic Distance Metrics

Relation of MI to Entropy

The entropy H(X ) and mutual inforamtion are related:

H(X ) = −

x∈X

p(x) log p(x) (1)

MI (X ,Y ) =H(X ) + H(Y )− H(X ,Y ) (2)

MI can seen as the reduction in entropy on the labels that results
from observing feature value xj

Some measures use MI normalized by the entropy H(X )

[From [?] ]

Mark Crowley ECE 657A : Lecture 4 January 24, 2016 20 / 61

Probabilistic Distance Metrics

Mutual Information For Feature Ranking

In our case, mutual information is:

MI (Xf ,Y ) =

xfi∈Xf


yk∈Y

p(xfi , yk ) log
p(xfi , yk )

p(xfi )p(yk )

where xfi ∈ Xf is the value of feature (column) f for datapoint i
and yk ∈ Y are the class labels
a feature with high MI may not have high probability but it’s better
at identifying the classes.

[From [?] ]

Mark Crowley ECE 657A : Lecture 4 January 24, 2016 21 / 61

Probabilistic Distance Metrics

Information Gain Measure

Another measure you could use is information gain.

IG (Y ,X ) = H(Y )− H(Y |X )

Mark Crowley ECE 657A : Lecture 4 January 24, 2016 22 / 61

Hypothesis Testing

Outline of Probability and Statistics Review

Probability Definitions
Bayes Theorem

Entropy

Probabilistic Distance Metrics

Hypothesis Testing

Mark Crowley ECE 657A : Lecture 4 January 24, 2016 23 / 61

Hypothesis Testing

Hypothesis Testing

Given a known distribution D0 we think produced the data, call this
our null hypothesis (often denoted H0)

Want to ask whether we can reject the null hypothesis given some
observed data.

Say D0 is N(0, 1) a standardized Gaussian and the sample is
x = 2.576.

p(|u| ≤ 2.576) = .99 : The probability of a sample u taken from
N(0, 1) being less then 2.576 is 99%.

So we say the difference of the sample x from the assumed
distribution is statistically significant

Also, say that the sample x lets us “reject the null hypothesis at the
0.1 confidence level”.

Many methods for doing this, for discrete data one is the Chi-sqaured
(χ2) Test.

Mark Crowley ECE 657A : Lecture 4 January 24, 2016 24 / 61

Hypothesis Testing

Chi-squared (χ2) Test

χ2 statistics can be used to test whether a feature is statistically
significant in predicting a class. For a feature xf and class yk we can
formulate the Chi-square test

χ2(xfi , yk ) =

xfi∈Xf


yk∈Y

(Oik − Eik )2

Eik

= N

xfi∈Xf


yk∈Y

pipk

(
(Oik/N)− pipk

pipk

)2
O are the observed counts of joint events and E are their expected counts.

χ2 tests the hypothesis that the features and the classes are assigned
randomly and independent

The higher the value of χ2, the more likley we reject the null hypothesis of
independent, random assignment of classes.

Thus, the higher the value of χ2 the more likely this feature f gives a
statistically significant discrimination between the classes.

Mark Crowley ECE 657A : Lecture 4 January 24, 2016 25 / 61

Hypothesis Testing

Contingency Table

The counts for χ2 can be obtained using a contingency table

o11 o12

o21 o22

xfi

1

0

1 0

yk

o11 is number of samples in the class that has the feature

o21 is number of samples in the class that doesnt have the feature

o12 is number of samples in other classes that has the feature

o22 is number of samples in other classes that doesnt have the feature

Mark Crowley ECE 657A : Lecture 4 January 24, 2016 26 / 61

Hypothesis Testing

Contingency Table

100 70

60 30

xfi

1

0

1 0

yk

= 160 = 100

= 170

= 90

N = 260

104.6 65.4

55.4 34.6

1

0

1 0

Eik

E11 = (o11 + o21)(o11 + o12)/N E12 = (o12 + o22)(o11 + o12)/N (3)

E21 = (o11 + o21)(o21 + o22)/N E22 = (o12 + o22)(o21 + o22)/N (4)

Mark Crowley ECE 657A : Lecture 4 January 24, 2016 27 / 61

Hypothesis Testing

Contingency Table

100 70

60 30

xfi

1

0

1 0

yk

= 160 = 100

= 170

= 90

N = 260

104.6 65.4

55.4 34.6

1

0

1 0

Eik

χ2 =
(100− 104.6)2

104.6
+

(70− 65.4)2

65.4
+

(60− 55.4)2

55.4
+

(30− 34.6)2

34.6
= 1.51

(5)

The number of degrees of freedom here is 1. Now we can use a
Chi-squared lookup table to find the critical value for this number at a
desired significance level. We see that for p=0.05 we need χ2 > 3.8 to
reject the null hypothesis and claim that our feature is significant. In this
case we can only claim p=0.30 significance.Mark Crowley ECE 657A : Lecture 4 January 24, 2016 28 / 61

Hypothesis Testing

χ2 Lookup Table

Figure: From wikipedia:Chi-squared Distribution

Mark Crowley ECE 657A : Lecture 4 January 24, 2016 29 / 61

Feature Selection – Definition

Outline of Feature Selection

Feature Selection – Definition
Definition
General Approach
Feature Ranking
Quality Measures for Feature Ranking

Choosing Feature Subsets
Objective Functions and Distance Measures
Information-Theoretic Measures

Search Strategies
Sequential Forward Selection
Sequential Backward Selection
Sequential Floating Selection
Oscillating Search

Mark Crowley ECE 657A : Lecture 4 January 24, 2016 30 / 61

Feature Selection – Definition Definition

Dimensionality Reduction and Feature Extraction Methods

Feature Extraction and Feature Selection

Feature Extraction combines original features into a new set of
features by transformation or projection.

Ideally the transformation or the projection is according to an
objective that helps finding the significant set of new features of lower
dimensions than the original set of features

Sometimes the transformation produce new features that may have
some relevant meaning to the data but many projection cases produce
features that are difficult to interpret or relate to the meanings of the
original features.

Mark Crowley ECE 657A : Lecture 4 January 24, 2016 31 / 61

Feature Selection – Definition Definition

Feature Selection

Material in this section is based on the following references:

1 Petr Somol, Jana Novovicova and Pavel Pudil (2010). Efficient Feature
Subset Selection and Subset Size Optimization, Pattern Recognition Recent
Advances, Adam Herout (Ed.), InTech, Available from: http://www.
intechopen.com/books/pattern-recognition-recent-advances/

efficient-feature-subset-selection-and-subset-size-optimization

2 Guyon, I. and Elisseeff, A. (2003). An introduction to variable and feature
selection. J. Mach. Learn. Res., 3, 11571182.

3 Luis Carlos Molina, Lluis Belanche, Angela Nebot: Feature Selection
Algorithms: A Survey and Experimental Evaluation. ICDM 2002: 306-313

4 Jain, A. K.; Duin, R. P. W., and Mao, J. (2000). Statistical pattern
recognition: A review. IEEE Trans. Pattern Anal. Mach. Intell., 22(1), 437.

5 Pudil, P.; Novovicova, J., and Kittler, J. (1994). Floating search methods in
feature selection.Pattern Recogn. Lett., 15(11), 46 11191125.

Mark Crowley ECE 657A : Lecture 4 January 24, 2016 32 / 61

http://www.intechopen.com/books/pattern-recognition-recent-advances/efficient-feature-subset-selection-and-subset-size-optimization
http://www.intechopen.com/books/pattern-recognition-recent-advances/efficient-feature-subset-selection-and-subset-size-optimization
http://www.intechopen.com/books/pattern-recognition-recent-advances/efficient-feature-subset-selection-and-subset-size-optimization

Feature Selection – Definition Definition

Feature Extraction vs. Feature Selection

Feature Extraction: we talked about ways to transform the original
features in the data into new features which have some advantage
such as

lower dimensionality
better description of the variance in the data
better ability to discriminate (distinguish) data points or clusters of
data points

Now we focus on finding a smaller set of the original features for use.

still want to reduce dimensionality
but also some interpretable features

Mark Crowley ECE 657A : Lecture 4 January 24, 2016 33 / 61

Feature Selection – Definition General Approach

General Approach

Search the feature space for a subset of features that optimizes a
selection criterion, an objective function (ie. a quality index or a
classifier output).

Challenge: If we have n samples and d features then there are 2d

possible subsets of d .

Components:
1 Selection Criteria
2 Search Strategy

Mark Crowley ECE 657A : Lecture 4 January 24, 2016 34 / 61

Feature Selection – Definition Feature Ranking

Feature Ranking

Need to use more feasible methods. Idea: Treat the problem as a search
problem and find ways to reduce the search space.
Feature Ranking
(also called Best Individual Features (BIF) and Naive Selection)

1 Evaluate the individual features according to a quality measure (how
good the feature is for discriminating the class)

2 Sort the features according to their value of the quality measure

3 Select the best m features

Advantage: search complexity is O(m) after sorting. It can be used for
very large number of features
Problem: features are considered in isolation

Mark Crowley ECE 657A : Lecture 4 January 24, 2016 35 / 61

Feature Selection – Definition Feature Ranking

Limitations of Correlation

Figure: From (murphy2012) Fig 2.12

Mark Crowley ECE 657A : Lecture 4 January 24, 2016 36 / 61

Feature Selection – Definition Quality Measures for Feature Ranking

Quality Measures for Feature Ranking

Some quality measures include:

Mutual Information: between a feature and class

Information Gain

Chi-squared test (χ2)

Mark Crowley ECE 657A : Lecture 4 January 24, 2016 37 / 61

Choosing Feature Subsets

Outline of Feature Selection

Feature Selection – Definition
Definition
General Approach
Feature Ranking
Quality Measures for Feature Ranking

Choosing Feature Subsets
Objective Functions and Distance Measures
Information-Theoretic Measures

Search Strategies
Sequential Forward Selection
Sequential Backward Selection
Sequential Floating Selection
Oscillating Search

Mark Crowley ECE 657A : Lecture 4 January 24, 2016 38 / 61

Choosing Feature Subsets

Objective Functions for Feature Subset Selection

Filter Approach: Evaluate the subsets in terms of their information
content, relevance, statistical dependence, their interclass
distance or their estimate of classification error

Wrapper Approach: Choose a learning algorithm to fit a classifier, test on
different subsets of features. Select the subset one that
maximizes the accuracy

Mark Crowley ECE 657A : Lecture 4 January 24, 2016 39 / 61

Choosing Feature Subsets

Objective Functions for Feature Subset Selection

Embedded Approach: The FS process is integrated with the classification
process. Example includes Decision Tree construction
(evaluates each feature and selects the high score feature to
start the tree). Recently Random Forest also has an effective
measure of relevance of a feature.

Hybrid Approach: A recent approach is to combine some of the above
approaches. For example combining filter and wrapper
approaches to handle high dimensional data by first using the
filter approach to select different subsets then use a wrapper
approach to select the best subset

Mark Crowley ECE 657A : Lecture 4 January 24, 2016 40 / 61

Choosing Feature Subsets

Pros and Cons of Each Approach

Filters:

+ independent of the learning algorithm

+ low complexity and therefore good for large number of features

– selects larger subsets

Wrappers:

+ better accuracy as they are tied to the classifier

+ subset selection is tune to the learning algorithm (limit the
generalization unless it uses good training practice)

– more complex and can be slow

– dependent on the learning algorithm

Mark Crowley ECE 657A : Lecture 4 January 24, 2016 41 / 61

Choosing Feature Subsets

Pros and Cons of Each Approach

Embedded:

+ performance is competitive with wrapper

+ fast learning

– tied to the learning process

Mark Crowley ECE 657A : Lecture 4 January 24, 2016 42 / 61

Choosing Feature Subsets Objective Functions and Distance Measures

Objective Functions for Filter Approach

Distance Based Functions: A good set of features is the one that increases
the inter-class distance and reduces the intra- class distance
(Euclidian, Mahalanobis etc.)

Correlation Based Functions: Good set of features should be correlated
with relevant class but should be uncorrelated with each
other

Information-Theoretic Functions: Good features should share maximum
information with the relevant classes

Mark Crowley ECE 657A : Lecture 4 January 24, 2016 43 / 61

Choosing Feature Subsets Objective Functions and Distance Measures

Distance Based Objective Functions

Goal: Maximize distance between classes.
Suppose we have C classes c1, c2, . . . , cK
In general, the distance between classes ca and cb is

class-distab(ca, cb) =
1

nanb

na∑
i=1

nb∑
j=1

dist(XTi ,a,X
T
j ,b)

Mark Crowley ECE 657A : Lecture 4 January 24, 2016 44 / 61

Choosing Feature Subsets Objective Functions and Distance Measures

Distance Based Objective Functions

Distance could be any distance
between 2 vectors Euclidian,
absolute…etc.

Vector i from one class to all
vectors j of the other

Sum over all classes

Normalize by size

Mark Crowley ECE 657A : Lecture 4 January 24, 2016 45 / 61

Choosing Feature Subsets Information-Theoretic Measures

Information-Theoretic Measures

Mutual Information measure:

Measures if a set of features Fs share information with the class.

MIc (Fs , c) =
1

|Fs |

fi∈Fs

MI (fi , c)

For multi-class we can sum over all classes.
A good measure is also to use mutual information to measure
relevance to the class as above but also to reduce relevance of the
features to each other. This can be measured by:

MIFs =
1

|Fs |2

fi ,fj∈Fs

MI (fi , fj )

Then the quality measure becomes:

maxFs (MIc (Fs , c)−MIFs )
Other measures include information gain based and entropy based
measures

Mark Crowley ECE 657A : Lecture 4 January 24, 2016 46 / 61

Search Strategies

Outline of Feature Selection

Feature Selection – Definition
Definition
General Approach
Feature Ranking
Quality Measures for Feature Ranking

Choosing Feature Subsets
Objective Functions and Distance Measures
Information-Theoretic Measures

Search Strategies
Sequential Forward Selection
Sequential Backward Selection
Sequential Floating Selection
Oscillating Search

Mark Crowley ECE 657A : Lecture 4 January 24, 2016 47 / 61

Search Strategies

Search Strategies

Once you have some metric for ranking or comparing features you still
need a search strategy for finding the best subset of features to use.
Exhaustive Search:

Optimal but prohibitively expensive as number of features grows.

Given 4 Features f1, f2, f3, f4

Select 2 Without repetitions,

Complexity is exponential in terms of branching factor

Mark Crowley ECE 657A : Lecture 4 January 24, 2016 48 / 61

Search Strategies

Search Strategies Sequential Simple Strategy

Strategy: Add or remove features sequentially, no backtracking

Sequential forward selection (SFS)

Sequential backward selection (SBS)

Mark Crowley ECE 657A : Lecture 4 January 24, 2016 49 / 61

Search Strategies Sequential Forward Selection

Sequential Forward Selection (SFS)

Can use with wrapper (classifier) or filter (objective distance)

1 Start with empty feature set

2 Select the next best feature that maximizes the objective function

3 Update your feature set

4 Terminate if you reached the required number of features, otherwise
repeat the search.

Mark Crowley ECE 657A : Lecture 4 January 24, 2016 50 / 61

Search Strategies Sequential Forward Selection

Sequential Forward Selection Example

Search Tree: Values on edges are cumulative total of objective function
for using that feature, larger is better.

Search Algorithm: Use Depth first
or hill climbing or greedy search to
select top three features

1 Select f3 → S = {f3}
2 best remaining feature is

f4 → S = {f3, f4}
3 S = {f3, f4, f1}

Mark Crowley ECE 657A : Lecture 4 January 24, 2016 51 / 61

Search Strategies Sequential Backward Selection

Sequential Backward Selection (SBS)

Can use with wrapper (classifier) or filter (objective distance)

Start with full set of features

Remove the worst feature (one that when removed will have least
effect)

Update the set of features.

Terminate if the required number of features is reached, otherwise
repeat

Mark Crowley ECE 657A : Lecture 4 January 24, 2016 52 / 61

Search Strategies Sequential Backward Selection

Sequential Backward Selection Example

Search Tree: Values on edges for feature fi are the total of the objective
function of the subset |S | − fi , larger is better.

Search Algorithm: Use Depth first
or hill climbing or greedy search to
select top three features

1 Start with S = {f1, f2, f3, f4}
2 Removing f2 has least impact so

S = {f1, f3, f4}
3 S = {f3, f4, f1}

Mark Crowley ECE 657A : Lecture 4 January 24, 2016 53 / 61

Search Strategies Sequential Backward Selection

Pros and Cons of SBS and SFS

Both: once you add or delete a feature you can’t go back, can get
stuck in suboptimal answers. This is called nesting

SBS is better if the sought subset is large relative to the original set
(not removing too many features)

SFS is better if subset is much smaller than original set of features

Mark Crowley ECE 657A : Lecture 4 January 24, 2016 54 / 61

Search Strategies Sequential Floating Selection

Sequential Forward Floating Selection

Avoids nesting problem, does backtracking.

Combines SFS and SBS, it’s a generalization of the (`, r) method

Algorithm: Input: Assume that we have already selected k features and
our current set is FS with obj(FS )

1 Inclusion step: using SFS, select next feature fnew so the new set as
FS = FS ∪ fnew , size of |FS | = k + 1

2 Conditional removal step:
find the least significant feature flow ∈ FS
If flow is fnew then keep it, and go to step 1.
Otherwise:

remove flow , FS = FS − flow , now |FS | = k.
Go to step 2.

Mark Crowley ECE 657A : Lecture 4 January 24, 2016 55 / 61

Search Strategies Sequential Floating Selection

SBFS

Start the algorithm by empty set of features and perform 2 steps of
SFS.

Needs termination criteria and way to avoid loops

SBFS algorithm is the dual of SFFS algorithm where step 1 will be
removal, step 2 will be conditional addition and step 3 continuation of
conditional addition.

SFFS and SBFS give good performance compared to SFS and SBS

Mark Crowley ECE 657A : Lecture 4 January 24, 2016 56 / 61

Search Strategies Oscillating Search

Oscillating Search (OS)

Oscillating Search (OS) starts from a set of features with desired
number and then it oscillates between cycles of down-swing and up-
swing

Down-swing removes and then adds back features to improve the
objective

Up-swing adds then removes features.

The algorithm uses an oscillating depth parameter ( number of
features to be replaced in the cycles)

The depth is increased after an unsuccessful cycle and reset to 1 after
a successful cycle and terminates when the depth exceeds some
threshold

The search oscillates around the desired number of features untill it
finds a better set but doesnt waste time evaluating subsets with less
features.

Mark Crowley ECE 657A : Lecture 4 January 24, 2016 57 / 61

Search Strategies Oscillating Search

Other Search Methods

These are often ’meta-heuristic’ methods

Branch and Bound (optimal if objective is monotonic, but in worst
case cost is exponential)

A* with admissible heuristic (optimal, worst case still exponential)

Genetic algorithms (very flexible but slow to converge, get’s stuck in
local minima)

Simulated annealing

Ant Colony Optimization

Mark Crowley ECE 657A : Lecture 4 January 24, 2016 58 / 61

Search Strategies Oscillating Search

Review and Guidelines

Feature Extraction:

Feature Selection:

Selection criteria: independent scoring and ranking, Mutual
Information, Chi-squared, filter/wrapper based evaluation

Search strategy: SFS, SBS, SFFS, Oscilating Search

How many features to keep? Good rule of thumb is to have
samples>=10F where F is the number of features.

Mark Crowley ECE 657A : Lecture 4 January 24, 2016 59 / 61

Search Strategies Oscillating Search

[Dunham, Data Mining Intro and Advanced Topics, 2003]
Margaret Dunham, Data Mining Introductory and Advanced Topics,
ISBN:0130888923, Prentice Hall, 2003.

[Han,Kamber and Pei. Data Mining, 2011]
Jiawei Han, Micheline Kamber and Jian Pei, Data Mining: Concepts
and Techniques, 3rd ed, Morgan Kaufmann Publishers, May 2011.

[Duda, Pattern Classification, 2001]
R. O. Duda, P. E. Hart and D. G. Stork, Pattern Classification (2nd
ed.), John Wiley and Sons, 2001.

[Jain and Dubes. Algs for Clustering Data, 1988]
A. K. Jain and R.C. Dubes, Algorithms for Clustering Data, ISBN:
0-13-022278-x, Prentice Hall, 1988.

[Cohen,Empirical Methods for Artificial Intelligence, 1995]
P. Cohen, Empirical Methods for Artificial Intelligence,
ISBN:0-262-03225-2, MIT Press, 1995.

[Ackoff, From Data to Wisdom, 1989]
Mark Crowley ECE 657A : Lecture 4 January 24, 2016 59 / 61

Search Strategies Oscillating Search

Ackoff, From Data to Wisdom, Journal of Applied Systems Analysis,
1989.

[Sima and Dougherty, 2008]
Sima, C. and Dougherty, E. R. The Peaking Phenomenon in the
Presence of Feature Selection. Pattern Recognition Letters, 29,
16671674, 2008.

[Zhu and Ghodsi, 2006]
Mu Zhu, Ali Ghodsi, Automatic dimensionality selection from the
scree plot via the use of profile likelihood, Computational Statistics &
Data Analysis 51 918 930, 2006.

[Cox, 2000]
Trevor Cox and M.A.A Cox, Multidimensional Scaling, Chapman and
Hall/CRC, Second Edition, 2000.

[Murphy, 2012]
Kevin Murphy, Machine Learning: A Probabilistic Perspective, MIT
Press, 2012.

Mark Crowley ECE 657A : Lecture 4 January 24, 2016 59 / 61

Search Strategies Oscillating Search

Summary

Independent Component Analysis

Non-linear Dimensionality Reduction / Manifold Learning

Probability and Statistics Review

Feature Selection

Naive Bayes Classification

Mark Crowley ECE 657A : Lecture 4 January 24, 2016 60 / 61

Search Strategies Oscillating Search

Closing Data Example

?

Mark Crowley ECE 657A : Lecture 4 January 24, 2016 61 / 61

Introduction
Class Admin
Announcements
Pitch Session
Updates from last time

Probability and Statistics Review
Probability Definitions
Bayes Theorem

Entropy
Probabilistic Distance Metrics
Hypothesis Testing

Feature Selection
Feature Selection – Definition
Definition
General Approach
Feature Ranking
Quality Measures for Feature Ranking

Choosing Feature Subsets
Objective Functions and Distance Measures
Information-Theoretic Measures

Search Strategies
Sequential Forward Selection
Sequential Backward Selection
Sequential Floating Selection
Oscillating Search