程序代写代做代考 data mining algorithm CS373 Data Mining and�

CS373 Data Mining and�
Machine Learning�

Lecture 7
Jean Honorio

Purdue University

(originally prepared by Tommi Jaakkola, MIT CSAIL)

Today’s topics
• Preface: regression

-  linear regression, kernel regression
• Feature selection

-  information ranking, regularization, subset selection

• We seek to learn a mapping from inputs to continuous
valued outputs (e.g., price, temperature)

• The mapping is assumed to be linear in the feature
space so that the predicted output is given by

• Assuming that the noise in the observed outputs is
additive zero mean Gaussian we can obtain the
parameters from training samples by minimizing

• The regularization term guarantees that any unused
parameter dimensions are set to zero

Linear regression

regularization term
squared prediction
loss on the examplesum over the

training examples

• We can easily obtain non-linear regression functions by
considering different feature mappings

Linear regression

3rd order polynomial

7th order polynomial5th order polynomial

linear

Linear regression solution

• The unique solution lies in the span of the feature
vectors (this is due to the regularization term)

Dual linear regression
• The dual parameters are obtained as the solution to a

linear equation

• Predicted output for a new input is given by

kernel

Gram matrix

kernel

Today’s topics
• Preface: regression

-  linear regression, kernel regression
• Feature selection

-  information ranking, regularization, subset selection

Coordinate selection
• Linear models

• We seek to identify a few feature coordinates that the
class label or regression output primarily depends on

• This is often advantageous in order to improve
generalization (as feature selection exerts additional
complexity control) or to gain interpretability

feature
coordinate

classification

regression

etc.

Simple coordinate selection
• There are a number of different approaches to

coordinate selection

• Information analysis
- rank individual features according to their mutual information

with the class label

-  limited to discrete variables
•  1-norm regularization

- replaces the two-norm regularizer with 1-norm that
encourages some of the coordinates to be set exactly to
zero

• Iterative subset selection
-  iteratively add (or prune) coordinates based on their impact

on the (training) error

Information analysis
• Suppose the feature vector is just the input vector x

whose coordinates take values in {1,…,k}

• Given a training set of size n, we can evaluate an
empirical estimate of the mutual information between
the coordinate values and the binary label

empirical
estimates

Information analysis
• Suppose the feature vector is just the input vector x

whose coordinates take values in {1,…,k}

• Given a training set of size n, we can evaluate an
empirical estimate of the mutual information between
the coordinate values and the binary label

- provides a ranking of the features to include
- weights redundant features equally (would include neither or

both)

- not tied to the linear classifier (may select features that the
linear classifier cannot use, or omit combinations of features
particularly useful in a linear classifier)

Regularization approach (Lasso)
• By using a 1-norm regularizer we will cause some of

parameters to be set exactly to zero

best setting of parameters
subject to norm constraint

solution based on the
loss alone

some coordinates
are exactly zero!

Regularization example
• n=100 samples, d=10, training outputs generated from

• If we increase the regularization penalty, we get fewer
non-zero parameters in the solution to

coordinate
values of

Regularization example
• n=100 samples, d=10, training outputs generated from

• If we increase the regularization penalty, we get fewer
non-zero parameters in the solution to

# of non-zero
coordinates in

coordinate
values of

Regularization example
• The 1-norm penalty term controls the number of non-

zero coordinates but also reduces the magnitude of the
coordinates

true parameters estimated parameters

• A greedy algorithm for finding a good subset of feature
coordinates (feature functions) to rely on

- k is used for (statistical) complexity control
- computationally hard (exponential in k)

Subset selection

each feature subset is
assessed on the basis

of the resulting
training error

find the best subset

• A greedy algorithm for finding a good subset of feature
coordinates (feature functions) to rely on

- each new feature is assessed in the context of those already
included

-  the method is not guaranteed to find the optimal subset

Greedy subset selection

re-estimate all the
parameters in the

context of the new
coordinate

try each new
coordinate

find the best
coordinate to include

Forward-fitting
• We can also choose feature coordinates without re-

estimating the parameters associated with already
included coordinates�

- same feature may be included more than once

re-estimate only the parameter
associated with the new coordinate

fixed at this stage

select the best
new feature

Myopic forward fitting
• We can also identify new features in a more limited way

by focusing on how much “potential” each feature has
for reducing the training error

|derivative| is large if the
corresponding feature would

have a large effect on the error

Myopic forward fitting
• We can identify new features with minimal fitting

fixed at this stage

estimate only the parameter
associated with the selected feature

the criterion does not involve
any parameter fitting

selection of best
coordinate

Forward-fitting example
• 1 dimensional polynomial regression

forward-fitting myopic forward-fitting

Forward-fitting example
• 1 dimensional polynomial regression

forward-fitting

myopic forward-fitting