CS373 Data Mining and�
Machine Learning�
Lecture 7
Jean Honorio
Purdue University
(originally prepared by Tommi Jaakkola, MIT CSAIL)
Today’s topics
• Preface: regression
- linear regression, kernel regression
• Feature selection
- information ranking, regularization, subset selection
• We seek to learn a mapping from inputs to continuous
valued outputs (e.g., price, temperature)
• The mapping is assumed to be linear in the feature
space so that the predicted output is given by
• Assuming that the noise in the observed outputs is
additive zero mean Gaussian we can obtain the
parameters from training samples by minimizing
• The regularization term guarantees that any unused
parameter dimensions are set to zero
Linear regression
regularization term
squared prediction
loss on the examplesum over the
training examples
• We can easily obtain non-linear regression functions by
considering different feature mappings
Linear regression
3rd order polynomial
7th order polynomial5th order polynomial
linear
Linear regression solution
• The unique solution lies in the span of the feature
vectors (this is due to the regularization term)
Dual linear regression
• The dual parameters are obtained as the solution to a
linear equation
• Predicted output for a new input is given by
kernel
Gram matrix
kernel
Today’s topics
• Preface: regression
- linear regression, kernel regression
• Feature selection
- information ranking, regularization, subset selection
Coordinate selection
• Linear models
• We seek to identify a few feature coordinates that the
class label or regression output primarily depends on
• This is often advantageous in order to improve
generalization (as feature selection exerts additional
complexity control) or to gain interpretability
feature
coordinate
classification
regression
etc.
Simple coordinate selection
• There are a number of different approaches to
coordinate selection
• Information analysis
- rank individual features according to their mutual information
with the class label
- limited to discrete variables
• 1-norm regularization
- replaces the two-norm regularizer with 1-norm that
encourages some of the coordinates to be set exactly to
zero
• Iterative subset selection
- iteratively add (or prune) coordinates based on their impact
on the (training) error
Information analysis
• Suppose the feature vector is just the input vector x
whose coordinates take values in {1,…,k}
• Given a training set of size n, we can evaluate an
empirical estimate of the mutual information between
the coordinate values and the binary label
empirical
estimates
Information analysis
• Suppose the feature vector is just the input vector x
whose coordinates take values in {1,…,k}
• Given a training set of size n, we can evaluate an
empirical estimate of the mutual information between
the coordinate values and the binary label
- provides a ranking of the features to include
- weights redundant features equally (would include neither or
both)
- not tied to the linear classifier (may select features that the
linear classifier cannot use, or omit combinations of features
particularly useful in a linear classifier)
Regularization approach (Lasso)
• By using a 1-norm regularizer we will cause some of
parameters to be set exactly to zero
best setting of parameters
subject to norm constraint
solution based on the
loss alone
some coordinates
are exactly zero!
Regularization example
• n=100 samples, d=10, training outputs generated from
• If we increase the regularization penalty, we get fewer
non-zero parameters in the solution to
coordinate
values of
Regularization example
• n=100 samples, d=10, training outputs generated from
• If we increase the regularization penalty, we get fewer
non-zero parameters in the solution to
# of non-zero
coordinates in
coordinate
values of
Regularization example
• The 1-norm penalty term controls the number of non-
zero coordinates but also reduces the magnitude of the
coordinates
true parameters estimated parameters
• A greedy algorithm for finding a good subset of feature
coordinates (feature functions) to rely on
- k is used for (statistical) complexity control
- computationally hard (exponential in k)
Subset selection
each feature subset is
assessed on the basis
of the resulting
training error
find the best subset
• A greedy algorithm for finding a good subset of feature
coordinates (feature functions) to rely on
- each new feature is assessed in the context of those already
included
- the method is not guaranteed to find the optimal subset
Greedy subset selection
re-estimate all the
parameters in the
context of the new
coordinate
try each new
coordinate
find the best
coordinate to include
Forward-fitting
• We can also choose feature coordinates without re-
estimating the parameters associated with already
included coordinates�
�
- same feature may be included more than once
re-estimate only the parameter
associated with the new coordinate
fixed at this stage
select the best
new feature
Myopic forward fitting
• We can also identify new features in a more limited way
by focusing on how much “potential” each feature has
for reducing the training error
|derivative| is large if the
corresponding feature would
have a large effect on the error
Myopic forward fitting
• We can identify new features with minimal fitting
fixed at this stage
estimate only the parameter
associated with the selected feature
the criterion does not involve
any parameter fitting
selection of best
coordinate
Forward-fitting example
• 1 dimensional polynomial regression
forward-fitting myopic forward-fitting
Forward-fitting example
• 1 dimensional polynomial regression
forward-fitting
myopic forward-fitting