PowerPoint Presentation
Lecture 2: Linear Regression
Instructor: Xiaobai Liu
Outline of This Lecture
Linear Regression (With One Feature)
Prediction Function
Cost Function
Optimization(aka: Learning or training)
Linear Regression With Multiple Features
Case Study
Regularized Regression (Ridge, Lasso and Elastic Net)
Best Practices
Outline of This Lecture
Linear Regression (With One Feature)
Prediction Function
Cost Function
Optimization(Learning/Training)
Linear Regression With Multiple Features
Case Study
Regularized Regression (Ridge, Lasso and Elastic Net)
Best Practices
Regression: single feature
Housing Price
Supervised Learning: trading history available
Regression: predict real-value price
Price (K$)
500
1000
1500
2000
2500
3000
100
300
200
Size (feet2)
Training Set (92115)
Size (feet2) Price (K$)
856 399.5
1512 449
865 350
1044 345
… …
Notations
, input features (i.e. size)
output results (i.e. price)
Linear Regression: Prediction Functions
Goal: learn a function y=f(x) that maps from x to y
linear function in liner regression
e.g. Size of House housing price
Formally,
x
y
(, ): parameters to learn (intercept and slope)
6
Quiz
E.g. (=50, ), what’s the price of a house of 895 square feet?
(268550$)
Illustration of linear Regression Models
Parameters: (, )
Price (K$)
500
1000
1500
2000
2500
3000
100
300
200
How to learn parameters?
Idea: choose parameters so that close to y for five training samples (x,y)
Price (K$)
500
1000
1500
2000
2500
3000
100
300
200
Size (feet2)
Which one is the best?
Linear Regression: Cost function
Choose parameters so that f(x) is close to y for all the five training samples (x, y)
Minimize
Residual Sum of Squares (RSS)
Least Square Loss
Note that:
For fixed parameters, there is a function of x
For a set of (x,y), there is a function of parameters
Residual of Cost Function
Indicate if training procedure has converged;
Used to estimate confidences of outputs
Price (K$)
500
1000
1500
2000
2500
3000
100
300
200
Size (feet2)
How to optimize cost function
Minimize function w.r.t
Iterative method
Start with some initials
Keep changing to reduce
Stop while certain conditions satisfied
A Computer-based solution
Iterative method
Start with some initials
Randomly generate new values for , keep it if is the minimal.
Stop while certain conditions satisfied
Pros:
Being easy to implement
Cons:
Difficult to converge
Being worse while solving a highly complicated loss function
A smart solution
Review: Quadratic Functions
Zero, one, or two real roots.
One extreme, called the vertex.
No inflection points.
Line symmetry through the vertex. (Axis of symmetry.)
Rises or falls at both ends.
Can be constructed from three non-colinear points or three pieces of information.
One fundamental shape.
Roots are solvable by radicals. (Quadratic Formula.)
Review: Quadratic Functions
Review: gradient at a point
Quiz: How to tell whether the gradient at a point is negative or positive?
Iterative methods
To minimize we use the iterative method
Start with some initials
Keep changing to reduce
Stop while certain conditions are satisfied
minimum
Iterative methods
To minimize we use the iterative method
Start with some initials
Keep changing to reduce
Stop while certain conditions satisfied
Question: how to change ?
Iterative methods
Left: should be positive
Right: negative
Question: how to s
= -()
Gradient based methods?
20
Solution: gradient based method
To solve
Initialize
Repeat until convergence
To solve
Initialize
Repeat until convergence
Review: First-order Derivative
Y=ax+b
Y=-ax^2+b
Y=-log x
Y=exp(-ax)
Y=ax+by+c
Derivatives:
A
-2ax
-1/x
-aExp(-ax)
a or b
22
Method : Gradient Descent
Initialize
Repeat until convergence
Gradient Descent for RSS
Cost Function
What are the Derivatives of F() w.r.t. and ?
=
=
Gradient Descent for RSS
Repeat until Convergence{
}
At each iteration, update and
Understanding GD
Prediction
True Label
Feature
Understanding GD
Learning Rate should be empirically set
Too small, slow convergence
Too large, fail to converge or diverge
fixed over time
Adaptive steps
Understanding GD
Batch Gradient Descent
At each step , access all the training samples
How to improve convergence
Access more training samples
Learning rate
Smaller or bigger
Better initialization: Using Closed form solution
Normal equation
Variants of GD
Access all training samples at each iteration
Full Batch
Access a portion of training samples at each iteration
Mini Batch
Access a single training sample at each iteration
Online Learning
Recap
Prediction Function: Linear Function
Cost Function: Residual Sum of Square
Optimization: Gradient Descent Method
Testing: Measurement Error
Once a regression model trained, apply the prediction function over each testing sample and compare its prediction to the true label
Both labels are real-valued
L2 error:
L1 error:
With multiple testing samples, report both Mean and Std
Popular metric: coefficient of determination,
A measure of how well observed outcomes are replicated by the model, based on the proportion of total variation of outcomes explained by the model
32
Regression Measure:
Let denote the true label of sample i, is the mean of the labels in testing dataset, the predicted label of sample i.
We have
where
indicates that the predicted labels exactly match the true labels;
indicates that the model already predicts
indicates that 49% of the variability of the dependent variables (true labels)are accounted for
33
Outline of This Lecture
Linear Regression (With One Feature)
Prediction Function
Cost Function
Optimization(Learning/Training)
Linear Regression With Multiple Features
Case Study
Regularized Regression (Ridge, Lasso and Elastic Net)
Best Practices
Regression: multiple feature
Example: Housing Price
Size Bedroom Bathroom Built year Stories Price (K$)
1024 3 2 1978 1 375
1329 3 2 1992 1 425
1893 4 2 1980 2 465
… … … … … …
+…
Single feature
Multiple feature
Review: Linear Algebra
Matrix/Vector
Addition and Scalar Multiplication
Matrix-vector/matrix-matrix multiplication
A \times B is not equal B times A (no commutative)
A * B *C =A*(B*C)
Identity matrix
Inverse and Transpose
Notations
Let
then we have
Let represent the i-th training sample, semantic label.
Gradient Descent for Multiple Features
Gradient Descent
Repeat{
}
Cost Function
=
Gradient Descent for Multiple Features
Gradient Descent
Repeat{
}
Update simultaneously
Dealing Qualitative features
Some predictors/features are not quantitative but are qualitative, taking a discrete set of values
Categorical predictors
Factor variable
E.g. house type, short sale, gender, student, status
Consider a feature, house type:
if single family
otherwise
Conti.
Resulting Model is:
if single family
otherwise
Additional Dummy Variables
if single family
Outline of This Lecture
Linear Regression (With One Feature)
Prediction Function
Cost Function
Optimization(Learning/Training)
Linear Regression With Multiple Features
Case Study
Regularized Regression (Ridge, Lasso and Elastic Net)
Best Practices
Cast Study: Linear Regression
Project 1: Synthetic DataSet
Project 2: Housing Dataset
Project 1: Synthetic Dataset
A single input feature and a continuous target variable (or output)
Feature
Output
Project 1: outline
Load dataset
Visualization
Data splitting
Training
Predictions
Evaluations
Project 1: outline
Load dataset
Visualization
Data splitting
Training
Predictions
Evaluations
Load dataset
Generate a set of data points (x, y), where x is the input and y is the output
Built-in functions: rnd.uniform()
Project 1: outline
Load dataset
Visualization
Data splitting
Training
Predictions
Evaluations
Visualization
Plotting all the data points in a 2-D figure
Project 1: outline
Load dataset
Visualization
Data splitting
Training
Predictions
Evaluations
Data Splitting
Partition the dataset into two subsets: used for training and testing, respectively.
Project 1: outline
Load dataset
Visualization
Data splitting
Training
Predictions
Evaluations
Training
Utilize the library sklearn.linear_model LinearRegression class
Project 1: outline
Load dataset
Visualization
Data splitting
Training
Predictions
Evaluations
Predictions
Predictions over individual data points
Project 1: outline
Load dataset
Visualization
Data splitting
Training
Predictions
Evaluations
Evaluations
Make prediction on every testing data point, compare to its ground-truth value, and calculate the :
Project 1: summary
Load dataset
Visualization
Data splitting
Training
Predictions
Evaluations
Project 2: Housing Dataset
Boston Housing dataset
To predict the median value of homes in several Boston neighborhoods
Feature predictors: crime rate, proximity to the Charles River, highway accessibility, etc.
506 data points, 13 features
Project 2: Outline
Load dataset
Visualization
Data splitting
Training
Predictions
Evaluations
Load and split dataset
Sklearn provides a built-in access to this dataset
Training and evaluations
Measurement:
Summary
Project 1: Synthetic DataSet
Project 2: Housing Dataset
Outline of This Lecture
Linear Regression (With One Feature)
Prediction Function
Cost Function
Optimization(Learning/Training)
Linear Regression With Multiple Features
Case Study
Regularized Regression (Ridge, Lasso and Elastic Net)
Best Practices
Regression: Overfitting
Regularization
A popular approach to reduce overfitting is to regularize the values of coefficients (column vector) , which works well in practice.
min
where is the cost function, e.g., the least square for linear regression, is an extra term over the parameters, is a constant
Ridge Regularization:
Lasso Regularization :
Elastic Net Regularizing:
Regularized Regression: Learning
Gradient Descent
Repeat{
}
Update simultaneously
Cost Function
Outline of the rest
Case study: Regularization
Boston Housing dataset
To predict the median value of homes in several Boston neighborhoods
Feature predictors: crime rate, proximity to the Charles River, highway accessibility, etc.
506 data points, 13 features
Case A: Linear Regression
Load dataset
Import libraries
Training
Case B: Ridge Regression (L2 regularization)
Ridge Regression
New Results ( worse training score, better testing score)
Previous Results (without regularization)
Case B: Ridge Regression (L2 regularization)
Training with different regularization weights (alpha below)
Source codes available
Lecture-LinearRegression.ipynb
LASSO: L1 Regularization
Still use Gradient Descent during training
Sub-gradients (introduced in later sections)
Encourage sparsity over the learned coefficients
Codes available in Lecture LinearRegression.ipynb as well
Instead of L2 terms, using L1 norm to regularize model parameters
Outline of This Lecture
Linear Regression (With One Feature)
Prediction Function
Cost Function
Optimization(Learning/Training)
Linear Regression With Multiple Features
Regularized Regression (Ridge, Lasso and Elastic Net)
Best Practices
LR: practices
Feature Processing
Learning rate
Polynomial Regression
Normal equation
Feature selection
Practice 1: Feature Processing
Many processing need to be done before even coding a machine learning system
Processing 1: Feature Scaling
to guarantee features are on a similar scale
E.g. two features for house pricing problem: Size: 100- 1000 square feet and Bed Room: 1..5 have different scales
Scale features to be, e.g., between -1 and 1
Do the same scaling for both training and testing data
Processing 2: Mean Normalization
Replace with to make features have approximately zero mean
Practice 2: Learning Rate
How to choose learning rate
Number of Iterations
F(
Too small: slow convergence
Too large: no convergence
To choose learning rate, try 0.001, 0.05, 0.01
Practice 3: Polynomial Regression
Case: Housing price
: area
size of living area
: size of yard
+
Good practice
Redundant features
Practice 3: Polynomial Regression
Polynomial regression
Size(x)
Price (y)
Practice 4: Normal Equation
Iterative method
Normal Equation: solve parameters analytically
Practice 4: Normal Equation
Basic idea:
Set
Solve for
Example: least square loss
Let
Solve as follows,
Practice 4: Normal Equation
cowbook
Example 4: Normal equation
Size Bedroom Bathroom Built year Stories Price (K$)
1024 3 2 1978 1 375
1329 3 2 1992 1 425
1893 4 2 1980 2 465
… … … … … …
y
375
425
465
…
1 1024 3 2 1978 1
1 1329 3 2 1992 1
1 1893 4 2 1980 2
…
Practice 4: Normal Equation
Normal Equation
No need to choose
No iteration
Need Compute matrix inverse which might be too large
Gradient Descent
Need to choose
Need many iterations
Fit to large-scale problem (i.e. training samples)
Normal Equation
What if is non-invertible
Too many features
Delete some redundant features
Use Regularization
Practice 5: selecting important features
Basic observation: some features are more important than others
Direct Approach: subset methods
For all possible subsets, compute least square fit and choose to balance training error and model size
Impossible for exploring all subsets: ,
Two alternative approach
Forward Selection
Backward Selection
Conti.
Forward selection
Begin with null model – only intercept but not predictors (features)
Train multiple models using a single variable and add to the null model the variable that results in the lowest RSS; (one)
Expand the one-variable model to be two-variable models and keep the one with lowest RSS (two)
Continue until some conditions satisfied
Conti.
Backward approach
Start with all variables in the model (let d denote the number of variables)
Remove one variable from the model to get a (d-1)-variable model; Test all (d-1)-variable models and keep the one with least training error.
Continue the above removal process until a stopping rule is reached.
LR: practices
Feature Processing
Learning rate
Polynomial Regression
Normal equation
Feature selection
Outline of This Lecture
Linear Regression (With One Feature)
Prediction Function
Cost Function
Optimization(Learning/Training)
Linear Regression With Multiple Features
Case Study
Regularized Regression (Ridge, Lasso and Elastic Net)
Best Practices
/docProps/thumbnail.jpeg