程序代写代做代考 case study PowerPoint Presentation

PowerPoint Presentation

Lecture 2: Linear Regression
Instructor: Xiaobai Liu

Outline of This Lecture
Linear Regression (With One Feature)
Prediction Function
Cost Function
Optimization(aka: Learning or training)
Linear Regression With Multiple Features
Case Study
Regularized Regression (Ridge, Lasso and Elastic Net)
Best Practices

Outline of This Lecture
Linear Regression (With One Feature)
Prediction Function
Cost Function
Optimization(Learning/Training)
Linear Regression With Multiple Features
Case Study
Regularized Regression (Ridge, Lasso and Elastic Net)
Best Practices

Regression: single feature
Housing Price

Supervised Learning: trading history available
Regression: predict real-value price

Price (K$)
500
1000
1500
2000
2500
3000
100
300
200
Size (feet2)

Training Set (92115)
Size (feet2) Price (K$)
856 399.5
1512 449
865 350
1044 345
… …

Notations

, input features (i.e. size)
output results (i.e. price)

Linear Regression: Prediction Functions
Goal: learn a function y=f(x) that maps from x to y
linear function in liner regression
e.g. Size of House  housing price
Formally,

x
y
(, ): parameters to learn (intercept and slope)

6

Quiz

E.g. (=50, ), what’s the price of a house of 895 square feet?

(268550$)

Illustration of linear Regression Models
Parameters: (, )
Price (K$)
500
1000
1500
2000
2500
3000
100
300
200

How to learn parameters?
Idea: choose parameters so that close to y for five training samples (x,y)
Price (K$)
500
1000
1500
2000
2500
3000
100
300
200
Size (feet2)

Which one is the best?

Linear Regression: Cost function
Choose parameters so that f(x) is close to y for all the five training samples (x, y)

Minimize
Residual Sum of Squares (RSS)
Least Square Loss

Note that:
For fixed parameters, there is a function of x

For a set of (x,y), there is a function of parameters

Residual of Cost Function
Indicate if training procedure has converged;
Used to estimate confidences of outputs
Price (K$)
500
1000
1500
2000
2500
3000
100
300
200
Size (feet2)

How to optimize cost function
Minimize function w.r.t

Iterative method
Start with some initials
Keep changing to reduce
Stop while certain conditions satisfied

A Computer-based solution
Iterative method
Start with some initials
Randomly generate new values for , keep it if is the minimal.
Stop while certain conditions satisfied

Pros:
Being easy to implement

Cons:
Difficult to converge
Being worse while solving a highly complicated loss function

A smart solution

Review: Quadratic Functions

Zero, one, or two real roots.
One extreme, called the vertex.
No inflection points.
Line symmetry through the vertex. (Axis of symmetry.)
Rises or falls at both ends.
Can be constructed from three non-colinear points or three pieces of information.
One fundamental shape.
Roots are solvable by radicals. (Quadratic Formula.)

Review: Quadratic Functions

Review: gradient at a point
Quiz: How to tell whether the gradient at a point is negative or positive?

Iterative methods
To minimize we use the iterative method
Start with some initials
Keep changing to reduce
Stop while certain conditions are satisfied

minimum

Iterative methods
To minimize we use the iterative method
Start with some initials
Keep changing to reduce
Stop while certain conditions satisfied

Question: how to change ?

Iterative methods
Left: should be positive
Right: negative

Question: how to s
= -()

Gradient based methods?
20

Solution: gradient based method
To solve
Initialize
Repeat until convergence

To solve
Initialize
Repeat until convergence

Review: First-order Derivative
Y=ax+b

Y=-ax^2+b

Y=-log x

Y=exp(-ax)

Y=ax+by+c

Derivatives:
A
-2ax
-1/x
-aExp(-ax)
a or b

22

Method : Gradient Descent

Initialize
Repeat until convergence

Gradient Descent for RSS
Cost Function

What are the Derivatives of F() w.r.t. and ?
=
=

Gradient Descent for RSS
Repeat until Convergence{

}

At each iteration, update and

Understanding GD

Prediction
True Label
Feature

Understanding GD
Learning Rate should be empirically set
Too small, slow convergence
Too large, fail to converge or diverge

fixed over time
Adaptive steps

Understanding GD
Batch Gradient Descent
At each step , access all the training samples

How to improve convergence
Access more training samples

Learning rate
Smaller or bigger

Better initialization: Using Closed form solution
Normal equation

Variants of GD
Access all training samples at each iteration
Full Batch

Access a portion of training samples at each iteration
Mini Batch

Access a single training sample at each iteration
Online Learning

Recap
Prediction Function: Linear Function

Cost Function: Residual Sum of Square

Optimization: Gradient Descent Method

Testing: Measurement Error
Once a regression model trained, apply the prediction function over each testing sample and compare its prediction to the true label
Both labels are real-valued
L2 error:
L1 error:
With multiple testing samples, report both Mean and Std
Popular metric: coefficient of determination,
A measure of how well observed outcomes are replicated by the model, based on the proportion of total variation of outcomes explained by the model

32

Regression Measure:
Let denote the true label of sample i, is the mean of the labels in testing dataset, the predicted label of sample i.
We have

where

indicates that the predicted labels exactly match the true labels;
indicates that the model already predicts
indicates that 49% of the variability of the dependent variables (true labels)are accounted for

33

Outline of This Lecture
Linear Regression (With One Feature)
Prediction Function
Cost Function
Optimization(Learning/Training)
Linear Regression With Multiple Features
Case Study
Regularized Regression (Ridge, Lasso and Elastic Net)
Best Practices

Regression: multiple feature
Example: Housing Price
Size Bedroom Bathroom Built year Stories Price (K$)
1024 3 2 1978 1 375
1329 3 2 1992 1 425
1893 4 2 1980 2 465
… … … … … …

+…
Single feature
Multiple feature

Review: Linear Algebra
Matrix/Vector
Addition and Scalar Multiplication
Matrix-vector/matrix-matrix multiplication
A \times B is not equal B times A (no commutative)
A * B *C =A*(B*C)
Identity matrix
Inverse and Transpose

Notations
Let

then we have

Let represent the i-th training sample, semantic label.

Gradient Descent for Multiple Features

Gradient Descent
Repeat{

}

Cost Function

=

Gradient Descent for Multiple Features
Gradient Descent
Repeat{

}

Update simultaneously

Dealing Qualitative features
Some predictors/features are not quantitative but are qualitative, taking a discrete set of values
Categorical predictors
Factor variable
E.g. house type, short sale, gender, student, status
Consider a feature, house type:

if single family
otherwise

Conti.
Resulting Model is:

if single family
otherwise
Additional Dummy Variables

if single family

Outline of This Lecture
Linear Regression (With One Feature)
Prediction Function
Cost Function
Optimization(Learning/Training)
Linear Regression With Multiple Features
Case Study
Regularized Regression (Ridge, Lasso and Elastic Net)
Best Practices

Cast Study: Linear Regression
Project 1: Synthetic DataSet

Project 2: Housing Dataset

Project 1: Synthetic Dataset
A single input feature and a continuous target variable (or output)

Feature
Output

Project 1: outline
Load dataset
Visualization
Data splitting
Training
Predictions
Evaluations

Project 1: outline
Load dataset
Visualization
Data splitting
Training
Predictions
Evaluations

Load dataset
Generate a set of data points (x, y), where x is the input and y is the output

Built-in functions: rnd.uniform()

Project 1: outline
Load dataset
Visualization
Data splitting
Training
Predictions
Evaluations

Visualization
Plotting all the data points in a 2-D figure

Project 1: outline
Load dataset
Visualization
Data splitting
Training
Predictions
Evaluations

Data Splitting
Partition the dataset into two subsets: used for training and testing, respectively.

Project 1: outline
Load dataset
Visualization
Data splitting
Training
Predictions
Evaluations

Training
Utilize the library sklearn.linear_model LinearRegression class

Project 1: outline
Load dataset
Visualization
Data splitting
Training
Predictions
Evaluations

Predictions
Predictions over individual data points

Project 1: outline
Load dataset
Visualization
Data splitting
Training
Predictions
Evaluations

Evaluations
Make prediction on every testing data point, compare to its ground-truth value, and calculate the :

Project 1: summary
Load dataset
Visualization
Data splitting
Training
Predictions
Evaluations

Project 2: Housing Dataset
Boston Housing dataset
To predict the median value of homes in several Boston neighborhoods
Feature predictors: crime rate, proximity to the Charles River, highway accessibility, etc.
506 data points, 13 features

Project 2: Outline
Load dataset
Visualization
Data splitting
Training
Predictions
Evaluations

Load and split dataset
Sklearn provides a built-in access to this dataset

Training and evaluations
Measurement:

Summary
Project 1: Synthetic DataSet

Project 2: Housing Dataset

Outline of This Lecture
Linear Regression (With One Feature)
Prediction Function
Cost Function
Optimization(Learning/Training)
Linear Regression With Multiple Features
Case Study
Regularized Regression (Ridge, Lasso and Elastic Net)
Best Practices

Regression: Overfitting

Regularization
A popular approach to reduce overfitting is to regularize the values of coefficients (column vector) , which works well in practice.
min
where is the cost function, e.g., the least square for linear regression, is an extra term over the parameters, is a constant
Ridge Regularization:
Lasso Regularization :
Elastic Net Regularizing:

Regularized Regression: Learning

Gradient Descent
Repeat{

}

Update simultaneously

Cost Function

Outline of the rest
Case study: Regularization

Boston Housing dataset
To predict the median value of homes in several Boston neighborhoods
Feature predictors: crime rate, proximity to the Charles River, highway accessibility, etc.
506 data points, 13 features

Case A: Linear Regression
Load dataset

Import libraries

Training

Case B: Ridge Regression (L2 regularization)
Ridge Regression

New Results ( worse training score, better testing score)

Previous Results (without regularization)

Case B: Ridge Regression (L2 regularization)

Training with different regularization weights (alpha below)

Source codes available
Lecture-LinearRegression.ipynb

LASSO: L1 Regularization

Still use Gradient Descent during training
Sub-gradients (introduced in later sections)

Encourage sparsity over the learned coefficients

Codes available in Lecture LinearRegression.ipynb as well

Instead of L2 terms, using L1 norm to regularize model parameters

Outline of This Lecture
Linear Regression (With One Feature)
Prediction Function
Cost Function
Optimization(Learning/Training)
Linear Regression With Multiple Features
Regularized Regression (Ridge, Lasso and Elastic Net)
Best Practices

LR: practices
Feature Processing
Learning rate
Polynomial Regression
Normal equation
Feature selection

Practice 1: Feature Processing
Many processing need to be done before even coding a machine learning system

Processing 1: Feature Scaling
to guarantee features are on a similar scale
E.g. two features for house pricing problem: Size: 100- 1000 square feet and Bed Room: 1..5 have different scales
Scale features to be, e.g., between -1 and 1
Do the same scaling for both training and testing data
Processing 2: Mean Normalization
Replace with to make features have approximately zero mean

Practice 2: Learning Rate
How to choose learning rate

Number of Iterations
F(

Too small: slow convergence
Too large: no convergence
To choose learning rate, try 0.001, 0.05, 0.01

Practice 3: Polynomial Regression
Case: Housing price

: area
size of living area
: size of yard
+

Good practice
Redundant features

Practice 3: Polynomial Regression
Polynomial regression

Size(x)
Price (y)

Practice 4: Normal Equation
Iterative method

Normal Equation: solve parameters analytically

Practice 4: Normal Equation
Basic idea:
Set
Solve for
Example: least square loss

Let

Solve as follows,

Practice 4: Normal Equation
cowbook

Example 4: Normal equation
Size Bedroom Bathroom Built year Stories Price (K$)
1024 3 2 1978 1 375
1329 3 2 1992 1 425
1893 4 2 1980 2 465
… … … … … …

y

375
425
465

1 1024 3 2 1978 1
1 1329 3 2 1992 1
1 1893 4 2 1980 2

Practice 4: Normal Equation
Normal Equation
No need to choose
No iteration
Need Compute matrix inverse which might be too large

Gradient Descent
Need to choose
Need many iterations
Fit to large-scale problem (i.e. training samples)

Normal Equation
What if is non-invertible

Too many features
Delete some redundant features
Use Regularization

Practice 5: selecting important features
Basic observation: some features are more important than others

Direct Approach: subset methods
For all possible subsets, compute least square fit and choose to balance training error and model size
Impossible for exploring all subsets: ,
Two alternative approach
Forward Selection
Backward Selection

Conti.
Forward selection
Begin with null model – only intercept but not predictors (features)
Train multiple models using a single variable and add to the null model the variable that results in the lowest RSS; (one)
Expand the one-variable model to be two-variable models and keep the one with lowest RSS (two)
Continue until some conditions satisfied

Conti.
Backward approach
Start with all variables in the model (let d denote the number of variables)
Remove one variable from the model to get a (d-1)-variable model; Test all (d-1)-variable models and keep the one with least training error.
Continue the above removal process until a stopping rule is reached.

LR: practices
Feature Processing
Learning rate
Polynomial Regression
Normal equation
Feature selection

Outline of This Lecture
Linear Regression (With One Feature)
Prediction Function
Cost Function
Optimization(Learning/Training)
Linear Regression With Multiple Features
Case Study
Regularized Regression (Ridge, Lasso and Elastic Net)
Best Practices

/docProps/thumbnail.jpeg