Lecture 2: Linear Regression
Instructor:
Outline of This Lecture
Copyright By PowCoder代写 加微信 powcoder
} Linear Regression (With One Feature) } Prediction Function
} Cost Function
} Optimization(Learning or Training)
} Linear Regression With Multiple Features
} Case Study
} Regularized Regression (Ridge, Lasso and Elastic Net)
Outline of This Lecture
} Linear Regression (With One Feature) } Prediction Function
} Cost Function
} Optimization(Learning or Training)
} Linear Regression With Multiple Features
} Case Study
} Regularized Regression (Ridge, Lasso and Elastic Net)
Regression: single feature
} Housing Price 300
500 1000 1500 2000 2500 3000
Size (feet2)
} Supervised Learning: trading history available } Regression: predict real-value price
Price (K$)
Training Set (92115)
Size (feet2)
Price (K$)
} Notations
𝑥: 𝑥!, 𝑥”, 𝑥# , …, input features (i.e. size) 𝑦: 𝑦!, 𝑦”, 𝑦# , … , output results (i.e. price)
Linear Regression: Prediction Functions
} Goal: learn a function y=f(x) that maps from x to y o linear function in liner regression
o e.g. Size of House à housing price
} Formally,
𝑦 = 𝑓!(𝑥) = 𝜃” + 𝜃# 𝑥
(𝜃$, 𝜃!): parameters to learn (intercept and slope)
𝑦 = 𝑓!(𝑥) = 𝜃” + 𝜃# 𝑥
E.g. (𝜃”=50, 𝜃# = 300), what’s the price of a house of
895 square feet?
Illustration of linear Regression Models
} Parameters: (𝜃!, 𝜃”) 300
𝑦 = 0 + 0.177 𝑥 𝑦 = 200 + 0 𝑥
𝑦 = 100 + 0.11 𝑥
500 1000 1500 2000 2500 3000
Price (K$)
How to learn parameters?
Which one is the best?
500 1000 1500 2000 2500 3000
Size (feet2)
} Idea: choose parameters 𝜃 = 𝜃”, 𝜃# so that 𝑓! 𝑥 isclosetoyforfivetrainingsamples(x,y)
Price (K$)
Linear Regression: Cost function
} Choose parameters so that f(x) is close to y for all the five training samples (x, y)
Minimize 𝐹𝜃$,𝜃! =1,𝑦%−𝑓&𝑥% } Residual Sum of Squares (RSS)
} Least Square Loss } Note that:
} For fixed parameters, there is a function of x 𝑦 = 𝑓!(𝑥)
} For a set of (x,y), there is a function of parameters 𝐹 𝜃”,𝜃#
Residual of Cost Function
500 1000 1500 2000 2500 3000
Size (feet2)
} Indicate if training procedure has converged; } Used to estimate confidences of outputs
Price (K$)
How to optimize cost function
} Minimize function 𝐹 𝜃!, 𝜃” w.r.t 𝜃!, 𝜃”
𝐹𝜃$,𝜃! =1,𝑦%−𝑓&𝑥%
} Iterative method
i. Startwithsomeinitials 𝜃$=0,𝜃!=0
ii. Keep changing 𝜃*, 𝜃+ to reduce 𝐹 𝜃*, 𝜃+
iii. Stop while certain conditions satisfied
A Computer-based solution
Iterative method
i. Startwithsomeinitials 𝜃$=0,𝜃!=0
ii. Randomly generate new values for 𝜃*, 𝜃+ , keep it if
𝐹 𝜃*, 𝜃+ is the minimal.
iii. Stop while certain conditions satisfied
} Being easy to implement
} Difficult to converge
} Being worse while solving a highly complicated loss function
A smart solution
Review: Quadratic Functions
• Zero, one, or two real roots.
• One extreme, called the vertex.
• No inflection points.
• Line symmetry through the vertex. ( Axis
of symmetry.)
• Rises or falls at both ends.
• Can be constructed from three non-
colinear points or three pieces of
information.
• One fundamental shape.
• Roots are solvable by radicals. (Quadratic
Formula. )
Review: Quadratic Functions
Review: gradient at a point
𝐹 = 𝑓𝑢𝑛𝑐 (𝜃)
Quiz: How to tell whether the gradient at a point 𝜃 is negative or positive?
Iterative methods
} To minimize 𝐹 𝜃” , we use the iterative method
i. Startwithsomeinitials 𝜃$=rand
ii. Keep changing 𝜃* to reduce 𝐹 𝜃*
iii. Stop while certain conditions satisfied
𝜃! 𝜃! Question: how to change 𝜽𝟎?
Iterative methods
𝜃*= 𝜃*-(𝐶)
Left: 𝐶 should be positive Right: 𝐶 should be negative
Question: how to s𝐞𝐥𝐞𝐜𝐭 𝑪?
Solution: gradient based method
To solve 𝜃*
} Initialize 𝜃!
} Repeat until convergence
} 𝜃$ = 𝜃$ − 𝛼 ‘( &! ‘&!
To solve 𝜃+
} Initialize 𝜃”
} Repeat until convergence
} 𝜃! = 𝜃! − 𝛼 ‘( &” ‘&”
Review: First-order Derivative
Y=ax+b Y=-ax^2+b Y=-log x Y=exp(-ax) Y=ax+by+c
Method : Gradient Descent
𝐹𝜃$,𝜃! =1,𝑦%−𝑓&𝑥%
} Initialize 𝜃!, 𝜃”
} Repeat until convergence
} 𝜃) = 𝜃) − 𝛼 ‘( , 𝑗 = 0, 1 ‘&#
Gradient Descent for RSS
} Cost Function
𝐹𝜃$,𝜃! =1,𝑦%−𝑓&𝑥%
= 1 ,𝜃$+𝜃!𝑥%−𝑦% ” 2𝑚 %
} What are the Derivatives of F() w.r.t. 𝜃! and 𝜃”? ‘ 𝐹𝜃$,𝜃!=!∑%(𝜃$+𝜃!𝑥%−𝑦%)
‘ 𝐹𝜃$,𝜃!=!∑%𝜃$+𝜃!𝑥%−𝑦%𝑥% ‘&” *
Gradient Descent for RSS
} Repeat until Convergence{
𝜃 $ = 𝜃 $ − 𝛼 𝑚1 , ( 𝜃 $ + 𝜃 ! 𝑥 % − 𝑦 % )
1% 𝜃!=𝜃!−𝛼𝑚,𝜃$+𝜃!𝑥%−𝑦% 𝑥%
} At each iteration, update 𝜃” and 𝜃# simultaneously
Gradient Descent for RSS
} Repeat until Convergence{
𝜃 $ = 𝜃 $ − 𝛼 𝑚1 , ( 𝜃 $ + 𝜃 ! 𝑥 % − 𝑦 % )
1% 𝜃!=𝜃!−𝛼𝑚,𝜃$+𝜃!𝑥%−𝑦% 𝑥%
} At each iteration, update 𝜃” and 𝜃# simultaneously
Understanding GD
𝜃 ” = 𝜃 ” − 𝛼 𝑚1 ) 𝜃 ! + 𝜃 ” 𝑥 # − #
Prediction True Label Feature
Understanding GD
} Learning Rate 𝛼 should be empirically set } Too small, slow convergence
} Too large, fail to converge or diverge
} fixed 𝛼 over time } Adaptive steps
Understanding GD
} Batch Gradient Descent
} At each step , access all the training samples
𝜃 ! = 𝜃 ! − 𝛼 𝑚1 𝜃 $ + 𝜃 ! 𝑥 % − 𝑦 % 𝑥 %
How to improve convergence
Access more training samples
Learning rate
– Smaller or bigger
Better initialization: Using Closed form solution – Normal equation
Variants of GD
Access all training samples at each iteration • Full Batch
Access a portion of training samples at each
• Mini Batch
Access a single training sample at each iteration • Online Learning
} Prediction Function: Linear Function
} Cost Function: Residual Sum of Square }Optimization:GradientDescent Method
Testing: Measurement Error
} Once a regression model trained, apply the prediction function over each testing sample and compare its prediction 𝑦=$ to the true label 𝑦$
} Both labels are real-valued
} With multiple testing samples, report both Mean and Std } Popular metric: coefficient of determination,
𝑅% or R squared
} A measure of how well observed outcomes are replicated by the model, based on the proportion of total variation of outcomes explained by the model
} L 2 e r r o r : 𝑦 − 𝑦? $$
}L1error:|𝑦 −𝑦?| $$
Regression Measure: 𝑅:
} Let 𝑦! denote the true label of sample i, 𝑦” is the mean of the labels
in testing dataset, 𝑦#! the predicted label of sample i.
} We have where
𝑅” = 1 − 𝐸# 𝐸$
𝐸 # = ) 𝑦 ! − 𝑦# !
𝐸 $ = ) 𝑦 ! − 𝑦”
} 𝑅$ = 1 indicates that the predicted labels exactly match the true labels;
} 𝑅$ = 0 indicates that the model already predicts 𝑦*
} 𝑅$ = 0.49, indicates that 49% of the variability of the dependent variables (true labels)are accounted for
Outline of This Lecture
} Linear Regression (With One Feature) } Prediction Function
} Cost Function
} Optimization(Learning/Training)
} Linear Regression With Multiple Features
} Case Study
} Regularized Regression (Ridge, Lasso and Elastic Net)
Regression: multiple feature
} Example: Housing Price
Built year
Price (K$)
Single feature 𝑦 = 𝑓!(𝑥) = 𝜃” + 𝜃# 𝑥 Multiplefeature𝑦=𝑓! 𝑥 =𝜃”+𝜃#𝑥#+𝜃%𝑥%+𝜃0𝑥0 +…
𝑦=𝑓! 𝑥 =𝜃”+𝜃#𝑥#+𝜃%𝑥%+𝜃0𝑥0+… Let 𝜃 = 𝜃”,𝜃#,𝜃%,𝜃0 … , 𝑥 = [1,𝑥#,𝑥%,𝑥0,…]
then we have
𝑦=𝑓! 𝑥 =𝜃1𝑥
Let 𝑥$ = 1, 𝑥$#, 𝑥$%, 𝑥$0 represent the i-th training sample, 𝑦$ semantic label.
Gradient Descent for Multiple Features
} Cost Function
𝐹 𝜃 = 1 K 𝑦$ − 𝑓! 𝑥$
= # ∑$ 𝑦$−𝜃1𝑥$ % %2
} Gradient Descent Repeat{
Gradient Descent for Multiple Features
} Gradient Descent
𝜃 & = 𝜃 & − 𝛼 𝑚1 9 ( 𝜃 ! 𝑥 ” − 𝑦 ” ) 𝑥 & ‘ ‘
} Update 𝜃#, 𝜃%, …simultaneously
Dealing Qualitative features
} Some predictors/features are not quantitative but are qualitative, taking a discrete set of values
} Categorical predictors } Factor variable
} E.g. house type, short sale, gender, student, status Consider a feature, house type:
1, if single family 0, otherwise
Resulting Model is:
𝑦 = 𝛽” + 𝛽#𝑥# =
𝛽” + 𝛽#, if single family 𝛽”, otherwise
} Additional Dummy Variables
𝑦 = 𝛽” + 𝛽#𝑥# + 𝛽%𝑥% =
𝛽” + 𝛽#, if single family 𝛽” +𝛽%,otherwise
Outline of This Lecture
} Linear Regression (With One Feature) } Prediction Function
} Cost Function
} Optimization(Learning/Training)
} Linear Regression With Multiple Features
} Case Study
} Regularized Regression (Ridge, Lasso and Elastic Net)
Cast Study: Linear Regression
} Project 1: Synthetic DataSet } Project 2: Housing Dataset
Project 1: Synthetic Dataset
} A single input feature and a continuous target variable (or output)
Project 1: outline
} Load dataset } Visualization } Data splitting } Training
} Predictions } Evaluations
Project 1: outline
} Load dataset } Visualization } Data splitting } Training
} Predictions } Evaluations
Load dataset
} Generate a set of data points (x, y), where x is the input and y is the output
Built-in functions: rnd.uniform()
Project 1: outline
} Load dataset } Visualization } Data splitting } Training
} Predictions } Evaluations
Visualization
} Plotting all the data points in a 2-D figure
Project 1: outline
} Load dataset } Visualization } Data splitting } Training
} Predictions } Evaluations
Data Splitting
} Partition the dataset into two subsets: used for training and testing, respectively.
Project 1: outline
} Load dataset } Visualization } Data splitting } Training
} Predictions } Evaluations
} Utilize the library sklearn.linear_model LinearRegression class
Project 1: outline
} Load dataset } Visualization } Data splitting } Training
} Predictions } Evaluations
Predictions
} Predictions over individual data points
Project 1: outline
} Load dataset } Visualization } Data splitting } Training
} Predictions } Evaluations
Evaluations
} Make prediction on every testing data point, compare to its ground-truth value, and calculate the 𝑅%:
Project 1: summary
} Load dataset } Visualization } Data splitting } Training
} Predictions } Evaluations
Project 2: Housing Dataset
} Boston Housing dataset
} To predict the median value of homes in several Boston
neighborhoods
} Feature predictors: crime rate, proximity to the , highway accessibility, etc.
} 506 data points, 13 features
Project 2: Outline
} Load dataset } Visualization } Data splitting } Training
} Predictions } Evaluations
Load and split dataset
} Sklearn provides a built-in access to this dataset
Training and evaluations
} Measurement: 𝑅%
} Project 1: Synthetic DataSet } Project 2: Housing Dataset
Outline of This Lecture
} Linear Regression (With One Feature) } Prediction Function
} Cost Function
} Optimization(Learning/Training)
} Linear Regression With Multiple Features
} Case Study
} Regularized Regression (Ridge, Lasso and Elastic Net)
Regression: Overfitting
Regularization
} A popular approach to reduce overfitting is to regularize the values of coefficients 𝜃 = 𝜃”, 𝜃#, 𝜃%, … 1 (column vector) , which works well in practice.
min𝐹𝜃 +𝜆𝐺(𝜃)
where 𝐹 𝜃 is the cost function, e.g., the least square for linear regression, and 𝐺(𝜃) is an extra term over the parameters, 𝜆 is a constant
} Ridge Regularization: 𝐺 𝜃 = 𝐺3$456 𝜃 = 𝜃1𝜃 = ∑$ 𝜃$% }LassoRegularization:𝐺 𝜃 =𝐺7899: 𝜃 = 𝜃 =∑$ 𝜃$
} Elastic Net Regularizing:𝐺 𝜃 = 𝐺3$456 𝜃 + 𝐺7899: 𝜃
Regularized Regression: Learning
} Cost Function
𝐹𝜃 = # ∑$ 𝑦$−𝜃1𝑥$ %+𝜆𝜃:𝜃
} Gradient Descent
} Update 𝜃#, 𝜃%, …simultaneously
𝜃;=𝜃;−𝛼[𝑚1K𝜃1𝑥$−𝑦$𝑥;$ +2𝜆𝜃;] $
Outline of the rest
} Case study: Regularization } Boston Housing dataset
} To predict the median value of homes in several Boston neighborhoods
} Feature predictors: crime rate, proximity to the , highway accessibility, etc.
} 506 data points, 13 features
Case A: Linear Regression
Import libraries
Load dataset
Case B: Ridge Regression (L2 regularization)
Previous Results (without regularization)
Ridge Regression
( worse training score, better testing score)
Case B: Ridge Regression (L2 regularization)
Training with different regularization weights (alpha below)
Source codes available
} Lecture02-LinearRegression.ipynb } Ordinary linear regression
} Ridge Regression (L2)
} Lasso (L1)
} Elastic Net (L1+L2)
LASSO: L1 Regularization
} Using the L1 norm to regularize model parameters 𝐹𝜃 = # ∑$ 𝑦$−𝜃1𝑥$ %+𝜆∑<|𝜃<|
} Still use Gradient Descent during training
} Sub-gradients (introduced in later sections)
} Encourage sparsity over the learned coefficients
} Codes available in Lecture02-LinearRegression.ipynb as well
Outline of This Lecture
} Linear Regression (With One Feature) } Prediction Function
} Cost Function
} Optimization(Learning/Training)
} Linear Regression With Multiple Features
} Regularized Regression (Ridge, Lasso and Elastic Net) }
LR: practices
1. Feature Processing
2. Learning rate
3. Polynomial Regression
4. Normal equation
5. Feature selection
Practice 1: Feature Processing
} Many processing need to be done before even coding a machine learning system
} Processing 1: Feature Scaling
} to guarantee features are on a similar scale
} E.g. two features for house pricing problem: Size: 100- 1000 square feet and Bed Room: 1..5 have different scales
} Scale features to be, e.g., between -1 and 1
} Do the same scaling for both training and testing data
} Processing 2: Mean Normalization
} Replace 𝑥% with 𝑥% − 𝑥+ to make features have approximately zero mean
Practice 2: Learning Rate
𝜃; = 𝜃; − 𝛼 𝜕 𝐽(𝜃) 𝜕𝜃;
} How to choose learning rate
Too small: slow convergence Too large: no convergence
Number of Iterations
• To choose learning rate, try 0.001, 0.05, 0.01
Practice 3: Polynomial Regression
} Case: Housing price
𝑓! 𝑥 =𝜃"+𝜃#×𝑥#+𝜃%×𝑥%+𝜃0×𝑥0
𝑥;: size of living area 𝑥<: size of yard
𝑥+ = 𝑥;+𝑥<
} Good practice
} Redundant features
Practice 3: Polynomial Regression
} Polynomial regression
𝑓! 𝑎𝑟𝑒𝑎 = 𝜃" + 𝜃#×𝑎𝑟𝑒𝑎 +𝜃%×𝑎𝑟𝑒𝑎% +𝜃0×𝑎𝑟𝑒𝑎0
Practice 4: Normal Equation
} Iterative method
} Normal Equation: solve parameters analytically
Practice 4: Normal Equation
} Basic idea:
Set = 𝐹 𝜃 = 0 =>#
Solve for 𝜃”,𝜃#,…., } Example: least square loss
1% F𝜃 =2𝑚K𝑦$−𝜃1𝑥$
$ 1 2×# Let 𝑦 = [𝑦$] = 𝑦#,𝑦%,𝑦0,.., ,∈ 𝑅
1,𝑥#,𝑥%,𝑥0,… 1 ∈ 𝑅2×>
F 𝜃 = 12 𝑦 − 𝑋 𝜃
Practice 4: Normal Equation
F 𝜃 = 12 𝑦 − 𝑋 𝜃 Solve 𝜃 as follows,
=0 𝜃= 𝑋1𝑋?#𝑋1𝑦
Example 4: Normal equation
Built year
Price (K$)
1 1024 3 2 1978 1 1 1329 3 2 1992 1 1 1893 4 2 1980 2
375 425 465 …
Practice 4: Normal Equation
} Normal Equation
} No need to choose 𝛼
} No iteration
} Need Compute matrix inverse which might be too large
} Gradient Descent
} Need to choose 𝛼
} Need many iterations
} Fit to large-scale problem (i.e. 10!, training samples)
Normal Equation
} What if 𝑋1𝑋 is non-invertible
} Too many features
} Delete some redundant features } Use Regularization
Practice 5: selecting important features
} Basic observation: some features are more important than others
} Direct Approach: subset methods
} For all possible subsets, compute least square fit and choose to
balance training error and model size
} Impossible for exploring all subsets: 2-, 𝑝 = 40 ,
} Two alternative approach } Forward Selection
} Backward Selection
} Forward selection
} Begin with null model – only intercept but not predictors
(features)
} Train multiple models using a single variable and add to the null model the variable that results in the lowest RSS; (one)
} Expand the one-variable model to be two-variable models and keep the one with lowest RSS (two)
} Continue until some conditions satisfied
} Backward approach
} Start with all variables in the model (let d denote the number
of variables)
} Remove one variable from the model to get a (d-1)-variable model; Test all (d-1)-variable models and keep the one with least training error.
} Continue the above removal process until a stopping rule is reached.
LR: practices
1. Feature Processing
2. Learning rate
3. Polynomial Regression
4. Normal equation
5. Feature selection
Outline of This Lecture
} Linear Regression (With One Feature) } Prediction Function
} Cost Function
} Optimization(Learning/Training)
} Linear Regression With Multiple Features
} Case Study
} Regularized Regression (Ridge, Lasso and Elastic Net)
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com