代写代考 Lecture 2: Linear Regression

Lecture 2: Linear Regression
Instructor:

Outline of This Lecture

} Linear Regression (With One Feature) } Prediction Function
} Cost Function
} Optimization(Learning or Training)
} Linear Regression With Multiple Features
} Case Study
} Regularized Regression (Ridge, Lasso and Elastic Net)

Outline of This Lecture
} Linear Regression (With One Feature) } Prediction Function
} Cost Function
} Optimization(Learning or Training)
} Linear Regression With Multiple Features
} Case Study
} Regularized Regression (Ridge, Lasso and Elastic Net)

Regression: single feature
} Housing Price 300
500 1000 1500 2000 2500 3000
Size (feet2)
} Supervised Learning: trading history available } Regression: predict real-value price
Price (K$)

Training Set (92115)
Size (feet2)
Price (K$)
} Notations
𝑥: 𝑥!, 𝑥”, 𝑥# , …, input features (i.e. size) 𝑦: 𝑦!, 𝑦”, 𝑦# , … , output results (i.e. price)

Linear Regression: Prediction Functions
} Goal: learn a function y=f(x) that maps from x to y o linear function in liner regression
o e.g. Size of House à housing price
} Formally,
𝑦 = 𝑓!(𝑥) = 𝜃” + 𝜃# 𝑥
(𝜃$, 𝜃!): parameters to learn (intercept and slope)

𝑦 = 𝑓!(𝑥) = 𝜃” + 𝜃# 𝑥
E.g. (𝜃”=50, 𝜃# = 300), what’s the price of a house of
895 square feet?

Illustration of linear Regression Models
} Parameters: (𝜃!, 𝜃”) 300
𝑦 = 0 + 0.177 𝑥 𝑦 = 200 + 0 𝑥
𝑦 = 100 + 0.11 𝑥
500 1000 1500 2000 2500 3000
Price (K$)

How to learn parameters?
Which one is the best?
500 1000 1500 2000 2500 3000
Size (feet2)
} Idea: choose parameters 𝜃 = 𝜃”, 𝜃# so that 𝑓! 𝑥 isclosetoyforfivetrainingsamples(x,y)
Price (K$)

Linear Regression: Cost function
} Choose parameters so that f(x) is close to y for all the five training samples (x, y)
Minimize 𝐹𝜃$,𝜃! =1,𝑦%−𝑓&𝑥% } Residual Sum of Squares (RSS)
} Least Square Loss } Note that:
} For fixed parameters, there is a function of x 𝑦 = 𝑓!(𝑥)
} For a set of (x,y), there is a function of parameters 𝐹 𝜃”,𝜃#

Residual of Cost Function
500 1000 1500 2000 2500 3000
Size (feet2)
} Indicate if training procedure has converged; } Used to estimate confidences of outputs
Price (K$)

How to optimize cost function
} Minimize function 𝐹 𝜃!, 𝜃” w.r.t 𝜃!, 𝜃”
𝐹𝜃$,𝜃! =1,𝑦%−𝑓&𝑥%
} Iterative method
i. Startwithsomeinitials 𝜃$=0,𝜃!=0
ii. Keep changing 𝜃*, 𝜃+ to reduce 𝐹 𝜃*, 𝜃+
iii. Stop while certain conditions satisfied

A Computer-based solution
Iterative method
i. Startwithsomeinitials 𝜃$=0,𝜃!=0
ii. Randomly generate new values for 𝜃*, 𝜃+ , keep it if
𝐹 𝜃*, 𝜃+ is the minimal.
iii. Stop while certain conditions satisfied
} Being easy to implement
} Difficult to converge
} Being worse while solving a highly complicated loss function

A smart solution

Review: Quadratic Functions
• Zero, one, or two real roots.
• One extreme, called the vertex.
• No inflection points.
• Line symmetry through the vertex. ( Axis
of symmetry.)
• Rises or falls at both ends.
• Can be constructed from three non-
colinear points or three pieces of
information.
• One fundamental shape.
• Roots are solvable by radicals. (Quadratic
Formula. )

Review: Quadratic Functions

Review: gradient at a point
𝐹 = 𝑓𝑢𝑛𝑐 (𝜃)
Quiz: How to tell whether the gradient at a point 𝜃 is negative or positive?

Iterative methods
} To minimize 𝐹 𝜃” , we use the iterative method
i. Startwithsomeinitials 𝜃$=rand
ii. Keep changing 𝜃* to reduce 𝐹 𝜃*
iii. Stop while certain conditions satisfied
𝜃! 𝜃! Question: how to change 𝜽𝟎?

Iterative methods
𝜃*= 𝜃*-(𝐶)
Left: 𝐶 should be positive Right: 𝐶 should be negative
Question: how to s𝐞𝐥𝐞𝐜𝐭 𝑪?

Solution: gradient based method
To solve 𝜃*
} Initialize 𝜃!
} Repeat until convergence
} 𝜃$ = 𝜃$ − 𝛼 ‘( &! ‘&!
To solve 𝜃+
} Initialize 𝜃”
} Repeat until convergence
} 𝜃! = 𝜃! − 𝛼 ‘( &” ‘&”

Review: First-order Derivative
Y=ax+b Y=-ax^2+b Y=-log x Y=exp(-ax) Y=ax+by+c

Method : Gradient Descent
𝐹𝜃$,𝜃! =1,𝑦%−𝑓&𝑥%
} Initialize 𝜃!, 𝜃”
} Repeat until convergence
} 𝜃) = 𝜃) − 𝛼 ‘( , 𝑗 = 0, 1 ‘&#

Gradient Descent for RSS
} Cost Function
𝐹𝜃$,𝜃! =1,𝑦%−𝑓&𝑥%
= 1 ,𝜃$+𝜃!𝑥%−𝑦% ” 2𝑚 %
} What are the Derivatives of F() w.r.t. 𝜃! and 𝜃”? ‘ 𝐹𝜃$,𝜃!=!∑%(𝜃$+𝜃!𝑥%−𝑦%)
‘ 𝐹𝜃$,𝜃!=!∑%𝜃$+𝜃!𝑥%−𝑦%𝑥% ‘&” *

Gradient Descent for RSS
} Repeat until Convergence{
𝜃 $ = 𝜃 $ − 𝛼 𝑚1 , ( 𝜃 $ + 𝜃 ! 𝑥 % − 𝑦 % )
1% 𝜃!=𝜃!−𝛼𝑚,𝜃$+𝜃!𝑥%−𝑦% 𝑥%
} At each iteration, update 𝜃” and 𝜃# simultaneously

Understanding GD
𝜃 ” = 𝜃 ” − 𝛼 𝑚1 ) 𝜃 ! + 𝜃 ” 𝑥 # − #
Prediction True Label Feature

Understanding GD
} Learning Rate 𝛼 should be empirically set } Too small, slow convergence
} Too large, fail to converge or diverge
} fixed 𝛼 over time } Adaptive steps

Understanding GD
} Batch Gradient Descent
} At each step , access all the training samples
𝜃 ! = 𝜃 ! − 𝛼 𝑚1 𝜃 $ + 𝜃 ! 𝑥 % − 𝑦 % 𝑥 %

How to improve convergence
Access more training samples
Learning rate
– Smaller or bigger
Better initialization: Using Closed form solution – Normal equation

Variants of GD
Access all training samples at each iteration • Full Batch
Access a portion of training samples at each
• Mini Batch
Access a single training sample at each iteration • Online Learning

} Prediction Function: Linear Function
} Cost Function: Residual Sum of Square }Optimization:GradientDescent Method

Testing: Measurement Error
} Once a regression model trained, apply the prediction function over each testing sample and compare its prediction 𝑦=$ to the true label 𝑦$
} Both labels are real-valued
} With multiple testing samples, report both Mean and Std } Popular metric: coefficient of determination,
𝑅% or R squared
} A measure of how well observed outcomes are replicated by the model, based on the proportion of total variation of outcomes explained by the model
} L 2 e r r o r : 𝑦 − 𝑦? $$
}L1error:|𝑦 −𝑦?| $$

Regression Measure: 𝑅:
} Let 𝑦! denote the true label of sample i, 𝑦” is the mean of the labels
in testing dataset, 𝑦#! the predicted label of sample i.
} We have where
𝑅” = 1 − 𝐸# 𝐸$
𝐸 # = ) 𝑦 ! − 𝑦# !
𝐸 $ = ) 𝑦 ! − 𝑦”
} 𝑅$ = 1 indicates that the predicted labels exactly match the true labels;
} 𝑅$ = 0 indicates that the model already predicts 𝑦*
} 𝑅$ = 0.49, indicates that 49% of the variability of the dependent variables (true labels)are accounted for

Outline of This Lecture
} Linear Regression (With One Feature) } Prediction Function
} Cost Function
} Optimization(Learning/Training)
} Linear Regression With Multiple Features
} Case Study
} Regularized Regression (Ridge, Lasso and Elastic Net)

Regression: multiple feature
} Example: Housing Price
Built year
Price (K$)
Single feature 𝑦 = 𝑓!(𝑥) = 𝜃” + 𝜃# 𝑥 Multiplefeature𝑦=𝑓! 𝑥 =𝜃”+𝜃#𝑥#+𝜃%𝑥%+𝜃0𝑥0 +…

𝑦=𝑓! 𝑥 =𝜃”+𝜃#𝑥#+𝜃%𝑥%+𝜃0𝑥0+… Let 𝜃 = 𝜃”,𝜃#,𝜃%,𝜃0 … , 𝑥 = [1,𝑥#,𝑥%,𝑥0,…]
then we have
𝑦=𝑓! 𝑥 =𝜃1𝑥
Let 𝑥$ = 1, 𝑥$#, 𝑥$%, 𝑥$0 represent the i-th training sample, 𝑦$ semantic label.

Gradient Descent for Multiple Features
} Cost Function
𝐹 𝜃 = 1 K 𝑦$ − 𝑓! 𝑥$
= # ∑$ 𝑦$−𝜃1𝑥$ % %2
} Gradient Descent Repeat{

Gradient Descent for Multiple Features
} Gradient Descent
𝜃 & = 𝜃 & − 𝛼 𝑚1 9 ( 𝜃 ! 𝑥 ” − 𝑦 ” ) 𝑥 & ‘ ‘
} Update 𝜃#, 𝜃%, …simultaneously

Dealing Qualitative features
} Some predictors/features are not quantitative but are qualitative, taking a discrete set of values
} Categorical predictors } Factor variable
} E.g. house type, short sale, gender, student, status Consider a feature, house type:
1, if single family 0, otherwise

Resulting Model is:
𝑦 = 𝛽” + 𝛽#𝑥# =
𝛽” + 𝛽#, if single family 𝛽”, otherwise
} Additional Dummy Variables
𝑦 = 𝛽” + 𝛽#𝑥# + 𝛽%𝑥% =
𝛽” + 𝛽#, if single family 𝛽” +𝛽%,otherwise

Cast Study: Linear Regression
} Project 1: Synthetic DataSet } Project 2: Housing Dataset

Project 1: Synthetic Dataset
} A single input feature and a continuous target variable (or output)