RMIT Classification: Trusted
Regression
COSC 2673-2793 | Semester 1 2021 (Computational) Machine Learning
Image: Freepik.com
RMIT Classification: Trusted
Agenda
Regression:
• Univariate Linear regression
• Multivariate Linear regression
• Polynomial Regression
• Practical Issues
By the end of the lecture you will:
• Understand the main concepts of regression including: hypothesis spaces, cost functions and optimization.
COSC2673 | COSC2793 Week 2: Regression 2
RMIT Classification: Trusted
Quick Recap of Last Week
The Task is an unknown target function: 𝐲 = 𝑓(𝐱)
Attributes of the task: 𝐱 Unknown function: 𝑓(𝐱) Output of the function: 𝐲
ML finds a Hypothesis, h ∈ 𝐻, which is a function (or model) which approximates the unknown target function
h∗(𝐱) ≈ 𝑓(𝐱)
The hypothesis is often called a model
3
RMIT Classification: Trusted
Quick Recap of Last Week
In supervised learning, the output is known: 𝑦 = 𝑓(𝐱)
Experience: Examples of input-output pairs
Task: Learns a model that maps input to desired output
Predict the output for new “unseen” inputs.
Performance: Error measure of how closely the hypothesis predicts the target output
Most typical of learning tasks
Two main types of supervised learning: Ø Classification
Ø Regression
4
RMIT Classification: Trusted
Quick Recap of Last Week
D = x(i), f x(i)
f x(i)
The experience is often called the training data Or training examples
The Experience is typically a data set, 𝐷, of values:
Attributes of the task: x(i) Output of the Unknown function:
5
RMIT Classification: Trusted
Revision: Performance
The Performance is typically numerical measure that determines how well the hypothesis matches the experience.
Ø Note, the performance is measured against the experience Ø NOT the unknown target function!
6
RMIT Classification: Trusted
Regression
Definition and examples
COSC2673 | COSC2793 Week 2: Regression 7
Regression
A form of supervised learning
y = f (x)
y∈R
• Stock price prediction – predict google stock prices from google, amazon, Microsoft etc prices yesterday.
• Predict distance for each pixel of a 2D image.
• Predicting hospital length-of-stay at time of admission.
• The output is continuous: Examples:
RMIT Classification: Trusted
WeiQin Chuah, Ruwan Tennakoon, Reza Hoseinnezhad, Alireza Bab-Hadiashar and David Suter. “Adjusting Bias in Long Range Stereo Matching: A semantics guided approach”.
COSC2673 | COSC2793 Week 2: Regression 8
RMIT Classification: Trusted
Example: Power Consumption
x
13.39 C 34.19 C
24.81 C
18.1 C
𝑦$ 𝑦
Regressor (ML Algorithm)
Training
??? kWh ??? kWh
??? kWh
5.46 kWh 7.37 kWh
4.02 kWh
Tuneable parameters
Regressor (ML Algorithm)
Testing & Prediction
??? kWh
COSC2673 | COSC2793 Week 2: Regression 9
RMIT Classification: Trusted
From Last week:
Nearly all ML algorithms can be described with a fairly simple recipe:
Ø Dataset (experience)
Ø Model (hypothesis space)
Ø Cost function (objective, loss) Ø Optimization procedure
Let’s gradually figure out what each element in the above recipe represent for regression.
COSC2673 | COSC2793 Week 2: Regression 10
Experience
The experience is a data set (instances) of inputs and outputs:
D = x(i), f x(i)
Outputs: 𝑦, real-value, continuous • Dependentvariable
RMIT Classification: Trusted
Attributes of the task: 𝐱
• real-value,continuous • Independentvariables
COSC2673 | COSC2793 Week 2: Regression 11
RMIT Classification: Trusted
Data Representation
D = x(i), f x(i)
Instance
Temperature (C)
Energy Consumption (kWh)
1
13.39
5.46
2
34.19
7.37
3
24.81
4.02
Univariate
Instance
Temp (C)
Humility
Pressure (ha)
Consumption (kWh)
1
13.39
0.81
1231
5.46
2
34.19
0.23
1638
7.37
3
24.81
0.45
1348
4.02
Multivariate
COSC2673 | COSC2793 Week 2: Regression 12
RMIT Classification: Trusted
Hypothesis space: Linear Regression
The simplest type of hypothesis space (from many possible hypothesis spaces).
RMIT Classification: Trusted
Linear Regression
In linear regression the Hypothesis is a linear** equation: h% 𝐱 ∶𝑦 =𝜃&+𝜃’𝑥’+𝜃(𝑥(+⋯+𝜃)𝑥)
Ø Features/attributes: 𝐱 = 𝑥!, 𝑥”, … , 𝑥# Ø Weights: 𝜃 = 𝜃$,𝜃!,…,𝜃#
Ø Hypothesis h, with respect to weights 𝜃
**No higher order terms in the hypothesis (e.g.: 𝑥!”, 𝑥! ⋅ 𝑥”)
“Regression is the ML problem class and is related to the task.
However, representing the hypothesis as a linear function (of attributes) is a design choice made by the ML engineer.”
COSC2673 | COSC2793 Week 2: Regression 14
RMIT Classification: Trusted
Simple Univariate Linear Regression
To discuss, consider univariate linear regression:
h% 𝐱 : 𝑦 = 𝜃& + 𝜃’𝑥’ “intercept”: 𝜃!
“gradient”: 𝜃!
This is a parametric model. The weights (or parameters) completely define the hypothesis (or model)
Instance
Temperature (C)
Energy Consumption (kWh)
1
13.39
5.46
2
34.19
7.37
3
24.81
4.02
COSC2673 | COSC2793 Week 2: Regression 15
RMIT Classification: Trusted
Understanding h(x) for linear regression
Lets understand what h(x) and the effect of parameters first
h% 𝐱 ∶𝑦=𝜃&+𝜃’𝑥’
𝜃! = 1 𝜃” = 0 𝜃! = 1 𝜃! = 0 𝜃” = 1 𝜃! = 2
COSC2673 | COSC2793 Week 2: Regression 16
RMIT Classification: Trusted
Revision Questions
qWhat is the purpose of including 𝜃&?
qFor univariate case, the hypothesis can be visualized using a line. What
geometric shape does the hypothesis take if we have two attributes?
q A particular problem has 5 input attributes. How many weights does the corresponding linear regression hypothesis have?
COSC2673 | COSC2793 Week 2: Regression 17
RMIT Classification: Trusted
Performance Measure:
Linear Regression
Quantifying the “goodness” of hypothesis
RMIT Classification: Trusted
Linear Regression Goal
Learning goal is to find the “best” Regression Line.
Ø Need a measure of performance • That is, the “line of best fit”
Ø Minimise the sum (total) of the distance between the hypothesis and the training examples
h%(𝐱) = 𝜃& + 𝜃’𝑥’
Which hypothesis would you pick? Green or red? Why?
COSC2673 | COSC2793 Week 2: Regression 19
Loss Function
The measure of performance (Loss Function):
1) (,) (,) 𝐽(𝜃&,𝜃’)=𝑛∑ h% 𝐱 −𝑦
,-‘
Find a hypothesis that minimises the sum of squared differences between the:
• Predictedoutput:h%(𝐱(,)),and
• Actualoutput:𝑦(,)
• Foreachtrainingexample,{𝐱(,),𝑦(,)}
RMIT Classification: Trusted
(
h!(𝐱) = 𝜃” + 𝜃#𝑥#
COSC2673 | COSC2793 Week 2: Regression 20
Loss Function
Loss Function, is the measure of performance
1) (,) (,) 𝐽(𝜃&,𝜃’)=𝑛∑ h% 𝐱 −𝑦
,-‘
Training Goal: Find 𝜃!, 𝜃” that
minimises 𝐽(𝜃&, 𝜃’):
𝜃&∗, 𝜃’∗ = 𝑚𝑖𝑛 𝐽(𝜃&, 𝜃’) %! ,%”
RMIT Classification: Trusted
(
h!(𝐱) = 𝜃” + 𝜃#𝑥#
𝐷= 𝐱,𝑦
COSC2673 | COSC2793 Week 2: Regression 21
Aside: Notation
)
Summation: ∑ ,-‘
Difference: h% 𝑥'(,) − 𝑦(,)
Squared difference: (h% 𝑥'(,) − 𝑦(,))(
Sum of squared differences: ∑) h% 𝑥(,) ,-‘ ‘
− 𝑦(,)
(
RMIT Classification: Trusted
22
RMIT Classification: Trusted
(Simplified) Loss Function Intuition
What exactly does the goal mean? How does it relates to finding the best hypothesis?
First consider a simplified version of loss function, forcing 𝜃& = 0 Thus, the hypothesis is effectively: h%(𝐱) = 𝜃’𝑥’
Minimise: 𝐽(𝜃’), )
• That is, 𝐽(𝜃’) = )’ ∑ 𝜃’𝑥'(,) − 𝑦(,)
Goal: 𝑚𝑖𝑛𝐽(𝜃’) ,-‘ %”
(
COSC2673 | COSC2793 Week 2: Regression 23
RMIT Classification: Trusted
(Simplified) Loss Function Intuition
h!(𝐱)=𝜃#𝑥# 𝐽(𝜃#)=𝑛 ∑ 𝜃#𝑥# −𝑦 $%#
1& ($) ($))
y
Data Loss curve
COSC2673 | COSC2793 Week 2: Regression 24
RMIT Classification: Trusted
(Simplified) Loss Function Intuition
1& ($) ($))
h!(𝐱)=𝜃#𝑥# 𝐽(𝜃#)=𝑛 ∑ 𝜃#𝑥# −𝑦 $%#
Data Loss curve
COSC2673 | COSC2793 Week 2: Regression 25
RMIT Classification: Trusted
Complete Loss Function Intuition
ØHypothesis:h%(𝐱)=𝜃& +𝜃’𝑥’
‘) (,) (,)
(
ØLossfunction:𝐽(𝜃&,𝜃’)=) ∑ h% 𝐱 −𝑦 ,-‘
Ø Parameters: 𝜃!, 𝜃”
Ø Goal: 𝑚𝑖𝑛𝐽(𝜃&, 𝜃’) %! ,%”
COSC2673 | COSC2793 Week 2: Regression 26
RMIT Classification: Trusted
Complete Loss Function Intuition
h!(𝐱)=𝜃”+𝜃#𝑥#
1 & ($) 𝐽(𝜃”,𝜃#)=𝑛∑ h! 𝐱
−𝑦
($) )
$%#
𝜃!
𝜃#
𝐽(𝜃#,𝜃!)
COSC2673 | COSC2793 Week 2: Regression 27
RMIT Classification: Trusted
Complete Loss Function Intuition
h!(𝐱)=𝜃”+𝜃#𝑥#
1 & ($) 𝐽(𝜃”,𝜃#)=𝑛∑ h! 𝐱
$%#
−𝑦
($) )
𝜃#
𝜃!
COSC2673 | COSC2793 Week 2: Regression 28
RMIT Classification: Trusted
Optimization Procedure:
Gradient Descent
Finding the optimal hypothesis automatically & efficiently
RMIT Classification: Trusted
Minimising the Loss Function
Reconsider simplified problem
How can we search for the parameter(s), that minimize the loss function?
Ø We can do random search: Pick random parameter values and evaluate loss function. Choose the parameter with the minimum cost as the optimal.
Ø Not a good idea: we don’t know where to search. How many evaluations to do.
COSC2673 | COSC2793 Week 2: Regression 30
RMIT Classification: Trusted
Minimising the Loss Function
Reconsider simplified problem
x
Question 1: if I started at the red, X point on the above loss function, should I increase or decrease 𝜃’ to find minimum (where is the minimum)?
Question 2: If you can only see the region marked by the red rectangle, what would you do?
COSC2673 | COSC2793 Week 2: Regression 31
RMIT Classification: Trusted
Using Gradients
We can use the information in the “slope” (gradient) to do the search cleverly.
h!(𝐱)=𝜃”+𝜃#𝑥#
1 & ($) 𝐽(𝜃”,𝜃#)=𝑛∑ h! 𝐱
$%#
−𝑦
($) )
Data
Contour map of the loss curve
𝜃#
𝜃!
COSC2673 | COSC2793 Week 2: Regression 32
RMIT Classification: Trusted
Using Gradients
Recall we want to find minimise the loss function, 𝑚𝑖𝑛𝐽(𝜃&, 𝜃’) %! ,%”
Gradient Descent approach:
Start with some 𝜃!, 𝜃”
• Couldberandom
• CouldbebasedonsomeheuristicsorrulesoranotherMLapproach
Update 𝜃!, 𝜃” such that it reduces 𝐽(𝜃&, 𝜃’)
• Usegradients,thatisthederivativeof𝐽(𝜃&,𝜃’)
Repeat until the minimum is found • Hopefullytheglobalminimum
COSC2673 | COSC2793 Week 2: Regression 33
RMIT Classification: Trusted
Differentiating the Loss Function
Require the derivate of the loss function: 𝐽(𝜃&, 𝜃’) = ) ∑ h% 𝐱
− 𝑦
‘) (,) (,)
(
Derivate with respect to that variable(s)?
• Forminimisinglossfunction,thisistheweights
• Thatis,𝜃!,𝜃”
As the attributes are independent, so are the weights • Cantakepartialderivates!
,-‘
COSC2673 | COSC2793 Week 2: Regression 34
RMIT Classification: Trusted
Partial Derivates (analytical)
Partial derivatives of the loss function: 𝐽(𝜃&, 𝜃’) = ) ∑ ,-‘
h% 𝐱
− 𝑦
‘) (,) (,)
(
𝜕𝐽(𝜃#,𝜃!) 1 & 𝜕 ($) ($) ”
𝜕𝜃 =𝑛∑𝜕𝜃h’𝐱 −𝑦 # $%!#
𝜕𝐽(𝜃#,𝜃!) 1&1 $ $ 𝜕
𝜕𝜃 =𝑛∑2⋅h’𝐱 −𝑦 ⋅𝜕𝜃h’𝐱 −𝑦
($)
($)
#$%! # 𝜕𝐽(𝜃#,𝜃!)1&1 $ $ 𝜕 ($) ($)
𝜕𝜃 =𝑛∑2⋅ h’ 𝐱 −𝑦 ⋅𝜕𝜃 #$%! #
𝜃#+𝜃!𝑥! −𝑦
𝜕𝐽(𝜃#,𝜃!) 1&1 $
−𝑦
$
⋅1
𝜕𝜃 =𝑛∑2⋅ h’ 𝐱 # $%!
How about = >(%!,%”)? =%”
COSC2673 | COSC2793 Week 2: Regression 35
RMIT Classification: Trusted
Partial Derivates (Numerical)
Partial derivatives of the loss function: 𝐽(𝜃&, 𝜃’) = ) ∑ h% 𝐱 ,-‘
𝜕𝐽(𝜃#,𝜃!)=𝐽𝜃#+𝛿,𝜃! −𝐽(𝜃#−𝛿,𝜃!) 𝜕𝜃# 2𝛿
𝛿 is a small offset.
**Most ML packages use the numerical derivatives.
How about = >(%!,%”)? =%”
− 𝑦
‘) (,) (,)
(
COSC2673 | COSC2793 Week 2: Regression 36
RMIT Classification: Trusted
Gradient Descent – Univariate Linear
1. Initialize 𝜃& and 𝜃’
2. Untilconverged:
1. Compute gradients: ∇!𝐽 𝜃”, 𝜃#
2. Update parameters (weights):
• 𝜃#←𝜃# −𝜶*+(‘!,'”)
*’!
• 𝜃!←𝜃! −𝜶*+(‘!,'”) *'”
# can be randomly selected.
= *+(!!,!”) , *-(!!,!”) *!! *!”
Step size (Learning rate): 𝜶
How can we determined when the algorithm has converged?
ØWhen the the change in loss function due the the update is small. ØPre determined number of iterations.
COSC2673 | COSC2793 Week 2: Regression 37
RMIT Classification: Trusted
Selecting Step Size
This is what we call a hyper parameter. A parameter that is part of the learning algorithm (not the model or hypothesis).
Ø If step size is set too low, it will take a long time to get to the optimal.
Ø If the step size is set too large, it will overshoot the optimal and oscillate
without converging.
How to set? Trial and error, prior experience
or guidelines. There are also adaptive methods that can Determine the step size value automatically.
COSC2673 | COSC2793 Week 2: Regression 38
RMIT Classification: Trusted
Global vs Local Minima
Even with the appropriate learning rate we may not converge to best optimum:
Ø In general gradient descent will find a local optimum.
Ø For linear regression we WILL find an optimal solution with gradient descent. This is because the linear regression loss function is convex and has only one minimum (the global minimum)
COSC2673 | COSC2793 Week 2: Regression 39
RMIT Classification: Trusted
Multi-Variate Regression
RMIT Classification: Trusted
Multiple Features (variables)
Instance
Temp (Deg C)
No. of people
Size of house
Area of windows
Energy Consumption (kWh)
1
15
2
70
40
3400
2
8
3
160
78
5000
3
26
4
300
50
8000
…
…
…
…
…
…
n
35
1
250
100
7000
COSC2673 | COSC2793 Week 2: Regression 41
RMIT Classification: Trusted
Multi-Variate Hypothesis
Single-variate hypothesis: h%(𝐱) = 𝜃& + 𝜃’𝑥’
Multi-variate hypothesis: h%(𝐱) = 𝜃& + 𝜃’𝑥’ + 𝜃(𝑥( + ⋯ + 𝜃B𝑥B For 𝑚 attributes
(𝑛 is typically used for the size of the data set)
COSC2673 | COSC2793 Week 2: Regression 42
RMIT Classification: Trusted
Gradient Descent
Hypothesis: h%(𝐱) = 𝜃C𝐱 = 𝜃& + 𝜃’𝑥’ + 𝜃(𝑥( + ⋯ + 𝜃B𝑥B Parameters: 𝜃&, 𝜃’, … , 𝜃B
‘) (,) (,) Lossfunction:𝐽(𝜃)=) ∑ h% 𝐱 −𝑦
,-‘
Gradient descent algorithm update: Repeat {
𝜃D = 𝜃D − 𝛼 E 𝐽(𝜃) E%#
}
(
COSC2673 | COSC2793 Week 2: Regression 43
RMIT Classification: Trusted
Issues
COSC2673 | COSC2793 Week 2: Regression 44
RMIT Classification: Trusted
Issues: Hypothesis
We can ask questions about the nature of the hypothesis: Is the regression model (choice of weights) good?
• Thatis,isthehypothesish(𝐱)agoodchoicefromtheHypothesisSpace𝐻?
Is Linear Regression suitable?
• Thatis,isdoesthehypothesisactuallyapproximatetheunknowntargetfunction?
h(𝐱) ≈ 𝑓(𝐱)
COSC2673 | COSC2793 Week 2: Regression 45
RMIT Classification: Trusted
Issues: Prediction
We can also ask questions about what the hypothesis predicts: Is a prediction for an unseen example any good?
• Thatis,howcloseareh(𝐱F)G)HI))and𝑓(𝐱#$%$&’$)?
Train data Train data + unseen data
COSC2673 | COSC2793 Week 2: Regression 46
RMIT Classification: Trusted
Issues: Practical Matters
As with many ML methods, Linear Regression “works in theory”
However, in practice many issues arise:
• Limitedcomputationpower
• Limitedtime
• Properties(eccentricities)ofthedataset
For regression, feature scaling and outliers are such issues
COSC2673 | COSC2793 Week 2: Regression 47
RMIT Classification: Trusted
Practical Issue: Feature Scaling
What happens if two features (attributes) have very different domains? 𝑥” = temperature (0 to 40 degrees Celsius)
𝑥( = number of people (1 to 12 people)
𝜃” 𝜃”
𝜃!
𝜃!
COSC2673 | COSC2793 Week 2: Regression 48
RMIT Classification: Trusted
Practical Issue: Feature Scaling
Regression works, but is slower.
Prefer to keep scale all feature to the same proportion Typically use normalisation
𝜃” 𝜃”
𝜃!
𝜃!
COSC2673 | COSC2793 Week 2: Regression 49
RMIT Classification: Trusted
Practical Issue: Mean Normalisation
Replace each feature 𝑥( with 𝑥, − 𝜇, to make features to have approximately zero mean
Donotapplyto𝑥& =1 E.g.,
𝑥’ = JKBLKMNJFMK O(& , −0.5 ≤ 𝑥’ ≤ 0.5 P&
𝑥( =)H.HRLKHLSKOT,−0.5≤𝑥( ≤0.5 ‘(
Further adjust by the standard-deviation: 𝑥, = U$OV$ W$
COSC2673 | COSC2793 Week 2: Regression 50
RMIT Classification: Trusted
Issues: Data & Loss function
What happens if there are data points that does not agree with the other data points?
How can we overcome this?
Ø Clean the data
Ø Use robust cost functions – outside the scope of this course.
COSC2673 | COSC2793 Week 2: Regression 51
RMIT Classification: Trusted
Polynomial Regression
RMIT Classification: Trusted
Polynomial Regression
Linear regression model is too simple and does not have enough capacity to represent the data.
How can we increase capacity of the model?
COSC2673 | COSC2793 Week 2: Regression 53
RMIT Classification: Trusted
Polynomial Regression
Assumes that the target function (learning task) is: A polynomial equation
Hypothesis:h(𝐱)=𝜃&+𝜃’𝑥’+𝜃(𝑥'(+⋯+𝜃,𝑥B+𝜃D𝑥B( +⋯ Add higher order terms to the hypothesis.
COSC2673 | COSC2793 Week 2: Regression 54
RMIT Classification: Trusted
Polynomial Regression
Polynomial regression is implemented using a simple trick. Ø Change the features to polynomial features.
Ø Fit linear regression model to new features.
e.g.: Polynomial feature of degree 2 for our univariate case:
h(𝐱)=𝜃&+𝜃’𝑥’ h 𝐱 =𝜃&+𝜃’𝑥’+𝜃(𝑥'(
Instance
x
y
1
13.39
5.46
2
34.19
7.37
3
24.81
4.02
Instanc e
x
x^2
y
1
13.39
179.3
5.46
2
34.19
1168.9
7.37
3
24.81
615.5
4.02
COSC2673 | COSC2793 Week 2: Regression 55
RMIT Classification: Trusted
Gradient Descent
‘) (,) (,) Thelossfunctionisthesame:𝐽(𝜃)=) ∑ h% 𝐱 −𝑦
,-‘
In theory the same gradient descent approach can be used
However!
Feature normalization becomes very important.
(
COSC2673 | COSC2793 Week 2: Regression 56
Summary
Linear Regression
Representation
Hypothesis
Model
Parameters fitting (intuition)
Gradient Descent
Next week:
Logistic Regression Regularisation
RMIT Classification: Trusted
62