Regression
COMP9417: Machine Learning & Data Mining Term 1, 2021
Adapted from slides by Dr Michael Bain
Administration
• Lecturer in Charge:
o Dr. Gelareh Mohammadi
• Course Admin:
o Omar Ghattas
• Teaching Assistant: o Anant Mathur
• Tutors: Omar Ghattas, Peng Yi, Anant Mathur, Sidney Tandjiria, Daniel Woolnough, Jiaxi Zhao
COMP9417 T1, 2021 1
Communication
• Coursewebsite:Moodle
• Email:cs9417@cse.unsw.edu.au(useyour UNSW email for all correspondence)
• Moodleforums
• Consultwithyourtutorinthetimeslots
• Consultation
COMP9417 T1, 2021 2
Lectures & Tutorials
• Tutorials start in week 2 (No tutorials this week)
• There is no lectures or tutorials in week 6 (term break)
• Attend your allocated tutorial session
• The assignments and homework are in Python
• References: a list is provided in the course outline
• Note: Lecture slides do not cover every details we go through in the classroom
COMP9417 T1, 2021 3
Tutorials
• Most weeks will have both lab and tutorial activities.
• We provide pre-recorded videos for each tutorial.
• You are supposed to go through the tutorial and lab activities before attending the tutorial class and use the scheduled time for asking questions.
• There will be 4 quizzes during your tutorial time (Weeks 3, 5, 7, 9).
COMP9417 T1, 2021 4
Assessments
• Homework 1 (5%) – Dueinweek3
• Homework 2 (5%) – Dueinweek6
• Quizzes (5%)
• Assignment (group project) (30%) – Dueinweek10
• Final Exam (55%)
Late submission penalties will apply to all assessable works. All submissions will be checked for plagiarism.
COMP9417 T1, 2021 5
A Brief Course Introduction
COMP9417 T1, 2021 6
Overview
This course will introduce you to machine learning, covering some of the core ideas, methods and theory currently used and understood by practitioners, including, but not limited to:
o Categories of learning (supervised / unsupervised learning, etc.)
o Widely-used machine learning techniques and algorithms o Parametric vs. non-parametric approaches
o Generalisation in machine learning
o Training, validation and testing phases in applications
o Learning Theory
COMP9417 T1, 2021 7
What we will cover
• Core algorithms and model types in machine learning
• Foundational concepts regarding learning from data
• Relevant theory to inform and generalise understanding
• Practical applications
COMP9417 T1, 2021 8
What we will NOT cover
• Probability and statistics
• Lots of neural nets and deep learning
• Reinforcement learning
• Big data
• Ethical aspects of AI and ML
although all of these are interesting and important topics!
COMP9417 T1, 2021 9
Some definitions
The field of machine learning is concerned with the question of how to construct computer programs that automatically improve from experience.
“Machine Learning”. T. Mitchell (1997)
Machine learning, then, is about making computers modify or adapt their actions (whether these actions are making predictions, or controlling a robot) so that these actions get more accurate, where accuracy is measured by how well the chosen actions reflect the correct ones.
“Machine Learning”. S. Marsland (2015)
COMP9417 T1, 2021 10
Some definitions
The term machine learning refers to the automated detection of meaningful patterns in data.
“Understanding Machine Learning”. S. Shwartz and S. David (2014)
Trying to get programs to work in a reasonable way to predict stuff. R. Kohn (2015)
Data mining is the extraction of implicit, previously unknown, and potentially useful information from data.
“Data Mining”. I. Witten et al. (2016)
COMP9417 T1, 2021 11
Examples
• Object recognition / object classification
• Text classification (e.g. spam/non-spam emails)
• Speech recognition
• Event detection
• Recommender systems
• Human behaviour recognition (emotions, state of mind, etc.)
• Automatic medical diagnosis
COMP9417 T1, 2021 12
Example
Simple face recognition pipeline:
Features
Learning algorithm
A man Bob
COMP9417 T1, 2021
13
Machine learning pipeline
COMP9417 T1, 2021 14
Supervised and unsupervised learning
COMP9417 T1, 2021 15
Supervised and unsupervised learning
The most widely used categories of machine learning algorithms are: o Supervised learning – output class (or label) is given
o Unsupervised learning – no output class is given
There are also hybrids, such as semi-supervised learning, and alternative strategies to acquire data, such as reinforcement learning and active learning.
Note: output class can be real-valued or discrete, scalar, vector, or other structure . . .
COMP9417 T1, 2021 16
Supervised and unsupervised learning
Supervised learning tends to dominate in applications. Why ?
COMP9417 T1, 2021 17
Supervised and unsupervised learning
Supervised learning tends to dominate in applications because generally, it is much easier to define the problem and develop an error measure (loss function) to evaluate different algorithms, parameter settings, data transformations, etc. for supervised learning than for unsupervised learning.
COMP9417 T1, 2021 18
Supervised and unsupervised learning
Unfortunately . . .
In the real world it is often difficult to obtain good labelled data in sufficient quantities
So, in such cases unsupervised learning is really what you want . . .
but currently, finding good unsupervised learning algorithms for complex machine learning tasks remains a research challenge.
COMP9417 T1, 2021 19
Aims
This lecture (which is on regression) will introduce you to machine learning approaches to the problem of numerical prediction. Following it you should be able to reproduce theoretical results, outline algorithmic techniques and describe practical applications for the topics:
– the supervised learning task of numeric prediction
– how linear regression solves the problem of numeric prediction
– fitting linear regression by least squares error criterion
– non-linear regression via linear-in-the-parameters models – parameter estimation for regression
– local (nearest-neighbour) regression
COMP9417 T1, 2021 20
Introduction to Regression
COMP9417 T1, 2021 21
Introduction to Regression
The “most typical” machine learning approach is to apply supervised learning methods for classification, where the task is to learn a model to predict a discrete value for data instances . . .
. . . however, we often find tasks where the most natural representation is that of prediction of numeric values. So, we need regression in those tasks
COMP9417 T1, 2021 22
Introduction to Regression
Introduction to Regression
Introduction to Regression
Example – task is to learn a model to predict CPU performance from
Example – task is to learn a model to predict CPU performance from a a datasedattosefteoxf aexmamppleleof 2099ddi↵ieffrenrtecnotmcpoumterpcuotnefirgucroantiofingsu: rations:
COMP9417 ML & DM Regression Term 2, 2019 21 / 107
COMP9417 T1, 2021 23
Introduction to Regression
Result: a linear regression equation fitted to the CPU dataset.
PRP =
-56.1
+ 0.049 MYCT + 0.015 MMIN + 0.006 MMAX + 0.630 CACH – 0.270 CHMIN + 1.46 CHMAX
COMP9417 T1, 2021 24
Introduction to Regression
For the class of category representations (or concept learning), machine learning is viewed as:
searching a space of hypotheses . . .
represented in a formal hypothesis language (trees, rules, graphs . . . ).
COMP9417 T1, 2021 25
Introduction to Regression
For the class of numeric representations, machine learning is viewed as: “searching” a space of functions . . .
represented as mathematical models (linear equations, neural nets, . . . ). Note: in both settings, the models may be probabilistic . . .
COMP9417 T1, 2021 26
Learning Linear Regression Models
COMP9417 T1, 2021 27
Linear Regression
In linear regression, we assume there is a linear relationship between the output and feature(s)/input variable(s); this means the expected value of the output given an input, 𝐸[𝑦|𝑥] is linear in input 𝑥
Example with one variable: 𝑦! = 𝑏𝑥 + 𝑐 Problem: given the data, estimate 𝑏 and 𝑐
COMP9417 T1, 2021 28
Linear regression
• Training instances: 𝑥!, 𝑦! , 𝑗 = 1, … , 𝑚
• 𝑥! is an input vector
• 𝑦! is the output to be predicted
• Each input has 𝑛 attributes/features 𝑥! = [𝑥!” , 𝑥!# , … , 𝑥!$ ]%
• 𝑥 and 𝑦 are a general observation with 𝑥 = [𝑥”, 𝑥#, … , 𝑥$]%
• 𝜃=[𝜃&,𝜃”,…,𝜃$]%
• 𝕪 is the vector of outputs:𝕪 = [𝑦”, … , 𝑦’]%
• 𝑋 is the matrix of inputs where each row is one observation
COMP9417 T1, 2021 29
Linear regression
• Linear regression can be used for numeric prediction from numeric attributes
• In linear models, the outcome is a linear combination of attributes:
𝑦1 = 𝜃 & + 𝜃 ” 𝑥 ” + 𝜃 # 𝑥 # + ⋯ + 𝜃 $ 𝑥 $ = h ( 𝑥 )
• Weights are estimated from the observed data (training data)
• Predicted value for the first training instance 𝑥( is:
𝑦1″ = 𝜃& +𝜃”𝑥”” +𝜃#𝑥”# +⋯+𝜃$𝑥”$ = 7$ 𝜃(𝑥”( = 𝑥”%𝜃 = h(𝑥”) ()&
To simplify the notation, we set 𝑥& = 1 and define 𝑥 = [𝑥&, 𝑥”, … , 𝑥$]% COMP9417 T1, 2021 30
Linear regression
• Univariate regression: one input variable/feature/attribute is used
to predict one output variable
• Multiple regression / multivariable regression: more than one variables/features/attributes are used to predict one output variable
COMP9417 T1, 2021 31
Linear Regression
Infinite number of lines can be fitted, depending on how we define the best fit criteria
…but the most popular estimation model is “Least Squares”, also known as “Ordinary Least Squares” (OLS) regression
COMP9417 T1, 2021 32
Minimizing Error
• Error = Difference between the predicted value and the actual value
• We want to minimize the error over all samples!
• We define the total error as the sum of squared errors and searching for
n+1 parameters to minimize that
• Sum of squared error for m samples/observations
‘$’
𝐽 𝜃 =7(𝑦! −7𝜃(𝑥!()# =7(𝑦! −𝑥!%𝜃)# =(𝕪−𝑋𝜃)%(𝕪−𝑋𝜃)
!)” ()& !)”
𝐽(𝜃) is called cost function or loss function. This concept generalizes to all functions that measure the distance between the predicted and true output. (in OLS framework, it is also called Residual Sum of Squares (RSS))
COMP9417 T1, 2021 33
Minimizing Squared Error (normal equations)
Here is how matrix 𝑋 and vector 𝕪 look like:
1 𝑥!! 𝑥!” … 𝑥!# 𝑦!
1 𝑥”! 𝑥”” … 𝑥”# 𝑦” 𝑋= . 𝕪=
1 𝑥%!𝑥%” …𝑥%# 𝑦%
You have to be careful to add a column of ones in 𝑋 to count for the intercept parameter.
COMP9417 T1, 2021 34
Minimizing Squared Error
This cost function can be also written as:
1′
𝐽𝜃 =𝑚7(𝑦!−𝑥!%𝜃)#
!)”
Which is the average of squared error and minimizing it, leads to the same 𝜃, that we get with 𝐽 𝜃 without taking the average.
The ” ∑’ (𝑦 − 𝑥%𝜃)# is called mean squared error or briefly MSE. ‘ !)” ! !
COMP9417 T1, 2021 35
Minimizing Squared Error
Multiple Regression
Multiple regrGeisvesnio2nreealx-vamluepdlvea:riables X1, X2, labelled with a real-valued variable
Y , find “plane of best fit” that captures the dependency of Y on X , X .
Learning Linear Regression Models
Given 2 variables/attributes with real values 𝑥 , 𝑥 ∈ R, labelled with a real- !”
valued variable 𝑦, find “plane of best fit” that captures the dependency of 𝑦 on 𝑥! and 𝑥”.
Again we can use MSE to find the best plane that fits the data.
For more than two variables, we find the best hyper-plane.
Learning here is by minimizing MSE, i.e., average of squared vertical
ˆˆ distances of actual values of Y from the learned function Y = f(X).
12
COMP9417 T1, 2021
36
COMP9417 ML & DM Regression Term 2, 2019
32/107
Least Squares Regression
COMP9417 T1, 2021 37
Gradient Descent
Gradient descent starts with some initial 𝜃, and repeatedly performs an update:
𝜃(≔𝜃(−𝛼* 𝐽(𝜃) *+!
𝛼 is the learning rate (a.k.a. step size). This algorithm takes a step in the direction of the steepest decrease in 𝐽(𝜃)
To implement this algorithm, we have to work out the partial derivative termfor𝐽𝜃=”∑’ (𝑦−𝑥%𝜃)#
‘ !)” ! !
COMP9417 T1, 2021 38
Gradient Descent
From: Gradient Descent: All you need to know, by S. Suryansh COMP9417 T1, 2021 39
Gradient Descent
If we focus on only one sample out of 𝑚 samples, then the cost function is:
$
𝐽𝜃 =(𝑦!−h+ 𝑥! )#=(𝑦!−7𝑥!(𝜃()#
h&𝑥’ =6𝑥'(𝜃(=𝑥’*𝜃 ()!
Taking the derivative will give:
𝜕 𝐽 𝜃 = −2(𝑦! − h+ 𝑥! )𝑥!( 𝜕𝜃(
So, for a single training sample, the update rule is:
𝜃( ≔𝜃( +2𝛼(𝑦! −h+ 𝑥! )𝑥!( (forevery𝑖)
The update rule for squared distance is called “Least Mean Squares” (LMS), also known as Widrow-Hoff
#
()”
COMP9417 T1, 2021 40
Gradient Descent
We looked at LMS rule for only a single sample. But what about other samples? There are two ways: Batch Gradient Descent and Stochastic Gradient Descent. For the following cost/loss function:
1′
𝐽𝜃 =𝑚7(𝑦!−𝑥!%𝜃)#
!)”
𝜃 = 𝜃 + 𝛼 # ∑’ (𝑦 − h 𝑥 )𝑥 (for every 𝑖) ( ( ‘ !)” ! + ! !(
Replace the gradient with the sum of gradient for all samples and continue until convergence.
Convergence means that, the estimated 𝜃 will be stabilized. COMP9417 T1, 2021 41
1. Batch Gradient Descent:
Gradient Descent
2. Stochastic Gradient Descent:
𝑓𝑜𝑟𝑗=1 𝑡𝑜 𝑚{
𝜃( =𝜃( +2𝛼(𝑦! −h+ 𝑥! )𝑥!( }
Repeat this algorithm until convergence. What this algorithm does?
(forevery𝑖)
COMP9417 T1, 2021
42
Gradient Descent
In stochastic gradient descent, 𝜃 gets updated at any sample separately. This algorithm is much less costly than batch gradient descent, however it may never converge to the minimum.
COMP9417 T1, 2021 43
Gradient Descent
In both algorithms:
– You can start with an arbitrary (random) values for your
parameters 𝜃( (initialization)
– You have to be careful with the selection of learning rate, 𝛼, as a small value can make the algorithm very slow, and a big value may stop the algorithm from convergence
COMP9417 T1, 2021 44
Minimizing Squared Error (normal equations)
Gradient descent is one way of minimizing our cost function 𝐽(𝜃) which is an iterative algorithm.
But maybe you can find the minimum of 𝐽(𝜃) explicitly by taking its derivatives and setting them to zero. This is also called exact or closed-form solution:
𝜕𝐽𝜃=0 𝜕𝜃
𝐽 𝜃 =(𝕪−𝑋𝜃)*(𝕪−𝑋𝜃)
𝜕𝐽𝜃 =−2𝑋* 𝕪−𝑋𝜃 =0 𝜕𝜃
𝑋* 𝕪−𝑋𝜃 =0
𝜃 = (𝑋*𝑋)+!𝑋*𝕪
COMP9417 T1, 2021 45
Minimizing Squared Error (probabilistic interpretation)
We can write the relationship between input variable x and output variable y as:
𝑦’ =𝑥’*𝜃+𝜀’
And 𝜀’ is an error term which might be unmodeled effect or random noise. Let’s assume 𝜀’s are independent and identically distributed (𝑖. 𝑖. 𝑑.) according to a Gaussian distribution:
𝜀’~𝑁(0,𝜎”)
1 𝜀'” 𝑝𝜀’ = 2𝜋𝜎”exp(−2𝜎”)
COMP9417 T1, 2021 46
Minimizing Squared Error (probabilistic interpretation)
This implies that:
1 ( 𝑦 ‘ − 𝑥 ‘* 𝜃 ) ” 𝑝𝜀’ =𝑝𝑦’𝑥’;𝜃 = 2𝜋𝜎”exp(− 2𝜎” )
So we want to estimate 𝜃 such that we maximize the probability of output 𝑦 given input 𝑥 over all 𝑚 training samples:
L 𝜃 = 𝑝(𝕪|𝑋; 𝜃) (this is called Likelihood function)
COMP9417 T1, 2021 47
Minimizing Squared Error (probabilistic interpretation)
Since we assumed independence over 𝜀’, that implicitly means that our training samples are also independent from each other. Based on this assumption, we
can write L 𝜃
as follows:
%
L𝜃 =L𝑝(𝑦’|𝑥’;𝜃) ‘)!
% 1 ( 𝑦 ‘ − 𝑥 ‘* 𝜃 ) ” = L’ ) ! 2 𝜋 𝜎 ” e x p ( − 2 𝜎 ” )
Now, we have to estimate 𝜃 such that it maximized L 𝜃 probability as possible. This is called maximum likelihood.
to have as high
COMP9417 T1, 2021 48
Minimizing Squared Error (probabilistic interpretation)
We know (from math) that to find 𝜃 that maximized L 𝜃 , we can also maximize any strictly increasing function of L 𝜃 . In this case it would be easier if we maximize the log likelihood l(𝜃):
% 1 ( 𝑦 ‘ − 𝑥 ‘* 𝜃 ) ”
l 𝜃 =logL 𝜃 =logL’)! 2𝜋𝜎”exp(− 2𝜎” )=
1 11%
=𝑚log 2𝜋𝜎” −𝜎”26(𝑦’−𝑥’*𝜃)”
‘)!
So, maximizing l 𝜃 is equal to minimizing ∑% (𝑦 − 𝑥*𝜃)” ‘)! ‘ ‘
Do you know what is ∑% (𝑦 − 𝑥*𝜃)”? ‘)! ‘ ‘
COMP9417 T1, 2021
49
Minimizing Squared Error (probabilistic interpretation)
• This simply shows that under certain assumptions (𝜀’~𝑁(0, 𝜎”)) and 𝑖. 𝑖. 𝑑. ) the least-squared regression is equivalent to find maximum likelihood
estimate of 𝜃.
• Note that the value of 𝜎” does not affect the choice of 𝜃
COMP9417 T1, 2021 50
Linear Regression Assumptions
• Linearity: The relationship between 𝑥 and the mean of 𝑦 is linear.
• Homoscedasticity: The variance of residual is the same for any value of 𝑥.
• Independence: Observations are independent of each other.
• Normality: For any fixed value of 𝑥, 𝑦 is normally distributed.
COMP9417 T1, 2021 51
Step back: Statistical Techniques for Data Analysis
COMP9417 T1, 2021 52
Probability vs Statistics: The Difference
Probability versus Statistics
• Probability: reasons from populations to samples
– This is deductive reasoning, and is usually sound (in the logical sense of the word)
• Statistics: reasons from samples to populations
– This is inductive reasoning, and is usually unsound (in the logical
sense of the word)
COMP9417 T1, 2021 53
Sampling
COMP9417 T1, 2021 54
Where do the Data come from? (Sampling)
• For groups (populations) that are fairly homogeneous, we do not need to collect a lot of data. (We do not need to sip a cup of tea several times to decide that it is too hot.)
• For populations which have irregularities, we will need to either take measurements of the entire group, or find some way of get a good idea of the population without having to do so
• Sampling is a way to draw conclusions about the population without having to measure all of the population. The conclusions need not be completely accurate.
• All this is possible if the sample closely resembles the population about which we are trying to draw some conclusions
COMP9417 T1, 2021 55
What We Want From a Sampling Method
• No systematic bias, or at least no bias that we cannot account for in our calculations
• The chance of obtaining an unrepresentative sample can be calculated. (So, if this chance is high, we can choose not to draw any conclusions.)
• The chance of obtaining an unrepresentative sample decreases with the size of the sample
COMP9417 T1, 2021 56
Estimation
COMP9417 T1, 2021 57
Estimation from a Sample
• In statistics, estimation refers to the process by which one makes inferences about a population, based on information obtained from a sample.
• Estimating some aspect of the population using a sample is a common task. E.g. sample mean are used to estimate population means
• Along with the estimate, we also want to have some idea of the accuracy of the estimate (usually expressed in terms of confidence limits)
• Some measures calculated from the sample are very good estimates of corresponding population values. For example, the sample mean 𝑚 is a very good estimate of the population mean 𝜇. But this is not always the case. For example, the range of a sample usually underestimates the range of the population
COMP9417 T1, 2021 58
Estimation from a Sample
• We will have to clarify what is meant by a “good estimate”. One meaning is that an estimator is correct on average. For example, on average, the mean of a sample is a good estimator of the mean of the population
• For example, when a number of sample sets are drawn and the mean of each is found, then average of these means is equal to the population mean
• Such an estimator is said to be statistically unbiased
COMP9417 T1, 2021 59
Estimate of the mean and Variance
Mean:
• Thearithmeticmeanofasampleis𝑚=”∑, 𝑥,wherethe observations are 𝑥 ,𝑥 ,…,𝑥 , ()” (
“#,
• This works very well when the data follow a “normal” distribution
• If we can group the data so that observation 𝑥” occurs 𝑓” times, 𝑥#
occurs 𝑓# times and so on, then the mean is calculated even easier a s 𝑚 = ,” ∑ ( 𝑥 ( 𝑓 (
• If we define relative frequencies as 𝑝( = -!, then the mean is simply ,
the observations weighted by relative frequency. That is 𝑚 = ∑( 𝑥(𝑝( • So, the expected value of a discrete random variable 𝑋 is:
𝐸𝑋 =7𝑥(𝑝(𝑋=𝑥()
(
COMP9417 T1, 2021 60
Estimates of the Mean and Variance Variance: measures how far a set of random numbers are spread out from
their average value
• Standard deviation (square root of variance) is estimates as:
𝑠 = 1 6(𝑥( − 𝑚)” 𝑁−1 (
• Again this is a very good estimate when the data are modeled by Normal distribution.
• For grouped data, this is modified to:
𝑠 = 1 6(𝑥( − 𝑚)” 𝑓( 𝑁−1 (
• Again, we can write this in terms of expected values, which estimates the scatter (spread) of values of a random variable 𝑋 around a mean value
𝐸(𝑋):
𝑉𝑎𝑟 𝑋 =𝐸( 𝑋−𝐸 𝑋 )” =𝐸 𝑋” −[𝐸(𝑋)]”
Variance is the “mean of squares minus the squares of means” COMP9417 T1, 2021 61
Covariance and Correlation
Covariance is a measure of relationship between two random variables:
𝑐𝑜𝑣 𝑥,𝑦 =∑((𝑥( −𝑥̅)(𝑦( −𝑦^)=(∑(𝑥(𝑦() −𝑁𝑥̅𝑦^ 𝑁−1 𝑁−1
Correlation is a measure to show how strongly a pair of random variables are related
• The formula for computing correlation between 𝑥 and 𝑦 is:
,-.(0,2) .45(0) .45(2)
This is also called Pearson’s correlation coefficient
𝑟=
COMP9417 T1, 2021 62
Correlation
Pearson correlation in some sets of (x,y). [Wikipedia]
COMP9417 T1, 2021 63
Correlation
• The correlation coefficient is a number between −1 and +1 that shows whether a pair of variable 𝑥 and 𝑦 are associated or not and whether their scatter in the association is high or low:
– A value close to 1 shows that high values of 𝑥 are associated with high values of 𝑦 and low values of 𝑥 are associates with low values of 𝑦, and scatter is low
– A value near 0 indicates that there is no association and there is a large scatter
– A value close to -1 suggests a strong inverse association between 𝑥 and 𝑦
• Correlation is only appropriate when 𝑥 and 𝑦 are roughly linearly associated (does not work well when the association is curved)
COMP9417 T1, 2021 64
What Does Correlation Mean?
COMP9417 T1, 2021 65
Regression
Univariate linear regression
Linear Relationship Between 2 Variables I
• GOAL: fit a line whose equation is of the form Yˆ = COMP9417 T1, 2021 2 66 ˆ 2
HOW: minimise i di = i (Yi Yi ) (the “least sq
•
a+b uares
PP
X
Meaning of the coefficients
COMP9417 T1, 2021 67
Finding the parameters for univariate linear regression
COMP9417 T1, 2021 68
Univariate linear regression
In univariate regression we aim to find the relationship between 𝑦 and one independent variable 𝑥.
Example:
Suppose we want to investigate the relationship between people’s height and weight. We collect 𝑚 height and weight measurements:
h’,𝑤’ , 𝑗=1,…,𝑚
Univariate linear regression assumes a linear relation 𝑤a = 𝜃6 + 𝜃!h, with
parameters 𝜃 , 𝜃 chosen such that the sum of squared residuals ∑% b𝑤 − 6 ! ‘)! ‘
𝜃6 + 𝜃!h’c” is minimized
COMP9417 T1, 2021 69
Univariate linear regression
COMP9417 T1, 2021 70
Linear regression: intuitions
COMP9417 T1, 2021 71
Linear regression: intuitions
Another important point to note is that the sum of the residuals of the least
square (or mean squared error) is zero %
6𝑦’−𝑥’*𝜃 =0 ‘)!
While this property is intuitively appealing it is worth keeping in mind that it also makes linear regression susceptible to outliers: points that are far removed from the regression line, often because of the measurement error.
COMP9417 T1, 2021 72
Univariate linear regression Finding the parameters for univariate linear regression
TThhe e↵ecftfeofcotuotliferos utliers
90 85
80 75 70 65 60 55 50 45
40
140 150 160
170 180 190 200
COMP9417 ML & DM Regression Term 2, 2019 68 / 107
COMP9417 T1, 2021 73
The effect of outliers
Shown on previous slide:
• Suppose that, as the result of a transcription error, one of the weight values from the previous example of univariate regression is increased by 10 kg. The diagram shows that this has a considerable effect on the least-squares regression line.
• Specifically, we see that one of the blue points got moved up 10 units to the green point, changing the red regression line to the green line.
COMP9417 T1, 2021 74
Multiple linear regression
COMP9417 T1, 2021 75
Multiple Regression
Often, we are interested in modelling the relationship of 𝑦 to several other variables. In observational studies the value of 𝑦 may be affected by the value of several variables.
Example:
Suppose we want to predict people’s weight from their height and body fame size (usually measured by wrist size). We collect 𝑚 height, weight and body frame measurement:
h’,𝑓’,𝑤’ , 𝑗=1,…,𝑚
COMP9417 T1, 2021 76
Multiple Regression
Similar to before the linear regression model is:
𝑤a = 𝜃 6 + 𝜃 ! h + 𝜃 ” 𝑓 ,
And similar to univariate linear regression, we have to minimize the sum of
squared residuals
%
6 𝑤’ −(𝜃6+𝜃!h’ +𝜃”𝑓’) ”
‘)!
• Including more variables can give a narrower confidence interval on the prediction being made
• With many variables, the regression equations and the expressions for the 𝜃 are expressed better using a matrix representation(as we used before) for sets of equations
COMP9417 T1, 2021 77
Linear Regression for curve shapes
You may think that linear regression produces straight lines or planes, and nonlinear equations models produce curvature!!
Well, that’s not completely correct. With some tricks we can produce curves with linear regression
COMP9417 T1, 2021 78
Linear Regression for curve shapes Example
COMP9417 T1, 2021 79
Linear Regression for curve shapes
• In this example we can predict the output with different models:
• 𝑦!=𝜃6+𝜃!𝑥!
• 𝑦! = 𝜃6 +𝜃!𝑥! +𝜃”𝑥!”, 𝑥” = 𝑥!”.→ 𝑦! = 𝜃6 +𝜃!𝑥! +𝜃”𝑥”
• 𝑦!=𝜃6+𝜃!𝑥!+𝜃”𝑥!”+𝜃7𝑥!7,𝑥” =𝑥!”, 𝑥7 =𝑥!7 → 𝑦!=𝜃6+𝜃!𝑥!+𝜃”𝑥”+ 𝜃7𝑥7
As you can see, these nonlinear models can still be treated like linear regression and they can fit curvature. They are still linear in parameters.
(Nonlinear regression is not linear in parameters, e.g., 𝑦! = &!0 ) &”80
COMP9417 T1, 2021 80
Linear Regression for curve shapes
Image from AWS website
COMP9417 T1, 2021 81
Linear Regression for curve shapes
Question:
How to control for degree of complexity of the model to avoid overfitting?
COMP9417 T1, 2021 82
Regularisation
COMP9417 T1, 2021 83
Regularisation
ameter Estimation by Optimization I
Regularisation
ularisRaetigounlairsisatgioeneisral gmeentehroadl mtoetahvoodidtooavveorifidttoivnegrfbitytinagpbpylyainpgplying additional constraints to the weight vector. A common approach is to
itional constraints to the weight vector. A common approach is to
make sure the weights are, on average, small in magnitude: this is
e sure the weights are, on average, small in magnitude: this is referred
referred to as shrinkage. s shrinkage.
Recall the setting for regression in terms of cost minimization.
all the setting for regression in terms of cost minimization.
• Can add penalty terms to a cost function, forcing coefficients to
Can add penalty terms to a cost function, forcing coe cients to shrink to zero
shrink to zero
COMP9417 T1, 2021 84
Par
Reg
add mak to a
Rec
•
Y =
f✓0,✓1,…,✓n (X1, X2, . . . , Xn)
= f✓(X)
Regularisation
COMP9417 T1, 2021 85
Regularisation
COMP9417 T1, 2021 86
Regularisation
COMP9417 T1, 2021 87
Train, Validation & Test Data
• Train Data: the data that we use to learn our model and its
parameters
• Validation Data: The data (unseen by the model) used to provide an unbiased evaluation of a model fit on the training dataset while tuning model hyperparameters.
• Test Data: unseen data by the model that we use to test the model and shows how well our model generalizes. In practice, we don’t know the ground truth (label) for this data
COMP9417 T1, 2021 88
Train, Validation & Test Data
Ideally, we would like, to get the same performance on test set as we get on validation/training set. This is called Generalization.
Generalization is the model ability to adapt properly to new, previously unseen data drawn from the same distribution as the one used to create the model.
COMP9417 T1, 2021 89
Model Evaluation
COMP9417 T1, 2021 90
Model Evaluation
– R-squared represents the portion of variance in the output that has been explained by the model COMP9417 T1, 2021 91
Model Evaluation Example:
• As can be seen adjusted R-squared does a better job at penalizing case 2 & 3 for having more variables without adding any extra information
COMP9417 T1, 2021 92 [By Alvin Swalin, Data Inistitute]
Model Evaluation
• •
•
the absolute value of RMSE does not actually tell how bad a model is. It can only be used to compare across two models
Adjusted R-squared easily talks about the quality of the model as well. For example, if a model has adjusted R-squared equal to 0.05 then it is definitely poor.
However, if you care only about prediction accuracy then RMSE is best. It is computationally simple, easily differentiable and present as default metric for most of the models.
COMP9417 T1, 2021 93
Some further issues in learning linear regression models
COMP9417 T1, 2021 94
Categorical Variables
• “Indicator” variables are those that take on the values 0 or 1.
• They are used to include the effect of categorical variables, for example, if 𝐷 is a variable that takes value 1 if a patient takes a drug and 0 if the patient does not. Suppose we want the effect of drug 𝐷 on blood pressure 𝑦, keeping 𝑎𝑔𝑒 constant
𝑦 = 70 + 5𝐷 + 0.44 𝐴𝑔𝑒
• So taking drug 𝐷 (a unit change in 𝐷) makes a difference of 5 units
in blood pressure, provided age is held constant.
COMP9417 T1, 2021 95
i
e
c Variables: X’s II
Categoric Variables
we capture any interaction e↵ect between age and drug IntrHoodwudcoewae cnaepwtureinandyicinatetroarctivonareifafebctlebeDtweXen =ageDand⇥drXug intake.
Introduce a new indicator variable 𝑧 = 𝐷×𝐴𝑔𝑒
ˆ 𝑦 = 70 + 5𝐷 + 0.44 𝐴𝑔𝑒 + 0.21 𝑧
Y =70+5D+0.44X+0.21DX
COMP9417 T1, 2021 96
do ?
Model Selection
Suppose there are a lot of variables/features (𝑥), some of which may be
representing products, powers, etc.
• Taking all the features will lead to an overly complex model. There are 3 ways to reduce complexity:
1. Subset-selection, by search over subset lattice. Each subset results in a new model, and the problem is to select one of the models.
2. Shrinkage, or regularization of coefficients to zero, by optimization. There is a single model, and unimportant variables have near-zero coefficients.
3. Dimensionality-reduction, by projecting points into a lower dimensional space (this is different to subset-selection, and we will look at it later)
COMP9417 T1, 2021 97
Model Selection
Subset-selection: we want to find a subset of variables/features which performs well and get rid of redundant features. This is also called stepwise regression.
• Historically, model-selection for regression has been done using “forward-selection”, “backward-elimination”, or “bidirectional” methods
– These are greedy search techniques that either:
o (a) start with no variable and at each step add one variable whose addition gives the most improvement to the fit
o (b) start with all variables and at each step remove one whose loss gives the most insignificant deterioration of the model fit;
o (c) a combination of the above, testing at each step for variables to be included or excluded.
COMP9417 T1, 2021 98
Model Selection
• This greedy selection done on the basis of calculating the fit quality (often using R-squared, which denotes the proportion of total variation in the dependent variable 𝑦 that is explained by the model)
• Given a model formed with a subset of variables 𝑥, it is possible to compute the observed change in R-squared due to the addition or deletion of some variable 𝑥
• This is used to select greedily the next best move in the graph- search
To set other hyper-parameters, such as shrinkage parameter in regularization, we can use grid search
COMP9417 T1, 2021 99
Local (nearest-neighbor) regression
COMP9417 T1, 2021 100
Local Learning
• Relates to the simplest form of learning: rote learning, or memorization
• Training instances are searched for instance that most closely resembles query or test instance
• The instances themselves represent the knowledge
• Called: nearest-neighbor, instance-based, memory-based or case- based learning; all forms of local learning
• The similarity or distance function defines “learning”, i.e., how to go beyond simple memorization
• Intuition — classify an instance similarly to examples “close by” — neighbors or exemplars
• A form of lazy learning – don’t need to build a model! COMP9417 T1, 2021 101
Nearest neighbor for numeric prediction
• Store all training examples 𝑓 𝑥( , 𝑥(
Nearest neighbour
• Given query instance 𝑥9 ,
• First locate nearest neighbour 𝑥# ,
• Then estimate 𝑦! = 𝑓(𝑥 )
9#
K-nearest neighbour
• Given 𝑥9, take mean of 𝑓values of 𝑘 nearest neighbour
∑: 𝑓(𝑥) 𝑦!=()! (
9
𝑘
COMP9417 T1, 2021 102
Distance function
The distance function defines what is “learned”, i.e., predicted
• Most commonly distance function used is Euclidean distance
• If instant 𝑥( is defined with 𝑛 feature values
𝑥(!,…,𝑥(#
• The Euclidean distance between two instances 𝑥( , 𝑥’ is defined as
#
𝑑𝑥,𝑥=6𝑥−𝑥” (‘ (:’:
:)!
COMP9417 T1, 2021 103
Local regression
Use KNN to form a local approximation to 𝑓 for each query point 𝑥. using a linear function of the form:
𝑦1=𝑓S𝑥 =𝜃&+𝜃”𝑥”+⋯+𝜃$𝑥$
• Fit the liner function to 𝑘 nearest neighbor of the query point
• Linear, quadratic or higher-order polynomial can be used
• Produces “piecewise approximation” so can be useful for non-linear functions, or data with changing distributions
COMP9417 T1, 2021 104
Local regression
[From, Locally-weighted homographies for calibration of imaging systems, by P. Ranganathan & E. Olson]
COMP9417 T1, 2021 105
Bias-Variance Tradeoff
COMP9417 T1, 2021 106
Bias-Variance Tradeoff
Good Fit
Truth
Underfitting
Overfitting
Source: Scott-Fortmann,Understanding Bias-variance tradeoff COMP9417 T1, 2021 107
Bias-Variance Tradeoff
Bias is the difference between the average prediction of our model and the correct value which we are trying to predict. Model with high bias pays very little attention to the training data and oversimplifies the model. It always leads to high error on training and test data.
Variance is the variability of model prediction for a given data point or a value which tells us spread of our data. Model with high variance pays a lot of attention to training data and does not generalize on the data which it hasn’t seen before. As a result, such models perform very well on training data but has high error rates on test data.
COMP9417 T1, 2021 108
Bias-Variance Tradeoff
• In supervised learning, underfitting happens when a model is unable to capture the underlying pattern of the data. These models usually have high bias and low variance.
• In supervised learning, overfitting happens when our model captures the noise along with the underlying pattern in data.
COMP9417 T1, 2021 109
Bias-Variance Decomposition
+𝜀
(𝜀)
COMP9417 T1, 2021
110
Bias-Variance Tradeoff
• When comparing unbiased estimators, we would like to select the one with minimum variance
• In general, we would be comparing estimators that have some bias and some variance
• If our model is too simple and has very few parameters, then it may have high bias and low variance. On the other hand, if our model
has large number of parameters then it’s going to have high variance and low bias. So, we need to find the right/good balance without overfitting and underfitting the data.
• We need to choose a complexity that makes a good tradeoff between bias and variance.
COMP9417 T1, 2021 111
Bias-Variance Tradeoff
• Often, the is to find the balance between bias and variance such that we minimize the overall (total) error.
• There is not a quantitative way to find this balanced error point. Instead, you will need to leverage (preferably on an unseen validation data set) and adjust your model’s complexity until you find the iteration that minimizes overall error.
COMP9417 T1, 2021 112
Acknowledgements
• Material derived from slides for the book
“Elements of Statistical Learning (2nd Ed.)” by T. Hastie, R. Tibshirani & J. Friedman. Springer (2009) http://statweb.stanford.edu/~tibs/ElemStatLearn/
• Material derived from slides for the book
“Machine Learning: A Probabilistic Perspective” by P. Murphy MIT Press (2012) http://www.cs.ubc.ca/~murphyk/MLbook
• Material derived from slides for the book “Machine Learning” by P. Flach Cambridge University Press (2012) http://cs.bris.ac.uk/~flach/mlbook
• Material derived from slides for the book
“Bayesian Reasoning and Machine Learning” by D. Barber Cambridge University Press (2012) http://www.cs.ucl.ac.uk/staff/d.barber/brml
• Material derived from slides for the book “Machine Learning” by T. Mitchell McGraw-Hill (1997) http://www- 2.cs.cmu.edu/~tom/mlbook.html
• Material derived from slides for the course “Machine Learning” by A. Ng, Stanford University, USA (2015)
• Material derived from slides for the course “Machine Learning” by A. Srinivasan BITS Pilani, Goa, India (2016)
COMP9417 T1, 2021 113