Supervised Learning – Regression
Supervised Learning – Regression
COMP9417 Machine Learning and Data Mining
Last revision: 7 Mar 2018
COMP9417 ML & DM Regression Semester 1, 2018 1 / 99
Acknowledgements
Material derived from slides for the book
“Elements of Statistical Learning (2nd Ed.)” by T. Hastie,
R. Tibshirani & J. Friedman. Springer (2009)
http://statweb.stanford.edu/~tibs/ElemStatLearn/
Material derived from slides for the book
“Machine Learning: A Probabilistic Perspective” by P. Murphy
MIT Press (2012)
http://www.cs.ubc.ca/~murphyk/MLbook
Material derived from slides for the book
“Machine Learning” by P. Flach
Cambridge University Press (2012)
http://cs.bris.ac.uk/~flach/mlbook
Material derived from slides for the book
“Bayesian Reasoning and Machine Learning” by D. Barber
Cambridge University Press (2012)
http://www.cs.ucl.ac.uk/staff/d.barber/brml
Material derived from slides for the book
“Machine Learning” by T. Mitchell
McGraw-Hill (1997)
http://www-2.cs.cmu.edu/~tom/mlbook.html
Material derived from slides for the course
“Machine Learning” by A. Srinivasan
BITS Pilani, Goa, India (2016)
COMP9417 ML & DM Regression Semester 1, 2018 2 / 99
http://statweb.stanford.edu/~tibs/ElemStatLearn/
http://www.cs.ubc.ca/~murphyk/MLbook
http://cs.bris.ac.uk/~flach/mlbook
http://www.cs.ucl.ac.uk/staff/d.barber/brml
http://www-2.cs.cmu.edu/~tom/mlbook.html
Aims
Aims
This lecture will introduce you to machine learning approaches to the
problem of numerical prediction. Following it you should be able to
reproduce theoretical results, outline algorithmic techniques and describe
practical applications for the topics:
• the supervised learning task of numeric prediction
• how linear regression solves the problem of numeric prediction
• fitting linear regression by least squares error criterion
• non-linear regression via linear-in-the-parameters models
• parameter estimation for regression
• local (nearest-neighbour) regression
Note: slides with titles marked ∗ are for background only.
COMP9417 ML & DM Regression Semester 1, 2018 3 / 99
Introduction
Introduction
In the introductory lecture we saw supervised learning methods mostly for
classification, where the task is prediction of a discrete value for data
instances . . .
. . . however, we often find tasks where the most natural representation is
that of prediction of numeric values
COMP9417 ML & DM Regression Semester 1, 2018 4 / 99
Introduction
Introduction
Task: learn a model to predict CPU performance from a datset of example
of 209 different computer configurations.
COMP9417 ML & DM Regression Semester 1, 2018 5 / 99
Introduction
Introduction
Result: a linear regression equation fitted to the CPU dataset.
PRP =
– 56.1
+ 0.049 MYCT
+ 0.015 MMIN
+ 0.006 MMAX
+ 0.630 CACH
– 0.270 CHMIN
+ 1.46 CHMAX
COMP9417 ML & DM Regression Semester 1, 2018 6 / 99
Introduction
Introduction
For the class of symbolic representations, machine learning is viewed as:
searching a space of hypotheses . . .
represented in a formal hypothesis language (trees, rules, graphs . . . ).
COMP9417 ML & DM Regression Semester 1, 2018 7 / 99
Introduction
Introduction
For the class of numeric representations, machine learning is viewed as:
“searching” a space of functions . . .
represented as mathematical models (linear equations, neural nets, . . . ).
Note: in both settings, the models may be probabilistic . . .
COMP9417 ML & DM Regression Semester 1, 2018 8 / 99
Introduction
Introduction
Methods to predict a numeric output from statistics and machine learning:
• linear regression (statistics) determining the “line of best fit” using
the least squares criterion
• linear models (machine learning) learning a predictive model from
data under the assumption of a linear relationship between predictor
and target variables
Very widely-used, many applications
Ideas that are generalised in Artificial Neural Networks
COMP9417 ML & DM Regression Semester 1, 2018 9 / 99
Introduction
Introduction
• non-linear regression by adding non-linear basis functions
• multi-layer neural networks (machine learning) learning non-linear
predictors via hidden nodes between input and output
• regression trees (statistics / machine learning) tree where each leaf
predicts a numeric quantity
• local (nearest-neighbour) regression
COMP9417 ML & DM Regression Semester 1, 2018 10 / 99
Learning Linear Regression Models
Learning Linear Regression Models
COMP9417 ML & DM Regression Semester 1, 2018 11 / 99
Learning Linear Regression Models
Regression
We will look at the simplest model for numerical prediction:
a regression equation
Outcome will be a linear sum of feature values with appropriate weights.
Note: the term regression is overloaded – it can refer to:
• the process of determining the weights for the regression equation, or
• the regression equation itself.
COMP9417 ML & DM Regression Semester 1, 2018 12 / 99
Learning Linear Regression Models
Linear Regression
Assumes: expected value of the output given an input, E[y|x], is linear.
Simplest case: Out(x) = bx for some unknown b.
Learning problem: given the data, estimate b (i.e., b̂).
COMP9417 ML & DM Regression Semester 1, 2018 13 / 99
Learning Linear Regression Models
Linear Models
• Numeric attributes and numeric prediction, i.e., regression
• Linear models, i.e. outcome is linear combination of attributes
y = b0 + b1x1 + b2x2 + . . .+ bnxn
• Weights are calculated from the training data
• Predicted value for first training instance x(1) is:
b0x
(1)
0 + b1x
(1)
1 + b2x
(1)
2 + . . .+ bnx
(1)
n =
n∑
i=0
bix
(1)
i
COMP9417 ML & DM Regression Semester 1, 2018 14 / 99
Learning Linear Regression Models
Minimizing Squared Error
Difference between predicted and actual values is the error !
n+ 1 coefficients are chosen so that sum of squared error on all instances
in training data is minimized
Squared error:
m∑
j=1
(
y(j) −
n∑
i=0
bix
(j)
i
)2
Coefficients can be derived using standard matrix operations
Can be done if there are more instances than attributes (roughly speaking).
Known as “Ordinary Least Squares” (OLS) regression – minimizing the
sum of squared distances of data points to the estimated regression line.
COMP9417 ML & DM Regression Semester 1, 2018 15 / 99
Learning Linear Regression Models
Multiple Regression
Example: linear least squares fitting with 2 input variables.
COMP9417 ML & DM Regression Semester 1, 2018 16 / 99
Statistical Techniques for Data Analysis
Step back: Statistical Techniques for Data Analysis
COMP9417 ML & DM Regression Semester 1, 2018 17 / 99
Statistical Techniques for Data Analysis
Probability vs Statistics: The Difference
• Probability versus Statistics
• Probability: reasons from populations to samples
• This is deductive reasoning, and is usually sound (in the logical sense
of the word)
• Statistics: reasons from samples to populations
• This is inductive reasoning, and is usually unsound (in the logical sense
of the word)
COMP9417 ML & DM Regression Semester 1, 2018 18 / 99
Statistical Techniques for Data Analysis
Statistical Analyses
• Statistical analyses usually involve one of 3 things:
1 The study of populations;
2 The study of variation; and
3 Techniques for data abstraction and data reduction
• Statistical analysis is more than statistical computation:
1 What is the question to be answered?
2 Can it be quantitative (i.e., can we make measurements about it)?
3 How do we collect data?
4 What can the data tell us?
COMP9417 ML & DM Regression Semester 1, 2018 19 / 99
Statistical Techniques for Data Analysis Sampling
Sampling
COMP9417 ML & DM Regression Semester 1, 2018 20 / 99
Statistical Techniques for Data Analysis Sampling
Where do the Data come from? (Sampling)
• For groups (populations) that are fairly homogeneous, we do not need
to collect a lot of data. (We do not need to sip a cup of tea several
times to decide that it is too hot.)
• For populations which have irregularities, we will need to either take
measurements of the entire group, or find some way of get a good
idea of the population without having to do so
• Sampling is a way to draw conclusions about the population without
having to measure all of the population. The conclusions need not be
completely accurate
• All this is possible if the sample closely resembles the population
about which we are trying to draw some conclusions
COMP9417 ML & DM Regression Semester 1, 2018 21 / 99
Statistical Techniques for Data Analysis Sampling
What We Want From a Sampling Method
• No systematic bias, or at least no bias that we cannot account for in
our calculations
• The chance of obtaining an unrepresentative sample can be
calculated. (So, if this chance is high, we can choose not to draw any
conclusions.)
• The chance of obtaining an unrepresentative sample decreases with
the size of the sample
COMP9417 ML & DM Regression Semester 1, 2018 22 / 99
Statistical Techniques for Data Analysis Sampling
Simple Random Sampling
• Each element of the population is associated with a number
• Shuffle all the numbers and put them into into a hat
• Draw a sample of n numbers from the hat and get the corresponding
elements of the population
Usually, there are no hats 🙂 and we will be using a computer to generate
n numbers that are approximately random.
In addition, the computer will use a mathematical relationship between
elements of the population and the set of numbers. Inverting this
relationship using the n random numbers will then give the elements of
the population.
COMP9417 ML & DM Regression Semester 1, 2018 23 / 99
Statistical Techniques for Data Analysis Sampling
Probability Sampling
• In effect, numbers drawn using simple random sampling (in a single
stage or more) use a uniform probability distribution over the
numbers. That is, the probability of getting any number from 1 . . . n
from the hat is 1/n.
• A more general form of this is to use any kind of probability
distribution over 1 . . . n. For example, a distribution could make larger
numbers are more likely than smaller numbers. This is a skewed
distribution
• For example, take a 2-stage sampling procedure in which households
are grouped according to size, and the probability of selecting larger
households is higher. A household is selected and then an individual is
selected from that household. This gives a greater chance of selecting
individuals from larger households
• Once again, it is relatively straightforward to do this form of
probability-based sampling using a computer
COMP9417 ML & DM Regression Semester 1, 2018 24 / 99
Statistical Techniques for Data Analysis Estimation
Estimation
COMP9417 ML & DM Regression Semester 1, 2018 25 / 99
Statistical Techniques for Data Analysis Estimation
Estimation from a Sample
• Estimating some aspect of the population using a sample is a
common task. Along with the estimate, we also want to have some
idea of the accuracy of the estimate (usually expressed in terms of
confidence limits)
• Some measures calculated from the sample are very good estimates of
corresponding population values. For example, the sample mean m is
a very good estimate of the population mean µ. But this is not
always the case. For example, the range of a sample usually
under-estimates the range of the population
• We will have to clarify what is meant by a “good estimate”. One
meaning is that an estimator is correct on average. For example, on
average, the mean of a sample is a good estimator of the mean of the
population
COMP9417 ML & DM Regression Semester 1, 2018 26 / 99
Statistical Techniques for Data Analysis Estimation
Estimation from a Sample
• For example, when a number of samples are drawn and the mean of
each is found, then average of these means is equal to the population
mean
• Such an estimator is said to be statistically unbiased
COMP9417 ML & DM Regression Semester 1, 2018 27 / 99
Statistical Techniques for Data Analysis Estimation
Sample Estimates of the Mean and the Spread I
Mean. This is calculated as follows.
• Find the total T of N observations. Estimate the
(arithmetic) mean by m = T/N .
• This works very well when the data follow a symmetric
bell-shaped frequency distribution (of the kind modelled
by “normal” distribution)
• A simple mathematical expression of this is
m = 1
N
∑
i xi, where the observations are x1, x2 . . .xn
• If we can group the data so that the observation x1
occurs f1 times, x2 occurs f2 times and so on, then the
mean is calculated even easier as m = 1
N
∑
i xifi
COMP9417 ML & DM Regression Semester 1, 2018 28 / 99
Statistical Techniques for Data Analysis Estimation
Sample Estimates of the Mean and the Spread II
• If, instead of frequencies, you had relative frequencies
(i.e. instead of fi you had pi = fi/N), then the mean is
simply the observations weighted by relative frequency.
That is, m =
∑
i xipi
• We want to connect this up to computing the mean
value of observations modelled by some theoretical
probability distribution function. That is, we want to a
similar counting method for calculating the mean of
random variables modelled using some known
distribution
COMP9417 ML & DM Regression Semester 1, 2018 29 / 99
Statistical Techniques for Data Analysis Estimation
Sample Estimates of the Mean and the Spread III
• Correctly, this is the mean value of the values of the
random variable function. But this is a bit cumbersome,
so we will just say the “mean value of the r.v.” For
discrete r.v.’s this is:
E(X) =
∑
i
xip(X = xi)
Variance. This is calculated as follows:
• Calculate the total T and the sum of squares of N
observations. The estimate of the standard deviation is
s =
√
1
N−1
∑
i(xi −m)2
• Again, this is a very good estimate when the data are
modelled by a normal distribution
COMP9417 ML & DM Regression Semester 1, 2018 30 / 99
Statistical Techniques for Data Analysis Estimation
Sample Estimates of the Mean and the Spread IV
• For grouped data, this is modified to
s =
√
1
N−1
∑
i(xi −m)2fi
• Again, we have a similar formula in terms of expected
values, for the scatter (spread) of values of a r.v. X
around a mean value E(X):
V ar(X) = E(X − E(X))2
= E(X2)− [E(X)]2
• You can remember this as “the mean of the squares
minus the square of the mean”
COMP9417 ML & DM Regression Semester 1, 2018 31 / 99
Statistical Techniques for Data Analysis Bias-Variance Decomposition
Bias-Variance Decomposition
COMP9417 ML & DM Regression Semester 1, 2018 32 / 99
Statistical Techniques for Data Analysis Bias-Variance Decomposition
The Bias-Variance Tradeoff
• When comparing unbiased estimators, we would like to select the one
with minimum variance
• In general, we would be comparing estimators that have some bias
and some variance
• We can combine the bias and variance of an estimator by obtaining
the mean square error of the estimator, or MSE. This is the average
value of squared deviations of an estimated value V from the true
value of the parameter θ. That is:
MSE = Avg. value of (V − θ)2
• Now, it can be shown that:
MSE = (variance) + (bias)
2
• If, as sample size increases, the bias and the variance of an estimator
approaches 0, then the estimator is said to be consistent.
COMP9417 ML & DM Regression Semester 1, 2018 33 / 99
Statistical Techniques for Data Analysis Bias-Variance Decomposition
The Bias-Variance Tradeoff
• Since
MSE = (variance) + (bias)
2
the lowest possible value of MSE is 0
• In general, we may not be able to get to the ideal MSE of 0.
Sampling theory tells us the minimum value of the variance of an
estimator. This value is known as the Cramer-Rao bound. So, given
an estimator with bias b, we can calculate the minimum value of the
variance of the estimator using the CR bound (say, vmin). Then:
MSE ≥ vmin + b2
The value of vmin depends on whether the estimator is biased or
unbiased (that is b = 0 or b 6= 0)
• It is not the case that vmin for an unbiased (b = 0) estimator is less
than vmin for a biased estimator. So, the MSE of a biased estimator
can end up being lower than the MSE of an unbiased estimator.
COMP9417 ML & DM Regression Semester 1, 2018 34 / 99
Statistical Techniques for Data Analysis Bias-Variance Decomposition
Decomposition of MSE
MSE = (variance) + (bias)
2
Imagine testing the prediction of our estimator ŷ on many samples of the
same size drawn at random from the same distribution. We compute error
based on the squared difference between predicted and actual values.
Then the MSE can be decomposed like this:
MSE = E[ŷ − f(x)]2
= E[ŷ − E(ŷ)]2 + [E(ŷ)− f(x)]2
Note that the first term in the error decomposition (variance) does not
refer to the actual value at all, although the second term (bias) does.
COMP9417 ML & DM Regression Semester 1, 2018 35 / 99
Statistical Techniques for Data Analysis Correlation
Correlation
COMP9417 ML & DM Regression Semester 1, 2018 36 / 99
Statistical Techniques for Data Analysis Correlation
Correlation I
• The correlation coefficient is a number between -1 and +1 that
indicates whether a pair of variables x and y are associated or not,
and whether the scatter in the association is high or low
• High values of x are associated with high values of y and low values of
x are associated with low values of y, and scatter is low
• A value near 0 indicates that there is no particular association and that
there is a large scatter associated with the values
• A value close to -1 suggests an inverse association between x and y
• Only appropriate when x and y are roughly linearly associated
(doesn’t work well when the association is curved)
• The formula for computing correlation between x and y is:
r =
cov(x, y)√
var(x)
√
var(y)
This is sometimes also called Pearson’s correlation coefficient
COMP9417 ML & DM Regression Semester 1, 2018 37 / 99
Statistical Techniques for Data Analysis Correlation
Correlation II
• The terms in the denominator are simply the standard deviations of x
and y. But the numerator is different. This is calculated as the
average of the product of deviations from the mean:
cov(x, y) =
∑
i(xi − x)(yi − y)
n− 1
• What does “covariance” mean ? Consider
1 Case 1: xi > x, yi > y
2 Case 2: xi < x, yi < y
3 Case 3: xi < x, yi > y
4 Case 4: xi > x, yi < y
In the first two cases, xi and yi vary together, both being high or low
relative to their means. In the other two cases, they vary in different
directions
COMP9417 ML & DM Regression Semester 1, 2018 38 / 99
Statistical Techniques for Data Analysis Correlation
Correlation III
• If the positive products dominate in the calculation of cov(x, y), then
the value of r will be positive. If the negative products dominate, then
r will be negative. If 0 products dominate, then r will be close to 0.
• You should be able to show that:
r =
∑
i(xi − x)(yi − y)√∑
(xi − x)2
√∑
(yi − y)2
• Computers generally use a short-cut formula:
r =
∑
i xiyi − nxy
n− 1
• The same kinds of calculations can be done if the data were not
actual values but ranks instead (i.e. ranks for the x’s and the y’s).
• This is called Spearman’s rank correlation, but we won’t do these
calculations here.
COMP9417 ML & DM Regression Semester 1, 2018 39 / 99
Statistical Techniques for Data Analysis Correlation
What Happens If You Sample? I
• Suppose you have a sample of 〈x, y〉 pairs and you calculate r = 0.3.
Is this really the case?
• Sampling theory tells us something. If: (a) the relative frequencies
observed are well modelled by a special kind of mathematical function
(a “Normal” or Gaussian distribution); (b) the true correlation is 0;
and (c) the number of samples is large
• Then:
• The sampling distribution of the correlation coefficient (that is, how r
varies from sample to sample) is also approximately distributed
according to the Normal distribution with mean 0 and standard error
(s.e.) of approximately 1/
√
n
• We can use this to calculate the (approximate) probability of
obtaining the sample if the assumptions were true
COMP9417 ML & DM Regression Semester 1, 2018 40 / 99
Statistical Techniques for Data Analysis Correlation
What Happens If You Sample? II
• Suppose we calculate r = 0.3 from the sample, and that the s.e. is 0.1,
say. Then if the sample came from a population with true correlation
0, this would be quite unusual (less than 1% chance)
• We would say instead that the sample was probably from a population
with correlation 0.3, with a 95% confidence interval of ±2× 0.1
COMP9417 ML & DM Regression Semester 1, 2018 41 / 99
Statistical Techniques for Data Analysis Correlation
What Does Correlation Mean? I
• r is a quick way of checking whether there is some linear association
between x and y
• The sign of the value tells you the direction of the association
• All that the numerical value tells you is about the scatter in the data
• The correlation coefficient does not model any relationship. That is,
given a particular x you cannot use the r value to calculate a y value
• It is possible for two datasets to have the same correlation, but
different relationships
• It is possible for two datasets to have different correlations but the
same relationship
• MORAL: Do not use correlations to compare datasets. All you can
derive is whether there is a positive or negative relationship between x
and y
• ANOTHER MORAL: Do not use correlation to imply x causes y or
the other way around
COMP9417 ML & DM Regression Semester 1, 2018 42 / 99
Statistical Techniques for Data Analysis Regression
Regression
COMP9417 ML & DM Regression Semester 1, 2018 43 / 99
Statistical Techniques for Data Analysis Regression
Regression
• Given a set of data points xi, yi, what is the relationship between
them? (We can generalise this to the “multivariate” case later)
• One kind of question is to ask: are these linearly related in some
manner? That is, can we draw a straight line that describes
reasonably well the relationship between X and Y
• Remember, the correlation coefficient can tell us if there is a case for
such a relationship
• In real life, even if such a relationship held, it will be unreasonable to
expect all pairs xi, yi to lie precisely on a straight line. Instead, we
can probably draw some reasonably well-fitting line. But which one?
COMP9417 ML & DM Regression Semester 1, 2018 44 / 99
Univariate linear regression
Univariate linear regression
COMP9417 ML & DM Regression Semester 1, 2018 45 / 99
Univariate linear regression
Linear Relationship Between 2 Variables I
• GOAL: fit a line whose equation is of the form Ŷ = a+ bX
• HOW: minimise
∑
i d
2
i =
∑
i(Yi− Ŷi)
2 (the “least squares estimator”)
COMP9417 ML & DM Regression Semester 1, 2018 46 / 99
Univariate linear regression
Linear Relationship Between 2 Variables II
• The calculation for b is given by:
b =
cov(x, y)
var(x)
where cov(x, y) is the covariance of x and y, given by∑
i(xi − x)(yi − y) as before
• This can be simplified to:
b =
∑
(xy)/
∑
x2
where x = (Xi −X) and y = (Yi − Y )
• a = Y − bX
COMP9417 ML & DM Regression Semester 1, 2018 47 / 99
Univariate linear regression
Meaning of the Coefficients a and b
• b: change in Y that accompanies a unit change in X
• If the values of X were assigned at random, then b estimates the unit
change in Y caused by a unit change in X
• If the values of X were not assigned at random (for examples, they
were data somebody observed), then the change in Y will include the
change in X and any other confounding variables that may have
changed as a result of changing X by 1 unit. So, you cannot say for
example, that a change of X by 1 unit causes b units of change in Y
• b = 0 means there is no linear relationship between X and Y , and
then best we can do is simply say is Ŷ = a = Y . Estimating the
sample mean is therefore a special case of the MSE criterion
COMP9417 ML & DM Regression Semester 1, 2018 48 / 99
Univariate linear regression
The Regression Model I
• The least-square estimator fits a line using sample data
• To draw inferences about the population requires us to have a
(statistical) model about what this line means
• What is being assumed is actually this:
COMP9417 ML & DM Regression Semester 1, 2018 49 / 99
Univariate linear regression
The Regression Model II
• That is: Obtain Y values for many instances of X1. This will result in
a distribution of Y values P (Y |X1); and so on for
P (Y |X2), P (Y |X3), etc.. The regression model makes the following
assumptions:
• All the Y distributions are the same, and have the same spread
• For each P (Y |Xi) distribution, the true mean value µi lies on a
straight line (this is the “true regression line”)
• The Yi are independent
• In standard terminology, the Yi are identically distributed independent
(i.i.d.) random variables with mean µi = α+ βXi and variance σ
2
• Or: Yi = α+ βXi + ei where the ei are independent errors with mean
0 and variance σ2
COMP9417 ML & DM Regression Semester 1, 2018 50 / 99
Univariate linear regression
How Good is the Least-Squares Estimator I
• The line fitted using the least-squares criterion is a sample-based
estimate of the true regression line
• To know how good this estimate is, we are really asking questions
about the bias and variance of the estimates of a and b
• It can be shown that under some assumptions, the least-square
estimates of a and b will be unbiased and that they will have the
lowest variance
• The proof of this is called the Gauss-Markov theorem. The
Gauss-Markov theorem makes the following assumptions:
1 The expected (average) values of residuals is 0 (E(ei) = 0)
2 The spread of residuals is constant for all Xi (V ar(ei) = σ
2)
3 There is no relationship amongst the residuals (cov(ei, ej) = 0)
4 There is no relationship between the residuals and the Xi
(cov(Xi, ei) = 0)
COMP9417 ML & DM Regression Semester 1, 2018 51 / 99
Univariate linear regression
How Good is the Least-Squares Estimator II
• If these assumptions hold, then the Gauss-Markov theorem shows that
E(a) = α, E(b) = β, and that the variance in these estimates will
have the lowest variance
• There is a special case of the assumptions that arises when the
residuals are assumed to be distributed according to the Normal
distribution, with mean 0
• In this case, minimising least-squares is equivalent to maximising the
probability of the Yi, given the Xi (that is, least-squares is equivalent
to maximum likelihood estimation)
• More on this in a later lecture
COMP9417 ML & DM Regression Semester 1, 2018 52 / 99
Univariate linear regression
Univariate linear regression
Example:
Suppose we want to investigate the relationship between people’s height
and weight. We collect n height and weight measurements
(hi, wi), 1 ≤ i ≤ n.
Univariate linear regression assumes a linear equation w = a+ bh, with
parameters a and b chosen such that the sum of squared residuals∑n
i=1(wi − (a+ bhi))
2 is minimised.
COMP9417 ML & DM Regression Semester 1, 2018 53 / 99
Univariate linear regression
Univariate linear regression
In order to find the parameters we take partial derivatives, set the partial
derivatives to 0 and solve for a and b:
∂
∂a
n∑
i=1
(wi − (a+ bhi))2 = −2
n∑
i=1
(wi − (a+ bhi)) = 0
⇒ â = w − b̂h
∂
∂b
n∑
i=1
(wi − (a+ bhi))2 = −2
n∑
i=1
(wi − (a+ bhi))hi = 0
⇒ b̂ =
∑n
i=1
(hi − h)(wi − w)∑n
i=1
(hi − h)2
So the solution found by linear regression is w = â+ b̂h = w + b̂(h− h).
COMP9417 ML & DM Regression Semester 1, 2018 54 / 99
Univariate linear regression
Univariate linear regression
140 150 160 170 180 190 200
40
45
50
55
60
65
70
75
80
85
90
COMP9417 ML & DM Regression Semester 1, 2018 55 / 99
Univariate linear regression
Univariate linear regression
Shown on previous slide:
The red solid line indicates the result of applying linear regression to 10
measurements of body weight (on the y-axis, in kilograms) against body
height (on the x-axis, in centimetres). The orange dotted lines indicate
the average height h = 181 and the average weight w = 74.5; the
regression coefficient b̂ = 0.78. The measurements were simulated by
adding normally distributed noise with mean 0 and variance 5 to the true
model indicated by the blue dashed line (b = 0.83).
COMP9417 ML & DM Regression Semester 1, 2018 56 / 99
Univariate linear regression
Linear regression: intuitions
For a feature x and a target variable y, the regression coefficient is the
covariance between x and y in proportion to the variance of x:
b̂ =
σxy
σxx
(Here we use σxx as an alternative notation for σ
2
x).
This can be understood by noting that the covariance is measured in units
of x times units of y (e.g., metres times kilograms above) and the variance
in units of x squared (e.g., metres squared), so their quotient is measured
in units of y per unit of x (e.g., kilograms per metre).
COMP9417 ML & DM Regression Semester 1, 2018 57 / 99
Univariate linear regression
Linear regression: intuitions
The intercept â is such that the regression line goes through (x, y).
Adding a constant to all x-values (a translation) will affect only the
intercept but not the regression coefficient (since it is defined in terms of
deviations from the mean, which are unaffected by a translation).
So we could zero-centre the x-values by subtracting x, in which case the
intercept is equal to y.
We could even subtract y from all y-values to achieve a zero intercept,
without changing the problem in an essential way.
COMP9417 ML & DM Regression Semester 1, 2018 58 / 99
Univariate linear regression
Linear regression: intuitions
Suppose we replace xi with x
′
i = xi/σxx and likewise x with x
′ = x/σxx,
then we have that b̂ = 1
n
∑n
i=1(x
′
i − x′)(yi − y) = σx′y.
In other words, if we normalise x by dividing all its values by x’s variance,
we can take the covariance between the normalised feature and the target
variable as regression coefficient.
This demonstrates that univariate linear regression can be understood as
consisting of two steps:
1 normalisation of the feature by dividing its values by the feature’s
variance;
2 calculating the covariance of the target variable and the normalised
feature.
COMP9417 ML & DM Regression Semester 1, 2018 59 / 99
Univariate linear regression
Linear regression: intuitions
Another important point to note is that the sum of the residuals of the
least-squares solution is zero:
n∑
i=1
(yi − (â+ b̂xi)) = n(y − â− b̂x) = 0
The result follows because â = y − b̂x, as derived above.
While this property is intuitively appealing, it is worth keeping in mind
that it also makes linear regression susceptible to outliers: points that are
far removed from the regression line, often because of measurement errors.
COMP9417 ML & DM Regression Semester 1, 2018 60 / 99
Univariate linear regression
The effect of outliers
140 150 160 170 180 190 200
40
45
50
55
60
65
70
75
80
85
90
COMP9417 ML & DM Regression Semester 1, 2018 61 / 99
Univariate linear regression
The effect of outliers
Shown on previous slide:
Suppose that, as the result of a transcription error, one of the weight
values from the previous example of univariate regression is increased by
10 kg. The diagram shows that this has a considerable effect on the
least-squares regression line.
Specifically, we see that one of the blue points got moved up 10 units to
the green point, changing the red regression line to the green line.
COMP9417 ML & DM Regression Semester 1, 2018 62 / 99
Univariate linear regression
Least-Squares as Cost Minimization I
• Finding the least-squares solution is in effect finding the value of a
and b that minimizes
∑
i d
2
i =
∑
i(Yi − Ŷi)
2, where Ŷi = a+ bXi
• This minimum value was obtained analytically by the usual process of
differentiating and equating to 0,
• A numerical alternative to the analytical approach is to take (small)
steps that decreases the value of the function to be minimised, and
stopping when we reach a minimum
• Recall that at a point the gradient vector points in the direction of
greatest increase of a function. So, the opposite direction to the
gradient vector gives the direction of greatest decrease
• bi+1 = bi - η × gb
• ai+1 = ai - η × ga
• Stop when bi+1 ≈ bi and ai+1 ≈ ai
• More on this in a later lecture
COMP9417 ML & DM Regression Semester 1, 2018 63 / 99
Multivariate linear regression
Multivariate linear regression
COMP9417 ML & DM Regression Semester 1, 2018 64 / 99
Multivariate linear regression
Many variables
• Often, we are interesting in modelling the relationship of Y to several
other variables
• In observational studies, the value of Y may be affected by the values
of several variables. For example, carcinogenicity may be
gender-specific. A regression model that ignores gender may find that
carcinogenicity to be related to some surrogate variable (height, for
example)
• Including more variables can give a narrower confidence interval on
the prediction being made
COMP9417 ML & DM Regression Semester 1, 2018 65 / 99
Multivariate linear regression
Multivariate linear model
• The Yi are identically distributed independent variables with mean
µ = β0 + β1X1 + β2X2 + · · ·+ βnXn and variance σ2
• Or: Yi = β0 + β1X1 + · · ·+ βnXn + ei where the ei are independent
errors with mean 0 and variance σ2
• As before, this linear model is estimated from a sample by the
equation Ŷ = b0 + b1X1 + b2X2 + · · ·+ bnXn
• With many variables, the regression equation and expressions for the
bi are expressed better using a matrix representation for sets of
equations.
COMP9417 ML & DM Regression Semester 1, 2018 66 / 99
Multivariate linear regression
Multivariate linear regression *
First, we need the covariances between every feature and the target
variable:
(XTy)j =
n∑
i=1
xijyi =
n∑
i=1
(xij − µj)(yi − y) + nµj y = n(σjy + µj y)
Assuming for the moment that every feature is zero-centred, we have
µj = 0 and thus X
Ty is an n-vector holding all the required covariances
(times n).
We can normalise the features by means of a d-by-d scaling matrix: a
diagonal matrix with diagonal entries 1/nσjj . If S is a diagonal matrix
with diagonal entries nσjj , we can get the required scaling matrix by
simply inverting S.
So our first stab at a solution for the multivariate regression problem is
ŵ = S−1XTy
COMP9417 ML & DM Regression Semester 1, 2018 67 / 99
Multivariate linear regression
Multivariate linear regression *
The general case requires a more elaborate matrix instead of S:
ŵ = (XTX)−1XTy (1)
Let us try to understand the term (XTX)−1 a bit better.
• Assuming the features are uncorrelated, the covariance matrix Σ is
diagonal with entries σjj .
• Assuming the features are zero-centred, XTX = nΣ is also diagonal
with entries nσjj .
• In other words, assuming zero-centred and uncorrelated features,
(XTX)−1 reduces to our scaling matrix S−1.
In the general case we cannot make any assumptions about the features,
and (XTX)−1 acts as a transformation that decorrelates, centres and
normalises the features.
COMP9417 ML & DM Regression Semester 1, 2018 68 / 99
Multivariate linear regression
Bivariate linear regression *
First, we derive the basic expressions.
X
T
X =
(
x11 · · · xn1
x12 · · · xn2
)
x11 x12
.
.
.
.
.
.
xn1 xn2
= n
(
σ11 + x1
2 σ12 + x1 x2
σ12 + x1 x2 σ22 + x2
2
)
(X
T
X)
−1
=
1
nD
(
σ22 + x2
2 −σ12 − x1 x2
−σ12 − x1 x2 σ11 + x12
)
D = (σ11 + x1
2
)(σ22 + x2
2
)− (σ12 + x1 x2)
2
X
T
y =
(
x11 · · · xn1
x12 · · · xn2
)
y1
.
.
.
yn
= n
(
σ1y + x1 y
σ2y + x2 y
)
COMP9417 ML & DM Regression Semester 1, 2018 69 / 99
Multivariate linear regression
Bivariate linear regression *
We now consider two special cases. The first is that X is in homogeneous
coordinates, i.e., we are really dealing with a univariate problem. In that
case we have xi1 = 1 for 1 ≤ i ≤ n; x1 = 1; and σ11 = σ12 = σ1y = 0.
We then obtain (we write x instead of x2, σxx instead of σ22 and σxy
instead of σ2y):
(XTX)−1 =
1
nσxx
(
σxx + x
2 −x
−x 1
)
XTy = n
(
y
σxy + x y
)
ŵ = (XTX)−1XTy =
1
σxx
(
σxxy − σxyx
σxy
)
This is the same result as obtained for the univariate case.
COMP9417 ML & DM Regression Semester 1, 2018 70 / 99
Multivariate linear regression
Bivariate linear regression *
The second special case we consider is where we assume x1, x2 and y to
be zero-centred, which means that the intercept is zero and w contains
the two regression coefficients. In this case we obtain
(XTX)−1 =
1
n(σ11σ22 − σ212)
(
σ22 −σ12
−σ12 σ11
)
XTy = n
(
σ1y
σ2y
)
ŵ = (XTX)−1XTy =
1
(σ11σ22 − σ212)
(
σ22σ1y − σ12σ2y
σ11σ2y − σ12σ1y
)
The last expression shows, e.g., that the regression coefficient for x1 may
be non-zero even if x1 doesn’t correlate with the target variable (σ1y = 0),
on account of the correlation between x1 and x2 (σ12 6= 0).
COMP9417 ML & DM Regression Semester 1, 2018 71 / 99
Regularisation
Regularisation
COMP9417 ML & DM Regression Semester 1, 2018 72 / 99
Regularisation
Parameter Estimation by Optimization I
Regularisation is a general method to avoid overfitting by applying
additional constraints to the weight vector. A common approach is to
make sure the weights are, on average, small in magnitude: this is referred
to as shrinkage.
Recall the setting for regression in terms of cost minimization.
• Can add penalty terms to a cost function, forcing coefficients to
shrink to zero
Y = fθ0,θ1,...,θn(X1, X2, . . . , Xn) = fθ(X)
COMP9417 ML & DM Regression Semester 1, 2018 73 / 99
Regularisation
Parameter Estimation by Optimization II
• MSE as a cost function, given data (x1, y1), . . . , (xn, yn)
Cost(θ) =
1
n
∑
i
(fθ(xi)− yi)2
and with a penalty function:
Cost(θ) =
1
n
∑
i
(fθ(xi)− yi)2 +
1
n
λ
∑
i=1
n
θi
• Parameter estimation by optimisation will attempt to values for
θ0, θ1, . . . , θn s.t. Cost(θ) is a minimum
• It will be easier to take the 1
n
term as 1
2n
, which will not affect the
minimisation
COMP9417 ML & DM Regression Semester 1, 2018 74 / 99
Regularisation
Parameter Estimation by Optimization III
• Using gradient descent with the penalty function will do two things:
(a) we will move each θj in a direction that minimises the cost; and
(b) each value of θj will also get “shrunk” on each iteration by
multiplying the old value by an amount < 1
θj
(i+1) = αθj
(i) − η∇θj
where α < 1
COMP9417 ML & DM Regression Semester 1, 2018 75 / 99
Regularisation
Regularised regression
The multivariate least-squares regression problem can be written as an
optimisation problem:
w∗ = argmin
w
(y −Xw)T(y −Xw)
The regularised version of this optimisation is then as follows:
w∗ = argmin
w
(y −Xw)T(y −Xw) + λ||w||2
where ||w||2 =
∑
iw
2
i is the squared norm of the vector w, or,
equivalently, the dot product wTw; λ is a scalar determining the amount
of regularisation.
COMP9417 ML & DM Regression Semester 1, 2018 76 / 99
Regularisation
Regularised regression
This regularised problem still has a closed-form solution:
ŵ = (XTX + λI)−1XTy
where I denotes the identity matrix. Regularisation amounts to adding λ
to the diagonal of XTX, a well-known trick to improve the numerical
stability of matrix inversion. This form of least-squares regression is known
as ridge regression.
An interesting alternative form of regularised regression is provided by the
lasso, which stands for ‘least absolute shrinkage and selection operator’. It
replaces the ridge regularisation term
∑
iw
2
i with the sum of absolute
weights
∑
i |wi|. The result is that some weights are shrunk, but others
are set to 0, and so the lasso regression favours sparse solutions.
COMP9417 ML & DM Regression Semester 1, 2018 77 / 99
Some further issues in learning linear regression models
Some further issues in learning linear regression models
COMP9417 ML & DM Regression Semester 1, 2018 78 / 99
Some further issues in learning linear regression models
What do the Coefficients bi Mean?
• Consider the two equations:
Ŷ = a+ bX
Ŷ = b0 + b1X1 + b2X2
• b: change in Y that accompanies a unit change in X
• b1: change in Y that accompanies a unit change in X1 provided X2
remains constant
• More generally, bi (i > 0) is the change in Y that accompanies a unit
change in Xi provided all other X’s are constant
• So: if all relevant variables are included, then we can assess the effect
of each one in a controlled manner
COMP9417 ML & DM Regression Semester 1, 2018 79 / 99
Some further issues in learning linear regression models
Categoric Variables: X’s I
• “Indicator” variables are those that take on the values 0 or 1
• They are used to include the effects of categoric variables. For
example, if D is a variable that takes the value 1 if a patient takes a
drug and 0 if the patient does not. Suppose you want to know the
effect of drug D on blood pressure Y keeping age (X) constant
Ŷ = 70 + 5D + 0.44X
So, taking the drug (a unit change in D) makes a difference of 5
units, provided age is held constant
COMP9417 ML & DM Regression Semester 1, 2018 80 / 99
Some further issues in learning linear regression models
Categoric Variables: X’s II
• How do we capture any interaction effect between age and drug
intake? Introduce a new indicator variable DX = D ×X
Ŷ = 70 + 5D + 0.44X + 0.21DX
COMP9417 ML & DM Regression Semester 1, 2018 81 / 99
Some further issues in learning linear regression models
Categoric Values: Y values
• Sometimes, Y values are simply one of two values (let’s call them 0
and 1)
• We can’t use the regression model as we described earlier, in which
the Y ’s can take any real value
• But, we can define a new linear regression model in which predicts
not the value of Y , but what are called the log odds of Y :
log odds Y = Odds = b0 + b1X1 + · · ·+ bnXn
• Once Odds are estimated, they can be used to calculate the
probability of Y :
Pr(Y = 1) =
eOdds
(1 + eOdds)
We can then use the value of Pr(Y = 1) to decide if Y = 1
• This procedure is called logistic regression (we’ll see this again)
COMP9417 ML & DM Regression Semester 1, 2018 82 / 99
Some further issues in learning linear regression models
Is the Model Appropriate ? * I
COMP9417 ML & DM Regression Semester 1, 2018 83 / 99
Some further issues in learning linear regression models
Is the Model Appropriate ? * II
• The residuals from the regression line can be calculated numerically,
along with their mean, variance and standard deviation. It can be
shown that the residual standard deviation is related to the standard
deviation of the Y values in the following manner:
rsd = sy
√
1− r2
• This helps us understand how much the regression line helped reduce
the scatter of the y values (sy gives a measure of the scatter of y
values about the mean y; and rsd gives a measure of the scatter of y
values about the regression line)
• This also gives you another way of understanding the correlation
coefficient. With r = 0.9, the scatter about the regression line is still
almost 45% of the original scatter about the mean
COMP9417 ML & DM Regression Semester 1, 2018 84 / 99
Some further issues in learning linear regression models
Is the Model Appropriate ? * III
• If there is no systematic pattern to the residuals—that is, there are
approximately half of them that are positive and half that are
negative, then the line is a good fit
• It should also be the case that there should be no pattern to the
residual scatter all along the line. If the average size of the residuals
varies along the line (this condition is called heteroscedasticity) then
the relationship is probably more complex than a straight line
• Residuals from a well-fitting line should show an approximate
symmetric, bell-shaped frequency distribution with a mean of 0
COMP9417 ML & DM Regression Semester 1, 2018 85 / 99
Some further issues in learning linear regression models
Non-linear Relationships
• Sometimes, the linear model may be inappropriate
• Some non-linear relationships can be captured in a linear model by a
transformation (“trick”). For example, the curved model
Ŷ = b0 + b1X1 + b2X
2
1 can be transformed by X2 = X
2
1 into a linear
model. This works for polynomial relationships.
• Some other non-linear relationships may require more complicated
transformations. For example, the relationship is Y = b0X
b1
1 X
b2
2 can
be transformed into the linear relationship
log(Y ) = log b0 + b1 logX1 + b2 logX2
• Other relationships cannot be transformed quite so easily, and will
require full non-linear estimation (in subsequent topics in the ML
course we will find out more about these)
COMP9417 ML & DM Regression Semester 1, 2018 86 / 99
Some further issues in learning linear regression models
Non-Linear Relationships (contd.)
• Main difficulty with non-linear relationships is choice of function
• How to learn ?
• Can use a form of gradient descent to estimate the parameters
• After a point, almost any sufficiently complex mathematical function
will do the job in a sufficiently small range
• Some kind of prior knowledge or theory is the only way to help here.
• Otherwise, it becomes a process of trial-and-error, in which case,
beware of conclusions that can be drawn
COMP9417 ML & DM Regression Semester 1, 2018 87 / 99
Some further issues in learning linear regression models
Model Selection
• Suppose there are a lot of variables Xi, some of which may be
representing products, powers, etc .
• Taking all the Xi will lead to an overly complex model. There are 3
ways to reduce complexity:
1 Subset-selection, by search over subset lattice. Each subset results in a
new model, and the problem is one of model-selection
2 Shrinkage, or regularization of coefficients to zero, by optimization.
There is a single model, and unimportant variables have near-zero
coefficients.
3 Dimensionality-reduction, by projecting points into a lower dimensional
space (this is different to subset-selection, and we will look at it later)
COMP9417 ML & DM Regression Semester 1, 2018 88 / 99
Some further issues in learning linear regression models
Model Selection as Search I
• The subsets of the set of possible variables form a lattice with S1 ∩ S2
as the g.l.b. or meet and S1 ∪ S2 as the l.u.b. or join
• Each subset refers to a model, and a pair of subsets are connected if
they differ by just 1 element
• A lattice is a graph, and we know how to search a graph
• A∗, greedy, randomised etc .
• “Cost” of node in the graph: MSE of the model. The parameters
(coefficients) of the model can be found
• Historically, model-selection for regression has been done using
“forward-selection”, “backward-elimination”, or “stepwise” methods
• These are greedy search techniques that either: (a) start at the top of
the subset lattice, and add variables; (b) start at the bottom of the
subset lattice and remove variables; or (c) start at some interior point
and proceed by adding or removing single variables (examining nodes
connected to the node above or below)
COMP9417 ML & DM Regression Semester 1, 2018 89 / 99
Some further issues in learning linear regression models
Model Selection as Search II
• Greedy selection done on the basis of calculating the coefficient of
determination (often denoted by R2) which denotes the proportion of
total variation in the dependent variable Y that is explained by the
model
• Given a model formed with a subset of variables X, it is possible to
compute the observed change in R2 due to the addition or deletion of
some variable x
• This is used to select greedily the next best move in the graph-search
To set other hyper-parameters, such as shrinkage parameter λ, can use
grid search
COMP9417 ML & DM Regression Semester 1, 2018 90 / 99
Some further issues in learning linear regression models
Prediction I
• It is possible to quantify what happens if the regression line is used
for prediction:
• The intuition is this:
• Recall the regression line goes through the mean (X,Y )
COMP9417 ML & DM Regression Semester 1, 2018 91 / 99
Some further issues in learning linear regression models
Prediction II
• If the Xi are slightly different, then the mean is not going to change
much. So, the regression line stays somewhat “fixed” at (X,Y ) but
with a different slope
• With each different sample of the Xi we will get a slightly different
regression line
• The variation in Y values is greater further we move from (X,Y )
• MORAL: Be careful, when predicting far away from the centre value
• ANOTHER MORAL: The model only works under the approximately
the same conditions that held when collecting the data
COMP9417 ML & DM Regression Semester 1, 2018 92 / 99
Local (nearest-neighbour) regression
Local (nearest-neighbour) regression
COMP9417 ML & DM Regression Semester 1, 2018 93 / 99
Local (nearest-neighbour) regression
Local learning
• Related to the simplest form of learning: rote learning or
memorization
• Training instances are searched for instance that most closely
resembles query or test instance
• The instances themselves represent the knowledge
• Called: nearest-neighbour, instance-based, memory-based or
case-based learning; all forms of local learning
• The similarity or distance function defines “learning”, i.e., how to go
beyond simple memorization
• Intuition — classify an instance similarly to examples “close by” —
neighbours or exemplars
• A form of lazy learning – don’t need to build a model!
COMP9417 ML & DM Regression Semester 1, 2018 94 / 99
Local (nearest-neighbour) regression
Nearest neighbour for numeric prediction
Store all training examples 〈xi, f(xi)〉.
Nearest neighbour:
• Given query instance xq,
• first locate nearest training example xn,
• then estimate ŷ = f̂(xq) = f(xn)
• k-Nearest neighbour:
• Given xq, take mean of f values of k nearest neighbours
ŷ = f̂(xq) =
∑k
i=1 f(xi)
k
COMP9417 ML & DM Regression Semester 1, 2018 95 / 99
Local (nearest-neighbour) regression
Distance function
The distance function defines what is “learned”, i.e., predicted.
Instance xi is described by an m-vector of feature values:
〈xi1, xi2, . . . xim〉
where xik denotes the value of the kth feature of xi.
Most commonly used distance function is Euclidean distance, where the
distance between two instances xi and xj is defined to be:
d(xi, xj) =
√√√√ m∑
k=1
(xik − xjk)2
COMP9417 ML & DM Regression Semester 1, 2018 96 / 99
Local (nearest-neighbour) regression
Local regression
Use kNN to form a local approximation to f for each query point xq using
a linear function of the form
f̂(x) = b0 + b1x1 + . . .+ bmxm
where xi denotes the value of the ith feature of instance x.
Where does this linear regression model come from ?
• fit linear function to k nearest neighbours
• or quadratic or higher-order polynomial . . .
• produces “piecewise approximation” to f
COMP9417 ML & DM Regression Semester 1, 2018 97 / 99
Summary
Summary
COMP9417 ML & DM Regression Semester 1, 2018 98 / 99
Summary
Summary
• Linear models give us a glimpse into many aspects of Machine
Learning
Terminology. Training data, test data, resubstitution error, prediction
error.
Conceptual. Learning as search, learning as optimisation,
assumptions underlying a technique
Implementation. Approximate alternatives to analytical methods
Application. Overfitting, problems of prediction
Each of these aspects will have counterparts in other kinds of
machine learning
• Linear models are one way to predict numerical quantities
• Ordinal regression: predicting ranks (not in the lectures)
• Neural networks: non-linear regression models (later)
• Regression trees: piecewise regression models (later)
• Class-probability trees: predicting probabilities (later)
• Model trees: piecewise non-linear models (later)
COMP9417 ML & DM Regression Semester 1, 2018 99 / 99
Aims
Introduction
Learning Linear Regression Models
Statistical Techniques for Data Analysis
Sampling
Estimation
Bias-Variance Decomposition
Correlation
Regression
Univariate linear regression
Multivariate linear regression
Regularisation
Some further issues in learning linear regression models
Local (nearest-neighbour) regression
Summary