程序代写代做代考 data mining Bayesian algorithm AI Supervised Learning

Supervised Learning – Regression

COMP9417 Machine Learning and Data Mining

Last revision: 7 Mar 2018

COMP9417 ML & DM Regression Semester 1, 2018 1 / 99

Acknowledgements

Material derived from slides for the book
“Elements of Statistical Learning (2nd Ed.)” by T. Hastie,
R. Tibshirani & J. Friedman. Springer (2009)
http://statweb.stanford.edu/~tibs/ElemStatLearn/

Material derived from slides for the book
“Machine Learning: A Probabilistic Perspective” by P. Murphy
MIT Press (2012)
http://www.cs.ubc.ca/~murphyk/MLbook

Material derived from slides for the book
“Machine Learning” by P. Flach
Cambridge University Press (2012)
http://cs.bris.ac.uk/~flach/mlbook

Material derived from slides for the book
“Bayesian Reasoning and Machine Learning” by D. Barber
Cambridge University Press (2012)
http://www.cs.ucl.ac.uk/staff/d.barber/brml

Material derived from slides for the book
“Machine Learning” by T. Mitchell
McGraw-Hill (1997)
http://www-2.cs.cmu.edu/~tom/mlbook.html

Material derived from slides for the course
“Machine Learning” by A. Srinivasan
BITS Pilani, Goa, India (2016)

COMP9417 ML & DM Regression Semester 1, 2018 2 / 99

http://statweb.stanford.edu/~tibs/ElemStatLearn/
http://www.cs.ubc.ca/~murphyk/MLbook
http://cs.bris.ac.uk/~flach/mlbook
http://www.cs.ucl.ac.uk/staff/d.barber/brml
http://www-2.cs.cmu.edu/~tom/mlbook.html

Aims

This lecture will introduce you to machine learning approaches to the
problem of numerical prediction. Following it you should be able to
reproduce theoretical results, outline algorithmic techniques and describe
practical applications for the topics:

• the supervised learning task of numeric prediction
• how linear regression solves the problem of numeric prediction
• fitting linear regression by least squares error criterion
• non-linear regression via linear-in-the-parameters models
• parameter estimation for regression
• local (nearest-neighbour) regression

Note: slides with titles marked ∗ are for background only.

COMP9417 ML & DM Regression Semester 1, 2018 3 / 99

Introduction

In the introductory lecture we saw supervised learning methods mostly for
classification, where the task is prediction of a discrete value for data
instances . . .

. . . however, we often find tasks where the most natural representation is
that of prediction of numeric values

COMP9417 ML & DM Regression Semester 1, 2018 4 / 99

Introduction

Task: learn a model to predict CPU performance from a datset of example
of 209 different computer configurations.

COMP9417 ML & DM Regression Semester 1, 2018 5 / 99

Introduction

Result: a linear regression equation fitted to the CPU dataset.

PRP =

– 56.1

+ 0.049 MYCT

+ 0.015 MMIN

+ 0.006 MMAX

+ 0.630 CACH

– 0.270 CHMIN

+ 1.46 CHMAX

COMP9417 ML & DM Regression Semester 1, 2018 6 / 99

Introduction

For the class of symbolic representations, machine learning is viewed as:

searching a space of hypotheses . . .

represented in a formal hypothesis language (trees, rules, graphs . . . ).

COMP9417 ML & DM Regression Semester 1, 2018 7 / 99

Introduction

For the class of numeric representations, machine learning is viewed as:

“searching” a space of functions . . .

represented as mathematical models (linear equations, neural nets, . . . ).

Note: in both settings, the models may be probabilistic . . .

COMP9417 ML & DM Regression Semester 1, 2018 8 / 99

Introduction

Methods to predict a numeric output from statistics and machine learning:

• linear regression (statistics) determining the “line of best fit” using
the least squares criterion

• linear models (machine learning) learning a predictive model from
data under the assumption of a linear relationship between predictor
and target variables

Very widely-used, many applications

Ideas that are generalised in Artificial Neural Networks

COMP9417 ML & DM Regression Semester 1, 2018 9 / 99

Introduction

• non-linear regression by adding non-linear basis functions
• multi-layer neural networks (machine learning) learning non-linear

predictors via hidden nodes between input and output

• regression trees (statistics / machine learning) tree where each leaf
predicts a numeric quantity

• local (nearest-neighbour) regression

COMP9417 ML & DM Regression Semester 1, 2018 10 / 99

Learning Linear Regression Models

COMP9417 ML & DM Regression Semester 1, 2018 11 / 99

Learning Linear Regression Models

Regression

We will look at the simplest model for numerical prediction:
a regression equation

Outcome will be a linear sum of feature values with appropriate weights.

Note: the term regression is overloaded – it can refer to:

• the process of determining the weights for the regression equation, or
• the regression equation itself.

COMP9417 ML & DM Regression Semester 1, 2018 12 / 99

Learning Linear Regression Models

Linear Regression

Assumes: expected value of the output given an input, E[y|x], is linear.
Simplest case: Out(x) = bx for some unknown b.
Learning problem: given the data, estimate b (i.e., b̂).

COMP9417 ML & DM Regression Semester 1, 2018 13 / 99

Learning Linear Regression Models

Linear Models

• Numeric attributes and numeric prediction, i.e., regression
• Linear models, i.e. outcome is linear combination of attributes

y = b0 + b1x1 + b2x2 + . . .+ bnxn

• Weights are calculated from the training data
• Predicted value for first training instance x(1) is:

b0x
(1)
0 + b1x

(1)
1 + b2x

(1)
2 + . . .+ bnx

(1)
n =

n∑
i=0

bix
(1)
i

COMP9417 ML & DM Regression Semester 1, 2018 14 / 99

Learning Linear Regression Models

Minimizing Squared Error

Difference between predicted and actual values is the error !

n+ 1 coefficients are chosen so that sum of squared error on all instances
in training data is minimized

Squared error:
m∑
j=1

(
y(j) −

n∑
i=0

bix
(j)
i

)2
Coefficients can be derived using standard matrix operations

Can be done if there are more instances than attributes (roughly speaking).

Known as “Ordinary Least Squares” (OLS) regression – minimizing the
sum of squared distances of data points to the estimated regression line.

COMP9417 ML & DM Regression Semester 1, 2018 15 / 99

Learning Linear Regression Models

Multiple Regression

Example: linear least squares fitting with 2 input variables.

COMP9417 ML & DM Regression Semester 1, 2018 16 / 99

Statistical Techniques for Data Analysis

Step back: Statistical Techniques for Data Analysis

COMP9417 ML & DM Regression Semester 1, 2018 17 / 99

Statistical Techniques for Data Analysis

Probability vs Statistics: The Difference

• Probability versus Statistics
• Probability: reasons from populations to samples

• This is deductive reasoning, and is usually sound (in the logical sense
of the word)

• Statistics: reasons from samples to populations
• This is inductive reasoning, and is usually unsound (in the logical sense

of the word)

COMP9417 ML & DM Regression Semester 1, 2018 18 / 99

Statistical Techniques for Data Analysis

Statistical Analyses

• Statistical analyses usually involve one of 3 things:
1 The study of populations;
2 The study of variation; and
3 Techniques for data abstraction and data reduction

• Statistical analysis is more than statistical computation:
1 What is the question to be answered?
2 Can it be quantitative (i.e., can we make measurements about it)?
3 How do we collect data?
4 What can the data tell us?

COMP9417 ML & DM Regression Semester 1, 2018 19 / 99

Statistical Techniques for Data Analysis Sampling

Sampling

COMP9417 ML & DM Regression Semester 1, 2018 20 / 99

Statistical Techniques for Data Analysis Sampling

Where do the Data come from? (Sampling)

• For groups (populations) that are fairly homogeneous, we do not need
to collect a lot of data. (We do not need to sip a cup of tea several
times to decide that it is too hot.)

• For populations which have irregularities, we will need to either take
measurements of the entire group, or find some way of get a good
idea of the population without having to do so

• Sampling is a way to draw conclusions about the population without
having to measure all of the population. The conclusions need not be
completely accurate

• All this is possible if the sample closely resembles the population
about which we are trying to draw some conclusions

COMP9417 ML & DM Regression Semester 1, 2018 21 / 99

Statistical Techniques for Data Analysis Sampling

What We Want From a Sampling Method

• No systematic bias, or at least no bias that we cannot account for in
our calculations

• The chance of obtaining an unrepresentative sample can be
calculated. (So, if this chance is high, we can choose not to draw any
conclusions.)

• The chance of obtaining an unrepresentative sample decreases with
the size of the sample

COMP9417 ML & DM Regression Semester 1, 2018 22 / 99

Statistical Techniques for Data Analysis Sampling

Simple Random Sampling

• Each element of the population is associated with a number
• Shuffle all the numbers and put them into into a hat
• Draw a sample of n numbers from the hat and get the corresponding

elements of the population

Usually, there are no hats 🙂 and we will be using a computer to generate
n numbers that are approximately random.

In addition, the computer will use a mathematical relationship between
elements of the population and the set of numbers. Inverting this
relationship using the n random numbers will then give the elements of
the population.

COMP9417 ML & DM Regression Semester 1, 2018 23 / 99

Statistical Techniques for Data Analysis Sampling

Probability Sampling

• In effect, numbers drawn using simple random sampling (in a single
stage or more) use a uniform probability distribution over the
numbers. That is, the probability of getting any number from 1 . . . n
from the hat is 1/n.

• A more general form of this is to use any kind of probability
distribution over 1 . . . n. For example, a distribution could make larger
numbers are more likely than smaller numbers. This is a skewed
distribution

• For example, take a 2-stage sampling procedure in which households
are grouped according to size, and the probability of selecting larger
households is higher. A household is selected and then an individual is
selected from that household. This gives a greater chance of selecting
individuals from larger households

• Once again, it is relatively straightforward to do this form of
probability-based sampling using a computer

COMP9417 ML & DM Regression Semester 1, 2018 24 / 99

Statistical Techniques for Data Analysis Estimation

Estimation

COMP9417 ML & DM Regression Semester 1, 2018 25 / 99

Statistical Techniques for Data Analysis Estimation

Estimation from a Sample

• Estimating some aspect of the population using a sample is a
common task. Along with the estimate, we also want to have some
idea of the accuracy of the estimate (usually expressed in terms of
confidence limits)

• Some measures calculated from the sample are very good estimates of
corresponding population values. For example, the sample mean m is
a very good estimate of the population mean µ. But this is not
always the case. For example, the range of a sample usually
under-estimates the range of the population

• We will have to clarify what is meant by a “good estimate”. One
meaning is that an estimator is correct on average. For example, on
average, the mean of a sample is a good estimator of the mean of the
population

COMP9417 ML & DM Regression Semester 1, 2018 26 / 99

Statistical Techniques for Data Analysis Estimation

Estimation from a Sample

• For example, when a number of samples are drawn and the mean of
each is found, then average of these means is equal to the population
mean

• Such an estimator is said to be statistically unbiased

COMP9417 ML & DM Regression Semester 1, 2018 27 / 99

Statistical Techniques for Data Analysis Estimation

Sample Estimates of the Mean and the Spread I

Mean. This is calculated as follows.

• Find the total T of N observations. Estimate the
(arithmetic) mean by m = T/N .

• This works very well when the data follow a symmetric
bell-shaped frequency distribution (of the kind modelled
by “normal” distribution)

• A simple mathematical expression of this is
m = 1

∑
i xi, where the observations are x1, x2 . . .xn

• If we can group the data so that the observation x1
occurs f1 times, x2 occurs f2 times and so on, then the
mean is calculated even easier as m = 1

∑
i xifi

COMP9417 ML & DM Regression Semester 1, 2018 28 / 99

Statistical Techniques for Data Analysis Estimation

Sample Estimates of the Mean and the Spread II

• If, instead of frequencies, you had relative frequencies
(i.e. instead of fi you had pi = fi/N), then the mean is
simply the observations weighted by relative frequency.
That is, m =

∑
i xipi

• We want to connect this up to computing the mean
value of observations modelled by some theoretical
probability distribution function. That is, we want to a
similar counting method for calculating the mean of
random variables modelled using some known
distribution

COMP9417 ML & DM Regression Semester 1, 2018 29 / 99

Statistical Techniques for Data Analysis Estimation

Sample Estimates of the Mean and the Spread III

• Correctly, this is the mean value of the values of the
random variable function. But this is a bit cumbersome,
so we will just say the “mean value of the r.v.” For
discrete r.v.’s this is:

E(X) =
∑
i

xip(X = xi)

Variance. This is calculated as follows:

• Calculate the total T and the sum of squares of N
observations. The estimate of the standard deviation is
s =

√
1

N−1
∑

i(xi −m)2
• Again, this is a very good estimate when the data are

modelled by a normal distribution

COMP9417 ML & DM Regression Semester 1, 2018 30 / 99

Statistical Techniques for Data Analysis Estimation

Sample Estimates of the Mean and the Spread IV

• For grouped data, this is modified to
s =

√
1

N−1
∑

i(xi −m)2fi
• Again, we have a similar formula in terms of expected

values, for the scatter (spread) of values of a r.v. X
around a mean value E(X):

V ar(X) = E(X − E(X))2

= E(X2)− [E(X)]2

• You can remember this as “the mean of the squares
minus the square of the mean”

COMP9417 ML & DM Regression Semester 1, 2018 31 / 99

Statistical Techniques for Data Analysis Bias-Variance Decomposition

Bias-Variance Decomposition

COMP9417 ML & DM Regression Semester 1, 2018 32 / 99

Statistical Techniques for Data Analysis Bias-Variance Decomposition

The Bias-Variance Tradeoff

• When comparing unbiased estimators, we would like to select the one
with minimum variance

• In general, we would be comparing estimators that have some bias
and some variance

• We can combine the bias and variance of an estimator by obtaining
the mean square error of the estimator, or MSE. This is the average
value of squared deviations of an estimated value V from the true
value of the parameter θ. That is:

MSE = Avg. value of (V − θ)2

• Now, it can be shown that:

MSE = (variance) + (bias)
2

• If, as sample size increases, the bias and the variance of an estimator
approaches 0, then the estimator is said to be consistent.

COMP9417 ML & DM Regression Semester 1, 2018 33 / 99

Statistical Techniques for Data Analysis Bias-Variance Decomposition

The Bias-Variance Tradeoff

• Since

MSE = (variance) + (bias)
2

the lowest possible value of MSE is 0
• In general, we may not be able to get to the ideal MSE of 0.

Sampling theory tells us the minimum value of the variance of an
estimator. This value is known as the Cramer-Rao bound. So, given
an estimator with bias b, we can calculate the minimum value of the
variance of the estimator using the CR bound (say, vmin). Then:

MSE ≥ vmin + b2

The value of vmin depends on whether the estimator is biased or
unbiased (that is b = 0 or b 6= 0)

• It is not the case that vmin for an unbiased (b = 0) estimator is less
than vmin for a biased estimator. So, the MSE of a biased estimator
can end up being lower than the MSE of an unbiased estimator.

COMP9417 ML & DM Regression Semester 1, 2018 34 / 99

Statistical Techniques for Data Analysis Bias-Variance Decomposition

Decomposition of MSE

MSE = (variance) + (bias)
2

Imagine testing the prediction of our estimator ŷ on many samples of the
same size drawn at random from the same distribution. We compute error
based on the squared difference between predicted and actual values.
Then the MSE can be decomposed like this:

MSE = E[ŷ − f(x)]2

= E[ŷ − E(ŷ)]2 + [E(ŷ)− f(x)]2

Note that the first term in the error decomposition (variance) does not
refer to the actual value at all, although the second term (bias) does.

COMP9417 ML & DM Regression Semester 1, 2018 35 / 99

Statistical Techniques for Data Analysis Correlation

Correlation

COMP9417 ML & DM Regression Semester 1, 2018 36 / 99

Statistical Techniques for Data Analysis Correlation

Correlation I

• The correlation coefficient is a number between -1 and +1 that
indicates whether a pair of variables x and y are associated or not,
and whether the scatter in the association is high or low
• High values of x are associated with high values of y and low values of
x are associated with low values of y, and scatter is low

• A value near 0 indicates that there is no particular association and that
there is a large scatter associated with the values

• A value close to -1 suggests an inverse association between x and y

• Only appropriate when x and y are roughly linearly associated
(doesn’t work well when the association is curved)

• The formula for computing correlation between x and y is:

r =
cov(x, y)√

var(x)
√
var(y)

This is sometimes also called Pearson’s correlation coefficient
COMP9417 ML & DM Regression Semester 1, 2018 37 / 99

Statistical Techniques for Data Analysis Correlation

Correlation II

• The terms in the denominator are simply the standard deviations of x
and y. But the numerator is different. This is calculated as the
average of the product of deviations from the mean:

cov(x, y) =

∑
i(xi − x)(yi − y)

n− 1

• What does “covariance” mean ? Consider
1 Case 1: xi > x, yi > y
2 Case 2: xi < x, yi < y 3 Case 3: xi < x, yi > y
4 Case 4: xi > x, yi < y In the first two cases, xi and yi vary together, both being high or low relative to their means. In the other two cases, they vary in different directions COMP9417 ML & DM Regression Semester 1, 2018 38 / 99 Statistical Techniques for Data Analysis Correlation Correlation III • If the positive products dominate in the calculation of cov(x, y), then the value of r will be positive. If the negative products dominate, then r will be negative. If 0 products dominate, then r will be close to 0. • You should be able to show that: r = ∑ i(xi − x)(yi − y)√∑ (xi − x)2 √∑ (yi − y)2 • Computers generally use a short-cut formula: r = ∑ i xiyi − nxy n− 1 • The same kinds of calculations can be done if the data were not actual values but ranks instead (i.e. ranks for the x’s and the y’s). • This is called Spearman’s rank correlation, but we won’t do these calculations here. COMP9417 ML & DM Regression Semester 1, 2018 39 / 99 Statistical Techniques for Data Analysis Correlation What Happens If You Sample? I • Suppose you have a sample of 〈x, y〉 pairs and you calculate r = 0.3. Is this really the case? • Sampling theory tells us something. If: (a) the relative frequencies observed are well modelled by a special kind of mathematical function (a “Normal” or Gaussian distribution); (b) the true correlation is 0; and (c) the number of samples is large • Then: • The sampling distribution of the correlation coefficient (that is, how r varies from sample to sample) is also approximately distributed according to the Normal distribution with mean 0 and standard error (s.e.) of approximately 1/ √ n • We can use this to calculate the (approximate) probability of obtaining the sample if the assumptions were true COMP9417 ML & DM Regression Semester 1, 2018 40 / 99 Statistical Techniques for Data Analysis Correlation What Happens If You Sample? II • Suppose we calculate r = 0.3 from the sample, and that the s.e. is 0.1, say. Then if the sample came from a population with true correlation 0, this would be quite unusual (less than 1% chance) • We would say instead that the sample was probably from a population with correlation 0.3, with a 95% confidence interval of ±2× 0.1 COMP9417 ML & DM Regression Semester 1, 2018 41 / 99 Statistical Techniques for Data Analysis Correlation What Does Correlation Mean? I • r is a quick way of checking whether there is some linear association between x and y • The sign of the value tells you the direction of the association • All that the numerical value tells you is about the scatter in the data • The correlation coefficient does not model any relationship. That is, given a particular x you cannot use the r value to calculate a y value • It is possible for two datasets to have the same correlation, but different relationships • It is possible for two datasets to have different correlations but the same relationship • MORAL: Do not use correlations to compare datasets. All you can derive is whether there is a positive or negative relationship between x and y • ANOTHER MORAL: Do not use correlation to imply x causes y or the other way around COMP9417 ML & DM Regression Semester 1, 2018 42 / 99 Statistical Techniques for Data Analysis Regression Regression COMP9417 ML & DM Regression Semester 1, 2018 43 / 99 Statistical Techniques for Data Analysis Regression Regression • Given a set of data points xi, yi, what is the relationship between them? (We can generalise this to the “multivariate” case later) • One kind of question is to ask: are these linearly related in some manner? That is, can we draw a straight line that describes reasonably well the relationship between X and Y • Remember, the correlation coefficient can tell us if there is a case for such a relationship • In real life, even if such a relationship held, it will be unreasonable to expect all pairs xi, yi to lie precisely on a straight line. Instead, we can probably draw some reasonably well-fitting line. But which one? COMP9417 ML & DM Regression Semester 1, 2018 44 / 99 Univariate linear regression Univariate linear regression COMP9417 ML & DM Regression Semester 1, 2018 45 / 99 Univariate linear regression Linear Relationship Between 2 Variables I • GOAL: fit a line whose equation is of the form Ŷ = a+ bX • HOW: minimise ∑ i d 2 i = ∑ i(Yi− Ŷi) 2 (the “least squares estimator”) COMP9417 ML & DM Regression Semester 1, 2018 46 / 99 Univariate linear regression Linear Relationship Between 2 Variables II • The calculation for b is given by: b = cov(x, y) var(x) where cov(x, y) is the covariance of x and y, given by∑ i(xi − x)(yi − y) as before • This can be simplified to: b = ∑ (xy)/ ∑ x2 where x = (Xi −X) and y = (Yi − Y ) • a = Y − bX COMP9417 ML & DM Regression Semester 1, 2018 47 / 99 Univariate linear regression Meaning of the Coefficients a and b • b: change in Y that accompanies a unit change in X • If the values of X were assigned at random, then b estimates the unit change in Y caused by a unit change in X • If the values of X were not assigned at random (for examples, they were data somebody observed), then the change in Y will include the change in X and any other confounding variables that may have changed as a result of changing X by 1 unit. So, you cannot say for example, that a change of X by 1 unit causes b units of change in Y • b = 0 means there is no linear relationship between X and Y , and then best we can do is simply say is Ŷ = a = Y . Estimating the sample mean is therefore a special case of the MSE criterion COMP9417 ML & DM Regression Semester 1, 2018 48 / 99 Univariate linear regression The Regression Model I • The least-square estimator fits a line using sample data • To draw inferences about the population requires us to have a (statistical) model about what this line means • What is being assumed is actually this: COMP9417 ML & DM Regression Semester 1, 2018 49 / 99 Univariate linear regression The Regression Model II • That is: Obtain Y values for many instances of X1. This will result in a distribution of Y values P (Y |X1); and so on for P (Y |X2), P (Y |X3), etc.. The regression model makes the following assumptions: • All the Y distributions are the same, and have the same spread • For each P (Y |Xi) distribution, the true mean value µi lies on a straight line (this is the “true regression line”) • The Yi are independent • In standard terminology, the Yi are identically distributed independent (i.i.d.) random variables with mean µi = α+ βXi and variance σ 2 • Or: Yi = α+ βXi + ei where the ei are independent errors with mean 0 and variance σ2 COMP9417 ML & DM Regression Semester 1, 2018 50 / 99 Univariate linear regression How Good is the Least-Squares Estimator I • The line fitted using the least-squares criterion is a sample-based estimate of the true regression line • To know how good this estimate is, we are really asking questions about the bias and variance of the estimates of a and b • It can be shown that under some assumptions, the least-square estimates of a and b will be unbiased and that they will have the lowest variance • The proof of this is called the Gauss-Markov theorem. The Gauss-Markov theorem makes the following assumptions: 1 The expected (average) values of residuals is 0 (E(ei) = 0) 2 The spread of residuals is constant for all Xi (V ar(ei) = σ 2) 3 There is no relationship amongst the residuals (cov(ei, ej) = 0) 4 There is no relationship between the residuals and the Xi (cov(Xi, ei) = 0) COMP9417 ML & DM Regression Semester 1, 2018 51 / 99 Univariate linear regression How Good is the Least-Squares Estimator II • If these assumptions hold, then the Gauss-Markov theorem shows that E(a) = α, E(b) = β, and that the variance in these estimates will have the lowest variance • There is a special case of the assumptions that arises when the residuals are assumed to be distributed according to the Normal distribution, with mean 0 • In this case, minimising least-squares is equivalent to maximising the probability of the Yi, given the Xi (that is, least-squares is equivalent to maximum likelihood estimation) • More on this in a later lecture COMP9417 ML & DM Regression Semester 1, 2018 52 / 99 Univariate linear regression Univariate linear regression Example: Suppose we want to investigate the relationship between people’s height and weight. We collect n height and weight measurements (hi, wi), 1 ≤ i ≤ n. Univariate linear regression assumes a linear equation w = a+ bh, with parameters a and b chosen such that the sum of squared residuals∑n i=1(wi − (a+ bhi)) 2 is minimised. COMP9417 ML & DM Regression Semester 1, 2018 53 / 99 Univariate linear regression Univariate linear regression In order to find the parameters we take partial derivatives, set the partial derivatives to 0 and solve for a and b: ∂ ∂a n∑ i=1 (wi − (a+ bhi))2 = −2 n∑ i=1 (wi − (a+ bhi)) = 0 ⇒ â = w − b̂h ∂ ∂b n∑ i=1 (wi − (a+ bhi))2 = −2 n∑ i=1 (wi − (a+ bhi))hi = 0 ⇒ b̂ = ∑n i=1 (hi − h)(wi − w)∑n i=1 (hi − h)2 So the solution found by linear regression is w = â+ b̂h = w + b̂(h− h). COMP9417 ML & DM Regression Semester 1, 2018 54 / 99 Univariate linear regression Univariate linear regression 140 150 160 170 180 190 200 40 45 50 55 60 65 70 75 80 85 90 COMP9417 ML & DM Regression Semester 1, 2018 55 / 99 Univariate linear regression Univariate linear regression Shown on previous slide: The red solid line indicates the result of applying linear regression to 10 measurements of body weight (on the y-axis, in kilograms) against body height (on the x-axis, in centimetres). The orange dotted lines indicate the average height h = 181 and the average weight w = 74.5; the regression coefficient b̂ = 0.78. The measurements were simulated by adding normally distributed noise with mean 0 and variance 5 to the true model indicated by the blue dashed line (b = 0.83). COMP9417 ML & DM Regression Semester 1, 2018 56 / 99 Univariate linear regression Linear regression: intuitions For a feature x and a target variable y, the regression coefficient is the covariance between x and y in proportion to the variance of x: b̂ = σxy σxx (Here we use σxx as an alternative notation for σ 2 x). This can be understood by noting that the covariance is measured in units of x times units of y (e.g., metres times kilograms above) and the variance in units of x squared (e.g., metres squared), so their quotient is measured in units of y per unit of x (e.g., kilograms per metre). COMP9417 ML & DM Regression Semester 1, 2018 57 / 99 Univariate linear regression Linear regression: intuitions The intercept â is such that the regression line goes through (x, y). Adding a constant to all x-values (a translation) will affect only the intercept but not the regression coefficient (since it is defined in terms of deviations from the mean, which are unaffected by a translation). So we could zero-centre the x-values by subtracting x, in which case the intercept is equal to y. We could even subtract y from all y-values to achieve a zero intercept, without changing the problem in an essential way. COMP9417 ML & DM Regression Semester 1, 2018 58 / 99 Univariate linear regression Linear regression: intuitions Suppose we replace xi with x ′ i = xi/σxx and likewise x with x ′ = x/σxx, then we have that b̂ = 1 n ∑n i=1(x ′ i − x′)(yi − y) = σx′y. In other words, if we normalise x by dividing all its values by x’s variance, we can take the covariance between the normalised feature and the target variable as regression coefficient. This demonstrates that univariate linear regression can be understood as consisting of two steps: 1 normalisation of the feature by dividing its values by the feature’s variance; 2 calculating the covariance of the target variable and the normalised feature. COMP9417 ML & DM Regression Semester 1, 2018 59 / 99 Univariate linear regression Linear regression: intuitions Another important point to note is that the sum of the residuals of the least-squares solution is zero: n∑ i=1 (yi − (â+ b̂xi)) = n(y − â− b̂x) = 0 The result follows because â = y − b̂x, as derived above. While this property is intuitively appealing, it is worth keeping in mind that it also makes linear regression susceptible to outliers: points that are far removed from the regression line, often because of measurement errors. COMP9417 ML & DM Regression Semester 1, 2018 60 / 99 Univariate linear regression The effect of outliers 140 150 160 170 180 190 200 40 45 50 55 60 65 70 75 80 85 90 COMP9417 ML & DM Regression Semester 1, 2018 61 / 99 Univariate linear regression The effect of outliers Shown on previous slide: Suppose that, as the result of a transcription error, one of the weight values from the previous example of univariate regression is increased by 10 kg. The diagram shows that this has a considerable effect on the least-squares regression line. Specifically, we see that one of the blue points got moved up 10 units to the green point, changing the red regression line to the green line. COMP9417 ML & DM Regression Semester 1, 2018 62 / 99 Univariate linear regression Least-Squares as Cost Minimization I • Finding the least-squares solution is in effect finding the value of a and b that minimizes ∑ i d 2 i = ∑ i(Yi − Ŷi) 2, where Ŷi = a+ bXi • This minimum value was obtained analytically by the usual process of differentiating and equating to 0, • A numerical alternative to the analytical approach is to take (small) steps that decreases the value of the function to be minimised, and stopping when we reach a minimum • Recall that at a point the gradient vector points in the direction of greatest increase of a function. So, the opposite direction to the gradient vector gives the direction of greatest decrease • bi+1 = bi - η × gb • ai+1 = ai - η × ga • Stop when bi+1 ≈ bi and ai+1 ≈ ai • More on this in a later lecture COMP9417 ML & DM Regression Semester 1, 2018 63 / 99 Multivariate linear regression Multivariate linear regression COMP9417 ML & DM Regression Semester 1, 2018 64 / 99 Multivariate linear regression Many variables • Often, we are interesting in modelling the relationship of Y to several other variables • In observational studies, the value of Y may be affected by the values of several variables. For example, carcinogenicity may be gender-specific. A regression model that ignores gender may find that carcinogenicity to be related to some surrogate variable (height, for example) • Including more variables can give a narrower confidence interval on the prediction being made COMP9417 ML & DM Regression Semester 1, 2018 65 / 99 Multivariate linear regression Multivariate linear model • The Yi are identically distributed independent variables with mean µ = β0 + β1X1 + β2X2 + · · ·+ βnXn and variance σ2 • Or: Yi = β0 + β1X1 + · · ·+ βnXn + ei where the ei are independent errors with mean 0 and variance σ2 • As before, this linear model is estimated from a sample by the equation Ŷ = b0 + b1X1 + b2X2 + · · ·+ bnXn • With many variables, the regression equation and expressions for the bi are expressed better using a matrix representation for sets of equations. COMP9417 ML & DM Regression Semester 1, 2018 66 / 99 Multivariate linear regression Multivariate linear regression * First, we need the covariances between every feature and the target variable: (XTy)j = n∑ i=1 xijyi = n∑ i=1 (xij − µj)(yi − y) + nµj y = n(σjy + µj y) Assuming for the moment that every feature is zero-centred, we have µj = 0 and thus X Ty is an n-vector holding all the required covariances (times n). We can normalise the features by means of a d-by-d scaling matrix: a diagonal matrix with diagonal entries 1/nσjj . If S is a diagonal matrix with diagonal entries nσjj , we can get the required scaling matrix by simply inverting S. So our first stab at a solution for the multivariate regression problem is ŵ = S−1XTy COMP9417 ML & DM Regression Semester 1, 2018 67 / 99 Multivariate linear regression Multivariate linear regression * The general case requires a more elaborate matrix instead of S: ŵ = (XTX)−1XTy (1) Let us try to understand the term (XTX)−1 a bit better. • Assuming the features are uncorrelated, the covariance matrix Σ is diagonal with entries σjj . • Assuming the features are zero-centred, XTX = nΣ is also diagonal with entries nσjj . • In other words, assuming zero-centred and uncorrelated features, (XTX)−1 reduces to our scaling matrix S−1. In the general case we cannot make any assumptions about the features, and (XTX)−1 acts as a transformation that decorrelates, centres and normalises the features. COMP9417 ML & DM Regression Semester 1, 2018 68 / 99 Multivariate linear regression Bivariate linear regression * First, we derive the basic expressions. X T X = ( x11 · · · xn1 x12 · · · xn2 ) x11 x12 . . . . . . xn1 xn2   = n ( σ11 + x1 2 σ12 + x1 x2 σ12 + x1 x2 σ22 + x2 2 ) (X T X) −1 = 1 nD ( σ22 + x2 2 −σ12 − x1 x2 −σ12 − x1 x2 σ11 + x12 ) D = (σ11 + x1 2 )(σ22 + x2 2 )− (σ12 + x1 x2) 2 X T y = ( x11 · · · xn1 x12 · · · xn2 ) y1 . . . yn   = n ( σ1y + x1 y σ2y + x2 y ) COMP9417 ML & DM Regression Semester 1, 2018 69 / 99 Multivariate linear regression Bivariate linear regression * We now consider two special cases. The first is that X is in homogeneous coordinates, i.e., we are really dealing with a univariate problem. In that case we have xi1 = 1 for 1 ≤ i ≤ n; x1 = 1; and σ11 = σ12 = σ1y = 0. We then obtain (we write x instead of x2, σxx instead of σ22 and σxy instead of σ2y): (XTX)−1 = 1 nσxx ( σxx + x 2 −x −x 1 ) XTy = n ( y σxy + x y ) ŵ = (XTX)−1XTy = 1 σxx ( σxxy − σxyx σxy ) This is the same result as obtained for the univariate case. COMP9417 ML & DM Regression Semester 1, 2018 70 / 99 Multivariate linear regression Bivariate linear regression * The second special case we consider is where we assume x1, x2 and y to be zero-centred, which means that the intercept is zero and w contains the two regression coefficients. In this case we obtain (XTX)−1 = 1 n(σ11σ22 − σ212) ( σ22 −σ12 −σ12 σ11 ) XTy = n ( σ1y σ2y ) ŵ = (XTX)−1XTy = 1 (σ11σ22 − σ212) ( σ22σ1y − σ12σ2y σ11σ2y − σ12σ1y ) The last expression shows, e.g., that the regression coefficient for x1 may be non-zero even if x1 doesn’t correlate with the target variable (σ1y = 0), on account of the correlation between x1 and x2 (σ12 6= 0). COMP9417 ML & DM Regression Semester 1, 2018 71 / 99 Regularisation Regularisation COMP9417 ML & DM Regression Semester 1, 2018 72 / 99 Regularisation Parameter Estimation by Optimization I Regularisation is a general method to avoid overfitting by applying additional constraints to the weight vector. A common approach is to make sure the weights are, on average, small in magnitude: this is referred to as shrinkage. Recall the setting for regression in terms of cost minimization. • Can add penalty terms to a cost function, forcing coefficients to shrink to zero Y = fθ0,θ1,...,θn(X1, X2, . . . , Xn) = fθ(X) COMP9417 ML & DM Regression Semester 1, 2018 73 / 99 Regularisation Parameter Estimation by Optimization II • MSE as a cost function, given data (x1, y1), . . . , (xn, yn) Cost(θ) = 1 n ∑ i (fθ(xi)− yi)2 and with a penalty function: Cost(θ) = 1 n ∑ i (fθ(xi)− yi)2 + 1 n λ ∑ i=1 n θi • Parameter estimation by optimisation will attempt to values for θ0, θ1, . . . , θn s.t. Cost(θ) is a minimum • It will be easier to take the 1 n term as 1 2n , which will not affect the minimisation COMP9417 ML & DM Regression Semester 1, 2018 74 / 99 Regularisation Parameter Estimation by Optimization III • Using gradient descent with the penalty function will do two things: (a) we will move each θj in a direction that minimises the cost; and (b) each value of θj will also get “shrunk” on each iteration by multiplying the old value by an amount < 1 θj (i+1) = αθj (i) − η∇θj where α < 1 COMP9417 ML & DM Regression Semester 1, 2018 75 / 99 Regularisation Regularised regression The multivariate least-squares regression problem can be written as an optimisation problem: w∗ = argmin w (y −Xw)T(y −Xw) The regularised version of this optimisation is then as follows: w∗ = argmin w (y −Xw)T(y −Xw) + λ||w||2 where ||w||2 = ∑ iw 2 i is the squared norm of the vector w, or, equivalently, the dot product wTw; λ is a scalar determining the amount of regularisation. COMP9417 ML & DM Regression Semester 1, 2018 76 / 99 Regularisation Regularised regression This regularised problem still has a closed-form solution: ŵ = (XTX + λI)−1XTy where I denotes the identity matrix. Regularisation amounts to adding λ to the diagonal of XTX, a well-known trick to improve the numerical stability of matrix inversion. This form of least-squares regression is known as ridge regression. An interesting alternative form of regularised regression is provided by the lasso, which stands for ‘least absolute shrinkage and selection operator’. It replaces the ridge regularisation term ∑ iw 2 i with the sum of absolute weights ∑ i |wi|. The result is that some weights are shrunk, but others are set to 0, and so the lasso regression favours sparse solutions. COMP9417 ML & DM Regression Semester 1, 2018 77 / 99 Some further issues in learning linear regression models Some further issues in learning linear regression models COMP9417 ML & DM Regression Semester 1, 2018 78 / 99 Some further issues in learning linear regression models What do the Coefficients bi Mean? • Consider the two equations: Ŷ = a+ bX Ŷ = b0 + b1X1 + b2X2 • b: change in Y that accompanies a unit change in X • b1: change in Y that accompanies a unit change in X1 provided X2 remains constant • More generally, bi (i > 0) is the change in Y that accompanies a unit
change in Xi provided all other X’s are constant

• So: if all relevant variables are included, then we can assess the effect
of each one in a controlled manner

COMP9417 ML & DM Regression Semester 1, 2018 79 / 99

Some further issues in learning linear regression models

Categoric Variables: X’s I

• “Indicator” variables are those that take on the values 0 or 1
• They are used to include the effects of categoric variables. For

example, if D is a variable that takes the value 1 if a patient takes a
drug and 0 if the patient does not. Suppose you want to know the
effect of drug D on blood pressure Y keeping age (X) constant

Ŷ = 70 + 5D + 0.44X

So, taking the drug (a unit change in D) makes a difference of 5
units, provided age is held constant

COMP9417 ML & DM Regression Semester 1, 2018 80 / 99

Some further issues in learning linear regression models

Categoric Variables: X’s II

• How do we capture any interaction effect between age and drug
intake? Introduce a new indicator variable DX = D ×X

Ŷ = 70 + 5D + 0.44X + 0.21DX

COMP9417 ML & DM Regression Semester 1, 2018 81 / 99

Some further issues in learning linear regression models

Categoric Values: Y values

• Sometimes, Y values are simply one of two values (let’s call them 0
and 1)

• We can’t use the regression model as we described earlier, in which
the Y ’s can take any real value

• But, we can define a new linear regression model in which predicts
not the value of Y , but what are called the log odds of Y :

log odds Y = Odds = b0 + b1X1 + · · ·+ bnXn

• Once Odds are estimated, they can be used to calculate the
probability of Y :

Pr(Y = 1) =
eOdds

(1 + eOdds)

We can then use the value of Pr(Y = 1) to decide if Y = 1

• This procedure is called logistic regression (we’ll see this again)

COMP9417 ML & DM Regression Semester 1, 2018 82 / 99

Some further issues in learning linear regression models

Is the Model Appropriate ? * I

COMP9417 ML & DM Regression Semester 1, 2018 83 / 99

Some further issues in learning linear regression models

Is the Model Appropriate ? * II

• The residuals from the regression line can be calculated numerically,
along with their mean, variance and standard deviation. It can be
shown that the residual standard deviation is related to the standard
deviation of the Y values in the following manner:

rsd = sy
√
1− r2

• This helps us understand how much the regression line helped reduce
the scatter of the y values (sy gives a measure of the scatter of y
values about the mean y; and rsd gives a measure of the scatter of y
values about the regression line)

• This also gives you another way of understanding the correlation
coefficient. With r = 0.9, the scatter about the regression line is still
almost 45% of the original scatter about the mean

COMP9417 ML & DM Regression Semester 1, 2018 84 / 99

Some further issues in learning linear regression models

Is the Model Appropriate ? * III

• If there is no systematic pattern to the residuals—that is, there are
approximately half of them that are positive and half that are
negative, then the line is a good fit

• It should also be the case that there should be no pattern to the
residual scatter all along the line. If the average size of the residuals
varies along the line (this condition is called heteroscedasticity) then
the relationship is probably more complex than a straight line

• Residuals from a well-fitting line should show an approximate
symmetric, bell-shaped frequency distribution with a mean of 0

COMP9417 ML & DM Regression Semester 1, 2018 85 / 99

Some further issues in learning linear regression models

Non-linear Relationships

• Sometimes, the linear model may be inappropriate
• Some non-linear relationships can be captured in a linear model by a

transformation (“trick”). For example, the curved model
Ŷ = b0 + b1X1 + b2X

2
1 can be transformed by X2 = X

2
1 into a linear

model. This works for polynomial relationships.

• Some other non-linear relationships may require more complicated
transformations. For example, the relationship is Y = b0X

b1
1 X

b2
2 can

be transformed into the linear relationship

log(Y ) = log b0 + b1 logX1 + b2 logX2

• Other relationships cannot be transformed quite so easily, and will
require full non-linear estimation (in subsequent topics in the ML
course we will find out more about these)

COMP9417 ML & DM Regression Semester 1, 2018 86 / 99

Some further issues in learning linear regression models

Non-Linear Relationships (contd.)

• Main difficulty with non-linear relationships is choice of function
• How to learn ?
• Can use a form of gradient descent to estimate the parameters

• After a point, almost any sufficiently complex mathematical function
will do the job in a sufficiently small range

• Some kind of prior knowledge or theory is the only way to help here.
• Otherwise, it becomes a process of trial-and-error, in which case,

beware of conclusions that can be drawn
COMP9417 ML & DM Regression Semester 1, 2018 87 / 99

Some further issues in learning linear regression models

Model Selection

• Suppose there are a lot of variables Xi, some of which may be
representing products, powers, etc .

• Taking all the Xi will lead to an overly complex model. There are 3
ways to reduce complexity:

1 Subset-selection, by search over subset lattice. Each subset results in a
new model, and the problem is one of model-selection

2 Shrinkage, or regularization of coefficients to zero, by optimization.
There is a single model, and unimportant variables have near-zero
coefficients.

3 Dimensionality-reduction, by projecting points into a lower dimensional
space (this is different to subset-selection, and we will look at it later)

COMP9417 ML & DM Regression Semester 1, 2018 88 / 99

Some further issues in learning linear regression models

Model Selection as Search I

• The subsets of the set of possible variables form a lattice with S1 ∩ S2
as the g.l.b. or meet and S1 ∪ S2 as the l.u.b. or join

• Each subset refers to a model, and a pair of subsets are connected if
they differ by just 1 element

• A lattice is a graph, and we know how to search a graph
• A∗, greedy, randomised etc .
• “Cost” of node in the graph: MSE of the model. The parameters

(coefficients) of the model can be found

• Historically, model-selection for regression has been done using
“forward-selection”, “backward-elimination”, or “stepwise” methods
• These are greedy search techniques that either: (a) start at the top of

the subset lattice, and add variables; (b) start at the bottom of the
subset lattice and remove variables; or (c) start at some interior point
and proceed by adding or removing single variables (examining nodes
connected to the node above or below)

COMP9417 ML & DM Regression Semester 1, 2018 89 / 99

Some further issues in learning linear regression models

Model Selection as Search II

• Greedy selection done on the basis of calculating the coefficient of
determination (often denoted by R2) which denotes the proportion of
total variation in the dependent variable Y that is explained by the
model

• Given a model formed with a subset of variables X, it is possible to
compute the observed change in R2 due to the addition or deletion of
some variable x

• This is used to select greedily the next best move in the graph-search

To set other hyper-parameters, such as shrinkage parameter λ, can use
grid search

COMP9417 ML & DM Regression Semester 1, 2018 90 / 99

Some further issues in learning linear regression models

Prediction I

• It is possible to quantify what happens if the regression line is used
for prediction:

• The intuition is this:
• Recall the regression line goes through the mean (X,Y )

COMP9417 ML & DM Regression Semester 1, 2018 91 / 99

Some further issues in learning linear regression models

Prediction II

• If the Xi are slightly different, then the mean is not going to change
much. So, the regression line stays somewhat “fixed” at (X,Y ) but
with a different slope

• With each different sample of the Xi we will get a slightly different
regression line

• The variation in Y values is greater further we move from (X,Y )

• MORAL: Be careful, when predicting far away from the centre value
• ANOTHER MORAL: The model only works under the approximately

the same conditions that held when collecting the data

COMP9417 ML & DM Regression Semester 1, 2018 92 / 99

Local (nearest-neighbour) regression

COMP9417 ML & DM Regression Semester 1, 2018 93 / 99

Local (nearest-neighbour) regression

Local learning

• Related to the simplest form of learning: rote learning or
memorization

• Training instances are searched for instance that most closely
resembles query or test instance

• The instances themselves represent the knowledge
• Called: nearest-neighbour, instance-based, memory-based or

case-based learning; all forms of local learning

• The similarity or distance function defines “learning”, i.e., how to go
beyond simple memorization

• Intuition — classify an instance similarly to examples “close by” —
neighbours or exemplars

• A form of lazy learning – don’t need to build a model!

COMP9417 ML & DM Regression Semester 1, 2018 94 / 99

Local (nearest-neighbour) regression

Nearest neighbour for numeric prediction

Store all training examples 〈xi, f(xi)〉.
Nearest neighbour:

• Given query instance xq,
• first locate nearest training example xn,
• then estimate ŷ = f̂(xq) = f(xn)
• k-Nearest neighbour:
• Given xq, take mean of f values of k nearest neighbours

ŷ = f̂(xq) =

∑k
i=1 f(xi)

COMP9417 ML & DM Regression Semester 1, 2018 95 / 99

Local (nearest-neighbour) regression

Distance function

The distance function defines what is “learned”, i.e., predicted.
Instance xi is described by an m-vector of feature values:

〈xi1, xi2, . . . xim〉

where xik denotes the value of the kth feature of xi.
Most commonly used distance function is Euclidean distance, where the
distance between two instances xi and xj is defined to be:

d(xi, xj) =

√√√√ m∑
k=1

(xik − xjk)2

COMP9417 ML & DM Regression Semester 1, 2018 96 / 99

Local (nearest-neighbour) regression

Local regression

Use kNN to form a local approximation to f for each query point xq using
a linear function of the form

f̂(x) = b0 + b1x1 + . . .+ bmxm

where xi denotes the value of the ith feature of instance x.
Where does this linear regression model come from ?

• fit linear function to k nearest neighbours
• or quadratic or higher-order polynomial . . .
• produces “piecewise approximation” to f

COMP9417 ML & DM Regression Semester 1, 2018 97 / 99

Summary

COMP9417 ML & DM Regression Semester 1, 2018 98 / 99

Summary

• Linear models give us a glimpse into many aspects of Machine
Learning

Terminology. Training data, test data, resubstitution error, prediction
error.

Conceptual. Learning as search, learning as optimisation,
assumptions underlying a technique

Implementation. Approximate alternatives to analytical methods
Application. Overfitting, problems of prediction

Each of these aspects will have counterparts in other kinds of
machine learning

• Linear models are one way to predict numerical quantities
• Ordinal regression: predicting ranks (not in the lectures)
• Neural networks: non-linear regression models (later)
• Regression trees: piecewise regression models (later)
• Class-probability trees: predicting probabilities (later)
• Model trees: piecewise non-linear models (later)

COMP9417 ML & DM Regression Semester 1, 2018 99 / 99

Aims
Introduction
Learning Linear Regression Models
Statistical Techniques for Data Analysis
Sampling
Estimation
Bias-Variance Decomposition
Correlation
Regression

Univariate linear regression
Multivariate linear regression
Regularisation
Some further issues in learning linear regression models
Local (nearest-neighbour) regression
Summary

Related Posts