Regression
COMP9417 Machine Learning and Data Mining
Term 2, 2020
COMP9417 ML & DM Regression Term 2, 2020 1 / 107
Acknowledgements
Material derived from slides for the book
“Elements of Statistical Learning (2nd Ed.)” by T. Hastie, R. Tibshirani & J. Friedman. Springer (2009) http://statweb.stanford.edu/~tibs/ElemStatLearn/
Material derived from slides for the book
“Machine Learning: A Probabilistic Perspective” by P. Murphy MIT Press (2012)
http://www.cs.ubc.ca/~murphyk/MLbook
Material derived from slides for the book “Machine Learning” by P. Flach Cambridge University Press (2012) http://cs.bris.ac.uk/~flach/mlbook
Material derived from slides for the book
“Bayesian Reasoning and Machine Learning” by D. Barber Cambridge University Press (2012) http://www.cs.ucl.ac.uk/staff/d.barber/brml
Material derived from slides for the book “Machine Learning” by T. Mitchell McGraw-Hill (1997)
http://www- 2.cs.cmu.edu/~tom/mlbook.html
Material derived from slides for the course “Machine Learning” by A. Srinivasan BITS Pilani, Goa, India (2016)
COMP9417 ML & DM
Regression
Term 2, 2020
2 / 107
Aims
Aims
After a brief introduction to this course and the topics in it, this lecture will introduce you to machine learning approaches to the problem of numerical prediction. Following it you should be able to reproduce theoretical results, outline algorithmic techniques and describe practical applications for the topics:
• the supervised learning task of numeric prediction
• how linear regression solves the problem of numeric prediction • fitting linear regression by least squares error criterion
• non-linear regression via linear-in-the-parameters models
• parameter estimation for regression
• local (nearest-neighbour) regression
COMP9417 ML & DM Regression Term 2, 2020 3 / 107
A Brief Course Introduction
A Brief Course Introduction
COMP9417 ML & DM Regression Term 2, 2020 4 / 107
A Brief Course Introduction
Overview
This course will introduce you to machine learning, covering some of the core ideas, methods and theory currently used and understood by practitioners, including, but not limited to:
• categories of learning (supervised learning, unsupervised learning, etc.) • widely-used machine learning techniques and algorithms
• batch vs. online settings
• parametric vs. non-parametric approaches
• generalisation in machine learning
• training, validation and testing phases in applications • limits on learning
COMP9417 ML & DM Regression Term 2, 2020 5 / 107
A Brief Course Introduction
What we will cover
• core algorithms and model types in machine learning
• foundational concepts regarding learning from data
• relevant theory to inform and generalise understanding • practical applications
COMP9417 ML & DM Regression Term 2, 2020 6 / 107
A Brief Course Introduction
What we will NOT cover
• lots of probability and statistics
• lots of neural nets and deep learning
• “big data”
• commercial and business aspects of “analytics” • ethical aspects of AI and ML
although all of these are interesting and important topics!
COMP9417 ML & DM Regression Term 2, 2020 7 / 107
Why Study Machine Learning?
Some history
One can imagine that after the machine had been in operation for some time, the instructions would have been altered out of recognition, but nevertheless still be such that one would have to admit that the machine was still doing very worthwhile calcula- tions. Possibly it might still be getting results of the type desired when the machine was first set up, but in a much more efficient manner. In such a case one would have to admit that the progress of the machine had not been foreseen when its original instructions were put in. It would be like a pupil who had learnt much from his master, but had added much more by his own work.
From A. M. Turing’s lecture to the London Mathematical Society. (1947)
COMP9417 ML & DM Regression Term 2, 2020 8 / 107
Why Study Machine Learning?
Some history
One can imagine that after the machine had been in operation for some time, the instructions would have been altered out of recognition, but nevertheless still be such that one would have to admit that the machine was still doing very worthwhile calcula- tions. Possibly it might still be getting results of the type desired when the machine was first set up, but in a much more efficient manner. In such a case one would have to admit that the progress of the machine had not been foreseen when its original instructions were put in. It would be like a pupil who had learnt much from his master, but had added much more by his own work.
From A. M. Turing’s lecture to the London Mathematical Society. (1947)
COMP9417 ML & DM Regression Term 2, 2020 9 / 107
Why Study Machine Learning?
Some definitions
The field of machine learning is concerned with the question of how to construct computer programs that automatically improve from experience.
“Machine Learning”. T. Mitchell (1997)
Machine learning, then, is about making computers modify or adapt their actions (whether these actions are making predictions, or controlling a robot) so that these actions get more accurate, where accuracy is measured by how well the chosen actions reflect the correct ones.
“Machine Learning”. S. Marsland (2015)
COMP9417 ML & DM Regression Term 2, 2020 10 / 107
Why Study Machine Learning?
Some definitions
Machine learning is the systematic study of algorithms and systems that improve their knowledge or performance with experience.
Machine Learning”. P. Flach (2012)
The term machine learning refers to the automated detection of meaningful patterns in data.
“Understanding Machine Learning”. S. Shalev-Shwartz and S. Ben-David (2014)
Data mining is the extraction of implicit, previously unknown, and potentially useful information from data.
“Data Mining”. I. Witten et al. (2016)
COMP9417 ML & DM Regression Term 2, 2020 11 / 107
Why Study Machine Learning?
Machine Learning is . . .
Trying to get programs to work in a reasonable way to predict stuff.
R. Kohn (2015)
COMP9417 ML & DM Regression Term 2, 2020 12 / 107
Why Study Machine Learning?
How is Machine Learning different from . . .
Machine learning comes originally from Artificial Intelligence (AI), where the motivation is to build intelligent agents, capable of acting autonomously. Learning is a characteristic of intelligence, so to be successful an agent must ultimately be able to learn, apply, understand and communicate what it has learned.
These are not requirements in:
• statistics — the results are typically mathematical models for humans
• data mining — the results are typically models of “insight” for humans
These criteria are often also necessary, but not always sufficient, for machine learning.
COMP9417 ML & DM Regression Term 2, 2020 13 / 107
Categories of Machine Learning
Supervised and unsupervised learning
The most widely used categories of machine learning algorithms are: • Supervised learning – output class (or label) is given
• Unsupervised learning – no output class is given
There are also hybrids, such as semi-supervised learning, and alternative strategies to acquire data, such as reinforcement learning and active learning.
Note: output class can be real-valued or discrete, scalar, vector, or other structure . . .
COMP9417 ML & DM Regression Term 2, 2020 14 / 107
Categories of Machine Learning
Supervised and unsupervised learning
Supervised learning tends to dominate in applications. Why ?
Generally, because it is much easier to define the problem and develop an error measure (loss function) to evaluate different algorithms, parameter settings, data transformations, etc. for supervised learning than for unsupervised learning.
COMP9417 ML & DM Regression Term 2, 2020 15 / 107
Categories of Machine Learning
Supervised and unsupervised learning
Unfortunately . . .
In the real world it is often difficult to obtain good labelled data in sufficient quantities
So in such cases unsupervised learning is really what you want . . .
but currently, finding good unsupervised learning algorithms for complex machine learning tasks remains a research challenge.
COMP9417 ML & DM Regression Term 2, 2020 16 / 107
Categories of Machine Learning
Models: the output of machine learning
Machine learning models
Machine learning models can be distinguished according to their main intuition, for example:
• Geometric models use intuitions from geometry such as separating (hyper-)planes, linear transformations and distance metrics.
• Probabilistic models view learning as a process of reducing uncertainty, modelled by means of probability distributions.
• Logical models are defined in terms of easily interpretable logical expressions.
COMP9417 ML & DM Regression Term 2, 2020 17 / 107
Categories of Machine Learning
Models: the output of machine learning
Machine learning models
Alternatively, can be characterised by algorithmic properties: • Regression models predict a numeric output
• Classification models predict a discrete class value
• Neural networks learn based on a biological analogy
• Local models predict in the local region of a query instance
• Tree-based models partition the data to make predictions
• Ensembles learn multiple models and combine their predictions
COMP9417 ML & DM Regression Term 2, 2020 18 / 107
Introduction to Regression
Introduction to Regression
COMP9417 ML & DM Regression Term 2, 2020 19 / 107
Introduction to Regression
Introduction to Regression
The “most typical” machine learning approach is to apply supervised learning methods for classification, where the task is to learn a model to predict a discrete value for data instances . . .
. . . however, we often find tasks where the most natural representation is that of prediction of numeric values
COMP9417 ML & DM Regression Term 2, 2020 20 / 107
Introduction to Regression
Introduction to Regression
Example – task is to learn a model to predict CPU performance from a datset of example of 209 different computer configurations:
COMP9417 ML & DM Regression Term 2, 2020 21 / 107
Introduction to Regression
Introduction to Regression
Result: a linear regression equation fitted to the CPU dataset.
PRP =
– 56.1
+ 0.049 MYCT
+ 0.015 MMIN
+ 0.006 MMAX
+ 0.630 CACH
– 0.270 CHMIN
+ 1.46 CHMAX
COMP9417 ML & DM
Regression
Term 2, 2020
22 / 107
Introduction to Regression
Introduction to Regression
For the class of symbolic representations, machine learning is viewed as: searching a space of hypotheses . . .
represented in a formal hypothesis language (trees, rules, graphs . . . ).
COMP9417 ML & DM Regression Term 2, 2020 23 / 107
Introduction to Regression
Introduction to Regression
For the class of numeric representations, machine learning is viewed as: “searching” a space of functions . . .
represented as mathematical models (linear equations, neural nets, . . . ). Note: in both settings, the models may be probabilistic . . .
COMP9417 ML & DM Regression Term 2, 2020 24 / 107
Introduction to Regression
Introduction to Regression
Methods to predict a numeric output from statistics and machine learning:
• linear regression (statistics) determining the “line of best fit” using the least squares criterion
• linear models (machine learning) learning a predictive model from data under the assumption of a linear relationship between predictor and target variables
Very widely-used, many applications
Ideas that are generalised in Artificial Neural Networks
COMP9417 ML & DM Regression Term 2, 2020 25 / 107
Introduction to Regression
Introduction to Regression
Regression as a term occurs in many areas of machine learning:
• non-linear regression by adding non-linear basis functions
• multi-layer neural networks (machine learning) learning non-linear predictors via hidden nodes between input and output
• regression trees (statistics / machine learning) tree where each leaf predicts a numeric quantity
• local (nearest-neighbour) regression
COMP9417 ML & DM Regression Term 2, 2020 26 / 107
Learning Linear Regression Models
Learning Linear Regression Models
COMP9417 ML & DM Regression Term 2, 2020 27 / 107
Learning Linear Regression Models
Regression
We will look at the simplest model for numerical prediction: a regression equation
Outcome will be a linear sum of feature values with appropriate weights.
Note: the term regression is overloaded – it can refer to:
• the process of determining the weights for the regression equation, or • the regression equation itself.
COMP9417 ML & DM Regression Term 2, 2020 28 / 107
Learning Linear Regression Models
Linear Regression
Assumes: expected value of the output given an input, E[y|x], is linear. Simplest case: Out(x) = bx for some unknown b.
Learning problem: given the data, estimate b (i.e., ˆb).
COMP9417 ML & DM Regression Term 2, 2020 29 / 107
Learning Linear Regression Models
Linear Models
• Numeric attributes and numeric prediction, i.e., regression
• Linear models, i.e. outcome is linear combination of attributes
y = b0 + b1x1 + b2x2 + . . . + bnxn • Weights are calculated from the training data
• Predicted value for first training instance x(1) is: n
b x(1) + b x(1) + b x(1) + . . . + b x(1) = b x(1) 001122 nn ii
i=0
COMP9417 ML & DM Regression
Term 2, 2020
30 / 107
Learning Linear Regression Models
Minimizing Squared Error
Difference between predicted and actual values is the error !
n + 1 coefficients are chosen so that sum of squared error on all instances
in training data is minimized Squared error:
m n 2 y(j)−bix(j)
i j=1 i=0
Coefficients can be derived using standard matrix operations
Can be done if there are more instances than attributes (roughly speaking).
Known as “Ordinary Least Squares” (OLS) regression – minimizing the sum of squared distances of data points to the estimated regression line.
COMP9417 ML & DM Regression Term 2, 2020 31 / 107
Learning Linear Regression Models
Multiple Regression
Given 2 real-valued variables X1, X2, labelled with a real-valued variable Y , find “plane of best fit” that captures the dependency of Y on X1, X2.
Learning here is by minimizing MSE, i.e., average of squared vertical
ˆˆ distances of actual values of Y from the learned function Y = f(X).
COMP9417 ML & DM Regression Term 2, 2020 32 / 107
Statistical Techniques for Data Analysis
Step back: Statistical Techniques for Data Analysis
COMP9417 ML & DM Regression Term 2, 2020 33 / 107
Statistical Techniques for Data Analysis
Probability vs Statistics: The Difference
• Probability versus Statistics
• Probability: reasons from populations to samples
• This is deductive reasoning, and is usually sound (in the logical sense of the word)
• Statistics: reasons from samples to populations
• This is inductive reasoning, and is usually unsound (in the logical sense
of the word)
COMP9417 ML & DM Regression Term 2, 2020 34 / 107
Statistical Techniques for Data Analysis
Statistical Analyses
• Statistical analyses usually involve one of 3 things:
1 The study of populations;
2 The study of variation; and
3 Techniques for data abstraction and data reduction
• Statistical analysis is more than statistical computation:
1 What is the question to be answered?
2 Can it be quantitative (i.e., can we make measurements about it)?
3 How do we collect data?
4 What can the data tell us?
COMP9417 ML & DM Regression Term 2, 2020 35 / 107
Statistical Techniques for Data Analysis
Sampling
Sampling
COMP9417 ML & DM Regression Term 2, 2020 36 / 107
Statistical Techniques for Data Analysis
Sampling
Where do the Data come from? (Sampling)
• For groups (populations) that are fairly homogeneous, we do not need to collect a lot of data. (We do not need to sip a cup of tea several times to decide that it is too hot.)
• For populations which have irregularities, we will need to either take measurements of the entire group, or find some way of get a good idea of the population without having to do so
• Sampling is a way to draw conclusions about the population without having to measure all of the population. The conclusions need not be completely accurate
• All this is possible if the sample closely resembles the population about which we are trying to draw some conclusions
COMP9417 ML & DM Regression Term 2, 2020 37 / 107
Statistical Techniques for Data Analysis
Sampling
What We Want From a Sampling Method
• No systematic bias, or at least no bias that we cannot account for in our calculations
• The chance of obtaining an unrepresentative sample can be calculated. (So, if this chance is high, we can choose not to draw any conclusions.)
• The chance of obtaining an unrepresentative sample decreases with the size of the sample
COMP9417 ML & DM Regression Term 2, 2020 38 / 107
Statistical Techniques for Data Analysis
Sampling
The inductive learning hypothesis
Any hypothesis found to approximate the target function well over a sufficiently large set of training examples will also approximate the target function well over other unobserved examples1.
1T. Mitchell (1997) “Machine Learning”
COMP9417 ML & DM Regression Term 2, 2020 39 / 107
Statistical Techniques for Data Analysis
Estimation
Estimation
COMP9417 ML & DM Regression Term 2, 2020 40 / 107
Statistical Techniques for Data Analysis
Estimation
Estimation from a Sample
• Estimating some aspect of the population using a sample is a common task. Along with the estimate, we also want to have some idea of the accuracy of the estimate (usually expressed in terms of confidence limits)
• Some measures calculated from the sample are very good estimates of corresponding population values. For example, the sample mean m is a very good estimate of the population mean μ. But this is not always the case. For example, the range of a sample usually under-estimates the range of the population
• We will have to clarify what is meant by a “good estimate”. One meaning is that an estimator is correct on average. For example, on average, the mean of a sample is a good estimator of the mean of the population
COMP9417 ML & DM Regression Term 2, 2020 41 / 107
Statistical Techniques for Data Analysis
Estimation
Estimation from a Sample
• For example, when a number of samples are drawn and the mean of each is found, then average of these means is equal to the population mean
• Such an estimator is said to be statistically unbiased
COMP9417 ML & DM Regression Term 2, 2020 42 / 107
Statistical Techniques for Data Analysis
Estimation
Sample Estimates of the Mean and the Spread I
Mean. This is calculated as follows.
• Find the total T of N observations. Estimate the (arithmetic) mean by m = T/N.
• This works very well when the data follow a symmetric bell-shaped frequency distribution (of the kind modelled by “normal” distribution)
• A simple mathematical expression of this is
m = 1 xi, where the observations are x1, x2 …xn
• If we can group the data so that the observation x1 occurs f1 times, x2 occurs f2 times and so on, then the
Ni
mean is calculated as m = 1 i fi
xifi i
COMP9417 ML & DM Regression
Term 2, 2020 43 / 107
Statistical Techniques for Data Analysis
Estimation
Sample Estimates of the Mean and the Spread II
•
If, instead of frequencies, you had relative frequencies (i.e. instead of fi you had pi = fi/N), then the mean is simply the observations weighted by relative frequency. That is, m = i xipi
We want to connect this up to computing the mean value of observations modelled by some theoretical probability distribution function. That is, we want to a similar counting method for calculating the mean of random variables modelled using some known distribution
•
COMP9417 ML & DM
Regression Term 2, 2020 44 / 107
Statistical Techniques for Data Analysis
Estimation
Sample Estimates of the Mean and the Spread III
• Correctly, this is the mean value of the values of the random variable function. But this is a bit cumbersome, so we will just say the “mean value of the r.v.” For discrete r.v.’s this is:
E(X) = xip(X=xi) i
Variance. This is calculated as follows:
•
•
Calculate the total T and the sum of squares of N
observations. The estimate of the standard deviation is
s= 1 (xi−m)2 N−1 i
Again, this is a very good estimate when the data are modelled by a normal distribution
COMP9417 ML & DM
Regression Term 2, 2020 45 / 107
Statistical Techniques for Data Analysis
Estimation
Sample Estimates of the Mean and the Spread IV
• For grouped data, this is modified to s= 1 (xi−m)2fi
•
N−1 i
Again, we have a similar formula in terms of expected values, for the scatter (spread) of values of a r.v. X around a mean value E(X):
V ar(X) = E((X − E(X))2) = E(X2) − [E(X)]2
You can remember this as “the mean of the squares minus the square of the mean”
•
COMP9417 ML & DM
Regression Term 2, 2020 46 / 107
Statistical Techniques for Data Analysis
Covariance and Correlation
Covariance and Correlation
COMP9417 ML & DM Regression Term 2, 2020 47 / 107
Statistical Techniques for Data Analysis
Covariance and Correlation
Correlation I
• The correlation coefficient is a number between -1 and +1 that indicates whether a pair of variables x and y are associated or not, and whether the scatter in the association is high or low
• High values of x are associated with high values of y and low values of x are associated with low values of y, and scatter is low
• A value near 0 indicates that there is no particular association and that there is a large scatter associated with the values
• A value close to -1 suggests an inverse association between x and y
• Only appropriate when x and y are roughly linearly associated
(doesn’t work well when the association is curved)
• The formula for computing correlation between x and y is:
cov(x, y)
r = var(x)var(y)
This is sometimes also called Pearson’s correlation coefficient
COMP9417 ML & DM Regression Term 2, 2020 48 / 107
Statistical Techniques for Data Analysis
Covariance and Correlation
Correlation II
• The terms in the denominator are simply the standard deviations of x and y. But the numerator is different. This is the covariance, calculated as the average of the product of deviations from the mean:
i(xi − x)(yi − y) cov(x, y) = n − 1
• What does “covariance” actually mean ? Consider 1 Case1: xi >x,yi >y
2 Case2: xi
4 Case4: xi >x,yi
• So: if all relevant variables are included, then we can assess the effect of each one in a controlled manner
COMP9417 ML & DM Regression Term 2, 2020 88 / 107
Some further issues in learning linear regression models
Categoric Variables: X’s I
• “Indicator” variables are those that take on the values 0 or 1
• They are used to include the effects of categoric variables. For example, if D is a variable that takes the value 1 if a patient takes a drug and 0 if the patient does not. Suppose you want to know the effect of drug D on blood pressure Y keeping age (X) constant
Yˆ =70+5D+0.44X
So, taking the drug (a unit change in D) makes a difference of 5
units, provided age is held constant
COMP9417 ML & DM Regression Term 2, 2020 89 / 107
Some further issues in learning linear regression models
Categoric Variables: X’s II
• How do we capture any interaction effect between age and drug intake? Introduce a new indicator variable DX = D × X
Yˆ =70+5D+0.44X+0.21DX
COMP9417 ML & DM Regression Term 2, 2020 90 / 107
Some further issues in learning linear regression models
Categoric Values: Y values
• Sometimes, Y values are simply one of two values (let’s call them 0 and 1)
• We can’t use the regression model as we described earlier, in which the Y ’s can take any real value
• But, we can define a new linear regression model in which predicts not the value of Y, but what are called the log odds of Y:
log odds Y = Odds = b0 +b1X1 +···+bnXn
• Once Odds are estimated, they can be used to calculate the
probability of Y :
Pr(Y =1)= eOdds
(1 + eOdds)
WecanthenusethevalueofPr(Y =1)todecideifY =1
• This procedure is called logistic regression (we’ll see this again)
COMP9417 ML & DM Regression Term 2, 2020 91 / 107
Some further issues in learning linear regression models
Is the Model Appropriate ?
• If there is no systematic pattern to the residuals—that is, there are approximately half of them that are positive and half that are negative, then the line is a good fit
• It should also be the case that there should be no pattern to the residual scatter all along the line. If the average size of the residuals varies along the line (this condition is called heteroscedasticity) then the relationship is probably more complex than a straight line
• Residuals from a well-fitting line should show an approximate symmetric, bell-shaped frequency distribution with a mean of 0
COMP9417 ML & DM Regression Term 2, 2020 92 / 107
Some further issues in learning linear regression models
Non-linear Relationships
A question: is it possible to do better than the line of best fit? Maybe. Linear regression assumes that the (xi,yi) examples in the data
are “generated” by the true (but unknown) function Y = f(X).
So any training set is a sample from the true distribution E(Y ) = f(X).
But what if f is non-linear ?
We may be able to reduce the mean squared error (MSE) value i(yi − yˆ)2 by trying a different function.
COMP9417 ML & DM Regression Term 2, 2020 93 / 107
Some further issues in learning linear regression models
Non-linear Relationships
• Some non-linear relationships can be captured in a linear model by a transformation (“trick”). For example, the curved model
Yˆ = b0 + b1X1 + b2X12 can be transformed by X2 = X12 into a linear model. This works for polynomial relationships.
• Some other non-linear relationships may require more complicated transformations. For example, the relationship is Y = b0Xb1 Xb2 can
be transformed into the linear relationship
log(Y)=logb0 +b1logX1 +b2logX2
• Other relationships cannot be transformed quite so easily, and will require full non-linear estimation (in subsequent topics in the ML course we will find out more about these)
12
COMP9417 ML & DM Regression Term 2, 2020 94 / 107
Some further issues in learning linear regression models
Non-Linear Relationships
• Main difficulty with non-linear relationships is choice of function • How to learn ?
• Can use a form of gradient descent to estimate the parameters
• After a point, almost any sufficiently complex mathematical function
will do the job in a sufficiently small range
• Some kind of prior knowledge or theory is the only way to help here. • Otherwise, it becomes a process of trial-and-error, in which case,
beware of conclusions that can be drawn
COMP9417 ML & DM Regression Term 2, 2020 95 / 107
Some further issues in learning linear regression models
Model Selection
• Suppose there are a lot of variables Xi, some of which may be representing products, powers, etc.
• Taking all the Xi will lead to an overly complex model. There are 3 ways to reduce complexity:
1 Subset-selection, by search over subset lattice. Each subset results in a new model, and the problem is one of model-selection
2 Shrinkage, or regularization of coefficients to zero, by optimization. There is a single model, and unimportant variables have near-zero coefficients.
3 Dimensionality-reduction, by projecting points into a lower dimensional space (this is different to subset-selection, and we will look at it later)
COMP9417 ML & DM Regression Term 2, 2020 96 / 107
Some further issues in learning linear regression models
Model Selection as Search I
• The subsets of the set of possible variables form a lattice with S1 ∩ S2 as the g.l.b. or meet and S1 ∪ S2 as the l.u.b. or join
• Each subset refers to a model, and a pair of subsets are connected if they differ by just 1 element
• A lattice is a graph, and we know how to search a graph
• A∗, greedy, randomised etc.
• “Cost” of node in the graph: MSE of the model. The parameters
(coefficients) of the model can be found
• Historically, model-selection for regression has been done using “forward-selection”, “backward-elimination”, or “stepwise” methods
• These are greedy search techniques that either: (a) start at the top of the subset lattice, and add variables; (b) start at the bottom of the subset lattice and remove variables; or (c) start at some interior point and proceed by adding or removing single variables (examining nodes connected to the node above or below)
COMP9417 ML & DM Regression Term 2, 2020 97 / 107
Some further issues in learning linear regression models
Model Selection as Search II
• Greedy selection done on the basis of calculating the coefficient of determination (often denoted by R2) which denotes the proportion of total variation in the dependent variable Y that is explained by the model
• Given a model formed with a subset of variables X, it is possible to compute the observed change in R2 due to the addition or deletion of some variable x
• This is used to select greedily the next best move in the graph-search To set other hyper-parameters, such as shrinkage parameter λ, can use
grid search
COMP9417 ML & DM Regression Term 2, 2020 98 / 107
Some further issues in learning linear regression models
Prediction I
• It is possible to quantify what happens if the regression line is used for prediction:
• The intuition is this:
• Recall the regression line goes through the mean (X, Y )
COMP9417 ML & DM Regression Term 2, 2020 99 / 107
Some further issues in learning linear regression models
Prediction II
• If the Xi are slightly different, then the mean is not going to change much. So, the regression line stays somewhat “fixed” at (X , Y ) but with a different slope
• With each different sample of the Xi we will get a slightly different regression line
• The variation in Y values is greater further we move from (X , Y )
• MORAL: Be careful, when predicting far away from the centre value • ANOTHER MORAL: The model only works under approximately the
same conditions that held when collecting the data
COMP9417 ML & DM Regression Term 2, 2020 100 / 107
Local (nearest-neighbour) regression
Local (nearest-neighbour) regression
COMP9417 ML & DM Regression Term 2, 2020 101 / 107
Local (nearest-neighbour) regression
Local learning
• Related to the simplest form of learning: rote learnin, or memorization
• Training instances are searched for instance that most closely
resembles query or test instance
• The instances themselves represent the knowledge
• Called: nearest-neighbour, instance-based, memory-based or case-based learning; all forms of local learning
• The similarity or distance function defines “learning”, i.e., how to go beyond simple memorization
• Intuition — classify an instance similarly to examples “close by” — neighbours or exemplars
• A form of lazy learning – don’t need to build a model!
COMP9417 ML & DM Regression Term 2, 2020 102 / 107
Local (nearest-neighbour) regression
Nearest neighbour for numeric prediction
Store all training examples ⟨xi,f(xi)⟩. Nearest neighbour:
• Given query instance xq,
• first locate nearest training example xn,
•ˆ
then estimate yˆ = f(xq) = f(xn)
• k-Nearest neighbour:
• Given xq, take mean of f values of k nearest neighbours
ˆ
yˆ = f ( x q ) =
ki=1 f(xi) k
COMP9417 ML & DM Regression
Term 2, 2020
103 / 107
Local (nearest-neighbour) regression
Distance function
The distance function defines what is “learned”, i.e., predicted. Instance xi is described by an m-vector of feature values:
⟨xi1, xi2, . . . xim⟩
where xik denotes the value of the kth feature of xi.
Most commonly used distance function is Euclidean distance, where the distance between two instances xi and xj is defined to be:
m
d(xi,xj)=
(xik −xjk)2
k=1
COMP9417 ML & DM Regression Term 2, 2020 104 / 107
Local (nearest-neighbour) regression
Local regression
Use kNN to form a local approximation to f for each query point xq using a linear function of the form
ˆ
f(x)=b0 +b1x1 +…+bmxm
where xi denotes the value of the ith feature of instance x. Where does this linear regression model come from ?
• fit linear function to k nearest neighbours • or quadratic or higher-order polynomial . . . • produces “piecewise approximation” to f
COMP9417 ML & DM Regression Term 2, 2020 105 / 107
Summary
Summary
COMP9417 ML & DM Regression Term 2, 2020 106 / 107
Summary
Summary
• Regression gives us a glimpse into many aspects of Machine Learning Terminology. Training data, test data, resubstitution error, prediction
error (later lecture).
Conceptual. Learning as search, learning as optimisation,
assumptions underlying a technique Implementation. Approximate alternatives to analytical methods
Application. Overfitting, problems of prediction
Each of these aspects will have counterparts in other kinds of machine learning
• Linear models are one way to predict numerical quantities • Ordinal regression: predicting ranks (not in the lectures) • Neural networks: non-linear regression models (later)
• Regression trees: piecewise regression models (later)
• Class-probability trees: predicting probabilities (later) • Model trees: piecewise non-linear models (later)
COMP9417 ML & DM Regression Term 2, 2020 107 / 107