Regression
COMP9417: Machine Learning & Data Mining Term 1, 2022
Adapted from slides by Dr Michael
• Lecturer in Charge:
Copyright By PowCoder代写 加微信 powcoder
• Course Admin:
o Ryan de Belen
o Zhizhou Ma
o Yiyang (Sophia) OMP9417 T1, 2022 1
Communication
• Coursewebsite:Moodle
• UNSW email for all correspondence)
• Moodleforums
• Consultwithyourtutorinthetimeslots
• Consultation
COMP9417 T1, 2022 2
Lectures & Tutorials
• Tutorials start in week 2 (No tutorials this week)
• There is no lectures or tutorials in week 6 (term break)
• Attend your allocated tutorial session
• The assignments and homework are in Python
• References: a list is provided in the course outline
• Note: Lecture slides do not cover every details we go through in the classroom
COMP9417 T1, 2022 3
• Most weeks will have both lab and tutorial activities.
• We provide pre-recorded videos for each tutorial at the end of each week.
• You are supposed to go through the tutorial and lab activities before attending the tutorial class. Your tutor will cover the more challenging questions and there will be time for you to ask your questions
• You have to hand in your work on a set of pre-specified questions every week. This will cover a portion of your final mark. You will not be provided with detailed feedback and it is more of an evaluation of the effort, rather than checking for correct answers.
COMP9417 T1, 2022 4
Assessments
• Homework 0 (1%) – Due in week 1
• Homework 1 (7%) – Due in week 4
• Homework 2 (7%) – Due in week 7
• Tutorial work (5%)
• Assignment (group project) (30%)
– Due in week 10
• Final Exam (50%)
Late submission penalties will apply to all assessable works. All submissions will be checked for plagiarism.
COMP9417 T1, 2022 5
Final Project
Two options:
– A Hackathon, working on some current/real world problem
– Working on a dataset we provide to you
• Groups of 4-5 students
• No need to be in the same tutorial
• You need to finalize your groups by Friday 11 March (end of week 4)
COMP9417 T1, 2022 6
A Brief Course Introduction
COMP9417 T1, 2022 7
This course will introduce you to machine learning, covering some of the core ideas, methods and theory currently used and understood by practitioners, including, but not limited to:
o Categories of learning (supervised / unsupervised learning, Neural Nets, etc.)
o Widely-used machine learning techniques and algorithms o Parametric vs. non-parametric approaches
o Generalisation in machine learning
o Training, validation and testing phases in applications
o Learning Theory
COMP9417 T1, 2022 8
What we will cover
• Core algorithms and model types in machine learning
• Foundational concepts regarding learning from data
• Relevant theory to inform and generalise understanding
• Practical applications
COMP9417 T1, 2022 9
What we will NOT cover
• Probability and statistics
• Lots of neural nets and deep learning
• Reinforcement learning
• Big data
• Ethical aspects of AI and ML
although all of these are interesting and important topics!
COMP9417 T1, 2022 10
Some definitions
The field of machine learning is concerned with the question of how to construct computer programs that automatically improve from experience.
“Machine Learning”. T. Mitchell (1997)
Machine learning, then, is about making computers modify or adapt their actions (whether these actions are making predictions, or controlling a robot) so that these actions get more accurate, where accuracy is measured by how well the chosen actions reflect the correct ones.
“Machine Learning”. S. Marsland (2015)
COMP9417 T1, 2022 11
Some definitions
The term machine learning refers to the automated detection of meaningful patterns in data.
“Understanding Machine Learning”. S. Shwartz and S. David (2014)
Trying to get programs to work in a reasonable way to predict stuff. R. Kohn (2015)
Data mining is the extraction of implicit, previously unknown, and potentially useful information from data.
“Data Mining”. I. Witten et al. (2016)
COMP9417 T1, 2022 12
• Object recognition / object classification
• Text classification (e.g., spam/non-spam emails)
• Speech recognition
• Event detection
• Recommender systems
• Human behaviour recognition (emotions, state of mind, etc.)
• Automatic medical diagnosis
COMP9417 T1, 2022 13
Simple face recognition pipeline:
Learning algorithm
A man OMP9417 T1, 2022
Machine learning pipeline
COMP9417 T1, 2022 15
Supervised and unsupervised learning
COMP9417 T1, 2022 16
Supervised and unsupervised learning
The most widely used categories of machine learning algorithms are: o Supervised learning – output class (or label) is given
o Unsupervised learning – no output class is given
There are also hybrids, such as semi-supervised learning, and alternative strategies to acquire data, such as reinforcement learning and active learning.
Note: output class can be real-valued or discrete, scalar, vector, or other structure . . .
COMP9417 T1, 2022 17
Supervised and unsupervised learning
Supervised learning tends to dominate in applications. Why ?
COMP9417 T1, 2022 18
Supervised and unsupervised learning
Supervised learning tends to dominate in applications because generally, it is much easier to define the problem and develop an error measure (loss function) to evaluate different algorithms, parameter settings, data transformations, etc. for supervised learning than for unsupervised learning.
COMP9417 T1, 2022 19
Supervised and unsupervised learning
Unfortunately . . .
In the real world it is often difficult to obtain good labelled data in sufficient quantities
So, in such cases unsupervised learning is really what you want . . .
but currently, finding good unsupervised learning algorithms for complex machine learning tasks remains a research challenge.
COMP9417 T1, 2022 20
This lecture (which is on regression) will introduce you to machine learning approaches to the problem of numerical prediction. Following it you should be able to reproduce theoretical results, outline algorithmic techniques and describe practical applications for the topics:
– supervised learning task of numeric prediction
– how linear regression solves the problem of numeric prediction
– fitting linear regression by least squares error criterion
– non-linear regression via linear-in-the-parameters models – parameter estimation for regression
– local (nearest-neighbour) regression
COMP9417 T1, 2022 21
Introduction to Regression
COMP9417 T1, 2022 22
Introduction to Regression
The “most typical” machine learning approach is to apply supervised learning methods for classification, where the task is to learn a model to predict a discrete value for data instances . . .
. . . however, we often find tasks where the most natural representation is that of prediction of numeric values. So, we need regression in those tasks
COMP9417 T1, 2022 23
Introduction to Regression
Example – task is to learn a model to predict CPU performance from a datset of example of 209 di↵erent computer configurations:
COMP9417 ML & DM Regression Term 2, 2019 21 / 107
Introduction to Regression
Example – task is to learn a model to predict CPU performance Introduction to Regression
from a dataset of example of 209 different computer configurations:
COMP9417 T1, 2022 24
Introduction to Regression
Result: a linear regression equation fitted to the CPU dataset.
+ 0.049 MYCT + 0.015 MMIN + 0.006 MMAX + 0.630 CACH – 0.270 CHMIN + 1.46 CHMAX
COMP9417 T1, 2022 25
Learning Linear Regression Models
COMP9417 T1, 2022 26
Linear Regression
In linear regression, we assume there is a linear relationship between the output and feature(s)/input variable(s); this means the expected value of the output given an input, 𝐸[𝑦|𝑥] is linear in input 𝑥
Example with one variable: 𝑦! = 𝑏𝑥 + 𝑐 Problem: given the data, estimate 𝑏 and 𝑐
COMP9417 T1, 2022 27
Linear regression
• Training instances: 𝑥!, 𝑦! , 𝑗 = 1, … , 𝑚
• 𝑥! is an input vector
• 𝑦! is the output to be predicted
• Each input has 𝑛 attributes/features 𝑥! = [𝑥!” , 𝑥!# , … , 𝑥!$ ]%
• 𝑥 and 𝑦 are a general observation with 𝑥 = [𝑥”, 𝑥#, … , 𝑥$]%
• 𝜃=[𝜃&,𝜃”,…,𝜃$]%
• 𝕪 is the vector of outputs:𝕪 = [𝑦”, … , 𝑦’]%
• 𝑋 is the matrix of inputs where each row is one observation
COMP9417 T1, 2022 28
Linear regression
• Linear regression can be used for numeric prediction from numeric attributes
• In linear models, the outcome is a linear combination of attributes:
𝑦1 = 𝜃 & + 𝜃 ” 𝑥 ” + 𝜃 # 𝑥 # + ⋯ + 𝜃 $ 𝑥 $ = h ( 𝑥 )
• Weights are estimated from the observed data (training data)
• Predicted value for the first training instance 𝑥( is:
𝑦1″ = 𝜃& +𝜃”𝑥”” +𝜃#𝑥”# +⋯+𝜃$𝑥”$ = 7$ 𝜃(𝑥”( = 𝑥”%𝜃 = h(𝑥”) ()&
To simplify the notation, we set 𝑥& = 1 and define 𝑥 = [𝑥&, 𝑥”, … , 𝑥$]% COMP9417 T1, 2022 29
Linear regression
• Univariate regression: one input variable/feature/attribute is used
to predict one output variable
• Multiple regression / multivariable regression: more than one variables/features/attributes are used to predict one output variable
COMP9417 T1, 2022 30
Linear Regression
Infinite number of lines can be fitted, depending on how we define the best fit criteria
…but the most popular estimation model is “Least Squares”, also known as “Ordinary Least Squares” (OLS) regression
COMP9417 T1, 2022 31
Minimizing Error
• Error = Difference between the predicted value and the actual value
• We want to minimize the error over all samples!
• We define the total error as the sum of squared errors and searching for 𝑛 +
1 parameters to minimize that
• Sum of squared error for m samples/observations
𝐽 𝜃 =7(𝑦! −7𝜃(𝑥!()# =7(𝑦! −𝑥!%𝜃)# =(𝕪−𝑋𝜃)%(𝕪−𝑋𝜃)
!)” ()& !)”
𝐽(𝜃) is called cost function or loss function. This concept generalizes to all functions that measure the distance between the predicted and true output. (in OLS framework, it is also called Residual Sum of Squares (RSS))
COMP9417 T1, 2022 32
Minimizing Squared Error (normal equations)
Here is how matrix 𝑋 and vector 𝕪 look like:
1 𝑥!! 𝑥!” … 𝑥!# 𝑦!
1 𝑥”! 𝑥”” … 𝑥”# 𝑦” 𝑋= . 𝕪=
1 𝑥%!𝑥%” …𝑥%# 𝑦%
You have to be careful to add a column of ones in 𝑋 to count for the intercept parameter.
COMP9417 T1, 2022 33
Minimizing Squared Error
This cost function can be also written as:
𝐽𝜃 =𝑚7(𝑦!−𝑥!%𝜃)#
Which is the average of squared error and minimizing it, leads to the same 𝜃, that we get with 𝐽 𝜃 without taking the average.
The ” ∑’ (𝑦 − 𝑥%𝜃)# is called mean squared error or briefly MSE. ‘ !)” ! !
COMP9417 T1, 2022 34
Minimizing Squared Error
Multiple Regression
Multiple regrGeisvesnio2nreealx-vamluepdlvea:riables X1, X2, labelled with a real-valued variable
Y , find “plane of best fit” that captures the dependency of Y on X , X .
Learning Linear Regression Models
Given 2 variables/attributes with real values 𝑥 , 𝑥 ∈ R, labelled with a real- !”
valued variable 𝑦, find “plane of best fit” that captures the dependency of 𝑦 on 𝑥! and 𝑥”.
Again we can use MSE to find the best plane that fits the data.
For more than two variables, we find the best hyper-plane.
Learning here is by minimizing MSE, i.e., average of squared vertical
ˆˆ distances of actual values of Y from the learned function Y = f(X).
COMP9417 T1, 2022
COMP9417 ML & DM Regression Term 2, 2019
Least Squares Regression
COMP9417 T1, 2022 36
Gradient Descent
Gradient descent starts with some initial 𝜃, and repeatedly performs an update:
𝜃(+,”)≔𝜃(+)−𝛼. 𝐽(𝜃(+)) ./!
𝛼 is the learning rate (a.k.a. step size). This algorithm takes a step in the direction of the steepest decrease in 𝐽(𝜃)
To implement this algorithm, we have to work out the partial derivative termfor𝐽𝜃=”∑’ (𝑦−𝑥%𝜃)#
COMP9417 T1, 2022 37
Gradient Descent
From: Gradient Descent: All you need to know, by S. Suryansh COMP9417 T1, 2022 38
Gradient Descent
If we focus on only one sample out of 𝑚 samples, then the cost function is:
𝐽𝜃 =(𝑦!−h/ 𝑥! )#=(𝑦!−7𝑥!(𝜃()#
h&𝑥’ =7𝑥'(𝜃(=𝑥’*𝜃 ()!
Taking the derivative will give:
𝜕 𝐽 𝜃 = −2(𝑦! − h/ 𝑥! )𝑥!( 𝜕𝜃(
So, for a single training sample, the update rule is:
𝜃( ≔𝜃( +2𝛼(𝑦! −h/ 𝑥! )𝑥!( (forevery𝑖)
The update rule for squared distance is called “Least Mean Squares” (LMS), also known as Widrow-Hoff
COMP9417 T1, 2022 39
Gradient Descent
We looked at LMS rule for only a single sample. But what about other samples? There are two ways: Batch Gradient Descent and Stochastic Gradient Descent. For the following cost/loss function:
𝐽𝜃 =𝑚7(𝑦!−𝑥!%𝜃)#
𝜃(+,”)=𝜃(+)+𝛼#∑’ (𝑦−h(#) 𝑥 )𝑥 (forevery𝑖) ( ( ‘!)”!/!!(
Replace the gradient with the sum of gradient for all samples and continue until convergence.
Convergence means that, the estimated 𝜃 will be stabilized. COMP9417 T1, 2022 40
1. Batch Gradient Descent:
Gradient Descent
2. Stochastic Gradient Descent:
𝑓𝑜𝑟𝑗=1 𝑡𝑜 𝑚{
𝜃( =𝜃( +2𝛼(𝑦! −h/ 𝑥! )𝑥!( }
Repeat this algorithm until convergence. What this algorithm does?
(forevery𝑖)
COMP9417 T1, 2022
Gradient Descent
In stochastic gradient descent, 𝜃 gets updated at any sample separately. This algorithm is much less costly than batch gradient descent, however it may never converge to the minimum.
COMP9417 T1, 2022 42
Gradient Descent
In both algorithms:
– You can start with an arbitrary (random) values for your
parameters 𝜃( (initialization)
– You have to be careful with the selection of learning rate, 𝛼, as a small value can make the algorithm very slow, and a big value may stop the algorithm from convergence
COMP9417 T1, 2022 43
Minimizing Squared Error (normal equations)
Gradient descent is one way of minimizing our cost function 𝐽(𝜃) which is an iterative algorithm.
But maybe you can find the minimum of 𝐽(𝜃) explicitly by taking its derivatives and setting them to zero. This is also called exact or closed-form solution:
𝐽 𝜃 =(𝕪−𝑋𝜃)*(𝕪−𝑋𝜃)
𝜕𝐽𝜃 =−2𝑋* 𝕪−𝑋𝜃 =0 𝜕𝜃
𝑋* 𝕪−𝑋𝜃 =0
𝜃 = (𝑋*𝑋)+!𝑋*𝕪
COMP9417 T1, 2022 44
Minimizing Squared Error (probabilistic interpretation)
We can write the relationship between input variable x and output variable y as:
𝑦’ =𝑥’*𝜃+𝜀’
And 𝜀’ is an error term which might be unmodeled effect or random noise. Let’s assume 𝜀’s are independent and identically distributed (𝑖. 𝑖. 𝑑.) according to a Gaussian distribution:
𝜀’~𝑁(0,𝜎”)
1 𝜀'” 𝑝𝜀’ = 2𝜋𝜎”exp(−2𝜎”)
COMP9417 T1, 2022 45
Minimizing Squared Error (probabilistic interpretation)
This implies that:
1 ( 𝑦 ‘ − 𝑥 ‘* 𝜃 ) ” 𝑝𝜀’ =𝑝𝑦’𝑥’;𝜃 = 2𝜋𝜎”exp(− 2𝜎” )
So we want to estimate 𝜃 such that we maximize the probability of output 𝑦 given input 𝑥 over all 𝑚 training samples:
L 𝜃 = 𝑝(𝕪|𝑋; 𝜃) (this is called Likelihood function)
COMP9417 T1, 2022 46
Minimizing Squared Error (probabilistic interpretation)
Since we assumed independence over 𝜀’, that implicitly means that our training samples are also independent from each other. Based on this assumption, we
can write L 𝜃
as follows:
L𝜃 =M𝑝(𝑦’|𝑥’;𝜃) ‘)!
% 1 ( 𝑦 ‘ − 𝑥 ‘* 𝜃 ) ” = M’ ) ! 2 𝜋 𝜎 ” e x p ( − 2 𝜎 ” )
Now, we have to estimate 𝜃 such that it maximized L 𝜃 probability as possible. This is called maximum likelihood.
to have as high
COMP9417 T1, 2022 47
Minimizing Squared Error (probabilistic interpretation)
We know (from math) that to find 𝜃 that maximized L 𝜃 , we can also maximize any strictly increasing function of L 𝜃 . In this case it would be easier if we maximize the log likelihood l(𝜃):
% 1 ( 𝑦 ‘ − 𝑥 ‘* 𝜃 ) ”
l 𝜃 =logL 𝜃 =logM’)! 2𝜋𝜎”exp(− 2𝜎” )=
=𝑚log 2𝜋𝜎” −𝜎”27(𝑦’−𝑥’*𝜃)”
So, maximizing l 𝜃 is equal to minimizing ∑% (𝑦 − 𝑥*𝜃)” ‘)! ‘ ‘
Do you know what is ∑% (𝑦 − 𝑥*𝜃)”? ‘)! ‘ ‘
COMP9417 T1, 2022
Minimizing Squared Error (probabilistic interpretation)
• This simply shows that under certain assumptions (𝜀’~𝑁(0, 𝜎”)) and 𝑖. 𝑖. 𝑑. ) the least-squared regression is equivalent to find maximum likelihood
estimate of 𝜃.
• Note that the value of 𝜎” does not affect the choice of 𝜃
COMP9417 T1, 2022 49
Linear Regression Assumptions
• Linearity: The relationship between 𝑥 and the mean of 𝑦 is linear.
• Homoscedasticity: The variance of residual is the same for any value of 𝑥.
• Independence: Observations are independent of each other.
• Normality: For any fixed value of 𝑥, 𝑦 is normally distributed.
COMP9417 T1, 2022 50
Step back: Statistical Techniques for Data Analysis
COMP9417 T1, 2022 51
Probability vs Statistics: The Difference
Probability versus Statistics
• Probability: reasons from populations to samples
– This is deductive reasoning, and is usually sound (in the logical sense of the word)
• Statistics: reasons from samples to populations
– This is inductive reasoning, and is usually unsound (in the logical
sense of the word)
COMP9417 T1, 2022 52
COMP9417 T1, 2022 53
Where do the Data come from? (Sampling)
• For groups (populations) that are fairly homogeneous, we do not need to collect a lot of data. (We do not need to sip a cup of tea several times to decide that it is too hot.)
• For populations which have irregularities, we will need to either take measurements of the entire group, or find some way of get a good idea of the population without having to do so
• Sampling is a way to draw conclusions about the population without having to measure all of the population. The conclusions need not be completely accurate.
• All this is possible if the sample closely resembles the population about which we are trying to draw some conclusions
COMP9417 T1, 2022 54
What We Want From a Sampling Method
• No systematic bias, or at least no bias that we cannot account for in our calculations
• The chance of obtaining an unrepresentative sample can be calculated. (So, if this chance is high, we can choose not to draw any conclusions.)
• The chance of obtaining an unrepresentative sample decreases with the size of the sample
COMP9417 T1, 2022 55
Estimation
COMP9417 T1, 2022 56
Estimation from a Sample
• In statistics, estimation refers to the process by which one makes inferences about a population, based on information obtained from a sample.
• Estimating some aspect of the population using a sample is a common task. E.g. sample mean are used to estimate population means
• Along with the estimate, we also want to have some idea of the accuracy of the estimate (usually expressed in terms of confidence limits)
• Some measures calculated from the sample are very good estimates of corresponding population values. For example, the sample mean 𝑚 is a very good estimate of the population mean 𝜇. But this is not always the case. For example, the range of a sample usually underestimates the range of the population
COMP9417 T1, 2022 57
Estimation from a Sample
• We will have to clarify what is meant by a “good estimate”. One meaning is that an estimator is correct on average. For example, on average, the mean of a sample is a good estimator of the mean of the population
• For example, when a number of sample sets are drawn and the mean of each is found, then average of these means is equal to the population mean
• Such an estimator is said to be statistically unbiased
COMP9417 T1, 2022 58
Estimate of the mean and Variance
• Thearithmeticmeanofasampleis𝑚=”∑0 𝑥,wherethe observations are 𝑥 ,𝑥 ,…,𝑥 0 ()” (
• This works very well when the data follow a “normal” distribution
• If we can group the data so that observation 𝑥” occurs 𝑓” times, 𝑥#
occurs 𝑓# times and so on, then the mean is calculated even easier a s 𝑚 = 0″ ∑ ( 𝑥 ( 𝑓 (
• If we define relative frequencies as 𝑝( = 1!, then the mean is simply 0
the observations weighted by relative frequency. That is 𝑚 = ∑( 𝑥(𝑝( • So, the expected value of a discrete random variable 𝑋 is:
𝐸𝑋 =7𝑥(𝑝(𝑋=𝑥()
COMP9417 T1, 2022 59
Estimates of the Mean and Variance Variance: measures how far a set of random numbers are spread out from
their aver
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com