Supervised learning – Introduction
School of Computing and Information Systems
@University of Melbourne 2022
Copyright By PowCoder代写 加微信 powcoder
Regression vs Classification
COMP20008 Elements of Data Processing 2
Classification – Example 1
Predicting disease from microarray data Training
Develop cancer <1 year
Develop cancer <1 year
COMP20008 Elements of Data Processing 3
Classification – Example 2
Animal classification
https://www-users.cs.umn.edu/~kumar/dmbook/ch4.pdf
COMP20008 Elements of Data Processing 4
Classification- Example 3
Banking: classifying borrowers
Home Owner
Marital status
Annual Income
Defaulted Borrower
COMP20008 Elements of Data Processing 5
Classification – Example 4
Detecting tax cheats
1 2 3 4 5 6 7 8 9 10
No No No No Yes No No Yes No Yes
Marital Status
Taxable Income
COMP20008 Elements of Data Processing 6
categorical categorical
continuous class
Classification: Definition
• Given a collection of records (training set )
• Eachrecordcontainsasetofattributes,oneclasslabel.
• Find a predictive model for each class label as a function of the valuesofallattributes,i.e.,! = #(%1,%2,...,%()
• !: discrete value, target variable
• %1, ... %(: attributes, predictors
• #: is the predictive model (a tree, a rule, a mathematical formula)
• Goal: previously unseen records should be assigned a class as accurately as possible
• Atestsetisusedtodeterminetheaccuracyofthemodel,i.e.the full data set is divided into training and test sets, with the training set used to build the model and the test set used to validate it
COMP20008 Elements of Data Processing 7
Classification framework
COMP20008 Elements of Data Processing 8
Regression: Definition
• Given a collection of records (training set )
• Eachrecordcontainsasetofattributes,onetargetvariable.
• Find a predictive model for each class label as a function of the valuesofallattributes,i.e.,! = #(%1,%2,...,%()
• !: continuous value, target variable
• %1, ... %(: attributes, predictors
• #: is the predictive model (a tree, a rule, mathematical formula)
• Goal: previously unseen records should be assigned a value as accurately as possible
• Atestsetisusedtodeterminetheaccuracyofthemodel,i.e.,the full data set is divided into training and test sets, with the training set used to build the model and the test set used to validate it
COMP20008 Elements of Data Processing 9
Regression example 1
Predicting ice-creams consumption from temperature: ! = #(%) ?
COMP20008 Elements of Data Processing 10
Regression example 2
Predicting the activity level of a target gene
Person m+1
COMP20008 Elements of Data Processing 11
Regression – Linear regression
School of Computing and Information Systems
@University of Melbourne 2022
Learning Objectives
• How to use linear regression analysis to predict the value of a dependent variable based on independent variables
• Make inferences about the slope and correlation coefficient
• Evaluate the assumptions of regression analysis and know what to do
if the assumptions are violated
COMP20008 Elements of Data Processing
Introduction to Regression Analysis
Regression analysis is used to:
• Predict the value of a dependent variable based on the value of at least one independent variable
• Explain the impact of changes in an independent variable on the dependent variable
• Dependent variable: the variable we wish to predict or explain
• Independent variable: the variable used to explain the dependent variable
COMP20008 Elements of Data Processing
Simple Linear Regression Model
• Only one independent variable, X
• Relationship between X and Y is described by a linear function
• Changes in Y are assumed to be caused by changes in X
COMP20008 Elements of Data Processing
Types of Relationships (1)
Linear relationships
Non-linear relationships
COMP20008 Elements of Data Processing
Types of Relationships (2)
Strong relationships
Weak relationships No relationship
COMP20008 Elements of Data Processing
Simple Linear Regression Model
The simple linear regression equation provides an estimate of the population regression line
Dependent Variable
Yi =β0 +β1Xi +εi
Slope Coefficient
Independent Variable
Error term
Linear component
Error component
COMP20008 Elements of Data Processing
Simple Linear Regression Model (2)
Yi=β0 +β1Xi+εi εi
for this X value i
Observed Value of Y for Xi
Predicted Value of Y for Xi
Intercept = β0
Slope = β1
COMP20008 Elements of Data Processing
Least Squares Method
b0 and b1 are obtained by finding the values of b0 and b1that minimize the sum of the squared differences between ! and !"
!"#$%−%'"=!"#$%−)+)+ " !! !#$!
COMP20008 Elements of Data Processing
Interpretation of Slope and Intercept
• b0 is the estimated average value of Y when the value of X is zero (intercept)
• b1 is the estimated change in the average value of Y as a result of a one-unit change in X (slope)
COMP20008 Elements of Data Processing
Simple Linear Regression Example
A real estate agent wishes to examine the relationship between the selling price of a home and its size (measured in square feet)
A random sample of 10 houses is selected
• Dependent variable (Y) = house price in $1000s
• Independent variable (X) = square feet
COMP20008 Elements of Data Processing
Sample Data for House Price Model
House Price in $1000s
Square Feet
COMP20008 Elements of Data Processing
Graphical Presentation
House price model: scatter plot
1000 1500 2000 2500 3000
Square Feet
COMP20008 Elements of Data Processing
House Price ($1000s)
Calculation Output
Regression Statistics
The regression equation is:
Multiple R R Square
Adjusted R Square Standard Error Observations
0.76211 0.58082 0.52842
41.33032 10
house price = 98.24833 + 0.10977 (square feet)
Regression 1 Residual 8 Total 9
Coefficients
Intercept 98.24833
Square Feet 0.10977
18934.9348 13665.5652 32600.5000
Standard Error
18934.9348 1708.1957
Significance F
-35.57720 0.03374
232.07386 0.18580
COMP20008 Elements of Data Processing
Graphical Presentation
House price model: scatter plot and regression line
Intercept = 98.248
1500 2000 2500 3000
Square Feet
house price = 98.24833 + 0.10977 (square feet)
COMP20008 Elements of Data Processing
House Price ($1000s)
Interpretation of the Intercept b0
house price = + 0.10977 (squarefeet)
• b0 is the estimated average value of Y when the value of X is zero
• Here, no houses had 0 square feet, so b0 = 98.24833 ($1000) just indicates that, for houses within the range of sizes observed, $98,248.33 is the portion of the house price not explained by square feet
COMP20008 Elements of Data Processing
Interpretation of the Slope Coefficient b1
house price =98.24833 + (square feet)
• b1 measures the estimated change in the average value of Y as a result of a one-unit change in X
• Here, b1 = .10977 tells us that the average value of a house increases by .10977 ($1000) = $109.77, on average, for each additional one square foot of size
COMP20008 Elements of Data Processing
Predictions using Regression
Predict the price for a house with 2000 square feet:
house price = 98.25 + 0.1098 (sq.ft.) = 98.25 + 0.1098 (2000)
The predicted price for a house with 2000 square feet is 317.85 ($1,000s) = $317,850
COMP20008 Elements of Data Processing
Interpolation vs. Extrapolation
When using a regression model for prediction, only predict within the relevant range of data
Relevant range for interpolation
0 500 Department of Statistics, ITS Surabaya
1000 1500 2000 2500 3000
Square Feet
Do not try to extrapolate beyond the range of observed X’s
COMP20008 Elements of Data Processing
House Price ($1000s)
Multiple Regression
• Multiple regression is an extension of simple linear regression
• It is used when we want to predict the value of a variable based
on the value of two or more other variables
• The variable we want to predict is called the dependent variable
• The variables we are using to predict the value of the dependent variable are called the independent variables
COMP20008 Elements of Data Processing
Multiple Regression Example
A researcher may be interested in the relationship between the weight of a car, the power of the engine, and petrol consumption.
• Independent Variable 1: weight
• Independent Variable 2: horsepower • Dependent Variable: miles per gallon
COMP20008 Elements of Data Processing
Multiple Regression Fitting
• Linear regression is based on fitting a line as close as possible to the plotted coordinates of the data on a two-dimensional graph
• Multiple regression with two independent variables is based on fitting a plane as close as possible to the plotted coordinates of your data on a three-dimensional graph: Y = b0 + b1X1 + b2X2
• More independent variables extend this into higher dimensions
• The plane (or higher dimensional shape) will be placed so that it minimises the distance (sum of squared errors) to every data point
COMP20008 Elements of Data Processing
Multiple Regression Graphic
COMP20008 Elements of Data Processing
Multiple Regression Assumptions
• Multiple regression assumes that the independent variables are not highly correlated with each other
• Use scatter plots to check
COMP20008 Elements of Data Processing
Multiple regression assumes that the independent variables are not highly correlated with each other
Multiple Regression Assumption
Use scatter plots to check
COMP20008 Elements of Data Processing
Regression – Linear regression cont.
School of Computing and Information Systems
@University of Melbourne 2022
Measures of Variation (1)
Total variation is made up of two parts:
SST = SSR + SSE
Total Sum of Squares
Regression Sum of Squares
S S T = $ Y − Y' % S S R = $ Y) − Y' % S S E = $ Y − Y) % $$$$
Y! = Average value of the dependent variable
Y! = Observed values of the dependent variable #Y! = Predicted value of Y for the given Xi value
COMP20008 Elements of Data Processing
Error Sum of Squares
Measures of Variation (2)
§ SST = total sum of squares
§ Measures the variation of the Y! values around their
§ SSR = regression sum of squares
§ Explained variation attributable to the relationship
between X and Y
§ SSE = error sum of squares
§ Variation attributable to factors other than the
relationship between X and Y
COMP20008 Elements of Data Processing
Measures of Variation (3)
S S T = ' ( " − (! #
= ' ( − (# # SSE " "
= ' (# − (! # "
COMP20008 Elements of Data Processing
Coefficient of Determination, r2
§ The coefficient of determination is the portion of the total variation in the dependent variable that is explained by variation in the independent variable
§ The coefficient of determination is also called r-squared and is denoted as r2
r2 =SSR/SST
= regression sum of squares / total sum ofsquares
0 ≤ $" ≤ 1
COMP20008 Elements of Data Processing
Examples of approximate r2 values
Perfect linear relationship between X and Y:
100% of the variation in Y is explained by variation in X
COMP20008 Elements of Data Processing 6
Examples of approximate r2 values (2)
0 < r2 < 1
Weaker linear relationships between X and Y:
Some but not all of the variation in Y is explained by variation in X
COMP20008 Elements of Data Processing
Examples of approximate r2 values (3)
No linear relationship between X and Y:
The value of Y does not depend on X.
COMP20008 Elements of Data Processing
Calculation Output
Regression Statistics
r2 = SSR = 18934.9348 = 0.58082 SST 32600.5000
Multiple R
R Square Adjusted R Square
Standard Error Observations
0.58082 0.52842
41.33032 10
58.08% of the variation in house prices is explained by variation in square feet
Regression
Residual 8 13665.5652 9
18934.9348
32600.5000
Coefficients
18934.9348 1708.1957
Standard Error t Stat
58.03348 1.69296
0.03297 3.32938
Significance F
-35.57720 0.03374
232.07386 0.18580
Intercept 98.24833
Square Feet 0.10977
COMP20008 Elements of Data Processing
Assumptions of Regression
• Linearity: The underlying relationship between X and Y is linear • Independence of residuals: Residual values (also known as
fitting errors) are statistically sound and sum to zero
COMP20008 Elements of Data Processing
Assumptions of Regression
! = # − #% (((
The residual for observation i, ei, is the difference between the observed and predicted value
• Check the assumptions of regression by examining the residuals
• Examine for linearity assumption
• Evaluate independence assumption
• Examine for constant variance for all levels of X
• Graphical Analysis of Residuals: plot residuals vs. X COMP20008 Elements of Data Processing
Residual Analysis for Linearity
Not Linear
COMP20008 Elements of Data Processing
Residual Analysis for Independence
Not Independent
Independent
COMP20008 Elements of Data Processing
residuals residuals
Residual Analysis for Equal Variance
Non-constant variance
Constant variance
COMP20008 Elements of Data Processing
House Price Residual Output
House Price Model Residual Plot
RESIDUAL OUTPUT
Predicted House Price
1 2 3 4 5 6 7 8 9
80 60 40 20
1000 2000 3000
Square Feet
Does not appear to violate any regression assumptions
COMP20008 Elements of Data Processing
Avoiding the Pitfalls of Regression
• Start with a scatter plot of X vs. Y to observe possible relationships
• Performresidualanalysistochecktheassumptions: Plot residuals vs. X to check for violations
• Avoid making predictions or forecasts outside the relevant range
• Formultipleregressionremembertheimportance of independence assumptions on the independent variables.
COMP20008 Elements of Data Processing
Classification – Decision Trees
School of Computing and Information Systems
@University of Melbourne 2022
Decision Tree example: Prediction of tax cheats
Splitting Attributes
1 2 3 4 5 6 7 8 9 10
No No No No Yes No No Yes No Yes
NO MarSt Single, Divorced
Training Data
Model: Decision Tree
COMP20008 Elements of Data Processing
categorical categorical
continuous class
Decision Tree example: Prediction of tax cheats 2
Single, Divorced
1 2 3 4 5 6 7 8 9 10
No No No No Yes No No Yes No Yes
Training Data
There can be more than one tree that fits the same data!
COMP20008 Elements of Data Processing
Model: Decision Tree
categorical categorical
continuous class
Apply DT Model to Test Data (1)
Start from the root of the tree
Single, Divorced
COMP20008 Elements of Data Processing
Apply DT Model to Test Data (2)
Single, Divorced
COMP20008 Elements of Data Processing
Apply DT Model to Test Data (3)
Single, Divorced
COMP20008 Elements of Data Processing
Apply DT Model to Test Data (4)
Single, Divorced
COMP20008 Elements of Data Processing
Apply DT Model to Test Data (5)
Single, Divorced
COMP20008 Elements of Data Processing
Apply DT Model to Test Data (6)
Single, Divorced
Assign Cheat to “No”
COMP20008 Elements of Data Processing
Apply DT Model to Test Data (7)
Start from the root of tree.
Single, Divorced
COMP20008 Elements of Data Processing
Decision Trees
Decision tree
• A flow-chart-like tree structure
• Internal node denot
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com