留学生作业代写 COMP20008 Elements of Data Processing 2

Supervised learning – Introduction
School of Computing and Information Systems
@University of Melbourne 2022

Copyright By PowCoder代写 加微信 powcoder

Regression vs Classification
COMP20008 Elements of Data Processing 2

Classification – Example 1
Predicting disease from microarray data Training
Develop cancer <1 year Develop cancer <1 year COMP20008 Elements of Data Processing 3 Classification – Example 2 Animal classification https://www-users.cs.umn.edu/~kumar/dmbook/ch4.pdf COMP20008 Elements of Data Processing 4 Classification- Example 3 Banking: classifying borrowers Home Owner Marital status Annual Income Defaulted Borrower COMP20008 Elements of Data Processing 5 Classification – Example 4 Detecting tax cheats 1 2 3 4 5 6 7 8 9 10 No No No No Yes No No Yes No Yes Marital Status Taxable Income COMP20008 Elements of Data Processing 6 categorical categorical continuous class Classification: Definition • Given a collection of records (training set ) • Eachrecordcontainsasetofattributes,oneclasslabel. • Find a predictive model for each class label as a function of the valuesofallattributes,i.e.,! = #(%1,%2,...,%() • !: discrete value, target variable • %1, ... %(: attributes, predictors • #: is the predictive model (a tree, a rule, a mathematical formula) • Goal: previously unseen records should be assigned a class as accurately as possible • Atestsetisusedtodeterminetheaccuracyofthemodel,i.e.the full data set is divided into training and test sets, with the training set used to build the model and the test set used to validate it COMP20008 Elements of Data Processing 7 Classification framework COMP20008 Elements of Data Processing 8 Regression: Definition • Given a collection of records (training set ) • Eachrecordcontainsasetofattributes,onetargetvariable. • Find a predictive model for each class label as a function of the valuesofallattributes,i.e.,! = #(%1,%2,...,%() • !: continuous value, target variable • %1, ... %(: attributes, predictors • #: is the predictive model (a tree, a rule, mathematical formula) • Goal: previously unseen records should be assigned a value as accurately as possible • Atestsetisusedtodeterminetheaccuracyofthemodel,i.e.,the full data set is divided into training and test sets, with the training set used to build the model and the test set used to validate it COMP20008 Elements of Data Processing 9 Regression example 1 Predicting ice-creams consumption from temperature: ! = #(%) ? COMP20008 Elements of Data Processing 10 Regression example 2 Predicting the activity level of a target gene Person m+1 COMP20008 Elements of Data Processing 11 Regression – Linear regression School of Computing and Information Systems @University of Melbourne 2022 Learning Objectives • How to use linear regression analysis to predict the value of a dependent variable based on independent variables • Make inferences about the slope and correlation coefficient • Evaluate the assumptions of regression analysis and know what to do if the assumptions are violated COMP20008 Elements of Data Processing Introduction to Regression Analysis Regression analysis is used to: • Predict the value of a dependent variable based on the value of at least one independent variable • Explain the impact of changes in an independent variable on the dependent variable • Dependent variable: the variable we wish to predict or explain • Independent variable: the variable used to explain the dependent variable COMP20008 Elements of Data Processing Simple Linear Regression Model • Only one independent variable, X • Relationship between X and Y is described by a linear function • Changes in Y are assumed to be caused by changes in X COMP20008 Elements of Data Processing Types of Relationships (1) Linear relationships Non-linear relationships COMP20008 Elements of Data Processing Types of Relationships (2) Strong relationships Weak relationships No relationship COMP20008 Elements of Data Processing Simple Linear Regression Model The simple linear regression equation provides an estimate of the population regression line Dependent Variable Yi =β0 +β1Xi +εi Slope Coefficient Independent Variable Error term Linear component Error component COMP20008 Elements of Data Processing Simple Linear Regression Model (2) Yi=β0 +β1Xi+εi εi for this X value i Observed Value of Y for Xi Predicted Value of Y for Xi Intercept = β0 Slope = β1 COMP20008 Elements of Data Processing Least Squares Method b0 and b1 are obtained by finding the values of b0 and b1that minimize the sum of the squared differences between ! and !" !"#$%−%'"=!"#$%−)+)+ " !! !#$! COMP20008 Elements of Data Processing Interpretation of Slope and Intercept • b0 is the estimated average value of Y when the value of X is zero (intercept) • b1 is the estimated change in the average value of Y as a result of a one-unit change in X (slope) COMP20008 Elements of Data Processing Simple Linear Regression Example A real estate agent wishes to examine the relationship between the selling price of a home and its size (measured in square feet) A random sample of 10 houses is selected • Dependent variable (Y) = house price in $1000s • Independent variable (X) = square feet COMP20008 Elements of Data Processing Sample Data for House Price Model House Price in $1000s Square Feet COMP20008 Elements of Data Processing Graphical Presentation House price model: scatter plot 1000 1500 2000 2500 3000 Square Feet COMP20008 Elements of Data Processing House Price ($1000s) Calculation Output Regression Statistics The regression equation is: Multiple R R Square Adjusted R Square Standard Error Observations 0.76211 0.58082 0.52842 41.33032 10 house price = 98.24833 + 0.10977 (square feet) Regression 1 Residual 8 Total 9 Coefficients Intercept 98.24833 Square Feet 0.10977 18934.9348 13665.5652 32600.5000 Standard Error 18934.9348 1708.1957 Significance F -35.57720 0.03374 232.07386 0.18580 COMP20008 Elements of Data Processing Graphical Presentation House price model: scatter plot and regression line Intercept = 98.248 1500 2000 2500 3000 Square Feet house price = 98.24833 + 0.10977 (square feet) COMP20008 Elements of Data Processing House Price ($1000s) Interpretation of the Intercept b0 house price = + 0.10977 (squarefeet) • b0 is the estimated average value of Y when the value of X is zero • Here, no houses had 0 square feet, so b0 = 98.24833 ($1000) just indicates that, for houses within the range of sizes observed, $98,248.33 is the portion of the house price not explained by square feet COMP20008 Elements of Data Processing Interpretation of the Slope Coefficient b1 house price =98.24833 + (square feet) • b1 measures the estimated change in the average value of Y as a result of a one-unit change in X • Here, b1 = .10977 tells us that the average value of a house increases by .10977 ($1000) = $109.77, on average, for each additional one square foot of size COMP20008 Elements of Data Processing Predictions using Regression Predict the price for a house with 2000 square feet: house price = 98.25 + 0.1098 (sq.ft.) = 98.25 + 0.1098 (2000) The predicted price for a house with 2000 square feet is 317.85 ($1,000s) = $317,850 COMP20008 Elements of Data Processing Interpolation vs. Extrapolation When using a regression model for prediction, only predict within the relevant range of data Relevant range for interpolation 0 500 Department of Statistics, ITS Surabaya 1000 1500 2000 2500 3000 Square Feet Do not try to extrapolate beyond the range of observed X’s COMP20008 Elements of Data Processing House Price ($1000s) Multiple Regression • Multiple regression is an extension of simple linear regression • It is used when we want to predict the value of a variable based on the value of two or more other variables • The variable we want to predict is called the dependent variable • The variables we are using to predict the value of the dependent variable are called the independent variables COMP20008 Elements of Data Processing Multiple Regression Example A researcher may be interested in the relationship between the weight of a car, the power of the engine, and petrol consumption. • Independent Variable 1: weight • Independent Variable 2: horsepower • Dependent Variable: miles per gallon COMP20008 Elements of Data Processing Multiple Regression Fitting • Linear regression is based on fitting a line as close as possible to the plotted coordinates of the data on a two-dimensional graph • Multiple regression with two independent variables is based on fitting a plane as close as possible to the plotted coordinates of your data on a three-dimensional graph: Y = b0 + b1X1 + b2X2 • More independent variables extend this into higher dimensions • The plane (or higher dimensional shape) will be placed so that it minimises the distance (sum of squared errors) to every data point COMP20008 Elements of Data Processing Multiple Regression Graphic COMP20008 Elements of Data Processing Multiple Regression Assumptions • Multiple regression assumes that the independent variables are not highly correlated with each other • Use scatter plots to check COMP20008 Elements of Data Processing Multiple regression assumes that the independent variables are not highly correlated with each other Multiple Regression Assumption Use scatter plots to check COMP20008 Elements of Data Processing Regression – Linear regression cont. School of Computing and Information Systems @University of Melbourne 2022 Measures of Variation (1) Total variation is made up of two parts: SST = SSR + SSE Total Sum of Squares Regression Sum of Squares S S T = $ Y − Y' % S S R = $ Y) − Y' % S S E = $ Y − Y) % $$$$ Y! = Average value of the dependent variable Y! = Observed values of the dependent variable #Y! = Predicted value of Y for the given Xi value COMP20008 Elements of Data Processing Error Sum of Squares Measures of Variation (2) § SST = total sum of squares § Measures the variation of the Y! values around their § SSR = regression sum of squares § Explained variation attributable to the relationship between X and Y § SSE = error sum of squares § Variation attributable to factors other than the relationship between X and Y COMP20008 Elements of Data Processing Measures of Variation (3) S S T = ' ( " − (! # = ' ( − (# # SSE " " = ' (# − (! # " COMP20008 Elements of Data Processing Coefficient of Determination, r2 § The coefficient of determination is the portion of the total variation in the dependent variable that is explained by variation in the independent variable § The coefficient of determination is also called r-squared and is denoted as r2 r2 =SSR/SST = regression sum of squares / total sum ofsquares 0 ≤ $" ≤ 1 COMP20008 Elements of Data Processing Examples of approximate r2 values Perfect linear relationship between X and Y: 100% of the variation in Y is explained by variation in X COMP20008 Elements of Data Processing 6 Examples of approximate r2 values (2) 0 < r2 < 1 Weaker linear relationships between X and Y: Some but not all of the variation in Y is explained by variation in X COMP20008 Elements of Data Processing Examples of approximate r2 values (3) No linear relationship between X and Y: The value of Y does not depend on X. COMP20008 Elements of Data Processing Calculation Output Regression Statistics r2 = SSR = 18934.9348 = 0.58082 SST 32600.5000 Multiple R R Square Adjusted R Square Standard Error Observations 0.58082 0.52842 41.33032 10 58.08% of the variation in house prices is explained by variation in square feet Regression Residual 8 13665.5652 9 18934.9348 32600.5000 Coefficients 18934.9348 1708.1957 Standard Error t Stat 58.03348 1.69296 0.03297 3.32938 Significance F -35.57720 0.03374 232.07386 0.18580 Intercept 98.24833 Square Feet 0.10977 COMP20008 Elements of Data Processing Assumptions of Regression • Linearity: The underlying relationship between X and Y is linear • Independence of residuals: Residual values (also known as fitting errors) are statistically sound and sum to zero COMP20008 Elements of Data Processing Assumptions of Regression ! = # − #% ((( The residual for observation i, ei, is the difference between the observed and predicted value • Check the assumptions of regression by examining the residuals • Examine for linearity assumption • Evaluate independence assumption • Examine for constant variance for all levels of X • Graphical Analysis of Residuals: plot residuals vs. X COMP20008 Elements of Data Processing Residual Analysis for Linearity Not Linear COMP20008 Elements of Data Processing Residual Analysis for Independence Not Independent Independent COMP20008 Elements of Data Processing residuals residuals Residual Analysis for Equal Variance Non-constant variance Constant variance COMP20008 Elements of Data Processing House Price Residual Output House Price Model Residual Plot RESIDUAL OUTPUT Predicted House Price 1 2 3 4 5 6 7 8 9 80 60 40 20 1000 2000 3000 Square Feet Does not appear to violate any regression assumptions COMP20008 Elements of Data Processing Avoiding the Pitfalls of Regression • Start with a scatter plot of X vs. Y to observe possible relationships • Performresidualanalysistochecktheassumptions: Plot residuals vs. X to check for violations • Avoid making predictions or forecasts outside the relevant range • Formultipleregressionremembertheimportance of independence assumptions on the independent variables. COMP20008 Elements of Data Processing Classification – Decision Trees School of Computing and Information Systems @University of Melbourne 2022 Decision Tree example: Prediction of tax cheats Splitting Attributes 1 2 3 4 5 6 7 8 9 10 No No No No Yes No No Yes No Yes NO MarSt Single, Divorced Training Data Model: Decision Tree COMP20008 Elements of Data Processing categorical categorical continuous class Decision Tree example: Prediction of tax cheats 2 Single, Divorced 1 2 3 4 5 6 7 8 9 10 No No No No Yes No No Yes No Yes Training Data There can be more than one tree that fits the same data! COMP20008 Elements of Data Processing Model: Decision Tree categorical categorical continuous class Apply DT Model to Test Data (1) Start from the root of the tree Single, Divorced COMP20008 Elements of Data Processing Apply DT Model to Test Data (2) Single, Divorced COMP20008 Elements of Data Processing Apply DT Model to Test Data (3) Single, Divorced COMP20008 Elements of Data Processing Apply DT Model to Test Data (4) Single, Divorced COMP20008 Elements of Data Processing Apply DT Model to Test Data (5) Single, Divorced COMP20008 Elements of Data Processing Apply DT Model to Test Data (6) Single, Divorced Assign Cheat to “No” COMP20008 Elements of Data Processing Apply DT Model to Test Data (7) Start from the root of tree. Single, Divorced COMP20008 Elements of Data Processing Decision Trees Decision tree • A flow-chart-like tree structure • Internal node denot 程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com