程序代写代做代考 data mining go flex Data mining

Data mining
Prof. Dr. Matei Demetrescu Summer 2020
Statistics and Econometrics (CAU Kiel)
Summer 2020 1 / 36

Today’s outline
Linear Regression
1 Linear regression
2 Adding flexibility
3 Further regression details
4 Up next
Statistics and Econometrics (CAU Kiel) Summer 2020 2 / 36

Linear regression
Outline
1 Linear regression
2 Adding flexibility
3 Further regression details
4 Up next
Statistics and Econometrics (CAU Kiel) Summer 2020 3 / 36

Linear regression
The simplest approach
Linear regression is a simple approach to supervised learning. It assumes that the dependence of E(Y |X) on X1, X2, . . . , Xp is linear.1
Although it may seem overly simplistic, linear regression is extremely useful both conceptually and practically. It gives first answers to questions like
Is there some relation between variables? Which of them is a potential predictor? How accurately can we predict prices?
Is the relation linear?
And: It is interpretable, and easy to fit and predict with.
(You may complicate your life with other methods, but only if you must!)
1This is seldom true in practice; but, as we’ll see, it is a good start.
Statistics and Econometrics (CAU Kiel) Summer 2020 4 / 36

Real estate again
house.age
0 1 2 3 4 5 6
0.03 0.05
20 40 60 80 120
−0.21
−0.67
0.57
price
0 10 20 30 40
0 2 4 6 8 10
Statistics and Econometrics (CAU Kiel)
Summer 2020
5 / 36
Linear regression
house.loc
−0.60
nr.stores
20 60 100 0 2 4 6
0 4 8
0 20 40

Linear regression
Simple linear regression
Let’s assume a simple linear model (i.e. with one predictor),
Y = β0 + β1X + ε.
The error term ε has zero conditional mean if the model is true; but is in
any case uncorrelated with X.
Given some estimates βˆ0 and βˆ1 for the model coefficients, we predict
prices using
Yˆ = βˆ 0 + βˆ 1 X .
Of course, Yˆ ̸= Y so there are prediction errors.2
2In sample, we call predictions fitted values and errors are residuals. OLS estimation minimizes the sum of squared residuals.
Statistics and Econometrics (CAU Kiel) Summer 2020 6 / 36

Linear regression
Assessing accuracy of coefficient estimates
We’ve seen that forecast errors depend on estimation error.
Standard errors reflect how estimators vary under repeated sampling. (You should know how they look for OLS estimation.)
These standard errors can be used to compute confidence intervals, βˆ1 ± 2 · SE(βˆ1).
Recall, a 95% confidence interval “covers” the true unknown value of the parameter with 95% probability. I.e. there is approximately a 95% chance that the interval
􏰒βˆ1 − 2 · SE(βˆ1), βˆ1 + 2 · SE(βˆ1)􏰓
will contain the true value of β1 (regularity conditions assumed).
Statistics and Econometrics (CAU Kiel) Summer 2020 7 / 36

Linear regression
Hypothesis Testing I
Standard errors can also be used to perform hypothesis tests. E.g. we test quite often the null
H0 : There is no (linear) relation between X and Y against the alternative
HA : There is some relation between X and Y
This corresponds to testing
H0 : β1 = 0 against HA : β1 ̸= 0,
since if β1 = 0 then the model reduces to Y = β0 + ε, and X is not (linearly) associated with Y .
Statistics and Econometrics (CAU Kiel) Summer 2020
8 / 36

Linear regression
Hypothesis Testing II
To test this null hypothesis, we compute the t t = βˆ 1 − 0 ,
SE(βˆ1 )
which is approximately standard normal (regularity conditions assumed).
Using statistical software, it is easy to compute the probability under the null of observing any value equal to |t| or larger, i.e. the p-value.
Results for the real estate prices Coefficient
Std. Error t-statistic 0.6526 70.26
0.3925 -18.50
p-value < 10−4 < 10−4 intercept Location 45.8514 -7.2621 Statistics and Econometrics (CAU Kiel) Summer 2020 9 / 36 Linear regression Assessing overall model accuracy Compute the Residual Standard Error – this gives an indication of prediction error. Compute R2 – this quantifies the linear dependence and is scale-free.3 Real estate results Quantity Residual Standard Error R2 F -statistic Value 10.07 0.4538 342.2 This is Prices on Location only – how all available predictors? 3For simple linear regression, R2 = r2 where r is the sample correlation of X and Y . Statistics and Econometrics (CAU Kiel) Summer 2020 10 / 36 Linear regression More inputs/features/predictors Here our model is (finally!) Y =β0 +β1X1 +β2X2 +···+βpXp +ε, We interpret βj as the average effect on Y of a one unit increase in Xj, holding all other predictors fixed. And of course we look for better predictions! Given estimates βˆ0, βˆ1, . . . , βˆp we predict Yˆ = βˆ 0 + βˆ 1 X 1 + βˆ 2 X 2 + · · · + βˆ p X p . One typically uses OLS estimation (recall what this implies for the estimated quantity). Statistics and Econometrics (CAU Kiel) Summer 2020 11 / 36 Linear regression Interpreting regression coefficients The ideal scenario is when the predictors are uncorrelated4 Each coefficient can be estimated and tested separately. Interpretations such as “a unit change in Xj with a βj change in Y , while all the other variables stay fixed”, are possible. In any case, claims of causality should be avoided for observational data. Correlations amongst predictors cause interpretation problems: Omitted variable bias (though not a killer when focussing on prediction) Interpretations become hazardous — when Xj changes, other features do not stay fixed on the average. 4This is called a balanced design. Statistics and Econometrics (CAU Kiel) Summer 2020 12 / 36 Linear regression Results for real estate data intercept Age Location Nr.Stores Follow-up questions: Coefficient 42.9772 -0.2529 -5.3791 1.2974 Std. Error t-statistic 1.3845 31.041 0.0401 -6.305 0.4530 -11.874 0.1943 6.678 p-value < 10−4 < 10−4 < 10−4 < 10−4 1 Is at least one of the predictors X1, X2, . . . , Xp useful for prediction? 2 Do all the predictors help to explain Y , or is only a subset of the predictors useful? Which subset? 3 How well does the model fit the data? 4 How accurate is our prediction? Statistics and Econometrics (CAU Kiel) Summer 2020 13 / 36 Linear regression ... and some (linear) answers 1 Use the F statistic, which is approx. χ2(p) distributed under the null; 2 Use individual t statistics (see model selection later on); 3 Compute R2; 4 Compute the residual standard error. (More Q&A in due time.) For the real estate data: Quantity Residual Standard Error R2 F-statistic Value 9.251 0.5411 161.1 Statistics and Econometrics (CAU Kiel) Summer 2020 14 / 36 Adding flexibility Outline 1 Linear regression 2 Adding flexibility 3 Further regression details 4 Up next Statistics and Econometrics (CAU Kiel) Summer 2020 15 / 36 Adding flexibility Discrete-valued inputs Some predictors are not quantitative but are qualitative, taking a discrete set of values. These are also called categorical predictors or factor variables. Don’t confuse them with count variables – these can be treated as usually (unless they’re targets). The linearity assumption is usually understood as Y changing at a constant rate as X increases in value. But Y = β0 + β1 × category is not very meaningful, so we look at the marginal effect of other (metric) predictors for each category. Statistics and Econometrics (CAU Kiel) Summer 2020 16 / 36 Example: Credit rating data I 26 3. Linear Regression tomersà Balance 0 500 1500 2 4 6 8 50 100 150 Rating 200 600 1000 20 40 60 Age 80 100 5 10 15 20 2000 8000 14000 Credit Card Data Cards Four additional qualitative variables gender, student (student status), status (marital status), and ethnicity (Caucasian, African American (AA) or Asian). Education Adding flexibility Income Limit 200 600 1000 50 100 150 2 4 6 8 0 500 1500 2000 8000 14000 5 10 15 20 20 40 60 80 100 (Data available from the companion website of James et al. textbook.) FIGURE 3à6à The Credit data set contains information about balance, age, 28 / 48 cards, education, income, limit, and rating for a number of potential cus- Statistics and Econometrics (CAU Kiel) Summer 2020 17 / 36 Adding flexibility Example: Credit rating data II Investigate differences in credit card balance between males and females, ignoring the other variables. We create a new variable Resulting model: Results for gender model: Coefficient Std. Error 33.13 46.05 t-statistic 15.389 0.429 p-value < 10−4 0.6690 Xi = 􏰆1, if ith person is female 0, if ith person is male Yi =β0+β1Xi+εi =􏰆β0+β1+εi, ifithpersonisfemale β0 + εi, if ith person is male intercept gender[Female] 509.80 19.73 Statistics and Econometrics (CAU Kiel) Summer 2020 18 / 36 Adding flexibility More than two categories/levels I With more than two levels, we create additional dummy variables. For example, for the ethnicity variable we create two dummy variables. The first could be Xi1 = 􏰆1 0 and the second could be Xi2 = 􏰆1 0 if ith person is Asian if ith person is not Asian (And so on for more categories.) if ith person is Caucasian if ith person is not Caucasian Statistics and Econometrics (CAU Kiel) Summer 2020 19 / 36 Adding flexibility More than two categories/levels II Then both of these variables can be used in the regression equation: β0 + β1 + εi, Yi =β0+β1Xi1+β2Xi2+εi = β0+β2+εi, ith person is Asian ithpersonisCaucasian ith person is AA. β0 + εi, There will always be one fewer dummy variable than the number of levels. The level with no dummy variable (here, AA) is the baseline. Results for ethnicity intercept ethnicity[Asian] ethnicity[Caucasian] Coef. 531.00 -18.69 -12.50 Std. Err. 46.32 65.02 56.68 t-stat. 11.464 -0.287 -0.221 p-val. < 10−4 0.7740 0.8260 Statistics and Econometrics (CAU Kiel) Summer 2020 20 / 36 Adding flexibility Interactions In analysing the real estate data, we assumed additivity: the slope coefficient (i.e. effect) of Location does not depend on other variables. This is what the linear model states! Price=βˆ +βˆ ·Age+βˆ ·Location+βˆ ·Nr.Stores 0123 But what if Location has a different effect on prices depending on Age? Taking this into account would increase prediction accuracy. In marketing, this is known as a synergy effect (in statistics: interaction). 􏰚 Statistics and Econometrics (CAU Kiel) Summer 2020 21 / 36 Adding flexibility Modelling interactions — real estate data The model now takes the form Sales=β0 +β1 ·Age+β2 ·Location+β3 ·Nr.Stores + β4 · (Age · Location) + ε =β0 +β1 ·Age+(β2 +β4 ·Age)·Location+β3 ·Nr.Stores+ε. Results: Intercept Age Location Nr.Stores Age · Location Coefficient 45.20927 -0.35166 -8.24568 1.24911 0.14428 Std. Error 1.51872 0.04924 0.95964 0.19240 0.04273 t-statistic 29.768 -7.141 -8.592 6.492 3.377 p-value < 10−4 < 10−4 < 10−4 < 10−4 0.0008 The interpretation of coefficients of both Age and Location changes. For the sake of interpretability, always include all linear terms entering an interaction even if insignificant (hierarchy principle). Statistics and Econometrics (CAU Kiel) Summer 2020 22 / 36 Adding flexibility Interactions between qualitative and quantitative variables Consider the credit ratings data set again, and suppose that we wish to predict balance using income (quantitative) and student (qualitative). Without an interaction term, the model takes the form balance ≈ β0 + β1 · income + 􏰆β2 0 =β1 ·income+􏰆β0 +β2 β0 if person is a student if person is not a student if person is a student if person is not a student. Statistics and Econometrics (CAU Kiel) Summer 2020 23 / 36 Adding flexibility Slopes also vary between groups With interactions, it takes the form balance ≈ β0 + β1 · income + 􏰆β2 + β3 · income if student 90 3. Linear Regression 0 if not student student non−student 0 50 100 150 0 50 100 150 Income Income FIGURE 3ààà For the Credit data, the least squares lines are shown for pre- Statistics and Econometrics (CAU Kiel) Summer 2020 24 / 36 Balance 200 600 1000 1400 Balance 200 600 1000 1400 Credit data; Left: no interaction between income and student. Adding flexibility (Non-)Linearity and feature engineering A linear relation will easily turn out to be too simplistic. (Use e.g. residual plots & model selection to judge on that.) The linear regression model is linear in the predictors, ... but there’s no law stopping us from adding suitable features. Interaction terms are actually such artificial predictors ... and we recall experimenting with squares. May consider other transformations of predictors5 in addition to (low degree!) polynomials. 5Logs are very popular, but: when working with log Y rather than Y , OLS fits the conditional expectation of the logs, which is not the log of E(Y |X), and interpretation of coefficients changes of course. Statistics and Econometrics (CAU Kiel) Summer 2020 25 / 36 Adding flexibility All together Intercept Age Age2 Location Nr.Stores Nr.Stores · Location Age · Location Coefficient 49.321393 -1.051326 0.017914 -6.408751 1.580103 -1.178413 0.123380 Std. Error 1.659967 0.146640 0.003515 0.939301 0.197932 0.236816 0.040284 t-statistic p-value 29.712 < 10−4 -7.169 < 10−4 5.096 < 10−4 -6.823 < 10−4 7.983 < 10−4 -4.976 < 10−4 3.063 0.0023 of feature selection interactions). We only kept significant coefficients; this is a form coming on top of the feature engineering (squares, Statistics and Econometrics (CAU Kiel) Summer 2020 26 / 36 Some standard diagnostics Residuals vs Fitted 271 313 114 10 20 30 40 50 60 Scale−Location 271 113413 114 Fitted values Normal Q−Q Fitted values Residuals vs Leverage 271 221313 Cook's distance 0.00 0.05 0.10 −3 −2 −1 0 1 2 3 Statistics and Econometrics (CAU Kiel) Summer 2020 27 / 36 Adding flexibility 271 313 1 0.5 0.5 0.15 10 20 30 40 50 60 Standardized residuals Residuals −4 02468 −40 0 40 80 Standardized residuals Standardized residuals 0.0 1.0 2.0 3.0 −5 0 5 10 Further regression details Outline 1 Linear regression 2 Adding flexibility 3 Further regression details 4 Up next Statistics and Econometrics (CAU Kiel) Summer 2020 28 / 36 Further regression details Linearity is not the only assumption What else could go wrong? correlation of errors non-constant variance of errors outliers high-leverage points collinearity Statistics and Econometrics (CAU Kiel) Summer 2020 29 / 36 Further regression details Correlation of error terms What are correlated errors? Regression errors are not independent anymore εi provides information about εj This is a form of model misspecification When should we expect correlated errors? time series data when there are hidden factors Why are they bad? correlated errors lead to biased estimation of standard error this means confidence intervals are too small or too large Fortunately, no bias in slope parameter estimators (unless you have a dynamic model) May resort to robust standard errors (but better to treat the cause). Statistics and Econometrics (CAU Kiel) Summer 2020 30 / 36 Further regression details Non-constant error variance: Simulated data set 1234567 1234567 Y 0 2 4 6 8 10 Y 0 2 4 6 8 10 Error variance may change with the predictor too; This also biases standard errors... ... but not the LS estimators. Statistics and Econometrics (CAU Kiel) Summer 2020 31 / 36 Further regression details Outliers Outliers are values far away from model predictions: 1234567 Remove if bad data, otherwise model! Alternatively, use robust regression methods such as Least Absolute Deviations estimation. Statistics and Econometrics (CAU Kiel) Summer 2020 32 / 36 ... but beware of differences. Y 0 2 4 6 8 10 Further regression details High-leverage points Outliers: unusual Y for usual X values Leverage: unusual X values A problem even if corresponding Y only just a bit outlying 2 4 6 8 10 12 14 Worth taking a closer look at (typically far away from centroid of X distribution) Statistics and Econometrics (CAU Kiel) Summer 2020 33 / 36 Y 0 2 4 6 8 10 Further regression details (Multi-)Collinearity High correlations among the regressors: Reduced predictive power for certain combinations of feature values Unstable estimates How to detect? look at the correlations between the covariates look at the eigenvalues of X′X Ways out? select a subset of predictors (model selection) put some constraints on βˆ (regularization & dimensionality reduction) Statistics and Econometrics (CAU Kiel) Summer 2020 34 / 36 Up next Outline 1 Linear regression 2 Adding flexibility 3 Further regression details 4 Up next Statistics and Econometrics (CAU Kiel) Summer 2020 35 / 36 Up next Coming up (Almost) Linear classification Statistics and Econometrics (CAU Kiel) Summer 2020 36 / 36