CS计算机代考程序代写 algorithm PowerPoint Presentation

PowerPoint Presentation

Re-Cap from Week 4

Clustering Vs Segmentation
Clustering Algorithms
Hierarchical
K-means

Hierarchical Clustering

FIGURE 5.2, Vidgen et al. 2019

K-means Clustering

FIGURE 5.3, Vidgen et al. 2019

Lecture
Outcomes

Lecture Outcomes
The learning outcomes from this week’s lecture are:
Specify a multiple linear regression model
Describe the output of a multiple linear regression model
Interpret the output of a multiple linear regression model in business terms
Assess whether the assumptions of multiple linear regression are met
Discuss the implications of ‘overfitting’ a predictive model
Use SAS VA to build a predictive model with multiple linear regression

Linear
Regression

Relationship between Variables

Relationship between Variables

Linear Regression Model
Mathematical representation of a linear regression model:

X

Y intercept
Gradient
Error Term

Model

Linear Regression Example
Example:

FIGURE 6.2, Vidgen et al. 2019
X

Model
Evaluation

Model Interpretation

F-test statistic allows us to determine whether the model – as a whole – is a significant improvement from if we were just to use the average as a predictor.

In this case the null hypothesis that the model is no better than using the mean is rejected with a probability of p<0.0001. Null Hypothesis and Significance Tests In this example we want to test the following: From the hypothesis we can establish a null hypothesis: If the p-value < 0.05, then significant and reject P-Values In reporting p-values we follow the standard convention that a p-value of less than: 0.05 is flagged ‘*’ 0.01 is flagged ‘**’ 0.001 is flagged ‘***’ Representing 5%, 1% and 0.1% level of significance, respectively. Coefficient Significance A t-test is used to test whether a predictor is significant. Null hypothesis is that the TV advertisement expenditure parameter is equal to zero, indicating that TV expenditure is not a meaningful predictor of the level of sales. In this case, the TV parameter estimate is significant at p<0.0001 and hence we can reject the null hypothesis. Variance Explained and Goodness of Fit (R2) R2 is proportion of the variation in the response variable (Y) that is explained by the explanatory variable (X). It measures the strength of linear relationship between variables. The closer R2 is to 1, the stronger the relationship. Calculating R2 Example TABLE 6.2 & 6.4, Vidgen et al. 2019 Multiple Regression Multiple Regression … Regression Coefficients Error Term Model Model Interpretation Model Interpretation TABLE 6.11, Vidgen et al. 2019 Model Assumptions Model Assumptions Given the following model: Here are the assumptions for this model: The observations are independent The relationship is linear (i.e. the model is correctly specified) The errors is normally distributed with mean zeros Model Checks in SAS VA SAS VA relies on two basic checks: Error terms (residuals) are normally distributed. There may be influential observations (such as outliers) that are exercising an undue influence on the model Checking Residuals Checking Residuals Handling non-normal residuals Commonly used transformations that can be applied to a variable, X, to make the distribution more normal include: Log Transformation of the data ( Square Root Transformation () Reciprocal ( By making the distribution a closer resemblance to a normal distribution this can also possibly help to stabilize the variance. Outliers We examine the standardized residuals, to identify potential outliers. Given a normal distribution we would expect that, as a rule of thumb: 99.9% should lie within +/- 3.29. Any standardized residual with a value greater than 3 is cause for concern as this is unlikely to happen by chance. 99% should lie within +/− 2.58. If more than 1% of standardized residuals have an absolute value greater than 2.5 there is cause for concern. 95% should lie within +/− 1.96. If more than 5% of standardized residuals have an absolute value greater than 2 there is cause for concern. Influential Cases An observation is said to be influential if its removal substantially changes the estimate of the regression coefficients. Leverage is a measure of how far an instance of an independent variable deviates from its mean. High leverage points can have a big impact on the estimate of a model’s regression coefficients. Cook’s distance (D) is a measure that gives an idea of the leverage of an observation. Cook’s distance values of greater than 0.30 are generally accepted to indicate an influential observation. Extending the Regression Model Categorical Variables We can code categorical variables with a numeric value Interaction Effects An interaction effect occurs when two variables have a joint impact that is greater than the sum of their parts Non-linear Relationships Where the relationship between variables is non-linear we might want to add a quadratic term Summary Summary of Linear Regression There will always be some error, a perfect fit is highly unlikely (and suspicious) Remember that all models are wrong: some are more wrong than others, some are more useful than others. Steps can be taken to reduce the error but one must be cautious of following the data too closely as it can results in an overfitted model Ultimately, models must be fit for purpose! .MsftOfcThm_Accent1_Fill_v2 { fill:#0B3BD3; }