CS代考 DSME5110F: Statistical Analysis

DSME5110F: Statistical Analysis
Lecture 10
Linear Regression and Applications

Copyright By PowCoder代写 加微信 powcoder

• Multiple Linear Regression –…
– Selecting Subsets of Predictors
– Dummy Variables
– Application
• Predicting the Quality of Vintage
– Extensions of the Linear Regressions • Non-linearity, interaction effect, …
• Log Transformation

Example 10.1: Predicting the Price of
Used Toyota Corolla Cars A dataset:
– PrevioussalesofusedToyotaCorollasduringthesummerof2004ata retailer in the Netherlands
– 1436records
– Variables(selected):
Goal: predict the price of a used Toyota Corolla car – Dependentvariable(outcome):Price
– Independentvariable(s)(predictor(s)):others

Adding More Predictors
Age, Kilometers
Age, Kilometers, Horsepower
Age, Kilometers, Horsepower, Weight
• Adding more predictors can always improve 𝑅𝑅2
• Diminishing returns as more predictors are added • Questions:
All (~ . )
– Arethesemodelsgood?
– Whichoneisthebest?Howtocomparedifferentmodels?

Multiple Coefficient of Determination
Multiple coefficient of determination (multiple r-squared): an important measure of the goodness of fit of the regression model
– Any model with a high coefficient of determination may contain independent variables that shouldn’t be included
The adjusted multiple coefficient of determination compensates for the number of independent variables
– the adjusted r-squared value corrects R-squared by penalizing models with more predictors.
– Calculated and interpreted the same way
– Models with more predictors always explain more variation, and hence the
multiple coefficient of determination is larger
– The adjusted R-squared is useful for comparing the performance of models with different numbers of predictors 𝑘𝑘(1 − 𝑟𝑟2)
𝑟𝑟2 = 𝑟𝑟2 − 𝑛𝑛 − 𝑘𝑘 − 1 𝑎𝑎𝑎𝑎𝑎𝑎
Penalty for number of predictors

Selecting Subsets of Predictors
• Not all available predictors should be used: More independent
– Eachnewpredictorrequiresmoredata
– Causes overfitting: may include variables that are unrelated to the dependent variable
– Multicollinearity:twopredictorsarehighlycorrelated
• If there is multicollinearity in the model, two predictors represent the same
• This may make the predictors seem insignificant when they should be significant
• By removing just one of them, it can be discovered that the other predictor is significant in the model
• Two preferred qualities of independent variable sets:
– The variables are linearly related to the dependent variable without
– The set of independent variables is smaller rather than larger (assuming the model has good predictive & explanatory power)
variables = more (possible) problems
being correlated with one another

Selecting Subsets of Predictors
• How to reduce number of predictors to a minimum while still retaining as much info as possible? That is, find the simplest model that performs sufficiently well.
– Advanced methods (e.g. principal component analysis)
– Best subsets regression/ – also known as “all possible regressions” and “all possible models”, is a model selection approach that consists of testing all
Exhaustive
combination
of the predictor variables, and then the according to some statistical criteria (e.g., adjusted
coefficient of determination).
• Computationallyintensive,notfeasibleforbigdata
– If we have 𝑛𝑛 predictor variables, we have 2𝑛𝑛 possible unique subsets. For
– Automatic stepwise regression techniques (e.g. stepwise, backward elimination, and forward selection)
example, if there are 10 potential predictors, then best subsets regression will need to fit 210 = 1024 models, including the model with only y and no predictor variable.

Selecting Subsets of Predictors
• Automatic stepwise regression techniques
– Noguaranteeforgettingthebestsubsetofanycriterion
– Forwardselection:
• Start with a model with no predictors (one of the inputs)
• At each step, add the predictor with the largest contribution to 𝑅𝑅2 • Stop when the addition is not statistically significant
– Backwardelimination:
• Start with a model with all predictors (one of the inputs)
• At each step, eliminate the least useful predictor (according to statistical significance)
• Stop when all remaining predictors are statistically significant
– Stepwise regression:
• At each step, consider adding and dropping predictors: drop predictors that are not statistically significant, and add the predictor with the largest contribution to 𝑅𝑅2
• Start with a model with all predictors (one of the inputs)

Selecting Subsets of Predictors in R
• Best subsets regression/Exhaustive Search:
– use regsubsets() in package leaps: regsubsets(y ~ x1 + x2 + x3 + x5 + x6, data = dataset, nbest = 1)
• nbest = number of subsets of each size to record
• Automatic stepwise regression techniques:
– Forward selection:
• use step() to run forward regression: direction = “forward”
– Backward elimination:
• use step() to run backward regression: direction = “backward”
– Stepwise regression:
• use step() to run stepwise regression: direction = “both”

Regression
• Runbestsubsetsregressionofmpgonallother10 variables and the results are given below.
– The best 1 variable model: Age_08_04
– The best 2 variables model: Age_08_04 + Weight
– The best 3 variables model: Age_08_04 + Weight +KM
– The best 4 variables model: Age_08_04 + Weight +KM +HP – The best 5 variables model: …

• Multiple Linear Regression –…
– Selecting Subsets of Predictors
– Dummy Variables
– Application
• Predicting the Quality of Vintage
– Extensions of the Linear Regressions • Non-linearity, interaction effect, …
• Log Transformation

Modeling Qualitative Variables with Dummy Variables
• Sometimes, we may want to include “qualitative” or “categorical” variables that cannot be measured numerically in a regression model.
• To include categorical variables into a regression model, the lm() function automatically applies a technique known as dummy coding
• Dummy coding allows a nominal feature to be converted into 0-1 dummy variables, for each category of the feature
• One category is always left out to serve as the reference category.
– If a categorical variable has 𝑘𝑘 categories, then we will need 𝑘𝑘– 1 dummy variables.
– Also for the reason to avoid multicollinearity.
– By default, R uses the first level of the factor variable as the reference. – The estimates are then interpreted relative to the reference.

Modeling Qualitative Variables with Dummy Variables
• For example:
– To model the variable “gender” which consists of two categories – male and female, we only need one dummy variable – 1 for female and 0 for male (or vice versa).
– To model the variable “seasons” that consists of four levels – Quarter 1, Quarter 2, Quarter 3 and Quarter 4, we will need 3 dummy variables as defined below. We don’t need a dummy variable for Quarter 4 because it is representedby𝑋𝑋 = 𝑋𝑋 = 𝑋𝑋 = 0.
𝑋𝑋1 =�1 ifitisQuarter1; 𝑋𝑋2 =�1 ifitisQuarter2;𝑋𝑋3 =�1 ifitisQuarter3
0 otherwise 0 otherwise 0 otherwise
– In the example, R automatically splits the three-category predictor
Fuel_Type into three dummy variables:
Fuel_TypeDiesel, Fuel_TypePetrol, Fuel_TypeCNG
• held out the Fuel_TypeCNG (reference category)
• Diesel-cars are more expensive than CNG-cars

Comparing Methods
using step LEAPS
Variable Forward Backward Both Exhaustive*
Age_08_04     KM  HP  Met_Color
Quarterly_Tax     Weight  Fuel_TypeCNG
Fuel_TypeDiesel     Fuel_TypePetrol    
*for 7 predictors

• Multiple Linear Regression –…
– Selecting Subsets of Predictors
– Dummy Variables
– Application
• Predicting the Quality of Vintage
– Extensions of the Linear Regressions • Non-linearity, interaction effect, …
• Log Transformation

Example 10.2: Predicting the Quality of Wine
• Large differences in price and quality between years, although wine is produced in a similar way
• Meant to be aged, so hard to tell if wine will be good when it is on the market
• Expert tasters predict which ones will be good
• , a Princeton economics professor, claimed he can predict wine quality without tasting the wine (March 1990)
• Reaction from the world’s most influential wine expert:
– “Ashenfelter is an absolute total sham”
– “rather like a movie critic who never goes to see the movie but tells you how good it is based on the actors and the director”

Building a Model
• Outcome: typical price in 1990-1991 wine auctions (approximates quality)
• Predictors:
– Age – older wines are more expensive
– Weather:
• Average Growing Season Temperature (AGST)
• Harvest Rain • Winter Rain

Multicollinearity

The Results
• The Final Model (now known as the Bordeaux equation)
– lm(Price ~ AGST + HarvestRain + WinterRain + Age, data=wine)
• The world’s most influential wine expert:
– 1986 is “very good to sometimes exceptional”
• Ashenfelter:
– 1986 is mediocre
– 1989 will be “the wine of the century” and 1990 will be even better!
• In wine auctions,
– 1989 sold for more than twice the price of 1986 – 1990 sold for even higher prices!

• Multiple Linear Regression –…
– Selecting Subsets of Predictors
– Dummy Variables
– Application
• Predicting the Quality of Vintage
– Extensions of the Linear Regressions • Non-linearity, interaction effect, …
• Log Transformation

Extensions of the Linear Regressions
• Standard linear regression model:
• In reality, Y and X’s may not be related in the ways that are assumed previously. For examples:
– The relationship between Y and X may be better described by a curve than by a linear equation.
– The value of Y may not be affected independently by the value of each Xi. The effect of one unit increase in Xi on Y may also depend on the values of other independent variables.
– (Random errors may not have a constant variance. This will cause the fluctuation of y around the regression line to be dependent on E(y).)
• How do we build regression model that can take the above situations into account?
– Interaction term
– Non-linear Relationships: example
• quadratic model, polynomial model

Example 10.3: Medical Expenses
• For a health insurance company to make money, it needs to collect more in yearly premiums than it spends on medical care to its beneficiaries
• It is important to accurately forecast medical expenses for the insured population
• The problem is challenging because the costliest conditions are rare and seemingly random.
• Still, some conditions are more prevalent for certain segments of the population. For example,
– Lungcancerismorelikelyamongsmokersthannon-smokers
– Heartdiseasemaybemorelikelyamongtheobese
• Objective: use patient data to estimate the average medical care
expenses for such population segments.
– Can be used to create actuarial tables that set the price of yearly premiums higher or lower, depending on the expected treatment costs

Data and Preliminary Analysis
• A simulated dataset containing hypothetical medical expenses for patients in the United States
• Using demographic statistics from the US Census Bureau, and thus approximately reflect real-world conditions

Exploring Relationships Among Variables: Multicollinearity
• Before fitting a regression model to data, it can be useful to determine how the independent variables are related to the dependent variable and each other
• Usecorrelationmatrixandscatterplotmatrix
– If the correlation of two independent variables is
large, there may be the problem of multicollinearity
– If the correlation of an independent variable and the dependent variable is 0, it does not mean they do not have any relationship

Correlations
• cor() works for numeric variables only
– Hence the nominal variables sex, smoker and
region cannot be included.
• Because the correlations are small, there is no multicollinearity issue

Improvements • Adding Non-Linear Relationships 2
– Add a higher order term to account for a non-linear relationship (treating the model as a polynomial, e.g.: 𝑦𝑦 = 𝛽𝛽0 + 𝛽𝛽1𝑥𝑥 + 𝛽𝛽2𝑥𝑥 )
– insurance$age2 <- insurance$age^2 • Converting Numeric to Nominal – Some features may have effect only after a specific threshold has been reached – One can create binary indicator variable • For example, BMI may be related to medical expenditures only if it is above 30 • insurance$bmi30 <- ifelse(insurance$bmi >= 30, 1, 0) • Adding interaction effects
– If certain features have a combined impact on the dependent variable
• For example, smoking and obesity may have harmful effects separately, but their effect
– When two features have a combined effect, this is known as an interaction
may be worse than the sum of each one alone
– Interaction should never be included in a model without adding each of the
interacting variables. Hence one should create interactions using the *
• bmi30*smoker (this is equivalent to bmi30 + smokeryes + smokeryes:smokeryes, in which smokeryes:smokeryes is the interaction)

Suggested Steps for Regression
1. Collect sample data to fit a selected model with the least square method.
– Select the best set of predictors
2. Model Validity Checks:
– The adjusted R2 takes the sample size into consideration.
– F Statistic is used to test “the overall usefulness of the model”, i.e., whether the dependent variable 𝑦𝑦 is linearly related to at least one of the independent variables.
– Individual t test is used to test whether the dependent variable 𝑦𝑦 is linearly related to an independent variable 𝑥𝑥𝑖𝑖.
3. Diagnostic Checks of the Regression Model:
–Checking Model Assumptions: Check whether regression residuals are
independently and normally distributed with a constant standard deviation.
– Checking Outliers: Check whether there are influential observations (outliers) in the data set that may have significant impact on the fitted regression model.

Example 10.4: Shipping Cost vs. Weight and Distance traveled
• How does shipping cost depend on the weight of the package and the distance traveled? Data collected from 20 packages can be found in the file: shipping_costs.csv.
Cost Weight Distance
• The variables are: – 𝑦𝑦 = cost,
– 𝑥𝑥1= weight (lbs),
– 𝑥𝑥2 = distance (miles)

A Pure Linear Model
• Let’s start by fitting the following pure linear model without interaction
–𝑦𝑦=𝛽𝛽+𝛽𝛽𝑥𝑥+𝛽𝛽𝑥𝑥
• F test is highly significant (p-value < 0.0001). • With Adjusted R2 = 0.9063, the model fits the data quite well. • Individual t tests indicate that all predictors are highly significant. • Although the linear model fits the data quite well, U-shaped residuals plots suggests that there may be curvature relationship between y and predictors. A Model with Interaction Term • To model the curvature relationship shown in residual plots, we now try to fit the following model: –𝑦𝑦=𝛽𝛽+𝛽𝛽𝑥𝑥+ 011 𝛽𝛽2𝑥𝑥2 + 𝛽𝛽3𝑥𝑥1𝑥𝑥2 • The regression equation: – Cost = -0.14 + 0.019 Weight + 0.0077 Distance + 0.0078 Weight*Distance Discussions • Once interaction term has been deemed important in the regression model, do not conduct individual t-tests on the 𝛽𝛽 coefficients of the first- order terms. These terms should be kept in the model regardless of the magnitude of their associated p-values shown on the printout. – By doing so, the model will be able to separate the independent effect from the estimated value of the coefficient, 𝛽𝛽 . 𝑖𝑖 interaction effect. If the independent effect is unimportant, it will be reflected by the • Shipping cost is significantly affected by the interaction of both variables. – One pound increase in weight will increase the shipping cost by $0.019 + 0.0078 Distance. So, the effect of one pound increase in weight on shipping cost is larger when the shipping distance is larger. – One mile increase in distance will increase the shipping cost by $0.0077 + 0.0078 Weight. So, the effect of one mile increase in distance on shipping cost is larger when the weight of the package is larger. • Prediction: How much would the average shipping cost be for a 5 lbs package that will be sent to 100 miles away? • Multiple Linear Regression –... – Selecting Subsets of Predictors – Dummy Variables – Application • Predicting the Quality of Vintage – Extensions of the Linear Regressions • Non-linearity, interaction effect, ... • Log Transformation Multiplicative Models • The two most commonly used multiplicative models: – 𝑦𝑦=𝑒𝑒 –𝑦𝑦=𝛽𝛽𝑥𝑥𝛽𝛽𝑥𝑥𝛽𝛽⋯𝑥𝑥𝛽𝛽𝜖𝜖 𝛽𝛽0+𝛽𝛽1𝑥𝑥1+𝛽𝛽2𝑥𝑥2+⋯+𝛽𝛽𝑘𝑘𝑥𝑥𝑘𝑘+𝜖𝜖 • Both are non-linear models. For both models, 𝑦𝑦 = 𝐸𝐸(𝑦𝑦) � 𝜖𝜖. Thus, the variability of 𝑦𝑦 depends on 𝐸𝐸(𝑦𝑦). In other words, the random errors of both models do not have a constant variance. • We can “linearize” the models by taking log transformation as 01122 𝑘𝑘𝑘𝑘 – ln(𝑦𝑦) = 𝛽𝛽 + 𝛽𝛽 𝑥𝑥 + 𝛽𝛽 𝑥𝑥 + ⋯ + 𝛽𝛽 𝑥𝑥 + 𝜖𝜖 01122 𝑘𝑘𝑘𝑘 – ln 𝑦𝑦 = ln(𝛽𝛽0) + 𝛽𝛽1ln(𝑥𝑥1) + 𝛽𝛽2ln(𝑥𝑥2) + ⋯ + 𝛽𝛽𝑘𝑘ln(𝑥𝑥𝑘𝑘) + ln(𝜖𝜖) • After log transformation, the variance of random errors can usually be stabilized to a constant. Interpretation of Coefficients in Multiplicative Models • ln(𝑦𝑦)=𝛽𝛽0+𝛽𝛽1𝑥𝑥1+𝛽𝛽2𝑥𝑥2+⋯+𝛽𝛽𝑘𝑘𝑥𝑥𝑘𝑘+𝜖𝜖 – Each coefficient 𝛽𝛽 is interpreted as “percentage change in 𝑦𝑦 for 𝑖𝑖𝑖𝑖 a small change in 𝑥𝑥 ”. – For example, if the regression equation is ln(𝑦𝑦) = 2 + 0.05𝑥𝑥, then a one unit increase in 𝑥𝑥 will result in approximately 5% increase in y. • ln 𝑦𝑦 =ln(𝛽𝛽 )+𝛽𝛽 ln(𝑥𝑥 )+𝛽𝛽 ln(𝑥𝑥 )+⋯+𝛽𝛽 ln(𝑥𝑥 )+ 01122 𝑘𝑘𝑘𝑘 – Each coefficient β is interpreted as “percentage change in y for a small percentage change in x ”. – For example, if i the regression equation is ln(𝑦𝑦) = 2 + 0.05ln(𝑥𝑥), then a one percent increase in x will result in approximately 0.05% increase in y. Example 10.5: Income vs. Spending • How much money do people spend out of their income? The file “income_spending.csv” contains data collected from 31 people. • Construct a regression model to describe the relationship between “Spending” and “Income”. A Linear Model • We will start with a simple linear model first. • Although the model is highly significant (p-value < 0.0001), residual plot shows that random errors do not have a constant variance. In fact, the variability of random errors is increasing with the fitted value of y. This is because that, as people’s income increase, the amount they spend tend to fluctuate more. • The violation of equal variance assumption can lead to biased prediction interval estimation for y – the interval tends to be too wide for small x and too narrow for large x. Log Transformed Model • The problem of unequal variance of random errors can often be solved by fitting a model of (natural) log transformed y against x. – ln(Spending) = 𝛽𝛽 + 𝛽𝛽 Income + 𝜖𝜖 01 • After taking log transformation on spending, the residual plot shows that the random errors of the model have been stabilized. Discussions • The regression output for the model, ln(Spending) = 𝛽𝛽0 + 𝛽𝛽1Income + 𝜖𝜖, is: – ln Spending = 8.05 + 0.000024 Income – So, one dollar increases in income will result in approximately 0.0024% increase in • Both R2 and adjusted R2 are reasonably high, suggesting that the model fits the data very well. • Global F test has a very small p-value, suggesting that the model is overall significant. • How much would the average spending be for people whose income are $70,000? • Note that the predicted value is in log(y). So, to get the predicted value for y, we need to take the anti-log (that is, exp) of y. Summary and Final Remarks • Linear regression models are very popular too 程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com