SSC442
Lab 2: Linear Regression and Simple Analyses
This question will require you to analyze a new dataset. Specifically, this lab utilizes the ames.csv data found here. This is a famous dataset about home sales from Ames, Iowa.
Backstory and Set Up
You have been recently hired to Zillows Zestimate product team as a junior analyst. As a part of their regular hazing, they have given you access to a small subset of their historic sales data. Your job is to present some basic predictions for housing values in a small geographic area Ames, IA using this historical pricing.
First, lets load the data.
ameslist read.tablehttps:msudataanalytics.github.ioSSC442Labsdataames.csv,
header TRUE,
sep ,
Before we proceed, lets note a few things about the simple code above. First, we have specified header TRUE becauseyou guessed itthe original dataset has headers. Although simple, this is an incredibly important step because it allows R to do some smart R things. Specifically, once the headers are in, the variables are formatted as int and factor where appropriate. It is absolutely vital that we format the data correctly; otherwise, many R commands will whine at us.
Try it: Run the above, but instead specifying header FALSE. What data type are the various columns? Now try ommitting the line altogether. What is the default behavior of the read.table function?1
Data Exploration and Processing
We are not going to tell you anything about this data. This is intended to replicate a realworld experience that you will all encounter in the possibly near future: someone hands you data and youre expected to make sense of it. Fortunately for us, this data is somewhat selfcontained. Well first check the variable names to try to divine some information. Recall, we have a handy little function for that:
namesameslist
Note that, when doing data exploration, we will sometimes choose to not save our output. This is a judgement call; here weve chosen to merely inspect the variables rather than diving in.
Inspection yields some obvious truths. For example:
Variable
Explanation
Type
ID
Unique identifier for each row
int
LotArea
Size of lot units unknown
int
SalePrice
Sale price of house
int
but we face some notsoobvious things as well. For example:
Variable
Explanation
Type
LotShape
? Something about the lot
factor
MSSubClass
? No clue at all
int
Condition1
? Seems like street info
factor
It will be difficult to learn anything about the data that is of type int without outside documentation. However, we can learn something more about the factortype variables. In order to understand these a little better, we need to review some of the values that each take on.
Try it: Go through the variables in the dataset and make a note about your interpretation for each. Many will be obvious, but some require additional thought.
We now turn to another central issueand one that explains our nomenclature choice thus far: the data object is of type list. To verify this for yourself, check:
typeofameslist
This isnt idealfor some visualization packages, for instance, we need data frames and not lists. Well make a mental note of this as something to potentially clean up if we desire.
Although there are some variables that would be difficult to clean, there are a few that we can address with relative ease. Consider, for instance, the variable GarageType. This might not be that important, but, remember, the weather in Ames, IA is pretty crummya detached garage might be a dealbreaker for some wouldbe homebuyers. Lets inspect the values:
uniqueameslistGarageType
1 Attchd Detchd BuiltIn CarPort NA Basment 2Types
With this, we could make an informed decision and create a new variable. Lets create OutdoorGarage to indicate, say, homes that have any type of garage that requires the homeowner to walk outdoors after parking their car. For those who arent familiar with different garage types, a car port is not insulated and is therefore considered outdoors. A detached garage presumably requires that the person walks outside after parking. The three other types are inside the main structure, and 2Types we can assume includes at least one attached garage of some sort. This is going to require a bit more coding and we will have to think through each step carefully.
First, lets create a new object that has indicator variables that is, a variable whose values are either zero or one for each of the GarageType values. As with everything in R, theres a handy function to do this for us:
GarageTemp model.matrix GarageType 1, dataameslistGarageType
We now have two separate objects living in our computers memory: ameslist and GarageTempso named to indicate that it is a temporary object.2 We now need to stitch it back onto our original data; well use a simple concatenation and write over our old list with the new one:
ameslist cbindameslist, GarageTemp
Error in data.frame…, check.names FALSE :
arguments imply differing number of rows: 1460, 1379
Huh. Whats going on?
Try it: Figure out whats going on above. Fix this code so that you have a working version.
Now that weve got that working ha! we can generate a new variable for our outdoor garage. Well use a somewhat gross version below because it is verbose; that said, this can be easily accomplished using logical indexing for those who like that approach.
ameslistGarageOutside ifelseameslistGarageTypeDetchd 1 ameslistGarageTypeCarPort 1, 1, 0
uniqueameslistGarageOutside
1 0 1 NA
This seems to have worked. The command above ifelse does what it says: if some condition is met here, either of two variables equals one then it returns a one; else it returns a zero. Such functions are very handy, though as mentioned above, there are other ways of doing this. Also note, that while fixed the issue with NA above, weve got new issues: we definitely dont want NA outputted from this operation. Accordingly, were going to need to deal with it somehow.
Try it: Utilizing a similar approach to what you did above, fix this so that the only outputs are zero and one.
Generally speaking, this is a persistent issue, and you will spend an extraordinary amount of time dealing with missing data or data that does not encode a variable exactly as you want it. This is expecially true if you deal with realworld data: you will need to learn how to handle NAs. There are a number of fixes as always, Google is your friend and anything that works is good. But you should spend some time thinking about this and learning at least one approach.
Exercise 1:
1. Prune the data to all of the variables that are type int about which you have some reasonable intuition for what they mean. This must include the variable SalePrice. Save this new dataset as Ames. Produce documentation for this object in the form of a .txt file. This must describe each of the preserved variables, the values it can take e.g., can it be negative? and your interpretation of the variable.
2. Produce a scatterplot matrix which includes 12 of the variables that are type int in the data set. Choose those that you believe are likely to be correlated with SalePrice.3
3. Compute a matrix of correlations between these variables using the function cor. Does this match your prior beliefs? Briefly discuss the correlation between the miscellaneous variables and SalePrice.
4. Produce a scatterplot between SalePrice and GrLivArea. Use the abline function to plot the relationship that youve found in the simple linear regression.
What is the largest outlier that is above the regression line? Produce the other information about this house?
Bonus; more challenging Create a visualization that shows the proliferation of air conditioning in homes in Ames.
Building a Model
Were now ready to start playing with a model. We will start by using the lm function to fit a simple linear regression model, with SalePrice as the response and lstat as the predictor.
Recall that the basic lm syntax is lmyx,data, where y is the response, x is the predictor, and data is the data set in which these two variables are kept. Lets quickly run this with two variables:
lm.fit lmSalePrice GrLivArea
This yields: Error in evalexpr, envir, enclos : Object SalePrice not found
This command causes an error because R does not know where to find the variables. We can fix this by attaching the data:
attachAmes
lm.fit lmSalePrice GrLivArea
Try it: What is GrLivArea? If you did not include this variable above, check its relationship to other variables in the dataset to get a better idea what it is.
The next line tells R that the variables are in the object known as Ames. If you havent created this object yet as in Exercise 1 youll get an error at this stage. But once we attach Ames, the first line works fine because R now recognizes the variables. Alternatively, we could specify this within the lm call using data Ames. Weve presented this way because it may be new to you; choose whichever you find most reasonable.
If we type lm.fit, some basic information about the model is output. For more detailed information, we use summarylm.fit. This gives us pvalues and standard errors for the coefficients, as well as the R2 statistic and Fstatistic for the entire model.4
Utilizing these functions hels us see some interesting results. Note that we built nearly the simplest possible model:
But even on its own, this model is instructive. It suggest that an increase in overall living area of 1 ft 2 is correlated with an expected increase in sales price of 107. Note that we cannot make causal claims!
Saving the model as we did above is useful because we can explore other pieces of information it stores. Specifically, we can use the names function in order to find out what else is stored in lm.fit. Although we can extract these quan tities by namee.g. lm.fitcoefficientsit is safer to use the extractor functions like coef to access them. We can also use a handy tool like plot applied directly to lm.fit to see some interesting data that is automatically stored by the model.
Try it: Use plot to explore the model above. Do you suspect that some outliers have a large influence on the data? We will explore this point specifically in the future.
We can now go crazy adding variables to our model. Its as simple as appending them to the previous codethough you should be careful executing this, as it will overwrite your previous output:
lm.fit lmSalePrice GrLivArea LotArea
Try it: Does controlling for LotArea change the qualitative conclusions from the previous regression? What about the quantitative results? Does the direction of the change in the quantitative results make sense to you?
Exercise 2:
1. Use the lm function in a simple linear regression e.g., with only one predictor with SalePrice as the response to determine the value of an indoor garage.
2. Use the lm function to perform a multiple linear regression with SalePrice as the response and all other variables from your Ames data as the predictors. Use the summary function to print the results. Comment on the output. For instance:
Is there a relationship between the predictors and the response?
Which predictors appear to have a statistically significant relationship to the response?
What does the coefficient for the year variable suggest?
3. Use the plot function applied to your model to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?
4. Use the and : symbols to fit linear regression models with some wellchosen interaction effects. Do any interactions appear to be statistically significant?
5. Try a few e.g., three different transformations of the variables, such as lnx, x2, sqrt x. Do any of these make sense to include in a model of SalePrice? Comment on your findings.
Bonus; very very challenging How might we build a model to estimate the elasticity of demand from this dataset?
1. Of course, you could find out the defaults of the function by simply using the handy ? command. Dont forget about this tool!
2. Its not exactly true that these objects are in memory. They are sort of. But how R handles memory is complicated and silly and blah blah who cares. Its basically in memory.
3. If you are not familiar with this type of visualization, consult the course textbook ISL, Chapters 2 and 3.
4. When we use the simple regression model with a single input, the Fstat includes the intercept term. Otherwise, it does not. See Lecture 5 for more detail.