Linear_Regression_housingPrices
Linear Regression¶
Predictive modeling, using a data samples to make predictions about unobserved or future events, is a common data analytics task. Predictive modeling is considered to be a form of machine learning.
Copyright By PowCoder代写 加微信 powcoder
Linear regression is a technique for predicting a response/dependent variable based on one or more explanatory/independent variables, or features. The term “linear” refers to the fact that the method models data as a linear combination of explanatory variables.
Linear regression, in its simplest form, fits a straight line to the response variable data so that the line minimizes the squared differences (also called errors or residuals) between the actual obbserved response and the predicted point on the line. Since linear regression fits the observed data with a line, it is most effective when the response and the explanatory variable do have a linear relationship.
Motivating Example: Housing Data¶
In this notebook, we will generate a model for predicting the sale price of houses, given information about the house features. To train this model we will be using housing data from the USA.
Install Relevant Libraries¶
#!pip install numpy
#!pip install pandas
#!pip install scikit-learn
#!pip install matplotlib
#!pip install seaborn
#!pip install statsmodels
Import Libraries¶
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics
# Needed to show plots inline for notebooks
%matplotlib inline
Load the data¶
First, let’s load the data into a pandas dataframe.
housingDF = pd.read_csv(‘USA_housing.csv’)
housingDF.head()
Avg. Area Income Avg. Area House Age Avg. Area Number of Rooms Avg. Area Number of Bedrooms Area Population Price Address
0 79545.458574 5.682861 7.009188 4.09 23086.800503 1.059034e+06 208 Apt. 674\nLaurabury, NE 3701…
1 79248.642455 6.002900 6.730821 3.09 40173.072174 1.505891e+06 188 Suite 079\nLake Kathleen, CA…
2 61287.067179 5.865890 8.512727 5.13 36882.159400 1.058988e+06 9127 \nDanieltown, WI 06482…
3 63345.240046 7.188236 5.586729 3.26 34310.242831 1.260617e+06 USS Barnett\nFPO AP 44820
4 59982.197226 5.040555 7.839388 4.23 26354.109472 6.309435e+05 USNS Raymond\nFPO AE 09386
housingDF.columns.values
array([‘Avg. Area Income’, ‘Avg. Area House Age’,
‘Avg. Area Number of Rooms’, ‘Avg. Area Number of Bedrooms’,
‘Area Population’, ‘Price’, ‘Address’], dtype=object)
The features?
Avg. Area Income: The average income of the residents of the city the house is located in.
Avg. Area House Age: The averge age of houses in the same city
Avg. Area Number of Rooms: The average number of rooms for houses in the same city
Avg. Area Number of Bedrooms: The average number of bedrooms for houses in the same city
Area Population: The population of the city the house is located in
Address: The address of the house
What is the response?
Price: The price the house was/will be sold at
Shape of dataframe¶
Now let’s look at the shape of the dataframe:
# print the size of the DataFrame object, i.e., the size of the dataset
housingDF.shape
There are 5000 observations, 6 features and 1 target variable.
We can also get more information from the dataframe by using pandas info() and describe() functions.
housingDF.info()
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 7 columns):
# Column Non-Null Count Dtype
— —— ————– —–
0 Avg. Area Income 5000 non-null float64
1 Avg. Area House Age 5000 non-null float64
2 Avg. Area Number of Rooms 5000 non-null float64
3 Avg. Area Number of Bedrooms 5000 non-null float64
4 Area Population 5000 non-null float64
5 Price 5000 non-null float64
6 Address 5000 non-null object
dtypes: float64(6), object(1)
memory usage: 273.6+ KB
housingDF.describe()
Avg. Area Income Avg. Area House Age Avg. Area Number of Rooms Avg. Area Number of Bedrooms Area Population Price
count 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5.000000e+03
mean 68583.108984 5.977222 6.987792 3.981330 36163.516039 1.232073e+06
std 10657.991214 0.991456 1.005833 1.234137 9925.650114 3.531176e+05
min 17796.631190 2.644304 3.236194 2.000000 172.610686 1.593866e+04
25% 61480.562388 5.322283 6.299250 3.140000 29403.928702 9.975771e+05
50% 68804.286404 5.970429 7.002902 4.050000 36199.406689 1.232669e+06
75% 75783.338666 6.650808 7.665871 4.490000 42861.290769 1.471210e+06
max 107701.748378 9.519088 10.759588 6.500000 69621.713378 2.469066e+06
From the above, we can discren that there is no missing data (i.e., no null elements in any columns). We can also see some basic statistics about the dataframe. For example, the lowest selling price of the houses in the dataframe is $\$$15,938.66, and the highest price is $\$$2,469,066.
Data Exploration¶
Now, let’s do some data exploration of our data!
A pairplot plots pairwise relationships between the variables. The diagonal plots show the univariate distribution of the data.
sns.pairplot(housingDF, kind=’reg’)
From here, we can see that the average area income has the strongest linear relationship with the price of all the features. We can also see that several of the features and the target variable have normal distributions.
We can get the distribution for only one variable by using the seaborn distplot function. Let’s look at the distribution of the target variable, price.
sns.distplot(housingDF.Price)
C:\Users\roman\anaconda3\lib\site-packages\seaborn\distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
warnings.warn(msg, FutureWarning)
As seen above, this plot has a fairly uniform distribution with a mean at around 1.2 million.
We can also look at the correlation between the features and the target variable using seaborn’s heatmap function.
We can mask the upper right triangle as they are identical to the lower left triangle. The diagonal also is irrelevant as its the correlation of a feature with itself.
mask = np.zeros_like(housingDF.corr()*-1)
mask[np.triu_indices_from(mask)] = True
#fig,ax = plt.subplots(figsize=(12,12))
sns.heatmap(housingDF.corr(), mask=mask, annot=True)
As expected, average area income has the highest positive correlation with the target variable. Average number of bedrooms has the lowest positive correlation. None of the features are negatively correlated with the target variable.
Data Cleaning¶
Now let’s clean the data a bit. As mentioned before, there is no missing data. We can verify this by plotting a heatmap of the nulls in the dataset.
fig, ax = plt.subplots(figsize=(12,8))
sns.heatmap(housingDF.isnull(), cmap=’coolwarm’, yticklabels=False, cbar=False, ax=ax)
As expected, there are no red lines indicating NaNs in the dataset.
Address column¶
The address column is in a text format so it cannot be fed into a scikit-learn model. This feature could be parsed to create new target variables, such as the type of street, if its a PO box, etc. For this example we will just drop the column and see how the model performs given the remaining features.
housingDF[‘Address’].head()
0 208 Apt. 674\nLaurabury, NE 3701…
1 188 Suite 079\nLake Kathleen, CA…
2 9127 \nDanieltown, WI 06482…
3 USS Barnett\nFPO AP 44820
4 USNS Raymond\nFPO AE 09386
Name: Address, dtype: object
# Drop Address column
housingDF.drop(‘Address’,axis=1,inplace=True)
Training a Linear Regression Model¶
Simple Linear Regression¶
Simple linear regression is an approach for modeling the relationship between a dependent variable (a “response”) and an explanatory variable, also known as a “predictor” or “feature”. The relationship is modeled as a linear function $y = \beta_0 + \beta_1x$ whose parameters are estimated from the available data.
In the equation above:
$y$ is called the response, regressand, endogenous variable, dependent variable, etc.
$x$ is the feature, regressor, exogenous variable, explanatory variables, predictor, etc.
$\beta_0$ is known as the intercept
$\beta_1$ is the regression coefficient, effect, etc.
Together, $\beta_0$ and $\beta_1$ are called paramaters, model/regression coefficients, or effects. To create a model, we must discover/learn/estimate the values of these coefficients.
Estimating/Learning Model/Regression Coefficients¶
Regression coefficients are estimated using a variety of methods. The least squares method, which finds the line which minimizes the sum of squared residuals (or “sum of squared errors”) is among the most oftenly used.
In the pictures below:
The blue dots are the observed values of x and y.
The red line is the least squares line.
The residuals are the distances between the observed values and the least squares line.
$\beta_0$ is the intercept of the least squares line (the value of $y$ when $x$=0)
$\beta_1$ is the slope of the least squares line, i.e. the ratio of the vertical change (in $y$) and the horizontal change (in $x$).
We can use the statsmodels package to estimate the model coefficients for avg. area income vs. the price).
Note: We rename the Avg. Area Income column to avgAreaIncome to work with ols function from statsmodel.
import statsmodels.formula.api as sf
#create a model with Sales as dependent variable and TV as explanatory variable
model = sf.ols(‘Price ~ avgAreaIncome’, housingDF.rename(index=str, columns={“Avg. Area Income”:”avgAreaIncome”}))
#fit the model to the data
fitted_model = model.fit()
# print the coefficients
print(fitted_model.params)
Intercept -221579.478206
avgAreaIncome 21.195483
dtype: float64
We can now pull the $\beta$ values from the params argument:
b0 = fitted_model.params[0]
b1 = fitted_model.params[1]
Interpreting Model Coefficients¶
Q: How do we interpret the coefficient ($\beta_1$) of the explanatory variable “Avg. Area Income”?
A: A unit ($\$$1 USD) increase in avg. area income is **associated with** a 21.195483 unit ($\$$1 USD) increase in Price, i.e., an additional $\$$1000 of avg. area income is associated with an increase in house price of ~$\$$21,195.48.
Note that it is, in general, possible to have a negative effect, e.g., an increase in avg area income can be associated with a decrease in house prices. $\beta_1$ would be negative in this case.
Using the Model for Prediction¶
Can we use the model we develop to predict housing prices? For example, if the average area income of area is \$100,000, what would the model predict as the house price?
$$price = \beta_0 + \beta_1\times avg. area\ income$$
print(“y = {} + {}*(avg. area income)”.format(round(b0,3),round(b1,3)))
y = -221579.478 + 21.195*(avg. area income)
y = b0 + b1*100000
print (“The predicted house price for a house in the region with a $100,000 average income is ${}”.format(round(y,2)))
The predicted house price for a house in the region with a $100,000 average income is $1897968.84
Using statsmodel to predict¶
We can also make this prediction using the stats model directly.
# create a DataFrame to use with the Statsmodels formula interface
avg_area_income = pd.DataFrame({‘avgAreaIncome’: [100000]})
#check the newly created DataFrame
avg_area_income.head()
avgAreaIncome
# use the model created above to predict the price
sales = fitted_model.predict(pd.DataFrame({‘avgAreaIncome’:[100000]}))
print(sales)
0 1.897969e+06
dtype: float64
As expected, the stats model returned the same result as the linear functon above.
Plotting the Least Squares Line¶
Let’s make predictions for the smallest and largest observed values of avg. area income, and then use the predicted values to plot the least squares line:
# create a DataFrame with the minimum and maximum values of TV ad money
avg_area_income = pd.DataFrame({‘avgAreaIncome’: [housingDF[‘Avg. Area Income’].min(), housingDF[‘Avg. Area Income’].max()]})
print(avg_area_income.head())
avgAreaIncome
0 17796.631190
1 107701.748378
# make predictions for those x values and store them
price_predictions = fitted_model.predict(avg_area_income)
print(price_predictions)
0 1.556287e+05
1 2.061211e+06
dtype: float64
# plot the observed data
housingDF.plot(kind=’scatter’, x=’Avg. Area Income’, y=’Price’)
# plot the least squares line
plt.plot(avg_area_income, price_predictions, c=’red’, linewidth=2)
[
Confidence in Linear Regression Models¶
Q: Is linear regression a high bias/low variance model, or a low variance/high bias model?
A: High bias/low variance. Under repeated sampling, the line will stay roughly in the same place (low variance), but the average of those models won’t do a great job capturing the true relationship (high bias). (A low variance is a useful characteristic when limited training data is available.)
We can use Statsmodels to calculate 95% confidence intervals for the model coefficients, which are interpreted as follows: If the population from which this sample was drawn was sampled 100 times, approximately 95 of those confidence intervals would contain the “true” coefficient.
# print the confidence intervals for the model coefficients
print(fitted_model.conf_int())
0 1
Intercept -270590.702759 -172568.253653
avgAreaIncome 20.489332 21.901634
Since we only have a single sample of data, and not the entire population the “true” value of the regression coefficient is either within this interval or it isn’t, but there is no way to actually know.
We estimate the regression coefficient using the data we have, and then we characterize the uncertainty about that estimate by giving a confidence interval, an interval that will “probably” contain the value coefficient. Note that there is no probability associated with the true value of the regression coefficient being in the given confidence interval!
Also note that using 95% confidence intervals is simply a convention. One can create 90% confidence intervals (narrower intervals), 99% confidence intervals (wider intervals), etc.
Hypothesis Testing and p-values¶
Closely related to confidence intervals is hypothesis testing. Generally speaking, you start with a null hypothesis and an alternative hypothesis (that is opposite the null). Then, you check whether the data supports rejecting the null hypothesis or failing to reject the null hypothesis.
(Note that “failing to reject” the null is not the same as “accepting” the null hypothesis. The alternative hypothesis may indeed be true, except that you just don’t have enough data to show that.)
As it relates to model coefficients, here is the conventional hypothesis test:
null hypothesis: There is no relationship between TV ads and Sales (and thus $\beta_1$ equals zero)
alternative hypothesis: There is a relationship between TV ads and Sales (and thus $\beta_1$ is not equal to zero)
How do we test this hypothesis? Intuitively, we reject the null (and thus believe the alternative) if the 95% confidence interval does not include zero. Conversely, the p-value represents the probability that the coefficient is actually zero:
# print the p-values for the model coefficients
fitted_model.pvalues
Intercept 1.064127e-18
avgAreaIncome 0.000000e+00
dtype: float64
If the 95% confidence interval includes zero, the p-value for that coefficient will be greater than 0.05. If the 95% confidence interval does not include zero, the p-value will be less than 0.05. Thus, a p-value less than 0.05 is one way to decide whether there is likely a relationship between the feature and the response. (Again, using 0.05 as the cutoff is just a convention.)
In this case, the p-value for TV is far less than 0.05, and so we believe that there is a relationship between TV ads and Sales.
Note that we generally ignore the p-value for the intercept.
How Well Does the Model Fit the data?¶
The most common way to evaluate the overall fit of a linear model to the available data is by calculating the R-squared (a.k.a, “coefficient of determination”) value.
R-squared has several interpretations:
(1) R-squared ×100 percent of the variation in the dependent variable ($y$) is reduced by taking into account predictor $x$
(2) R-squared is the proportion of variance in the observed data that is “explained” by the model.
R-squared is between 0 and 1, and, generally speaking, higher is considered to be better because more variance is accounted for (“explained”) by the model.
Note, however, that R-squared does not indicate whether a regression model is actually good. You can have a low R-squared value for a good model, or a high R-squared value for a model that does not fit the data!
One should evaluate the adequacy of a model by looking at R-squared values as well as residual (i.e., observed value – fitted value) plots, other model statistics, and subject area knowledge.
The R-squared value for our simple linear regression model is:
# print the R-squared value for the model
fitted_model.rsquared
0.4092593070338846
Is that a “good” R-squared value? One cannot generally assess that. What a “good” R-squared value is depends on the domain and therefore R-squared is most useful as a tool for comparing different models.
Multiple Linear Regression¶
Simple linear regression can be extended to include multiple explanatory variables:
$y = \beta_0 + \beta_1x_1 + … + \beta_nx_n$
Each $x$ represents a different predictor/feature, and each predictor has its own coefficient. In our case:
$y = \beta_0 + \beta_1 \times avgAreaIncome + \beta_2 \times avgAreaHouseAge + \beta_3 \times avgAreaNumOfRooms + \beta_4 \times avgAreaNumOfBedrooms$
Let’s use Statsmodels to estimate these coefficients:
# create a model with all the features
# Rename the columns since ols doesn’t take spaces
tempDF = housingDF.rename(index=str,columns={
“Avg. Area Income”: “avgAreaIncome”,
“Avg. Area House Age”: “avgAreaHouseAge”,
“Avg. Area Number of Rooms” : “avgAreaNumOfRooms”,
“Avg. Area Number of Bedrooms”: “avgAreaNumOfBedrooms”,
“Area Population”: “areaPopulation”
multi_model = sf.ols(formula=’Price ~ avgAreaIncome + avgAreaHouseAge + avgAreaNumOfRooms + avgAreaNumOfBedrooms’, data=tempDF)
fitted_multi_model = multi_model.fit()
# print the coefficients
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com