a
Trying out a linear model:¶
There have been a few great scripts on xgboost already so I’d figured I’d try something simpler: a regularized linear regression model. Surprisingly it does really well with very little feature engineering. The key point is to to log_transform the numeric variables since most of them are skewed.
In [2]:
import pandas as pd
import numpy as np
# import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
from scipy.stats import skew
from scipy.stats.stats import pearsonr
%config InlineBackend.figure_format = ‘png’ #set ‘png’ here when working on notebook
%matplotlib inline
In [3]:
train = pd.read_csv(“train.csv”)
test = pd.read_csv(“test.csv”)
In [4]:
train.head()
Out[4]:
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities … PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
0 1 60 RL 65.0 8450 Pave NaN Reg Lvl AllPub … 0 NaN NaN NaN 0 2 2008 WD Normal 208500
1 2 20 RL 80.0 9600 Pave NaN Reg Lvl AllPub … 0 NaN NaN NaN 0 5 2007 WD Normal 181500
2 3 60 RL 68.0 11250 Pave NaN IR1 Lvl AllPub … 0 NaN NaN NaN 0 9 2008 WD Normal 223500
3 4 70 RL 60.0 9550 Pave NaN IR1 Lvl AllPub … 0 NaN NaN NaN 0 2 2006 WD Abnorml 140000
4 5 60 RL 84.0 14260 Pave NaN IR1 Lvl AllPub … 0 NaN NaN NaN 0 12 2008 WD Normal 250000
5 rows × 81 columns
In [5]:
all_data = pd.concat((train.loc[:,’MSSubClass’:’SaleCondition’],
test.loc[:,’MSSubClass’:’SaleCondition’]))
In [6]:
all_data.head()
Out[6]:
MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities LotConfig … ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition
0 60 RL 65.0 8450 Pave NaN Reg Lvl AllPub Inside … 0 0 NaN NaN NaN 0 2 2008 WD Normal
1 20 RL 80.0 9600 Pave NaN Reg Lvl AllPub FR2 … 0 0 NaN NaN NaN 0 5 2007 WD Normal
2 60 RL 68.0 11250 Pave NaN IR1 Lvl AllPub Inside … 0 0 NaN NaN NaN 0 9 2008 WD Normal
3 70 RL 60.0 9550 Pave NaN IR1 Lvl AllPub Corner … 0 0 NaN NaN NaN 0 2 2006 WD Abnorml
4 60 RL 84.0 14260 Pave NaN IR1 Lvl AllPub FR2 … 0 0 NaN NaN NaN 0 12 2008 WD Normal
5 rows × 79 columns
Data preprocessing:¶
We’re not going to do anything fancy here:
First I’ll transform the skewed numeric features by taking log(feature + 1) – this will make the features more normal
Create Dummy variables for the categorical features
Replace the numeric missing values (NaN’s) with the mean of their respective columns
In [7]:
matplotlib.rcParams[‘figure.figsize’] = (12.0, 6.0)
prices = pd.DataFrame({“price”:train[“SalePrice”], “log(price + 1)”:np.log1p(train[“SalePrice”])})
prices.hist()
Out[7]:
array([[
In [8]:
#log transform the target:
train[“SalePrice”] = np.log1p(train[“SalePrice”])
#log transform skewed numeric features:
numeric_feats = all_data.dtypes[all_data.dtypes != “object”].index
skewed_feats = train[numeric_feats].apply(lambda x: skew(x.dropna())) #compute skewness
skewed_feats = skewed_feats[skewed_feats > 0.75]
skewed_feats = skewed_feats.index
all_data[skewed_feats] = np.log1p(all_data[skewed_feats])
In [9]:
all_data = pd.get_dummies(all_data)
In [10]:
#filling NA’s with the mean of the column:
all_data = all_data.fillna(all_data.mean())
In [11]:
#creating matrices for sklearn:
X_train = all_data[:train.shape[0]]
X_test = all_data[train.shape[0]:]
y = train.SalePrice
Models¶
Now we are going to use regularized linear regression models from the scikit learn module. I’m going to try both l_1(Lasso) and l_2(Ridge) regularization. I’ll also define a function that returns the cross-validation rmse error so we can evaluate our models and pick the best tuning par
In [15]:
from sklearn.linear_model import Ridge, RidgeCV, ElasticNet, LassoCV, LassoLarsCV
from sklearn.cross_validation import cross_val_score
def rmse_cv(model):
rmse= np.sqrt(-cross_val_score(model, X_train, y, scoring=”neg_mean_squared_error”, cv = 5))
return(rmse)
In [16]:
model_ridge = Ridge()
The main tuning parameter for the Ridge model is alpha – a regularization parameter that measures how flexible our model is. The higher the regularization the less prone our model will be to overfit. However it will also lose flexibility and might not capture all of the signal in the data.
In [17]:
alphas = [0.05, 0.1, 0.3, 1, 3, 5, 10, 15, 30, 50, 75]
cv_ridge = [rmse_cv(Ridge(alpha = alpha)).mean()
for alpha in alphas]
—————————————————————————
KeyError Traceback (most recent call last)
/Users/vagrant/anaconda/lib/python3.5/site-packages/sklearn/metrics/scorer.py in get_scorer(scoring)
192 try:
–> 193 scorer = SCORERS[scoring]
194 except KeyError:
KeyError: ‘neg_mean_squared_error’
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
1 alphas = [0.05, 0.1, 0.3, 1, 3, 5, 10, 15, 30, 50, 75]
2 cv_ridge = [rmse_cv(Ridge(alpha = alpha)).mean()
—-> 3 for alpha in alphas]
1 alphas = [0.05, 0.1, 0.3, 1, 3, 5, 10, 15, 30, 50, 75]
2 cv_ridge = [rmse_cv(Ridge(alpha = alpha)).mean()
—-> 3 for alpha in alphas]
3
4 def rmse_cv(model):
—-> 5 rmse= np.sqrt(-cross_val_score(model, X_train, y, scoring=”neg_mean_squared_error”, cv = 5))
6 return(rmse)
/Users/vagrant/anaconda/lib/python3.5/site-packages/sklearn/cross_validation.py in cross_val_score(estimator, X, y, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch)
1423
1424 cv = check_cv(cv, X, y, classifier=is_classifier(estimator))
-> 1425 scorer = check_scoring(estimator, scoring=scoring)
1426 # We clone the estimator to make sure that all the folds are
1427 # independent, and that it is pickle-able.
/Users/vagrant/anaconda/lib/python3.5/site-packages/sklearn/metrics/scorer.py in check_scoring(estimator, scoring, allow_none)
236 “‘fit’ method, %r was passed” % estimator)
237 elif has_scoring:
–> 238 return get_scorer(scoring)
239 elif hasattr(estimator, ‘score’):
240 return _passthrough_scorer
/Users/vagrant/anaconda/lib/python3.5/site-packages/sklearn/metrics/scorer.py in get_scorer(scoring)
195 raise ValueError(‘%r is not a valid scoring value. ‘
196 ‘Valid options are %s’
–> 197 % (scoring, sorted(SCORERS.keys())))
198 else:
199 scorer = scoring
ValueError: ‘neg_mean_squared_error’ is not a valid scoring value. Valid options are [‘accuracy’, ‘adjusted_rand_score’, ‘average_precision’, ‘f1’, ‘f1_macro’, ‘f1_micro’, ‘f1_samples’, ‘f1_weighted’, ‘log_loss’, ‘mean_absolute_error’, ‘mean_squared_error’, ‘median_absolute_error’, ‘precision’, ‘precision_macro’, ‘precision_micro’, ‘precision_samples’, ‘precision_weighted’, ‘r2’, ‘recall’, ‘recall_macro’, ‘recall_micro’, ‘recall_samples’, ‘recall_weighted’, ‘roc_auc’]
In [13]:
cv_ridge = pd.Series(cv_ridge, index = alphas)
cv_ridge.plot(title = “Validation – Just Do It”)
plt.xlabel(“alpha”)
plt.ylabel(“rmse”)
Note the U-ish shaped curve above. When alpha is too small the regularization is too strong and the model cannot capture all the complexities in the data. If however we let the model be too flexible (alpha large) the model begins to overfit. A value of alpha = 10 is about right based on the plot above.
In [14]:
cv_ridge.min()
So for the Ridge regression we get a rmsle of about 0.127
Let’ try out the Lasso model. We will do a slightly different approach here and use the built in Lasso CV to figure out the best alpha for us. For some reason the alphas in Lasso CV are really the inverse or the alphas in Ridge.
In [15]:
model_lasso = LassoCV(alphas = [1, 0.1, 0.001, 0.0005]).fit(X_train, y)
In [16]:
rmse_cv(model_lasso).mean()
Nice! The lasso performs even better so we’ll just use this one to predict on the test set. Another neat thing about the Lasso is that it does feature selection for you – setting coefficients of features it deems unimportant to zero. Let’s take a look at the coefficients:
In [17]:
coef = pd.Series(model_lasso.coef_, index = X_train.columns)
In [18]:
print(“Lasso picked ” + str(sum(coef != 0)) + ” variables and eliminated the other ” + str(sum(coef == 0)) + ” variables”)
Good job Lasso. One thing to note here however is that the features selected are not necessarily the “correct” ones – especially since there are a lot of collinear features in this dataset. One idea to try here is run Lasso a few times on boostrapped samples and see how stable the feature selection is.
We can also take a look directly at what the most important coefficients are:
In [19]:
imp_coef = pd.concat([coef.sort_values().head(10),
coef.sort_values().tail(10)])
In [20]:
matplotlib.rcParams[‘figure.figsize’] = (8.0, 10.0)
imp_coef.plot(kind = “barh”)
plt.title(“Coefficients in the Lasso Model”)
The most important positive feature is GrLivArea – the above ground area by area square feet. This definitely sense. Then a few other location and quality features contributed positively. Some of the negative features make less sense and would be worth looking into more – it seems like they might come from unbalanced categorical variables.
Also note that unlike the feature importance you’d get from a random forest these are actual coefficients in your model – so you can say precisely why the predicted price is what it is. The only issue here is that we log_transformed both the target and the numeric features so the actual magnitudes are a bit hard to interpret.
In [21]:
#let’s look at the residuals as well:
matplotlib.rcParams[‘figure.figsize’] = (6.0, 6.0)
preds = pd.DataFrame({“preds”:model_lasso.predict(X_train), “true”:y})
preds[“residuals”] = preds[“true”] – preds[“preds”]
#preds.plot(x = “preds”, y = “residuals”,kind = “scatter”)
The residual plot looks pretty good.To wrap it up let’s predict on the test set and submit on the leaderboard:
Adding an xgboost model:¶
Let’s add an xgboost model to our linear model to see if we can improve our score:
In [22]:
import xgboost as xgb
In [59]:
dtrain = xgb.DMatrix(X_train, label = y)
dtest = xgb.DMatrix(X_test)
params = {“max_depth”:2, “eta”:0.1}
model = xgb.cv(params, dtrain, num_boost_round=500, early_stopping_rounds=100)
In [61]:
model.loc[30:,[“test-rmse-mean”, “train-rmse-mean”]].plot()
In [62]:
model_xgb = xgb.XGBRegressor(n_estimators=360, max_depth=2, learning_rate=0.1) #the params were tuned using xgb.cv
model_xgb.fit(X_train, y)
In [63]:
xgb_preds = np.expm1(model_xgb.predict(X_test))
lasso_preds = np.expm1(model_lasso.predict(X_test))
In [64]:
predictions = pd.DataFrame({“xgb”:xgb_preds, “lasso”:lasso_preds})
predictions.plot(x = “xgb”, y = “lasso”, kind = “scatter”)
In [62]:
preds = 0.7*lasso_preds + 0.3*xgb_preds
In [63]:
solution = pd.DataFrame({“id”:test.Id, “SalePrice”:preds})
solution.to_csv(“ridge_sol.csv”, index = False)