Homework Task 4 Nonparametric Regression
Advanced Analytics (QBUS3830)
Homework Task 4: Additive Models
Setting up the notebook for figures:
In [11]:
# Packages
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
# Settings
sns.set_context(‘notebook’)
sns.set_style(‘ticks’)
colors = [‘#4E79A7′,’#F28E2C’,’#E15759′,’#76B7B2′,’#59A14F’, ‘#EDC949′,’#AF7AA1′,’#FF9DA7′,’#9C755F’,’#BAB0AB’]
sns.set_palette(colors)
plt.rcParams[‘figure.figsize’] = (9, 6)
In [13]:
import warnings
warnings.filterwarnings(‘ignore’)
In [7]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn.linear_model import LinearRegression
import xgboost as xgb
import lightgbm as lgb
California housing dataset¶
Data processing¶
In [1]:
from sklearn.datasets import fetch_california_housing
raw = fetch_california_housing()
print(raw.DESCR)
California housing dataset.
The original database is available from StatLib
http://lib.stat.cmu.edu/datasets/
The data contains 20,640 observations on 9 variables.
This dataset contains the average house value as target variable
and the following input variables (features): average income,
housing average age, average rooms, average bedrooms, population,
average occupation, latitude, and longitude in that order.
References
———-
Pace, R. Kelley and Ronald Barry, Sparse Spatial Autoregressions,
Statistics and Probability Letters, 33 (1997) 291-297.
In [4]:
data = pd.DataFrame(raw.data, columns=raw.feature_names)
data[‘MedianHouseValue’]=raw.target
data.head()
Out[4]:
MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude Longitude MedianHouseValue
0 8.3252 41.0 6.984127 1.023810 322.0 2.555556 37.88 -122.23 4.526
1 8.3014 21.0 6.238137 0.971880 2401.0 2.109842 37.86 -122.22 3.585
2 7.2574 52.0 8.288136 1.073446 496.0 2.802260 37.85 -122.24 3.521
3 5.6431 52.0 5.817352 1.073059 558.0 2.547945 37.85 -122.25 3.413
4 3.8462 52.0 6.281853 1.081081 565.0 2.181467 37.85 -122.25 3.422
Some of the predictors have severe outliers, which we delete for simplicity.
In [5]:
data=data[data[‘AveRooms’]1.609]=1.609
y_pred[y_pred<-2]=-2
results.iloc[i,0] = np.sqrt(mean_squared_error(y_test, y_pred))
results.iloc[i,1] = r2_score(y_test, y_pred)
results.iloc[i,2] = mean_absolute_error(y_test, y_pred)
results.round(3)
Out[24]:
Test RMSE Test R2 Test MAE
OLS 0.318 0.691 0.239
LightGBM 0.235 0.831 0.163
XGBoost 0.236 0.830 0.165