程序代写代做代考 database Homework Task 4 Nonparametric Regression

Homework Task 4 Nonparametric Regression

Advanced Analytics (QBUS3830)

Homework Task 4: Additive Models

Setting up the notebook for figures:

In [11]:

# Packages
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Settings
sns.set_context(‘notebook’)
sns.set_style(‘ticks’)
colors = [‘#4E79A7′,’#F28E2C’,’#E15759′,’#76B7B2′,’#59A14F’, ‘#EDC949′,’#AF7AA1′,’#FF9DA7′,’#9C755F’,’#BAB0AB’]
sns.set_palette(colors)
plt.rcParams[‘figure.figsize’] = (9, 6)

In [13]:

import warnings
warnings.filterwarnings(‘ignore’)

In [7]:

import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

from sklearn.linear_model import LinearRegression
import xgboost as xgb
import lightgbm as lgb

California housing dataset¶
Data processing¶

In [1]:

from sklearn.datasets import fetch_california_housing
raw = fetch_california_housing()
print(raw.DESCR)

California housing dataset.

The original database is available from StatLib

http://lib.stat.cmu.edu/datasets/

The data contains 20,640 observations on 9 variables.

This dataset contains the average house value as target variable
and the following input variables (features): average income,
housing average age, average rooms, average bedrooms, population,
average occupation, latitude, and longitude in that order.

References
———-

Pace, R. Kelley and Ronald Barry, Sparse Spatial Autoregressions,
Statistics and Probability Letters, 33 (1997) 291-297.

In [4]:

data = pd.DataFrame(raw.data, columns=raw.feature_names)
data[‘MedianHouseValue’]=raw.target
data.head()

Out[4]:

MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude Longitude MedianHouseValue
0 8.3252 41.0 6.984127 1.023810 322.0 2.555556 37.88 -122.23 4.526
1 8.3014 21.0 6.238137 0.971880 2401.0 2.109842 37.86 -122.22 3.585
2 7.2574 52.0 8.288136 1.073446 496.0 2.802260 37.85 -122.24 3.521
3 5.6431 52.0 5.817352 1.073059 558.0 2.547945 37.85 -122.25 3.413
4 3.8462 52.0 6.281853 1.081081 565.0 2.181467 37.85 -122.25 3.422

Some of the predictors have severe outliers, which we delete for simplicity.

In [5]:

data=data[data[‘AveRooms’]1.609]=1.609
y_pred[y_pred<-2]=-2 results.iloc[i,0] = np.sqrt(mean_squared_error(y_test, y_pred)) results.iloc[i,1] = r2_score(y_test, y_pred) results.iloc[i,2] = mean_absolute_error(y_test, y_pred) results.round(3) Out[24]: Test RMSE Test R2 Test MAE OLS 0.318 0.691 0.239 LightGBM 0.235 0.831 0.163 XGBoost 0.236 0.830 0.165