Tutorial_06_new
QBUS2820 – Predictive Analytics
Tutorial 6 – Model Selection and Estimation¶
Recall that, in Tut 04, we focused on demonstrating the bias-variance decomposition, and in Tut 05 we practiced model selection for kNN.
In Tut 06, we
1) Continue model selection for kNN using alternative built-in CV methods
2) Model selection in linear regression with CV
3) Look at how to implement AIC/BIC for model selection in linear regression
First, let’s use the same approach as the previous weeks to simulate some data
In [24]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error
# Initialise RNG, so we get same result everytime
np.random.seed(0)
# Number of training points
n = 50
x = np.linspace(0.0, 1.0, n)
# Function coefficients/parameters
beta0 = 4
beta1 = 1.5
beta2 = 3.2
# f values from 2nd order polynomial
f = beta0 + beta1 * x + beta2 * np.power(x,2)
$f$ contains data perfectly sampled along the 2nd order polynomial. In the real world we usually observe data that is noisy. So we will add a small amount of Gaussian noise to the data
In [25]:
# Generate noisy sample from population
sigma2 = 0.1
y_train = f + np.random.normal(0, np.sqrt(sigma2), n)
y_validation = f + np.random.normal(0, np.sqrt(sigma2), n)
In [26]:
fig1 = plt.figure()
plt.plot(x, f, label = “f (Ground Truth)”)
plt.scatter(x, y_train, label = “y (Observed Points)”, color = “red”)
plt.scatter(x, y_validation, label = “y (Observed Points)”, color = “orange”, alpha = 0.5)
plt.xlabel(“Predictor/Feature Value”)
plt.ylabel(“Target Value”)
plt.title(“Underlying Function vs Observations”)
plt.legend(loc=”upper left”)
Out[26]:
Selecting k in kNN with GridSearchCV¶
GridSearchCV is an exhaustive search over specified hyperparameter values, which is convenient when you have a few hyperparameters to search for the best combination. In kNN, there is only one hyperparameter, which is K.
In [27]:
# GridSearchCV
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsRegressor
# Set up the grid of parameters to search (in this case its 1 dimensional)
param_grid = {
‘n_neighbors’: np.arange(1,11)
}
model = KNeighborsRegressor()
# Create the grid search object
grid_cv_obj = GridSearchCV(model, param_grid)
# Do the grid search
grid_cv_obj.fit(x.reshape(-1, 1), y_train)
print(“Grid Search CV Complete!”)
Grid Search CV Complete!
In [28]:
# Get the best model
best_model = grid_cv_obj.best_estimator_
# Print the best model
print(best_model)
KNeighborsRegressor(n_neighbors=4)
In [29]:
# Print the best params
print(grid_cv_obj.best_params_)
{‘n_neighbors’: 4}
In [30]:
import pandas as pd
results = pd.DataFrame(grid_cv_obj.cv_results_)
results.head()
Out[30]:
mean_fit_time std_fit_time mean_score_time std_score_time param_n_neighbors params split0_test_score split1_test_score split2_test_score split3_test_score split4_test_score mean_test_score std_test_score rank_test_score
0 0.001209 0.000742 0.001587 0.000476 1 {‘n_neighbors’: 1} -0.081308 -4.900943 -0.322660 0.636097 -1.474163 -1.228595 1.957575 6
1 0.000593 0.000485 0.000806 0.000403 2 {‘n_neighbors’: 2} -1.379118 -1.497278 0.073428 0.687549 -1.823326 -0.787749 0.984246 3
2 0.000411 0.000504 0.000997 0.000631 3 {‘n_neighbors’: 3} -1.619946 -1.534856 0.247589 0.868792 -1.403595 -0.688403 1.038912 2
3 0.000803 0.000402 0.000393 0.000481 4 {‘n_neighbors’: 4} -1.420876 -1.618033 0.414264 0.667964 -1.328431 -0.657022 0.986007 1
4 0.000403 0.000494 0.000587 0.000480 5 {‘n_neighbors’: 5} -1.613556 -2.656514 0.409688 0.640341 -1.667683 -0.977545 1.283878 4
Understanding Grid Search¶
In this kNN example we only had one parameter so the grid is just a 1 dimensonal list. GridSearchCV performs cross validation for each value of n_neighbours and returns the best value of n_neighbours.
We can use the ParameterGrid object in sklearn to list out all combinations of values that will be checked. In this case there are only 10.
In [31]:
from sklearn.model_selection import ParameterGrid
grid_df = pd.DataFrame(list(ParameterGrid(param_grid)))
grid_df
Out[31]:
n_neighbors
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
8 9
9 10
GridSearchCV is convenient to use when you have a few hyperparameters.
If we have 3 hyperparameters, for example, then the search list is 3-dimensional,
and the number of hyperparameter combinations that we have to search increases dramatically. We perform cross validation for each row in the parameter grid.
Suppose we have three hyperparameters a (integer from 1 to 10), b (integer from 101 to 110) and c (from -4 to 5).
Below we show the first 15 rows from 1000 ($10 \times 10 \times 10$) total rows.
In [32]:
from sklearn.model_selection import ParameterGrid
param_grid = {
‘param_a’: np.arange(1,11),
‘param_b’: np.arange(101,111),
‘param_c’: np.arange(-4,6),
}
grid_df_3 = pd.DataFrame(list(ParameterGrid(param_grid)))
grid_df_3.head(15)
Out[32]:
param_a param_b param_c
0 1 101 -4
1 1 101 -3
2 1 101 -2
3 1 101 -1
4 1 101 0
5 1 101 1
6 1 101 2
7 1 101 3
8 1 101 4
9 1 101 5
10 1 102 -4
11 1 102 -3
12 1 102 -2
13 1 102 -1
14 1 102 0
Total number of combinations of hyperparamters
In [33]:
len(grid_df_3)
Out[33]:
1000
CV for linear regression¶
In [34]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
max_deg = 10
cv_scores_lr = []
for i in range(1, max_deg):
poly_transformer = PolynomialFeatures(i)
poly_x = poly_transformer.fit_transform(x.reshape(-1,1))
# Create the linear regression object
lin_reg = LinearRegression()
# use 3-fold cv
# https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html
scores_lr = cross_val_score(lin_reg, poly_x, y_train, cv=3, scoring = ‘neg_mean_squared_error’)
# Get the mean of the scores from the 3 folds
cv_score = np.mean(scores_lr)
cv_scores_lr.append(cv_score)
You can change the number of folds in cross validation, and see what is the impact of it on the model selection.¶
In [35]:
cv_scores_lr
Out[35]:
[-0.7677682428258845,
-0.2471889370479724,
-1.6382231164261842,
-11.62963133724679,
-300.13865968973823,
-2832.9134362128243,
-10117.513235502312,
-21932.13243979918,
-3126766.599095208]
The optimal Polynomial order
In [36]:
1 + np.argmax(cv_scores_lr)
Out[36]:
2
In [37]:
fig = plt.figure()
plt.plot(np.arange(1,max_deg), cv_scores_lr)
plt.xlabel(“Polynomial Degree”)
plt.ylabel(“CV score (Higher is better)”)
Out[37]:
Text(0, 0.5, ‘CV score (Higher is better)’)
Note the in the above plot the extreme values, e.g. -3126766.599, make the pattern of the plot not clear. Let’s only plot CV score versus the first 4 polynomial orders.
In [38]:
fig = plt.figure()
plt.plot(np.arange(1,5), cv_scores_lr[0:4])
plt.xlabel(“Polynomial Degree”)
plt.ylabel(“CV score (Higher is better)”)
Out[38]:
Text(0, 0.5, ‘CV score (Higher is better)’)
Using CV to Evaluate Model Performance¶
Lets compare the performance of our best polynomial regression model with that of the knn regressor.
To ensure that each model sees the exact same K-Folds we need to store the K-Fold split in to a Python object and then share it to the cross_val_score function
In [40]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
# Build poly2 model
poly_transformer = PolynomialFeatures(2)
poly_x = poly_transformer.fit_transform(np.reshape(x, (50,1)))
poly_reg = LinearRegression()
poly_reg.fit(poly_x, y_train)
# Build knn model
knn = KNeighborsRegressor(n_neighbors= 4)
knn.fit(x.reshape(-1,1), y_train)
Out[40]:
KNeighborsRegressor(n_neighbors=4)
In [41]:
from sklearn.model_selection import KFold
# 3-Fold Cross Validation
kf = KFold(3)
poly_scores = cross_val_score(poly_reg, poly_x, y_train, cv = kf)
poly_mean = np.mean(poly_scores)
knn_scores = cross_val_score(knn, x.reshape(-1,1), y_train, cv = kf)
knn_mean = np.mean(knn_scores)
In [42]:
# Make the results pretty
data = [
{“Model”: “Polynomial Regression”, “CV score”: poly_mean},
{“Model”: “KNN Regressor”, “CV score”: knn_mean}
]
results = pd.DataFrame(data)
results = results.set_index(“Model”)
results
Out[42]:
CV score
Model
Polynomial Regression -0.63556
KNN Regressor -1.74397
Model selection with AIC and BIC in linear regression¶
Consider a linear regression model with response vector $y$ and design matrix $X$. Let
$\hat{\sigma^2}=\frac{1}{n}(y-X\widehat\beta)^\top(y-X\widehat\beta)$
be the estimate of the variance of error term $\epsilon$. The AIC and BIC are given by
$AIC=n\log(\hat{\sigma^2})+2d$
$BIC=n\log(\hat{\sigma^2})+\log(n)d$
with $n$ the data size and $d$ the total number of parameters in the model. Please refer to a question on Ed Discussion for a derivation of these AIC and BIC formula.
Now, let’s create a synthetic dataset as in Tutorial 04 and let’s see how AIC/BIC work as model choice techniques. The data comes from the following model
$y = \beta_0 + \beta_1 x + \beta_2 x^2 +\epsilon,$
and we compare between the following potential models
$y = \beta_0 + \beta_1 x +\epsilon$
$y = \beta_0 + \beta_1 x + \beta_2 x^2 +\epsilon$
$y = \beta_0 + \beta_1 x + \beta_2 x^2 +\beta_3 x^3+\epsilon$
For the $k$-polynomial regression model, the total number od parameters is $d=k+2$ (why?)
In [43]:
import numpy as np
import matplotlib.pyplot as plt
# Initialise RNG, so we get same result everytime
np.random.seed(0)
# Number of training points
n = 100
x = np.linspace(0.0, 1.0, n)
# true coefficients/parameters
beta0 = 4
beta1 = 1.5
beta2 = 3.2
# generate data
sigma2 = 0.1
y = beta0 + beta1 * x + beta2 * np.power(x,2) + np.random.normal(0, np.sqrt(sigma2), n)
Calculate AIC and BIC values for the three models
In [44]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
max_deg = 4
aic = list()
bic = list()
for k in range(1, max_deg):
poly_transformer = PolynomialFeatures(k)
poly_x = poly_transformer.fit_transform(x.reshape(-1,1))
# Create the linear regression object
lin_reg = LinearRegression()
# Estimate coefficients
lin_reg.fit(poly_x, y)
# Calculate predictions
y_pred = lin_reg.predict(poly_x)
# Calcualte sigma2_hat
sigma2_hat = mean_squared_error(y, y_pred)
number_of_parameters = k+2
aic_value = n*np.log(sigma2_hat)+2*number_of_parameters
bic_value = n*np.log(sigma2_hat)+np.log(n)*number_of_parameters
aic.append(aic_value)
bic.append(bic_value)
In [45]:
aic
Out[45]:
[-145.12914244244618, -238.13407604531412, -236.15160054620478]
In [46]:
bic
Out[46]:
[-137.3136318844819, -227.71339530136174, -223.12574961626433]
In [ ]: