Assignment_5_Solved
Follow these instructions:¶
Once you are finished, ensure to complete the following steps.
Copyright By PowCoder代写 加微信 powcoder
Restart your kernel by clicking ‘Kernel’ > ‘Restart & Run All’.
Fix any errors which result from this.
Repeat steps 1. and 2. until your notebook runs without errors.
Submit your completed notebook to OWL by the deadline.
Grade: /100 pts¶
Assignment 05: Confidence Intervals & Bootstrap¶
# Add the necessary imports for this homework
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn.model_selection
import sklearn.linear_model
from scipy.stats import t
Question 1: /10pts¶
Load in the hockey_draftees_2005.csv data into pandas. It contains data from hockey players drafted in 2005, including their age (age – in years), weight (wt – in pounds), height (ht – in inches), and overall ranking within the year (1 being best).
Perform the following actions:
Fit a linear model of the influence of weight (wt) on the overall ranking (overall) using a quadratic regression model from sklearn. Call your fitted model model.
Make a scatter plot of the weight (x-axis) against overall (y-axis).
Plot the predicted values for the overall ability for 170-250 pounds as a line plot.
Calculate the r-squared value for this fit directly from the residuals (not using the built-in function).
df=pd.read_csv(‘hockey_draftees_2005.csv’)
# Create and fit the model
model = sklearn.linear_model.LinearRegression()
model.fit(np.c_[df.wt,df.wt**2],df.overall)
# Make the scatter plot
ax=sns.scatterplot(x=df.wt,y=df.overall)
ax.set_xlabel(‘Weight’)
ax.set_ylabel(‘Overall’)
#Generate and plot the predicted values
xp = np.linspace(170, 250, 30)
yp = model.predict(np.c_[xp,xp**2])
plt.plot(xp,yp)
# Calculate
res = df.overall-model.predict(np.c_[df.wt,df.wt**2])
RSS = np.sum(res**2)
TSS = np.sum((df.overall-df.overall.mean())**2)
R2 = 1 – RSS/TSS
print(‘Training r-squared is’,R2)
Training r-squared is 0.10022402355587823
Question 2: /15pts¶
Bootstrap confidence intervals on parameters¶
How confident can we be about the relation between weight and the overall ranking?
To judge this, we need confidence intervals on the linear and quadratic coefficients.
Write the function BootstrapCoef to conduct a boostrap analysis for this regression model. Input arguments to this function should be:
data, which is a dataframe having columns ‘weight’ and ‘overall’
numboot which is an integer denoting how many bootstrap replications to perform.
Write bootstrap to perform bootstrap resampling. You can use pd.DataFrame.sample with replace = True to perform the resampling.
Your function bootstrap should return:
params: a numpy array of size [numboot,numParams] of bootstraped parameter values. The parameters are the intercept, the slope and the quadratic term of the linear regression model.
Tip: Note that the intercept can be retrieved from model.intercept, whereas all the other regression coefficients are stored in model.coef.
Read through the documentation for pd.DataFrame.sample (link) to understand how to make the samples for each bootstrap run.
Then call the function to get 100 boostrap samples for your quadratic regression model of overall ability explained by weight. Make a joint scatter plot of the parameter value for the intercept and the quadratic coefficient.
# Write a Bootstrap function that records the fitted models
def BootstrapCoef(data,numboot=1000):
regr = sklearn.linear_model.LinearRegression()
numboot = 1000
n = len(data)
theta = np.zeros((numboot,3))
np.random.seed(4)
for i in range(numboot):
d = data.sample(n, replace=True)
X_fit = np.c_[d.wt,d.wt**2]
regr.fit(X_fit,d.overall)
theta[i,0]=regr.intercept_
theta[i,1]=regr.coef_[0]
theta[i,2]=regr.coef_[1]
return theta
params = BootstrapCoef(df,100)
ax = sns.jointplot(x=params[:,0],y=params[:,2])
ax.ax_joint.set_xlabel(‘Intercept’)
ax.ax_joint.set_ylabel(‘Quadratic’)
plt.show()
Question 3: /5pts¶
Make a copy of your data frame and subtract the mean weight from the column weight. Call the boostrap function with the modified data and provide a joint plot of the estimate for the intercept and the quadratic coefficient.
Written answer: Compare this graph to the one from Question 2. Why do the two graphs look so different? How come the two parameters are so positively correlated in one case and nearly look independent in the other case? What is actually in common in the two cases?
df1 = df.copy()
df1[‘wt’] = df1[‘wt’]-df1.wt.mean()
params = BootstrapCoef(df1,100)
ax = sns.jointplot(x=params[:,0],y=params[:,2])
ax.ax_joint.set_xlabel(‘Intercept’)
ax.ax_joint.set_ylabel(‘Quadratic’)
plt.show()
Intercept and slope coefficient are highly colinear –
As the quadratic term changes, the intercept has to also change dramatically, just to fit the mean of the data. This is because the intercept gives the predicted value for zero and the mean weight is quite far away from it.
After mean-subtracting the weight variable, the depdendency is broken. However, the marginal distribution of the quadratic term does not change.
Question 4: /10pts¶
Plot the bootstrap estimates for the quadratic coefficient as a histogram (use the estimates from the model fitted in Question 3). Use your samples to compute a 95% confidence interval. Note that the CI should be constructed around the sample estimate of the quadratic coefficient.
Written answer: How can you interpret this confidence interval?
bs_res = params[:,2]-model.coef_[1]
plt.hist(bs_res, edgecolor = ‘white’, density=True)
ci_lower, ci_upper = np.quantile(bs_res, [0.025, 0.975])
boot_ci = [model.coef_[1] – ci_upper,
model.coef_[1] – ci_lower]
print(‘My confidence interval is between’, boot_ci[0], ‘and’, boot_ci[1])
My confidence interval is between -0.015220241421753069 and 0.03671889322326306
Written answer: The interval contains the true slope parameter with a probability of 95%. This means that we cannot be very certain that the relationship between weight and overall ability is convex – it could also be linear (quadratic term = 0) or concave (negative quadratic coefficient).
Question 5: /15pts¶
Write the function BootstrapPred to bootstrap your fit and generate a prediction from each of these bootstrapped models.
Draw again a scatter plot of weight against overall ability. Plot the predictions from the 20 fitted bootstrap models to derive predictions for the weight ranging from 170 to 250. Draw all these predicted quadratic functions onto your scatter plot.
Written answer: Where (what weight range) are we most uncertain in our prediction about the overall ability of a player? Why?
# Write a Bootstrap function that records the fitted models
def BootstrapPred(data,xp):
regr = sklearn.linear_model.LinearRegression()
numboot = 20
n = len(data)
X_pred = np.c_[xp,xp**2]
y_pred = np.zeros((numboot,X_pred.shape[0]))
for i in range(numboot):
d = data.sample(n, replace=True)
X_fit = np.c_[d.wt,d.wt**2]
regr.fit(X_fit,d.overall)
y_pred[i,:]=regr.predict(X_pred)
return y_pred
# Get predictions from 20 bootstrapped models
xp = np.linspace(170, 250, 30)
YP = BootstrapPred(df,xp)
# Make a scatterplot and draw the 20 lines
ax=sns.scatterplot(x=df.wt,y=df.overall)
for i in range(20):
plt.plot(xp,YP[i,:])
ax.set_xlabel(‘Weight’)
ax.set_ylabel(‘Overall’)
plt.show()
The greatest uncertainty we have for very heavy players (235pounds and larger). The reason is that we do not have many data examples here, and we do not know for certain whether the underlying function is concave or convex.
Question 6: /15pts¶
Now, let’s see how well our model performs out of sample. Load in the hockey_draftees_test.csv file into a dataframe.
Use your fitted model to make predictions on the test data. Do not refit your model using the test data.
Make a scatter plot of the test data (weight against overall) and superimpose the prediction of the model.
To evaluate this prediction, calculate the r-square value for the out of sample (oos) data. Don’t rely on a built-in function, but calculate the predictive r-square from the test data and your prediction directly.
Written answer: What might explain the terrible performance of your model?
df_test=pd.read_csv(‘hockey_draftees_test.csv’)
# Make the scatter plot
ax=sns.scatterplot(x=df_test.wt,y=df_test.overall)
ax.set_xlabel(‘Weight’)
ax.set_ylabel(‘Overall’)
#Generate and plot the predicted values
xp = np.linspace(170, 250, 30)
yp = model.predict(np.c_[xp,xp**2])
plt.plot(xp,yp,’r’)
# Now do the prediction for the test data and compute R2
y = df_test.overall.values
yhat = model.predict(np.c_[df_test.wt,df_test.wt**2])
rsquared_oos = 1 – (np.sum((y – yhat)**2)) / np.sum((y – y.mean())**2)
print(‘Out of sample r-squared is’, rsquared_oos)
Out of sample r-squared is -0.7358097599787574
50.02757707304956
The model makes very bad predictions, as the mean ranking in ability of the players in the training set goes to 94 only, and here the ranking sometimes exceeds 200 (as there are more players per year). So the model predicts the mean ranking very wrong, and thus makes predictions that are further from the data than the mean of the test set.
Question 7: /10pts¶
In this question, you will construct a confidence interval for the sample mean, not using the normal distribution, but the t-distribution (see end of lecture 4.3), which is more accurate for small sample sizes.
The $100(1-\alpha)\%$ confidence interval is
$$ \bar{x} \pm t_{1-\alpha/2, n-1} \dfrac{\hat{\sigma}}{\sqrt{n}} $$Where $ t_{1-\alpha/2, n-1}$ is the appropiorate quantile of a Student’s t distribution with $n-1$ degrees of freedom.
Write a function called confidence_interval which takes as it’s argument an array of data called data and returns two things:
An estimated mean of data, and
The lower and upper bounds of the 95% confidence interval for the mean of data. Ensure these are returned in a numpy array of shape (2,)
To get the appropirate quantiles for the t-distribution, you can use scipy.stats.t, which implements some statistical functions for the t-distribution. Take a look at the documentation for scipy.stats.t, especially the ppf method.
Here is the documentation: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.t.html
def confidence_interval(data):
n = data.size
t_quantile = t.ppf(1-0.05/2, df=n-1)
estimated_mean = data.mean()
# Note, np.std divides by n and not n-1
# Force it to apply the correct formula by ussing ddof=1
# Alternaively, you can use scipy.stats.sem to compute
#The standard error
estimated_se = data.std(ddof=1)/np.sqrt(data.size)
bounds = estimated_mean + t_quantile*estimated_se*np.array([-1,1])
return estimated_mean, bounds
Question 8: /15pts¶
The “95% confidence interval” is named so because the long term relative frequency of these estimators containing the true estimand is 95%. That is to say if I construct 100 95% confidence intervals for the sample mean again and again from the same data generating mechanism, 95 of these intervals I construct will contain the true population mean.
Write a function called ci_simulation that runs some simulations to show this is the case. From a standard normal distirbution, sample 25 observations and construct a confidence interval. Do this 20 times and plot the intervals using matplotlib.pyplot.errorbar. Color the bar red if the confidence interval does not capture the true mean and blue if it does.
If you are unfamilliar with matplotlib.pyplot.errorbar, I highly suggest reading Matplotlib’s excellent documentation which has some examples at the bottom of the webpage. The documentation for matplotlib.pyplot.errorbar is in this link.
If you are unfamilliar with how to sample random numbers, I suggest you look at numpy.random.normal. Try searching for the documentation for that function yourself if you need to.
def ci_simulation():
# Set the random seed to always get the same random numbers.
# This is for Reproducibility.
np.random.seed(4)
# Create the figure.
fig, ax = plt.subplots(dpi = 120)
# If the interval crosses this line, it should be blue, else red.
ax.axhline(0, color = ‘k’)
# Do the following 20 times
for i in range(20):
#Draw 25 observations from a standard normal
data = np.random.normal(size = 25)
# Compute what we need for the CI, namely the mean and the bounds
mu,bounds = confidence_interval(data)
# color should be blue if it crosses the black line
color = ‘blue’
if (min(bounds)>0)|(max(bounds)<0):
# but in the case it does not, turn it red
color = 'red'
# Need to get the length of the interval from bounds
interval_len = 1.0/2*(bounds[1] - bounds[0])
ax.errorbar(i, mu, yerr=interval_len, color = color, fmt = 'o')
# This function does not have to return anything
return None
ci_simulation()
Question 9: /5pts¶
If you haven't changed the random seed from 4 and if you implemented the solution correctly, you should have only one red interval.
Answer the following below in no more than 3 sentences:
a) How many red intervals did we expect to see? What is your justifiation for this?
Changing the random seed might affect how many red intervals you see. Try changing the random seed in your function to 3. This will yield two red intervals (which is different than what you should expect to see).
b) Why does the simulation sometimes deviate from the predicted results
a) We expected 1 interval. Only 95% of the intervals should contain the true group mean - 5% not. 5% of 20 is 1.
b) Each simulation of 20 intervals is a random experiment - the mean number of CIs that does not contain the mean should be on, but have binomial distribution with n=20 and p=0.05.
Follow these instructions:¶
Once you are finished, ensure to complete the following steps.
Restart your kernel by clicking 'Kernel' > ‘Restart & Run All’.
Fix any errors which result from this.
Repeat steps 1. and 2. until your notebook runs without errors.
Submit your completed notebook to OWL by the deadline.
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com