Set 5: Linear Regression, Linear Algebra Review and Nonparametric Methods (The Bootstrap & Permutation Tests) – STAT GU4206/GR5206 Statistical Computing & Introduction to Data Science
Set 5: Linear Regression, Linear Algebra Review
and Nonparametric Methods
(The Bootstrap & Permutation Tests)
STAT GU4206/GR5206 Statistical Computing & Introduction to Data
Science
Gabriel Young
Columbia University
October 1, 2021
Gabriel Young Set 5: Regression and Bootstrapping October 1, 2021 1 / 60
Last Time
• Exploratory Data Analysis.
• Base R Graphics.
• ggplot.
Gabriel Young Set 5: Regression and Bootstrapping October 1, 2021 2 / 60
Section I
Warm up
Gabriel Young Set 5: Regression and Bootstrapping October 1, 2021 3 / 60
Check Yourself (Warm Up)
Iris
• Use the built-in iris dataset:
• This famous (Fisher’s or Anderson’s) iris data set gives the
measurements in centimeters of the variables sepal length and width
and petal length and width, respectively, for 50 flowers from each of 3
species of iris. The species are Iris setosa, versicolor, and virginica.
Gabriel Young Set 5: Regression and Bootstrapping October 1, 2021 4 / 60
Check Yourself (Warm Up)
Tasks: Plotting Data
Use the built-in iris dataset:
• Use the hist() function to plot a histogram of iris sepal width. Label
the axes properly.
• Create a scatterplot of iris sepal width vs. iris sepal length. Color the
points by whether or not the species is versicolor.
• Create side-by-side boxplots of iris petal length for each species.
Gabriel Young Set 5: Regression and Bootstrapping October 1, 2021 5 / 60
Section II
Multiple Linear Regression
Gabriel Young Set 5: Regression and Bootstrapping October 1, 2021 6 / 60
Multiple Linear Regression
Example
A large national grocery retailer tracks productivity and costs of its
facilities closely. Consider a data set obtained from a single distribution
center for a one-year period. Each data point for each variable represents
one week of activity. The variables included are number of cases shipped
in thousands (X1), the indirect costs of labor as a percentage of total costs
(X2), a qualitative predictor called holiday that is coded 1 if the week has
a holiday and 0 otherwise (X3), and total labor hours (Y ).
Gabriel Young Set 5: Regression and Bootstrapping October 1, 2021 7 / 60
Multiple Linear Regression
Suppose, as statisticians, we are asked to build a model to predict total
labor hours in the future using this dataset.
What information would be useful to provide such a model?
• Is there a relationship between holidays and total labor hours? What
about number of cases shipped? Indirect costs?
• How strong are these relationships?
• Is the relationship linear?
Multiple linear regression can be used to answer each of these questions.
Gabriel Young Set 5: Regression and Bootstrapping October 1, 2021 8 / 60
Multiple Linear Regression
Suppose, as statisticians, we are asked to build a model to predict total
labor hours in the future using this dataset.
What information would be useful to provide such a model?
• Is there a relationship between holidays and total labor hours? What
about number of cases shipped? Indirect costs?
• How strong are these relationships?
• Is the relationship linear?
Multiple linear regression can be used to answer each of these questions.
Gabriel Young Set 5: Regression and Bootstrapping October 1, 2021 8 / 60
Multiple Linear Regression
Models a relationship between two or more explanatory variables and a
response variable by fitting a linear equation to observed data.
General Model
Y = β0 + β1 X1 + β2 X2 + · · ·+ βp Xp + �,
with � ∼ N(0, σ2).
Coefficient βj quantifies the association between the predictor and the
response.
Interpret βj as the average effect on Y of one unit increase of Xj , holding
all other predictors fixed.
Gabriel Young Set 5: Regression and Bootstrapping October 1, 2021 9 / 60
Multiple Linear Regression
Models a relationship between two or more explanatory variables and a
response variable by fitting a linear equation to observed data.
General Model
Y = β0 + β1 X1 + β2 X2 + · · ·+ βp Xp + �,
with � ∼ N(0, σ2).
Coefficient βj quantifies the association between the predictor and the
response.
Interpret βj as the average effect on Y of one unit increase of Xj , holding
all other predictors fixed.
Gabriel Young Set 5: Regression and Bootstrapping October 1, 2021 9 / 60
Multiple Linear Regression
Matrix Formulation
Using a set of training observations (data): (Yi ,Xi1,Xi2, . . . ,Xip) for
i = 1, 2, . . . , n, we want to estimate the model
Yi = β0 + β1 Xi1 + β2 Xi2 + · · ·+ βp Xip + �i ,
with � ∼ N(0, σ2). We can represent this with matrices as follows:
Y = Xβ + �,
where
Y = (Y1,Y2, . . . ,Yn)
T ∈ Rn, X = design matrix ∈ Rn×(p+1)
β = (β0, β1, . . . , βp)
T ∈ Rp+1, � = (�1, �2, . . . , �n)T ∈ Rn.
Gabriel Young Set 5: Regression and Bootstrapping October 1, 2021 10 / 60
The Training Set
Note that we refer to the observations as the training data because we
will use these training data observations to train, or teach, our method
how to estimate the model.
This is in contrast to the test set which is data that isn’t used to estimate
(or train) the model, but rather to test how well the model is at prediction.
Gabriel Young Set 5: Regression and Bootstrapping October 1, 2021 11 / 60
The Training Set
Note that we refer to the observations as the training data because we
will use these training data observations to train, or teach, our method
how to estimate the model.
This is in contrast to the test set which is data that isn’t used to estimate
(or train) the model, but rather to test how well the model is at prediction.
Gabriel Young Set 5: Regression and Bootstrapping October 1, 2021 11 / 60
Multiple Linear Regression
Example (Multiple Linear Regression Model)
Y = β0 + β1 X1 + β2 X2 + β3 X3 + �, � ∼ N(0, σ2)
where,
• Total labor hours (Y ).
• Number of cases shipped (in thousands) (X1) .
• Indirect costs of labor as a percentage of total costs (X2).
• Holiday (X3) with
Xi3 =
{
1 holiday week,
0 otherwise.
Gabriel Young Set 5: Regression and Bootstrapping October 1, 2021 12 / 60
Multiple Linear Regression
Example
Case Y X 1 X 2 X 3
1 4264 305.657 7.17 0
2 4496 328.476 6.20 0
3 4317 317.164 4.61 0
4 4292 366.745 7.02 0
5 4945 265.518 8.61 1
6 4325 301.995 6.88 0
…
…
…
…
…
48 4993 442.782 7.61 1
49 4309 322.303 7.39 0
50 4499 290.455 7.99 0
51 4186 411.750 7.83 0
52 4342 292.087 7.77 0
Gabriel Young Set 5: Regression and Bootstrapping October 1, 2021 13 / 60
Multiple Linear Regression
Design Matrix
X =
1 X11 X12 · · · X1p
1 X21 X22 · · · X2p
…
…
…
…
…
1 Xn1 Xn2 · · · Xnp
What is the dimension of the design matrix?
Example
X =
1 305.657 7.17 0
1 328.476 6.20 0
…
…
…
…
1 411.750 7.83 0
1 292.087 7.77 0
Gabriel Young Set 5: Regression and Bootstrapping October 1, 2021 14 / 60
Parameter Estimation
Using the training data, how do we estimate the parameters of the linear
regression model? How do we find
β̂ =
(
β̂0, β̂1, . . . , β̂p
)T
which provide predictions
Ŷ = β̂0 + β̂1X1 + β̂2X2 + · · ·+ β̂pXp? (1)
We say, how do we fit or how do we train the model?
Least Squares Estimate
The least squares line is calculated from the training data by minimizing
the sum of the squares of the vertical deviations from each data point to
the line.
Gabriel Young Set 5: Regression and Bootstrapping October 1, 2021 15 / 60
Parameter Estimation
Using the training data, how do we estimate the parameters of the linear
regression model? How do we find
β̂ =
(
β̂0, β̂1, . . . , β̂p
)T
which provide predictions
Ŷ = β̂0 + β̂1X1 + β̂2X2 + · · ·+ β̂pXp? (1)
We say, how do we fit or how do we train the model?
Least Squares Estimate
The least squares line is calculated from the training data by minimizing
the sum of the squares of the vertical deviations from each data point to
the line.
Gabriel Young Set 5: Regression and Bootstrapping October 1, 2021 15 / 60
The Least Squares Line
Gabriel Young Set 5: Regression and Bootstrapping October 1, 2021 16 / 60
The Least Squares Line
Gabriel Young Set 5: Regression and Bootstrapping October 1, 2021 16 / 60
The Least Squares Line
Gabriel Young Set 5: Regression and Bootstrapping October 1, 2021 16 / 60
The Least Squares Line
Gabriel Young Set 5: Regression and Bootstrapping October 1, 2021 16 / 60
The Least Squares Line
Gabriel Young Set 5: Regression and Bootstrapping October 1, 2021 16 / 60
Parameter Estimation
Least Squares Estimate
Define an objective function Q(b) as follows.
Q(b0, b1, . . . , bp) :=
n∑
i=1
(Yi − (b0 + b1Xi1 + b2Xi2 + · · ·+ bpXip))2
= ‖Y − Xb‖2,
which we minimize with respect to b = (b0, b1, b2, . . . , bp) ∈ Rp+1.
Gabriel Young Set 5: Regression and Bootstrapping October 1, 2021 17 / 60
Parameter Estimation
Theorem
If design matrix X has full column rank, then the global minimizer of
Q(b) = ‖Y − Xb‖2
with respect to b = (b0, b1, . . . , bp−1)
T
is
β̂ = (XTX )−1XTY .
To come:
• What do we mean by full column rank?
• Is there a geometric interpretation of this result?
Gabriel Young Set 5: Regression and Bootstrapping October 1, 2021 18 / 60
Parameter Estimation
Sketch of Theorem Proof
First note,
Q(b) = ‖Y − Xb‖2 = (Y − Xb)T (Y − Xb)
= Y TY − Y TXb + (Xb)TY + (Xb)TXb
= bTXTXb − 2Y TXb + Y TY .
Taking the first derivative of Q(b) with respect to b and equating the
derivative equal to zero gives
2XTXb − 2XTY = 0.
Solving the expression for b yields b = (XTX )−1XTY .
The second derivative is a positive definite matrix which implies that Q(b)
achieves its minimum at β̂ = (XTX )−1XTY .
Gabriel Young Set 5: Regression and Bootstrapping October 1, 2021 19 / 60
Parameter Estimation
Example
> Grocery <- read.table("Kutner_6_9.txt", header=T) > head(Grocery)
Y X1 X2 X3
1 4264 305.657 7.17 0
2 4496 328.476 6.20 0
3 4317 317.164 4.61 0
4 4292 366.745 7.02 0
5 4945 265.518 8.61 1
6 4325 301.995 6.88 0
> # Construct design matrix
> X <- cbind(rep(1,52), Grocery$X1, Grocery$X2, Grocery$X3) Gabriel Young Set 5: Regression and Bootstrapping October 1, 2021 20 / 60 Parameter Estimation Example Least Square Estimator: β̂ = (XTX )−1XTY > beta_hat <- solve((t(X) %*% X)) %*% t(X) %*% Grocery$Y > round(t(beta_hat), 2)
[,1] [,2] [,3] [,4]
[1,] 4149.89 0.79 -13.17 623.55
What is the estimated model?
Ŷ = 4149.89 + 0.79X1 − 13.17X2 + 623.56X3
̂Labor.Hours = 4149.89 + 0.79× Cases.Shipped
− 13.17× Indirect.Costs + 623.56× Holiday.
Gabriel Young Set 5: Regression and Bootstrapping October 1, 2021 21 / 60
Parameter Estimation
Example
Least Square Estimator: β̂ = (XTX )−1XTY
> beta_hat <- solve((t(X) %*% X)) %*% t(X) %*% Grocery$Y > round(t(beta_hat), 2)
[,1] [,2] [,3] [,4]
[1,] 4149.89 0.79 -13.17 623.55
What is the estimated model?
Ŷ = 4149.89 + 0.79X1 − 13.17X2 + 623.56X3
̂Labor.Hours = 4149.89 + 0.79× Cases.Shipped
− 13.17× Indirect.Costs + 623.56× Holiday.
Gabriel Young Set 5: Regression and Bootstrapping October 1, 2021 21 / 60
Parameter Estimation
Estimated Model:
̂Labor.Hours = 4149.89 + 0.79× Cases.Shipped
− 13.17× Indirect.Costs + 623.56× Holiday.
Example: Prediction
How many labor hours does our model predict for a holiday week with
350000 cases shipped and indirect costs at 8.5 percent?
̂Labor.Hours = 4149.89 + 0.79(350000)− 13.17(8.5) + 623.56(1)
= 4938.01 hours .
Gabriel Young Set 5: Regression and Bootstrapping October 1, 2021 22 / 60
Parameter Estimation
Estimated Model:
̂Labor.Hours = 4149.89 + 0.79× Cases.Shipped
− 13.17× Indirect.Costs + 623.56× Holiday.
Example: Prediction
How many labor hours does our model predict for a holiday week with
350000 cases shipped and indirect costs at 8.5 percent?
̂Labor.Hours = 4149.89 + 0.79(350000)− 13.17(8.5) + 623.56(1)
= 4938.01 hours .
Gabriel Young Set 5: Regression and Bootstrapping October 1, 2021 22 / 60
Fitting Linear Models in R
lm(formula,data) is used to fit linear models.
Example
> lm0 <- lm(Y ~ X1 + X2 + X3, data = Grocery) > lm0
Call:
lm(formula = Y ~ X1 + X2 + X3, data = Grocery)
Coefficients:
(Intercept) X1 X2 X3
4149.8872 0.7871 -13.1660 623.5545
Gabriel Young Set 5: Regression and Bootstrapping October 1, 2021 23 / 60
Fitted Values and Residuals
• The i th fitted value is denoted Ŷi and defined by
Ŷi = β̂0 + β̂1 Xi1 + β̂2 Xi2 + · · ·+ β̂p−1 Xi ,p−1.
• Denote the fitted values by the n × 1 vector
Ŷ =
(
Ŷ1, Ŷ2, . . . , Ŷn
)T
= X β̂.
• The i th residual denoted ei is the difference between the actual
response value and its corresponding fitted value: ei = Yi − Ŷi .
• Denote the residuals by the n × 1 vector
e = (e1, e2, . . . , en)
T
= Y − Ŷ .
Gabriel Young Set 5: Regression and Bootstrapping October 1, 2021 24 / 60
Fitted Values and Residuals
For an estimated linear model in R,
• Compute residuals with residuals().
• Compute fitted values with fitted().
Example
> lm0 <- lm(Y ~ X1 + X2 + X3, data = Grocery) > residuals(lm0)[1:5]
1 2 3 4 5
-32.06348 169.20509 -21.82543 -54.11955 75.93372
> fitted(lm0)[1:5]
1 2 3 4 5
4296.063 4326.795 4338.825 4346.120 4869.066
Gabriel Young Set 5: Regression and Bootstrapping October 1, 2021 25 / 60
Fitted Values and Residuals
For an estimated linear model in R,
• Compute residuals with residuals().
• Compute fitted values with fitted().
Example
> lm0 <- lm(Y ~ X1 + X2 + X3, data = Grocery) > residuals(lm0)[1:5]
1 2 3 4 5
-32.06348 169.20509 -21.82543 -54.11955 75.93372
> fitted(lm0)[1:5]
1 2 3 4 5
4296.063 4326.795 4338.825 4346.120 4869.066
Gabriel Young Set 5: Regression and Bootstrapping October 1, 2021 25 / 60
Model Summary
> summary(lm0)
Residuals:
Min 1Q Median 3Q Max
-264.05 -110.73 -22.52 79.29 295.75
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4149.8872 195.5654 21.220 < 2e-16 *** X1 0.7871 0.3646 2.159 0.0359 * X2 -13.1660 23.0917 -0.570 0.5712 X3 623.5545 62.6409 9.954 2.94e-13 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 '' 1 Residual standard error: 143.3 on 48 degrees of freedom Multiple R-squared: 0.6883, Adjusted R-squared: 0.6689 F-statistic: 35.34 on 3 and 48 DF, p-value: 3.316e-12 Gabriel Young Set 5: Regression and Bootstrapping October 1, 2021 26 / 60 Section III Linear Algebra Review Gabriel Young Set 5: Regression and Bootstrapping October 1, 2021 27 / 60 Rank Definition The rank of a matrix A, denoted rank(A), is the number of linearly independent columns of A. Example A = ( v1 v2 v3 ) = 3 −2 8−2 12 −16 7 9 5 . rank(A) = 2 because v1 and v2 are linearly independent, but 0 = 2v1 − v2 − v3 = 2 3−2 7 − −212 9 − 8−16 5 = 00 0 . Gabriel Young Set 5: Regression and Bootstrapping October 1, 2021 28 / 60 Rank Definition The rank of a matrix A, denoted rank(A), is the number of linearly independent columns of A. Example A = ( v1 v2 v3 ) = 3 −2 8−2 12 −16 7 9 5 . rank(A) = 2 because v1 and v2 are linearly independent, but 0 = 2v1 − v2 − v3 = 2 3−2 7 − −212 9 − 8−16 5 = 00 0 . Gabriel Young Set 5: Regression and Bootstrapping October 1, 2021 28 / 60 Inverse Theorem The following statements are equivalent if A is a p × p matrix • A is invertible. • C(A) = Rp • rank(A) = p Theorem If X is a n × p matrix with rank(X ) = p, then rank (XTX ) = p Why do we care about the above result? β̂ = (XTX )−1XTY Gabriel Young Set 5: Regression and Bootstrapping October 1, 2021 29 / 60 Inverse Theorem The following statements are equivalent if A is a p × p matrix • A is invertible. • C(A) = Rp • rank(A) = p Theorem If X is a n × p matrix with rank(X ) = p, then rank (XTX ) = p Why do we care about the above result? β̂ = (XTX )−1XTY Gabriel Young Set 5: Regression and Bootstrapping October 1, 2021 29 / 60 In the Grocery retailer example consider introducing another variable not holiday. Example For i = 1, 2, . . . , 52 weeks, Yi = β0 + β1 Xi1 + β2 Xi2 + β3 Xi3 + β4 Xi4 + �i , �i ∼ N(0, σ2) Xi3 = { 1 holiday week 0 otherwise Xi4 = { 1 not holiday week 0 otherwise For week i , • Total labor hours (Yi ) • Number of cases shipped (Xi1) • Indirect costs of the total labor hours as a percentage (Xi2) • Holiday (Xi3) • Not holiday (Xi4) Gabriel Young Set 5: Regression and Bootstrapping October 1, 2021 30 / 60 Multiple Linear Regression Example X = 1 305657 7.17 0 1 1 328476 6.20 0 1 1 317164 4.61 0 1 1 366745 7.02 0 1 1 265518 8.61 1 0 1 301995 6.88 0 1 ... ... ... ... ... 1 442782 7.61 1 0 1 322303 7.39 0 1 1 290455 7.99 0 1 1 411750 7.83 0 1 1 292087 7.77 0 1 Notice: col1 = col4 + col5 −→ rank(X ) = 4, rank(XTX ) = 4. Gabriel Young Set 5: Regression and Bootstrapping October 1, 2021 31 / 60 Section IV The Bootstrap Gabriel Young Set 5: Regression and Bootstrapping October 1, 2021 32 / 60 The Bootstrap Principle • If we could repeat an experiment over and over again, we could actually find a very good approximation to the sampling distribution. • Grocery example: If I had 1000 years of data, run the regression model on each year to see how estimates change. • Often too expensive or time-consuming. • Bradley Efron’s Idea (1979): Use computers to simulate replication. • Instead of repeatedly obtaining new, independent datasets from the population, we repeatedly obtain datasets from the sample itself, the original dataset. Gabriel Young Set 5: Regression and Bootstrapping October 1, 2021 33 / 60 The Bootstrap Principle • If we could repeat an experiment over and over again, we could actually find a very good approximation to the sampling distribution. • Grocery example: If I had 1000 years of data, run the regression model on each year to see how estimates change. • Often too expensive or time-consuming. • Bradley Efron’s Idea (1979): Use computers to simulate replication. • Instead of repeatedly obtaining new, independent datasets from the population, we repeatedly obtain datasets from the sample itself, the original dataset. Gabriel Young Set 5: Regression and Bootstrapping October 1, 2021 33 / 60 Bootstrap Methods “Pull yourself up by your bootstraps!” Gabriel Young Set 5: Regression and Bootstrapping October 1, 2021 34 / 60 Bootstrap Methods To get a bootstrap estimate, 1. Resample from the original data n times with replacement (note an original data observation could be in the new sample more than once), 2. Use the new dataset to compute a bootstrap estimate, 3. Repeat this to create B new datasets, and B new estimates. Gabriel Young Set 5: Regression and Bootstrapping October 1, 2021 35 / 60 Bootstrap Methods Formally, you have original data (xi ) n i=1 and you are interested in estimating a population parameter Θ from the data. Label the estimate Θ̂. Procedure 1. For b = 1, . . . ,B, • Create a new dataset Bb = (x (b) i ) n i=1 by sampling from original dataset with replacement. • Use the new dataset to find an estimate Θ̂(b). 2. The collection (Θ̂(b) − Θ̂)Bb=1 estimates the sampling distribution of Θ̂−Θ. Gabriel Young Set 5: Regression and Bootstrapping October 1, 2021 36 / 60 Bootstrap Methods Formally, you have original data (xi ) n i=1 and you are interested in estimating a population parameter Θ from the data. Label the estimate Θ̂. Procedure 1. For b = 1, . . . ,B, • Create a new dataset Bb = (x (b) i ) n i=1 by sampling from original dataset with replacement. • Use the new dataset to find an estimate Θ̂(b). 2. The collection (Θ̂(b) − Θ̂)Bb=1 estimates the sampling distribution of Θ̂−Θ. Gabriel Young Set 5: Regression and Bootstrapping October 1, 2021 36 / 60 Example: Gaussian Random Variables You sample n = 100 data points, x1, . . . , x100 ∼ N(µ, 1). (Recall, Lab 1.) > n <- 100 > vec <- rnorm(n, mean = mu) > head(vec)
[1] -0.6264538 0.1836433 -0.8356286 1.5952808 0.3295078
[6] -0.8204684
• What’s a good estimator for µ?
> mean(vec)
[1] 0.1088874
Set µ̂ = 0.11. Recall, µ̂ ∼ N(µ, 1/100).
• How can we estimate Var(µ̂)?
Gabriel Young Set 5: Regression and Bootstrapping October 1, 2021 37 / 60
Example: Gaussian Random Variables
You sample n = 100 data points, x1, . . . , x100 ∼ N(µ, 1). (Recall, Lab 1.)
> n <- 100 > vec <- rnorm(n, mean = mu) > head(vec)
[1] -0.6264538 0.1836433 -0.8356286 1.5952808 0.3295078
[6] -0.8204684
• What’s a good estimator for µ?
> mean(vec)
[1] 0.1088874
Set µ̂ = 0.11. Recall, µ̂ ∼ N(µ, 1/100).
• How can we estimate Var(µ̂)?
Gabriel Young Set 5: Regression and Bootstrapping October 1, 2021 37 / 60
Example: Gaussian Random Variables
You sample n = 100 data points, x1, . . . , x100 ∼ N(µ, 1). (Recall, Lab 1.)
> n <- 100 > vec <- rnorm(n, mean = mu) > head(vec)
[1] -0.6264538 0.1836433 -0.8356286 1.5952808 0.3295078
[6] -0.8204684
• What’s a good estimator for µ?
> mean(vec)
[1] 0.1088874
Set µ̂ = 0.11. Recall, µ̂ ∼ N(µ, 1/100).
• How can we estimate Var(µ̂)?
Gabriel Young Set 5: Regression and Bootstrapping October 1, 2021 37 / 60
Example: Gaussian Random Variables
We’ll use the bootstrap to estimate the variance! For b = 1 : B,
• Resample x1, . . . , x100 with replacement to get x
(b)
1 , . . . , x
(b)
100.
• Compute µ̂(b) = 1
100
∑100
i=1 x
(b)
i .
> B <- 1000 > estimates <- vector(length = B) > for (b in 1:B) {
+ new_sample <- sample(vec, size = n, replace = TRUE) + estimates[b] <- mean(new_sample) + } > head(estimates)
[1] -0.033571603 0.088058195 0.164764913 0.007993406
[5] 0.313045453 0.114700212
Gabriel Young Set 5: Regression and Bootstrapping October 1, 2021 38 / 60
Example: Gaussian Random Variables
A histogram of the original sample and a histogram of a single resampled
bootstrap sample.
Gabriel Young Set 5: Regression and Bootstrapping October 1, 2021 39 / 60
Example: Gaussian Random Variables
The Bootstrap Distribution of the Statistic. Recall (µ̂(b) − µ̂)Bb=1
approximates the sampling distribution of µ̂− µ.
Gabriel Young Set 5: Regression and Bootstrapping October 1, 2021 40 / 60
Example: Gaussian Random Variables
We’ll use the bootstrap to estimate the variance!
Estimating the Variance
> var(estimates)
[1] 0.007474996
True variance: Var(µ̂) = σ
2
n
= 1
100
= 0.01.
Gabriel Young Set 5: Regression and Bootstrapping October 1, 2021 41 / 60
Regular Bootstrap Intervals
Regular Bootstrap Interval Formula:
• The 95% bootstrap interval is (L,U), where
L = 2Θ̂− Θ̂∗0.975 and U = 2Θ̂− Θ̂
∗
0.025
• Note that Θ̂∗p is the p
th percentile of Θ̂Bb=1
> L <- 2*mean(vec)-quantile(estimates,.975);L 97.5% -0.0592433 > U <- 2*mean(vec)-quantile(estimates,.025);U 2.5% 0.2759329 Gabriel Young Set 5: Regression and Bootstrapping October 1, 2021 42 / 60 Regular Bootstrap Intervals Regular Bootstrap Interval Formula: • The 95% bootstrap interval is (L,U), where L = 2Θ̂− Θ̂∗0.975 and U = 2Θ̂− Θ̂ ∗ 0.025 • Note that Θ̂∗p is the p th percentile of Θ̂Bb=1 > L <- 2*mean(vec)-quantile(estimates,.975);L 97.5% -0.0592433 > U <- 2*mean(vec)-quantile(estimates,.025);U 2.5% 0.2759329 Gabriel Young Set 5: Regression and Bootstrapping October 1, 2021 42 / 60 Regular Bootstrap Intervals (Derivation) Gabriel Young Set 5: Regression and Bootstrapping October 1, 2021 43 / 60 Regular Bootstrap Intervals (Derivation) Gabriel Young Set 5: Regression and Bootstrapping October 1, 2021 44 / 60 Percentile Based Bootstrap Intervals Percentile Based Bootstrap Formula: • The 95% percentile bootstrap interval is (L,U), where L = Θ̂∗0.025 and U = Θ̂ ∗ 0.975 • Note that Θ̂∗p is the p th percentile of Θ̂Bb=1 > L <- quantile(estimates,.025);L 2.5% -0.05815818 > U <- quantile(estimates,.975);U 97.5% 0.277018 Gabriel Young Set 5: Regression and Bootstrapping October 1, 2021 45 / 60 Percentile Based Bootstrap Intervals Percentile Based Bootstrap Formula: • The 95% percentile bootstrap interval is (L,U), where L = Θ̂∗0.025 and U = Θ̂ ∗ 0.975 • Note that Θ̂∗p is the p th percentile of Θ̂Bb=1 > L <- quantile(estimates,.025);L 2.5% -0.05815818 > U <- quantile(estimates,.975);U 97.5% 0.277018 Gabriel Young Set 5: Regression and Bootstrapping October 1, 2021 45 / 60 Bootstrap Intervals Nonparametric Testing Gabriel Young Set 5: Regression and Bootstrapping October 1, 2021 46 / 60 Bootstrapping Summary Bootstrapping is very flexible! • Bootstrapping gives you a distribution over estimators. • This can be used to: • Approximate more complicated metrics (medians, quantiles, etc.). • Approximate distributional properties. • Create confidence intervals. • By resampling (xi , yi ) n i=1 pairs, we could create bootstrap estimators for linear model regression parameters. Intervals • Intervals can be used as a nonparametric hypothesis testing procedure. • Both the regular and percentile based intervals are common techniques. • The percentile based bootstrap interval is more intuitive to construct. Gabriel Young Set 5: Regression and Bootstrapping October 1, 2021 47 / 60 Section V Permutation Tests Gabriel Young Set 5: Regression and Bootstrapping October 1, 2021 48 / 60 Nonparametric Statistics Nonparametric Statistics • We can all agree that bootstraping is awesome! • Are there other useful computational-based non-parametric methods? • Yes... obviously. • Traditional statistical inference assumes a parametric model on your data. • We can relax some aspects of the parametric assumption. Gabriel Young Set 5: Regression and Bootstrapping October 1, 2021 49 / 60 Cats The cats dataset includes the heart and body weights of samples of male and female cats. All the cats are adults and over 2 kg in body weight. > # install.packages(“MASS”)
> library(MASS)
> head(cats)
Sex Bwt Hwt
1 F 2.0 7.0
2 F 2.0 7.4
3 F 2.0 9.5
4 F 2.1 7.2
5 F 2.1 7.3
6 F 2.1 7.6
Gabriel Young Set 5: Regression and Bootstrapping October 1, 2021 50 / 60
Male and Female Cat Heart Weights
> boxplot(cats$Hwt ~ cats$Sex,
+ main = “Male and Female Cat Heart Weights”)
F M
6
8
1
0
1
4
1
8
Male and Female Cat Heart Weights
cats$Sex
ca
ts
$
H
w
t
Gabriel Young Set 5: Regression and Bootstrapping October 1, 2021 51 / 60
Male and Female Cat Heart Weights
> plot(density(cats$Hwt[cats$Sex == “F”]), col = “red”,
+ xlim = c(4, 18), main = “Male and Female Cat Heart Weights”)
> lines(density(cats$Hwt[cats$Sex == “M”]), col = “blue”)
4 6 8 10 12 14 16 18
0
.0
0
0
.1
0
0
.2
0
Male and Female Cat Heart Weights
N = 47 Bandwidth = 0.5442
D
e
n
si
ty
Gabriel Young Set 5: Regression and Bootstrapping October 1, 2021 52 / 60
Male and Female Cat Heart Weights
Recall the Independent Two-Sample T-test
• Tests if the population means of two samples are equal.
• Assume the data in each sample come from a normal distribution.
> girlcats <- cats$Sex == "F" > t.test(cats$Hwt[girlcats], cats$Hwt[!girlcats])
Welch Two Sample t-test
data: cats$Hwt[girlcats] and cats$Hwt[!girlcats]
t = -6.5179, df = 140.61, p-value = 1.186e-09
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-2.763753 -1.477352
sample estimates:
mean of x mean of y
9.202128 11.322680
Gabriel Young Set 5: Regression and Bootstrapping October 1, 2021 53 / 60
Male and Female Cat Heart Weights
What if the assumptions don’t hold?
• In our dataset, 47 female cats and 97 male cats.
• We study test statistic:
D̂ = mean(XF )−mean(XM),
where XF are female cat heart weights and XM are male cat heart
weights.
• For our data D̂ = −2.12.
Permutation Principle
If there were no difference in male and female heart weights (null
hypothesis), then all datasets obtained by randomly assigning 47 of the
values in cats$Hwt to female cats and 97 to male cats would have equal
chance of being observed in the study.
Gabriel Young Set 5: Regression and Bootstrapping October 1, 2021 54 / 60
Male and Female Cat Heart Weights
What if the assumptions don’t hold?
• In our dataset, 47 female cats and 97 male cats.
• We study test statistic:
D̂ = mean(XF )−mean(XM),
where XF are female cat heart weights and XM are male cat heart
weights.
• For our data D̂ = −2.12.
Permutation Principle
If there were no difference in male and female heart weights (null
hypothesis), then all datasets obtained by randomly assigning 47 of the
values in cats$Hwt to female cats and 97 to male cats would have equal
chance of being observed in the study.
Gabriel Young Set 5: Regression and Bootstrapping October 1, 2021 54 / 60
Male and Female Cat Heart Weights
What if the assumptions don’t hold?
This gives us (
144
47
)
=
144!
47! 97!
= 2.231033× 1038
possible two-sample datasets (under the null hypothesis).
How rare is our result?
How many of these datasets have |mean(XF )−mean(XM)| ≥ 2.12? What
is the probability of seeing differences in the group means as extreme (or
more extreme) than ours?
Gabriel Young Set 5: Regression and Bootstrapping October 1, 2021 55 / 60
Male and Female Cat Heart Weights
What if the assumptions don’t hold?
• Could consider each of the 2.231033× 1038 possible two-sample
datasets, calculate |mean(XF )−mean(XM)| for each, and then
compare to our original D̂…
• Instead we use a permutation test, which randomly samples from
the 2.231033× 1038 possible two-sample datasets and estimates a
p-value based on our original D̂.
Gabriel Young Set 5: Regression and Bootstrapping October 1, 2021 56 / 60
Permutation Test
Steps
For step i = 1, 2, . . . ,P,
• Randomly assign heart weights in cats$Hwt as either male or female
with exactly 97 males and 47 females.
• Compute D̂i = |mean(XF ,i )−mean(XM,i )| where XF ,i are female cat
heart weights in step i and XM,i are male cat heart weights.
Finally, we calculate a (two-sided) p-value as follows:
1
P
P∑
i=1
I(|D̂i | ≥ |D̂|).
Gabriel Young Set 5: Regression and Bootstrapping October 1, 2021 57 / 60
Permutation Test
What are we actually testing?
• Using the permutation test, we test the null hypothesis that the two
samples are from the same distribution.
• How do we do this? Have we seen enough evidence (in the form of
the observed difference between the sample means being large
enough) to reject the null hypothesis that the two groups have
identical probability distributions?
Gabriel Young Set 5: Regression and Bootstrapping October 1, 2021 58 / 60
Check Yourself: Permutation Test
Task
Fill in the following code to run a permutation test on the cats$Hwt data.
> girlcats <- cats$Sex == "F" > Dhat <- mean(cats$Hwt[girlcats]) > – mean(cats$Hwt[!girlcats])
> nf <- sum(girlcats); nm <- sum(!girlcats) > P <- 10000 > sample_diffs <- rep(NA, P) > for (i in 1:P) {
+ ###################
+ ## Add code here ##
+ ###################
+ }
> pval <- mean(abs(sample_diffs) >= abs(Dhat))
Gabriel Young Set 5: Regression and Bootstrapping October 1, 2021 59 / 60
Check Yourself: Permutation Test
Solution
> girlcats <- cats$Sex == "F" > Dhat <- mean(cats$Hwt[girlcats])-mean(cats$Hwt[!girlcats]) > nf <- sum(girlcats); nm <- sum(!girlcats) > P <- 100000 > sample_diffs <- rep(NA, P) > for (i in 1:P) {
+ perm_data <- cats$Hwt[sample(1:(nf+nm))] + meanf <- mean(perm_data[1:nf]) + meanm <- mean(perm_data[-(1:nf)]) + sample_diffs[i] <- meanf - meanm + } > pval <- mean(abs(sample_diffs) >= abs(Dhat))
> pval
[1] 0
Gabriel Young Set 5: Regression and Bootstrapping October 1, 2021 60 / 60