MAST90083 Computational Statistics & Data Mining Bootstrap and SVM
Practical 10: Bootstrap and Support Vector Machines
This practical covers
� the use of the bootstrap R package, boot and
� the use of the support vector machines R package, e1071.
Question 1: Bootstrap
The table below gives 13 measurements of corrosion loss (yi) in copper-nickel alloys, each
with a specific iron content (xi)
x 0.01 0.48 0.71 0.95 1.19 0.01 0.48 1.44 0.71 1.96 0.01 1.44 1.96
y 127.6 124.0 110.8 103.9 101.5 130.1 122.0 92.3 113.1 83.7 128.0 91.4 86.2
Of interest is the change in corrosion in the alloys as the iron content increases, relative to
the corrosion loss when there is no iron. Thus θ = β1
β0
is the quantity we estimate from the
linear regression model yi = β0 + β1xi + �i. Now, set the seed to 5 and use boot function
in R to estimate bias and standard deviation of θ using the following approaches. Also, for
each approach, use boot.ci to compute normal confidence interval for θ
1. Use ”bootstrap the cases” approach to obtain 20 different bootstraps of the estimator
θ. To achieve this, you need to use the boot function with lm function and 20 repli-
cates. This will allow you to resample the pair (xi, yi) with replacement, and estimate
replicates of θ as θ∗j where j = 1, .., 20. Print the results from this function and use
boot.array to find the array for bootstrap resamples. Use this array to do ”bootstrap
the cases” manually and compare results with the boot function.
2. Use ”bootstrap the residuals” approach to obtain 20 different bootstraps of the esti-
mator θ. Similar to the case above, you can use the boot function with lm function
and 20 replicates. This will allow you to resample the residual �i with replacement to
construct bootsrap of responses as y∗i = ŷi + �
∗
i , (where �
∗
i are bootstrap samples and
ŷi is the initial estimate) and estimate replicates of θ as θ
∗
j , where j = 1, .., 20.
3. Use ”parametric bootstrap” approach to obtain 20 different bootstraps of the estimator
θ. In this case, use boot function along with lm function, 20 replicates, mle parameters,
and ran.gen.
4. Let ρ = cor(x,y), perform non-parametric bootstrap for ρ. Overlay the density plot
over histogram for 5000 replicates and highlight this plot with the original ρ value and
the normal bootstrap’s confidence interval.
Question 2: Support Vector Machines
Set the seed to 114, and lets consider an example with two classes. For this purpose, generate
two classes (X1,X2) of vector values (i.e. p = 2 predictors) of size n = 100 from two different
1
MAST90083 Computational Statistics & Data Mining Bootstrap and SVM
probability distributions respectively. The distribution for the first class is a mixture of
10 bivariate Gaussians with 10 means being sampled from N ((1, 0)T ,C) distribution, and
with the 10 covariance matrices all being C = I, where I is the 2 × 2 identity matrix.
The distribution for the second class is the same except that the 10 means are sampled
from N ((0, 1)T ,C) where C = I. Now, the data points of each class are generated by first
randomly picking one of the ten bivariate Gaussians, and then sampling from that bivariate
Gaussian with the associated mean and the covariance C = 0.2I2, where I2 is the 2 × 2
identity matrix. Obtain, a decision boundary for this simulated data by
1. Finding where the two mixture density functions are equal. The boundary can be
visually displayed as a level-0 contour of the differences between the two densities on a
grid of points.
2. Using svm function with a polynomial kernel of degree one (x′.y in R), is your data
linearly separable? The boundary can be visually displayed as a level-0 contour of the
svm fit on a grid of points.
3. Using svm function with third degree polynomial ((γ.x′.y + C)d in R) to make a non-
linear decision boundary. Try different parameter values of γc slide 69 and γ in R
polynomial function. The boundary can be visually displayed as a level-0 contour of
the svm fit on a grid of points.
4. Using svm function with a radial basis functions to make a nonlinear decision boundary.
Try different values of γc and γ of the R radial basis function. The boundary can be
visually displayed as a level-0 contour of the svm fit on a grid of points.
5. What effect the parameter γc has on the decision boundary of polynomial kernel of
degree three, and what effect both γc and γ of the R radial basis function has on
decision boundary?
2