MAST90083 Computational Statistics & Data Mining Regression Splines
Tutorial & Practical 5: Regression Splines
The implementation of splines has been described in detail in your course book, here we are
going to call built-in functions from R. Our aim in this tutorial is to use the different types
of splines to estimate a smooth data (that we can also call the predicted data).
Question 1:
This question consists of curve generation and first task uses random number generation so
set the seed for it equal to 5
1. Noisy data: Generate a one-dimensional (noisy) data termed y that consists of 50
samples. For this purpose, use the expression [y = cos(2πx) − 0.2x + e], where x is
drawn from a uniform distribution between 0 and 1, and e is a drawn from a normal
distribution with zero mean and standard deviation of 0.2.
2. True data: Also generate another one-dimensional (true) data termed b that is noise
free given as [b = cos(2πa) − 0.2a], and consists of 50 samples of a generated as a
sequence between 0 and 1.
3. Plotting: Using plot command in R plot y against x and then using lines command
plot b against a.
Hint: Use these commands in R; set.seed for setting the seed, runif for uniform distribution,
rnorm for normal distribution, and seq for a
Question 2:
This question is about fitting a natural cubic spline to the noisy data generated in question
1.
1. Generate interior knots k for x by firstly deciding on a number of these knots, and then
choosing locations for them (you may want to use the quantile command).
2. Using these k and ”ns” function in the splines package [xns = ns(x, knots = k, intercept =
TRUE)] generate a B-spline basis matrix xns for natural cubic spline with boundary
knots at 0 and 1. Use (? ns) in command prompt to learn how to impose boundary
knots.
3. Use this basis matrix to create a fit for your noisy data using pracma package (for pinv)
as [yfit = xns(x
>
nsxns)
−1x>nsy]. This expression [(x
>
nsxns)
−1x>ns] can also be implemented
using pinv(xns)
4. Draw all three plots together, i.e. using plot for noisy, lines for true and dashed lines
for fitted data. You may want to choose different colors for each of them.
5. You can experiment with the different number of knots and their placement.
6. At what point does the overfitting start to happen?
1
MAST90083 Computational Statistics & Data Mining Regression Splines
Question 3:
This question is about fitting a cubic smoothing spline to the noisy data generated in Question
1. However, this time we will not generate the basis and let R do all the work via gam function
1. Use the gam function from the gam package [xss = gam(y s(x, df = 6))] to generate
an object xss that has all the information necessary for curve fitting, where df refers to
the degrees of freedom.
2. Use the object xss and R’s predict function to create a fit for the noisy data i.e. [yfit =
predict(xss)]
3. Draw all three plots together, i.e. using plot function for noisy, lines for true and dashed
lines for fitted data.
Question 4:
This question is about MSE curve against the degrees of freedom for a cubic smoothing spline
fit
1. Using the gam function from last question estimate the mean square error between the
predicted data (yfit) and the true data (b) for degrees of freedom (df) ranging between
1 and 15.
2. Draw the plot of MSE against df and label the plot.
3. What is the optimal number of degrees of freedom for your example?
Question 5:
This question is about fitting a penalized spline. For this purpose, we will use smooth.spline
function. The main difference between gam function and smooth.spline is that smooth.spline
can only model for one set of predictors. For sample size greater than 50, smooth.spline uses
penalized spline approximation
1. Read provided file data.txt into a data variable using the function (read.table), and
place its first column (after converting it into a numeric value using ”as.numeric” func-
tion and dropping the first row) into x and second into y
2. While sparsity parameter set at 0.9 and the usage of all knots set to false, use function
[xps = smooth.spline(x,y)] to find a penalized spline fit.
3. Use function [yfit = predict(xps,x)$y] to find a fit for y.
4. Draw both plots together, i.e. using plot function for original y and dashed lines for
fitted yfit.
5. What happens if you vary the sparsity parameter between 0 and 1.2. At what value of
this parameter the overfitting and underfitting start?
2