程序代写代做代考 data mining MAST90083 Computational Statistics & Data Mining KR and GAM

MAST90083 Computational Statistics & Data Mining KR and GAM

Tutorial & Practical 6: Local & Kernel Regression (KR)

and Generalized Additive Models (GAM)

For this practical, generate a sinusoid by extending the curvy dataset from the last practical
to 250 samples and increase the range of uniform distribution for noisy data from 1 to 5 and
also the range of sequence for true data from 1 to 5. Also set the seed equal to 25.

Question 1:

This question is about implementing a local regression using loess function.

1. Using the function ”loess(y ∼ x)”, simply fit a local linear regression to y while ignoring
all other arguments for loess function. Use overlaid plot of fitted, true, and noisy data
to see the results.

2. Since you were unable to fit a curve in the first part lets use two arguments for loess
function named span and degree. What effect does span and degree parameters have
on the data fitting?

3. Using residual sum of square (RSS) curve, select a best value for span. Remember,
mean square error (MSE) is a normalized version of RSS as MSE=RSS/n, where n is
the length of the data.

4. Try changing the sample size and range for your curve back to 50 samples and 1
respectively, What effect do you find on span parameter this time?

Question 2:

This question is about implementing a linear kernel regression using a locfit function from a
locfit package.

1. Using the function ”locfit(y ∼ lp(x, nn=1, deg=1))”, fit a local linear regression to
y. Similar to loess’ span parameter case find a best value of nn for fitting. Then plot
fitted, noisy, and true data in the same figure.

2. Using the function ”locfit(y ∼ lp(x, h = 1), kern=”rect”)” fit a linear kernel regression,
here ”h” is the bandwidth size and ”kern” is the kernel. Do all the plots again. For
function lp, use the ”deg” parameter to select for linear kernel regression.

3. Using RSS curve, select a best kernel, and a best value of bandwidth. You may consider
using a nested loop and a string vector.

4. Overlay the fit on a plot for the best value of kernel and bandwidth along with the true
function and (noisy) data.

1

MAST90083 Computational Statistics & Data Mining KR and GAM

Question 3:

This question is about implementing a different linear/non-linear function for each variable
using a generalized additive model (gam). Call library ”ISLR” and ”gam” and use ”Wage”
dataset for this question

1. This gam function [M1=gam(wage∼s(age ,df=5) +education ,data=Wage)] as a model
(i), fits wage using smoothing spline function of age, while education is treated as a
qualitative variable. Use command ”coef(M1)” to read the coefficients, interpret them?

2. Extend the gam function for two more models ii) year is added into model as a linear
function, iii) year is added into the model using natural cubic spline function.

3. Using function ”anova(M1,M2,M3,test=”F”)” for F-test, answer which one of the above
three models is statistically more significant by looking at the p-values in right most
column?

Question 4:

This question is also about gam. Use the provided dataset ”heartdis” for this question.

1. Using the gam function and natural cubic spline for each variable except the qualitative
variable, fit the ”chd” variable. Add the variables to the model according to how they
improve the AIC measure one by one (forward stagewise, you need to chose the order,
any order will be OK).

2. Now, examine the plot of the marginal effect of the tobacco variable on the heart
disease. You can do this by firstly rerunning the model (gam function) by including
all variables, secondly estimating the basis matrix for tobacco variables using ”ns”
function. Then multiplying the basis matrix from ns function with tobacco coefficients
from gam function. Also label the plot showing tobacco on x-axis and heart disease on
y-axis. Is the plot linear?

3. What happened to the plot of tobacco variable when you dont take out the effect of
other variables? Plot this as lines plot, along with the plot of original chd variable
against the tobacco variable from the heartdis data.

2