Student nr.:
Problem 1 (30 pts)
Consider a prediction task (with one predictor X for simplicity) where you have enough data to allow for the use of k-nearest neighbours regression.
(a) You are being told that the bias of the KNN estimator at a given x0 is proportional to k/n, and that the variance is inversely proportional to k. Relate this to underfitting and overfitting.
(b) Instead of KNN you decide to use a local linear regression. Explain briefly (no formulas) how the bandwidth influences the bias-variance trade-off.
(c) For either of them, describe a bootstrap algorithm that would produce confidence intervals for the
ˆ
estimated f(x0) at a given nominal coverage probability 1 − α.
Solution:
(a) Overfitting is characterized by small error variance on the training set but large error variance on test data. This happens when estimation bias is low but estimation variance is high (large estimation bias implies both, large training and test error variance). Hence too small k leads to overfitting (and conversely too large k to underfitting).
(b) The bandwidth essentially controls which observations are given weights that are nonzero (or not too close to zero to be negligible): large bandwidths imply many observations playing a role in the WLS estimation, small bandwidths imply less. Therefore, the bandwidth plays the role analogous to k in (a).
(c) The following two steps are repeated B times:
(a) Draw n times with replacement from the original sample obtaining bth bootstrap sample
(Yi,Xi)∗b, i = 1,…,n.
(b) With these data, fit a KNN estimator at x0, generating fˆ∗(x0)
Given the B bootstrapped values fˆ∗(x0), get their standard deviation to be used in conjunction with b
standard normal quantiles to generate the confidence interval. (Alternatively: use their empirical distribution to get a confidence interval not based on the normal distribution.)
This works for both, KNN and local linear regression. For LL regression, one may bootstrap the coefficients of the regression local to x0 and employ their bootstrap distribution to work out the
ˆ confidence interval for f(x0).
b
1
Student nr.:
Problem 2 (25 pts)
(a) Argue that this is a shrinkage-type estimator (one where the shrinkage target is not zero) and discuss its properties as a function of the weights.
(b) Can overfitting appear in this case and how?
(c) How would you use cross-validation to prevent the method from overfitting? Give a precise descrip- tion of the algorithm!
Consider KNN again. You are inclined to use a small k (i.e. small bias, large variance) but have concerns ˆ
about the statistical properties of the procedure. For this reason you take as f(x0) the weighted sum of the average dependent variable y ̄ and a KNN estimator with small k.
Solution:
(a) Denoting the KNN estimator at x
by fˆ (x ), the weighted average is given by 0k0
fˆ (x )=(1−λ)fˆ(x )+λy ̄, w0 k0
where λ ∈ [0, 1]. Letting λ → 0, fˆ (x ) → fˆ (x ), whereas λ → 1 implies fˆ (x ) → y ̄. The w0k0 w0
average dependent variable y ̄ has very small variance compared to the KNN estimator, so it acts as
variance stabilizer when λ increases – although at the cost of bias, since y ̄ is only a good estimator
for the regression curve f(x ) when f is constant. This makes fˆ (x ) a shrinkage parameter with 0w0
shrinkage intensity λ. (Should you not like the fact that λ ∈ [0, 1], you can reparameterize e.g. as λ = 1 − exp(−Λ) and Λ ∈ [0, ∞)).
(b) Yes, if λ is too small.
(c) The main idea is that it is λ (and not k!) that has to be chosen by minimizing the CV criterion; see textbook for the details.
2
Student nr.:
Problem 3 (20 pts)
Consider now a classification problem (say good customers/bad customers) given characteristics of the potential customers such as the color of the eyes.
(a) Describe step-by-step how to grow and prune a classification tree for this task.
(b) You are told to boost the classification tree. Since this has not been treated in class, you fit a boosted regression tree instead (the classes are denoted by 0 and 1 like in logistic regression).
(a) Explain the role of the tuning parameter λ in the formula of the boosted tree.
(b) Fitting a linear regression instead of a logistic one to categorical data can generate funny predictions (e.g. negative conditional class probabilities). Does this happen with our boosted regression tree?
Solution:
(a) See textbook; make sure you mention all the important details, such as what criterion is minimized in each step etc.
(b) (a)
This is a parameter controlling the learning rate and can also be interpreted as shrinkage rate; see textbook.
(b) Typically no, since for each region the predictions based on regression trees are an average of 0’s and 1’s, hence always in the interval [0, 1]. The boosted prediction is therefore likely to be smaller than 1 as well, since λ is typically small. Alternatively, you may argue that a regression tree generates a nonlinear fit, such that it would rather mimic an S-shaped regression curve (as fitted by logistic regressions) than a linear fit.
3
Student nr.:
Problem 4 (25 pts)
Let’s get back to a regression problem.
(a) Describe the basic bagging algorithm for the case of simple linear regression.
(b) How would that change for multiple linear regression?
(c) How could you transfer the idea of random forests to bagging the multiple linear regression from (b)?
Solution:
(a) See textbook.
(b) In principle nothing changes. (Model selection steps may be included, since there is more than one regressor, but this need not be the case in this problem.)
(c) Simply select m < p regressors randomly for each bootstrapped sample.
4
Student nr.:
5