Student nr.:
Problem 1 (30 pts)
You’re considering fitting a local linear regression to data generated from a population with a very nonlinear regression function.
1. (10 pts) Explain in your words what the curse of dimensionality is and how it would affect your fitting problem and the forecasts based on it.
2. (10 pts) Assume now that you only have binary regressors in your data set (i.e. X1, X2, … Xp are all dummy variables). (How) Does the curse of dimensionality affect your fitting now?
3. (10 pts) Back to the continuous regressors case. (How) Does the curse of dimensionality affect regression trees?
Solution:
1. Data become sparse in high dimensions, so in order to estimate a regression function locally to a given degree of precision, one needs more data compared to lower dimensions. The forecasts depend on the estimation error as well as on the irreducible error, so, if building on local estimates of the regression function, forecasts will be less precise in higher dimensions.
2. Since dummies take exactly 2 possible values, the data can’t become sparse; or, in other words, you only need to estimate a regression function at two points (per regressor), where all the data are concentrated anyway. There is no curse of dimensionality for dummies.
3. Regression trees essentially fit piecewise constant models. With enough nodes, this becomes a local estimation in nature, so the curse of dimensionality is back. (Do not let yourselves fooled by binary splitting, the distributions of the regressors remain continuous.)
1
Student nr.:
Problem 2 (35 pts)
Consider a classification problem with k = 2 classes and p < n predictors.
1. (10 pts) Discuss the circumstances under which using linear discriminant analysis would be prefe-
rable to using classification trees.
2. (15 pts) Now, consider a situation where p ≈ n (while p < n still holds). Describe how shrinkage can be applied for model selection within this setup. Pick either LDA or classification trees (but not both) for your answer.
3. (10pts)Whatwouldyoudoifp>n?
Solution:
1. LDA generates a separating hyperplane. If the Bayes boundary is roughly linear, then it can be easily captured by LDA. This is only seldom the case for classification trees (essentially, the Bayes boundary must correspond to one – at most a few – binary split(s)).
2. Since binary splitting looks at one regressor at a time, there is no direct shrinkage method that may be applied. The most that one could do for classification trees is to “pool” a large tree with a simple one such that, by suitably choosing weights, the large tree is shrunken towards the simple one. For LDA, one just has to use shrinkage estimators for the required mean vector and covariance matrix estimators.
3. This is a trick question: if shrinkage works for p < n, it probably works for p > n as well.
2
Student nr.:
Problem 3 (35 pts)
Consider a classification problem with k = 2 classes and two features X1 and X2. 1. (5 pts) Describe the idea behind KNN classifier.
2. (10 pts) You are being told that the bias of the KNN estimator at a given x0 is proportional to k/n, and that the variance is inversely proportional to k. Relate this to underfitting and overfitting.
3. (10 pts) Now you want to use a logistic regression for the same classification problem. You need to decide which feaures are related to the response (only X1, only X2, or both feautures). Explain how k−fold cross-validation could be used for this problem.
4. (10 pts) What are the advantages and disadvantages of k−fold cross-validation relative to i) the validation set approach ii) LOOCV? Which approach (LOOCV or k−fold cross-validation) will have a lower bias? Which approach (LOOCV or k−fold cross-validation) will be characterized by a higher variance?
3
Student nr.:
4