STAT318/462
1
STAT 318/462: Data Mining Assignment 1
Due Date: 23.59pm, 15th August, 2021
Please submit your assignment as a single pdf on Learn.
You may do the assignment by yourself or with one other person from the same cohort (300- level students cannot work with 400-level students). If you hand in a joint assignment, you will each be given the same mark. Marks will be lost for unexplained, poorly presented and incomplete answers. Whenever you are asked to do computations with data, feel free to do them any way that is convenient. If you use R (recommended), please provide your code. All figures and plots must be clearly labelled.
1. (4 marks) Describe one advantage and one disadvantage of flexible (versus a less flexible) approaches for regression. Under what conditions might a less flexible approach be preferred?
2. (6 marks) Consider a binary classification problem Y ∈ {0,1} with one predictor X. The prior probability of being in class 0 is Pr(Y = 0) = π0 = 0.69 and the density function for X in class 0 is a standard normal
1 12 f0(x) = Normal(0, 1) = √2π exp −2x
.
The density function for X in class 1 is also normal, but with μ = 1 and σ2 = 0.5
f1(x) = Normal(1, 0.5) = √1 exp −(x − 1)2 . π
(a) Plot π0f0(x) and π1f1(x) in the same figure.
(b) Find the Bayes decision boundary (Hint: π0f0(x) = π1f1(x) on the boundary).
(c) Using Bayes classifier, classify the observation X = 3. Justify your prediction. (d) What is the probability that an observation with X = 2 is in class 1?
3. (8 marks) In this question, you will fit kNN regression models to the Auto data set to predict Y = mpg using X = horsepower. This data has been divided into training and testing sets: AutoTrain.csv and AutoTest.csv (download these sets from Learn). The kNN() R function on Learn should be used to answer this question (you need to run the kNN code before calling the function).
(a) Perform kNN regression with k = 2,5,10,20,30,50 and 100, (learning from the training data) and compute the training and testing MSE for each value of k.
(b) Which value of k performed best? Explain.
(c) Plot the training data, testing data and the best kNN model in the same figure.
(The points() function is useful to plot the kNN model because it is discontinuous.)
(d) Describe the bias-variance trade-off for kNN regression.
University of Canterbury, G ́abor Erd ́elyi, 2021