CS代写 STAT318/462

STAT318/462
1
STAT 318/462: Data Mining Assignment 2
Due Date: 11.59pm, 26th Sep, 2021
Please submit your assignment as a single pdf on Learn.
You may do the assignment by yourself or with one other person from the same cohort (300- level students cannot work with 400-level students). If you hand in a joint assignment, you will each be given the same mark. Marks will be lost for unexplained, poorly presented and incomplete answers. Whenever you are asked to do computations with data, feel free to do them any way that is convenient. If you use R (recommended), please provide your code. All figures and plots must be clearly labelled.
1. (2 marks) Suppose we collect data for a group of students that have taken STAT318 with variables X1 = hours spent studying per week, X2 = number of classes attended
and
􏱔1 if the student received a GPA value ≥ 7 in STAT318
Y=
0 otherwise.
We fit a logistic regression model and find the estimated coefficients to be βˆ0 = −16, βˆ1 = 1.4 and βˆ2 = 0.3.
(a) Estimate the probability of a student getting a GPA value ≥ 7 in STAT318 if they study for 5 hours per week and attend all 36 classes.
(b) If a student attends 18 classes, how many hours do they need to study per week to have a 50% chance of getting a GPA value ≥ 7 in STAT318?
2. (10 marks) In this question, you will fit a logistic regression model to predict the probability of a banknote being forged using the Banknote data set. This data has been divided into training and testing sets: BankTrain.csv and BankTest.csv (download these sets from Learn). The response variable is y (the fifth column), where y = 1 denotes a forged banknote and y = 0 denotes a genuine banknote. Although this data set has four predictors, you will be using x1 and x3 to fit your model1.
(a) Perform multiple logistic regression using the training data. Comment on the model obtained.
(b) Suppose we classify observations using
􏱔forged banknote if Pr(Y = 1|X = x) > θ
f(x) =
genuine banknote otherwise.
i. Plot the training data (using a different symbol for each class) and the decision boundary for θ = 0.5 on the same figure.
ii. Using θ = 0.5, compute the confusion matrix for the testing set and comment on your output.
1These predictors are features extracted from an image of a banknote: x1 is the variance of a Wavelet Transformed image and x3 is the kurtosis of a Wavelet Transformed image.
University of Canterbury, G ́abor Erd ́elyi, 2021

STAT318/462 2
iii. Compute confusion matrices for the testing set using θ = 0.3 and θ = 0.6. Comment on your output. Describe a situation when the θ = 0.3 model may be the preferred model.
3. (6marks)Inthisquestion,youwillfitlineardiscriminantanalysis(LDA)andquadratic discriminant analysis (QDA) models to the training set from question 1 of this assign- ment.
(a) Fit an LDA model to predict the probability of a banknote being forged using the predictors x1 and x3. Compute the confusion matrix for the testing set from question 2.
(b) Repeat part (a) using QDA.
(c) Comment on your results from parts (a) and (b). Compare these methods with the logistic regression model (using θ = 0.5) from question 2. Which method would you recommend for this problem and why?
4. (2 marks) Consider a binary classification problem Y ∈ {0,1} with one predictor X. Assume that X is normally distributed (X ∼ N(μ, σ2)) in each class with X ∼ N(0,4) in class 0 and X ∼ N(2,4) in class 1. Calculate Bayes error rate when the prior probability of being in class 0 is π0 = 0.4. (Bayes error rate is the test error rate using Bayes classifier.)
Recall: The normal density function takes the form:
1 􏰆1 2􏰇
f(x)=√2πσexp −2σ2(x−μ) .
For normal densities, probabilities can be calculated in R as follows:
P(X ≤ x) = pnorm(x, μ, σ)
University of Canterbury, G ́abor Erd ́elyi, 2021