PDF document created by PDFfiller
Problem 1
Figure 1
Set up:
Consider the data set NormalMix.csv and its histogram displayed in Figure 2. The above
histogram shows a clear bimodal shape in the distribution of X. One way to model a
distribution of this type is to use a mixture of two probability distributions. Here we assume
that our data set NormalMix.csv is a random variable governed by the probability density
f(x), defined by
f(x) = f(x;µ1, σ1, µ2, σ2, δ)
= δf1(x;µ1, σ1) + (1− δ)f2(x;µ2, σ2)
= δ
1
√
2πσ2
1
exp−
1
2σ2
1
(x− µ1)
2 + (1− δ)
1
√
2πσ2
2
exp−
1
2σ2
2
(x− µ2)
2,
where −∞ < x < ∞ and the parameter space is defined by −∞ < µ1, µ2 < ∞, σ1, σ2 > 0,
and 0 ≤ δ ≤ 1. The mixture parameter δ governs how much mass gets placed on the first
distribution f(x;µ1, σ1) and the complement of δ governs how much mass gets placed on
the other distribution f2(x;µ2, σ2).
2
In our setting, we have n = 10, 000 sampled observations but we do not know how many
males and females were sampled. Assume that the distribution of males is governed by
f1(x;µ1, σ1) =
1
√
2πσ2
1
exp−
1
2σ2
1
(x− µ1)
2, −∞ < x < ∞,
and the distribution of females is governed by
f2(x;µ2, σ2) =
1
√
2πσ2
2
exp−
1
2σ2
2
(x− µ2)
2, −∞ < x < ∞.
Our goal is to use a maximum likelihood approach to estimate parameters µ1, µ2, σ1, σ2, δ.
Using these estimated parameters, we can answer questions about the individual populations
and what percentage of males and females contribute the distribution of X.
Perform the following tasks
i. Set up the log-likelihood function ℓ(µ1, µ2, σ1, σ2, δ; x1, x2, . . . , xn). Note that this func-
tion will not simplify very much.
ii. Run the following R code.
NormalMix <- read.csv("NormalMix.csv")[,-1]
hist(NormalMix,breaks=20,xlab="x",probability = T)
iii. Define the negative log-likelihood function in R using the data set NormalMix.csv.
Evaluate the negative log-likelihood function at the point µ1 = 4, σ1 = 2, µ2 = 8, σ2 =
2, δ = .5.
iv. Compute the maximum likelihood estimates in R using the nlm() function.
v. Approximately what percentage of males and females contribute to the distribution of
X based on our data set?
Hint: In Homework 2 when computing the MLE, you only had to optimize with respect
to 1 parameter. In this exam problem, you have to optimize with respect to 5 parameters.
There is another MLE example using two parameters posted on Canvas. The file is named
gammaMLE.
Problem 2
Consider the kNNData.csv data set posted under the midterm module. The goal of this
exercise to apply a classification model using the basic kNN algorithm. The response variable
Class is a categorical variable with three levels: Group1,Group2,Group3. We will build a
kNN classification model from the training data kNNData.train and validate the trained
model using the test data kNNData.test.
3
Perform the following tasks
2i. Run the following code so that everyone in the class has the same training data set
and test data set.
kNNData <- read.csv("kNNData.csv")[,c("X1","X2","Class")]
set.seed(2)
test.index <- sample(1:nrow(kNNData),100,replace=F)
kNNData.test <- kNNData[test.index, ]
kNNData.train <- kNNData[-test.index, ]
2ii. Run the following code to gain a visual representation of how the response behaves for
different values of the features X1 and X2.
library(ggplot2)
ggplot(data=kNNData.train)+
geom_point(mapping=aes(x=X1,y=X2,col=Class))+
labs(title="kNN Classification")
2iii. Modify the KNN.decision() function from class so that it can be applied to the the
kNNData data frame. Using K = 5, test your function at the query points (X1test =
0,X2test = 10) and (X1test = 0,X2test = 5).
2iv. Compute the prediction error for K = 5.
2v. Compute the prediction error for K = 1, 2, 3, . . . , 200. Create a plot of the prediction
error verses K. Note that you could also plot the prediction error verses 1/K so that
the plot is consistent with the text but this is not required.
2vi. Based on the plot from Part 2iii, what range of values would you choose for the tuning
parameter K? Why did you pick this range?
4
Problem 3
Recall the Weather data set from the PCA applications covered during lecture. This data
set is named Daily1995.csv. Note that this is a high dimensional PCA example.
Figure 2
Perform the following tasks
3i. How many principal components do we require to explain 95% of the variance captured
by this data set? To receive full credit, validate your claim with the appropriate plot.
3ii. Construct the yearly weather for Flagstaff using the minimum number of PCs that
explain 95% of the data’s variation. Plot this constructed case with the actual data
for Flagstaff . Make sure to label your plots appropriately.
5