程序代写代做代考 algorithm PDF document created by PDFfiller

PDF document created by PDFfiller

Problem 1

Figure 1

Set up:

Consider the data set NormalMix.csv and its histogram displayed in Figure 2. The above

histogram shows a clear bimodal shape in the distribution of X. One way to model a

distribution of this type is to use a mixture of two probability distributions. Here we assume

that our data set NormalMix.csv is a random variable governed by the probability density

f(x), defined by

f(x) = f(x;µ1, σ1, µ2, σ2, δ)

= δf1(x;µ1, σ1) + (1− δ)f2(x;µ2, σ2)

= δ
1

2πσ2
1

exp−
1

2σ2
1

(x− µ1)
2 + (1− δ)

1

2πσ2
2

exp−
1

2σ2
2

(x− µ2)
2,

where −∞ < x < ∞ and the parameter space is defined by −∞ < µ1, µ2 < ∞, σ1, σ2 > 0,

and 0 ≤ δ ≤ 1. The mixture parameter δ governs how much mass gets placed on the first

distribution f(x;µ1, σ1) and the complement of δ governs how much mass gets placed on

the other distribution f2(x;µ2, σ2).

2

In our setting, we have n = 10, 000 sampled observations but we do not know how many

males and females were sampled. Assume that the distribution of males is governed by

f1(x;µ1, σ1) =
1

2πσ2
1

exp−
1

2σ2
1

(x− µ1)
2, −∞ < x < ∞, and the distribution of females is governed by f2(x;µ2, σ2) = 1 √ 2πσ2 2 exp− 1 2σ2 2 (x− µ2) 2, −∞ < x < ∞. Our goal is to use a maximum likelihood approach to estimate parameters µ1, µ2, σ1, σ2, δ. Using these estimated parameters, we can answer questions about the individual populations and what percentage of males and females contribute the distribution of X. Perform the following tasks i. Set up the log-likelihood function ℓ(µ1, µ2, σ1, σ2, δ; x1, x2, . . . , xn). Note that this func- tion will not simplify very much. ii. Run the following R code. NormalMix <- read.csv("NormalMix.csv")[,-1] hist(NormalMix,breaks=20,xlab="x",probability = T) iii. Define the negative log-likelihood function in R using the data set NormalMix.csv. Evaluate the negative log-likelihood function at the point µ1 = 4, σ1 = 2, µ2 = 8, σ2 = 2, δ = .5. iv. Compute the maximum likelihood estimates in R using the nlm() function. v. Approximately what percentage of males and females contribute to the distribution of X based on our data set? Hint: In Homework 2 when computing the MLE, you only had to optimize with respect to 1 parameter. In this exam problem, you have to optimize with respect to 5 parameters. There is another MLE example using two parameters posted on Canvas. The file is named gammaMLE. Problem 2 Consider the kNNData.csv data set posted under the midterm module. The goal of this exercise to apply a classification model using the basic kNN algorithm. The response variable Class is a categorical variable with three levels: Group1,Group2,Group3. We will build a kNN classification model from the training data kNNData.train and validate the trained model using the test data kNNData.test. 3 Perform the following tasks 2i. Run the following code so that everyone in the class has the same training data set and test data set. kNNData <- read.csv("kNNData.csv")[,c("X1","X2","Class")] set.seed(2) test.index <- sample(1:nrow(kNNData),100,replace=F) kNNData.test <- kNNData[test.index, ] kNNData.train <- kNNData[-test.index, ] 2ii. Run the following code to gain a visual representation of how the response behaves for different values of the features X1 and X2. library(ggplot2) ggplot(data=kNNData.train)+ geom_point(mapping=aes(x=X1,y=X2,col=Class))+ labs(title="kNN Classification") 2iii. Modify the KNN.decision() function from class so that it can be applied to the the kNNData data frame. Using K = 5, test your function at the query points (X1test = 0,X2test = 10) and (X1test = 0,X2test = 5). 2iv. Compute the prediction error for K = 5. 2v. Compute the prediction error for K = 1, 2, 3, . . . , 200. Create a plot of the prediction error verses K. Note that you could also plot the prediction error verses 1/K so that the plot is consistent with the text but this is not required. 2vi. Based on the plot from Part 2iii, what range of values would you choose for the tuning parameter K? Why did you pick this range? 4 Problem 3 Recall the Weather data set from the PCA applications covered during lecture. This data set is named Daily1995.csv. Note that this is a high dimensional PCA example. Figure 2 Perform the following tasks 3i. How many principal components do we require to explain 95% of the variance captured by this data set? To receive full credit, validate your claim with the appropriate plot. 3ii. Construct the yearly weather for Flagstaff using the minimum number of PCs that explain 95% of the data’s variation. Plot this constructed case with the actual data for Flagstaff . Make sure to label your plots appropriately. 5