### Conceptual
1. The goal of this exercise is to gain a deeper understanding of the trade-off between sensitivity and sepecificity and the ROC curve. Assume a binary classification problem with a single quantitative predictor $X$. Assume that you know the true distribution of $X$ for each of the two classes (in practice these distributions would not be known but can be estimated as its done in LDA). Specifically, $\,X \sim N(\mu=-1, \sigma=1)$ in the negative class $\,Y=0$ and $\,X \sim N(\mu=2, \sigma=1)$ in the positive class $\,Y=1$.
Assume also that the two classes are equally likely.
a. Derive the posterior probabilities $\, P(Y=0 \;|\; X=x)$ and $\, P(Y=1 \;|\; X=x)$.
b. Derive the Bayes rule that classifies an observation with $\,X=x$ to $\,Y=1$ if $\, P(Y=1 \;|\; X=x) > P(Y=0 \;|\; X=x)$
c. Show that there is a cuttof $\,t_{Bayes}$ such that the Bayes rule can be expressed as:
\[
Y = \begin{cases}
0 & \text{if} \quad x \le t_{Bayes} \\
1 & \text{if} \quad x > t_{Bayes}
\end{cases}
\]
d. Compute the specificity, sensitivity, false positive rate, false negative rates, and overall misclassification rate for the Bayes rule. (Hint: recall you can use the R function “pnornm“ to compute the probability that a normal variable exceeds or is below a given threshold)
e. Consider now the more general decision rule (below) with arbitrary cutoff $t$ (not necessarily $\,t_{Bayes}$). Compute the specificity, sensitivity, false positive rate, false negative rates, and overall misclassification rate for a grid of 20 equally spaced values of the cutoff $t$ ranging from $t=-4$ to $t=6$. (Hint: use “seq“ to generate the grid and a “for“ loop to iterate over the grid values.)
\[
Y = \begin{cases}
0 & \text{if} \quad x \le t \\
1 & \text{if} \quad x > t
\end{cases}
\]
f. Plot in the same graph the sensitivity, specificity and misclassification rate as a function of $t$. Interpret the plot and comment on the cutoff where the minimum misclassification rate is attained.
g. Plot the ROC curve.
“`{r}
breast = read.csv(“/Users/jp/Google Drive/Teaching/Machine Learning/Datasets/Breast Cancer/breast-cancer.data.txt”, header=T)
breast = breast[complete.cases(breast), ] #keeps complete cases only
levels(breast$recurrence) = c(“no-recurrence”, “recurrence”) #renames levels using shorter names
breast$age_quant = as.integer(breast$age) # creates a quantitative age variable
breast$tumor_size_quant = factor(breast$tumor_size, levels(breast$tumor_size)[c(1, 10, 2, 3, 4, 5, 6, 7, 8, 9, 11)]) #creates a quantitative tumor size variable
breast$tumor_size_quant = as.integer(breast$tumor_size_quant)
table(breast$tumor_size_quant, breast$tumor_size) # Check that recoding worked as expected
breast$inv_nodes_quant = factor(breast$inv_nodes, levels(breast$inv_nodes)[c(1, 5, 6, 7, 2, 3, 4)]) #creates a quantitative invasive nodes variable
breast$inv_nodes_quant = as.integer(breast$inv_nodes_quant)
table(breast$inv_nodes_quant , breast$inv_nodes) # Check that recoding worked as expected
“`