————————————————————————
————————————————————————
For the second midterm you will fill in missing pieces of the code in the following blocks. Suppose we observe N data points, {X_(i)}_(i = 1)^(N), in three dimensions so that X_(i) = (x_(i1), x_(i2), x_(i3)). Recall that the isolation forest separates points by repeating the steps in the following algorithm.
1. Randomly choose one of the variables in the dataset, say x₁;
2. Randomly choose a number c in the interval (min_(i)(x_(i1)), max_(i)(x_(i1)).
3. Divide the data into groups depending on whether x_(i1) > c or x_(i1) ≤ c.
_The midterm project is worth 16 points and due on April 2nd. You may consult other students in the class to finish the midterm, but you must submit your own project. You may not ask for help from me or the teaching assistants._
CODE BLOCK #1: This code block divides the data into groups. The results are recorded in an N × N matrix where entry (i, j) indicates whether observation i and j belong to the same cluster. In this block, I have replaced parts of the code with a question mark. Make the necessary changes to the code.
## you will need to change the path to load the data
out_dat <- read.csv("~/Dropbox/Teaching/341/data/cluster_outlier_set.csv")
getClusters <- function(mydata, J){
N <- nrow(mydata)
K <- ncol(mydata)
cluster_matrix <- matrix(1,N,N)
for (j in c(1:J)){
## 1. sample a dimension (1 point)
#mydim <- sample(1:?,1)
## 2. sample a number c greater than the minimum and less than the maximum of that dimension (1 point)
#c <- runif(1,min(mydata[,?]),max(mydata[,?]))
## 3. compute a matrix that determines whether the each pair of points satifies the condition (1 point)
#tmp <- (mydata[,mydim] < c) %*% t(mydata[,mydim] < c) ? (mydata[,mydim] >= c) %*% t(mydata[,mydim] >= c)
## 4. Update the cluster matrix (1 point)
#cluster_matrix <- ? * tmp
}
return(cluster_matrix)
}
set.seed(341)
mycluster_matrix <- getClusters(out_dat, J = 4)
CODE BLOCK #2: This code block divides the data into groups. The results are recorded in an N × N matrix where entry (i, j) indicates whether observation i and j belong to the same cluster. In this block, I have replaced parts of the code with an asterik. Make the necessary changes to the code.
## 5. determine the number of groups (1 point)
#unique_row <- mycluster_matrix[which(!duplicated(mycluster_matrix)),]
#number_groups <- nrow(?)
print(number_groups)
## 6. get the group assignments (1 point)
#mygroups <- apply(unique_row,1,function(x){which(x==1)})
#cluster_assignments <- numeric(nrow(out_dat))
#for (ii in c(1:number_groups)){
# cluster_assignments[mygroups[[ii]]] <- ?
#}
## 7. assign a colour to each group (1 point)
#mycols <- rainbow(n = ?, alpha = .75)
#group_cols <- mycols[cluster_assignments]
## 8. Make a plot the data with the points coloured by cluster number (1 point)
#plot(out_dat[,c(1,2)],typ="p",col=?,pch = 20)
#plot(out_dat[,c(1,3)],typ="p",col=?,pch = 20)
#plot(out_dat[,c(2,3)],typ="p",col=?,pch = 20)
CODE BLOCK #3: Now, let’s repeat this R = 2, 500 times and average the results of the cluster matrices. Entry ij in the resulting matrix is the probability that two points belong to the same cluster, q_(ij). We will set the number of clusters, G, to be equal to the number of clusters in the single run. Then we will sample from the set of cluster assignments and plot the results.
In the following code block, I have written a function that computes the logged probability of a cluster
$$p(C_1,\ldots,C_{G}) = \prod_{i=1}^N \prod_{jp2){
mycluster[ind] <- newc
}
s <- s + 1
}
## 12. plot the results (1 point - you can use the other code block to do this)
CODE BLOCK #4: Challenge: Now compare the results of the single run to the 2,500 runs. For the two results, compute the sum of squared residuals given by
$$\text{SSR} = \sum_{g = 1}^G \sum_{i: \mathbf{x}_i \in C_g}d(\mathbf{x}_i,\bar{\mathbf{c}}_g)^2$$
where G is the number of groups, C_(g) is the set of points in cluster g, and $d(\mathbf{x}_i,\bar{\mathbf{c}}_g)$ is the Euclidean distance between X_(i) and the mean of its cluster $\bar{\mathbf{c}}_g$.
## 13. Compute the SSR for the two set of results (2 points)
## 14. In the space below, answer the following question. What happends to the SSR for a single run of our clustering method as the number of cuts (J) grows large? Why? (2 points)