—
title: “Problem Set 4 – Part B”
output: html_document
—
“`{r, echo = FALSE}
library(“quanteda”, quietly = TRUE, warn.conflicts = FALSE, verbose = FALSE)
“`
### Clustering methods
#### 1. **Distance matrixes and hierarchical clustering.**
Suppose that we have four observations, for which we compute a dissimilarity matrix, given by
$$\left[ \begin{array}{ccc}
& 0.3 & 0.4 & 0.7 \\
0.3 & & 0.5 & 0.8 \\
0.4 & 0.5 & & 0.45 \\
0.7 & 0.8 & 0.45 &
\end{array} \right]$$
For instance, the dissimilarity (distance) between the first and second observations is 0.3, and the dissimilarity (distance) between the second and fourth observations is 0.8.
1.a) On the basis of this dissimilarity matrix, sketch the dendrogram that results from hierarchically clustering these four observations. Use the same set of steps outlined in the lecture slides.
Include a graphic (fine to snap a photo of the drawing using your phone, or draw any other way).
**YOUR ANSWER HERE**
1.b) Now export the distances to R and plot the dendrogram using R’s built-in functions. Compare your results.
“`{r}
“`
#### 2. **$k$-means clustering on texts.**
2.a) Extract a subset of the texts from the `quanteda.corpora::data_corpus_ukmanifestos` corpus that includes just the Conservative, Labour, and Lib Dem parties from 1970 onward.
“`{r}
“`
2.b) Perform a $k$-means clustering of these texts for $k=3$. Examine which manifestos fall into each cluster. What do you learn?
“`{r}
“`
**YOUR ANSWER HERE.***
2.c) Now perform a $k$-means clustering for each text for $k$ from 1 to 8. For each outcome, save the total within group sum of squares. Plot the log total within group sum of squares as a function of $k$. Use this “scree plot” to select the best $k$ using the elbow method described in the lecture.
“`{r}
“`
**YOUR ANSWER HERE.**
1.d) Examine the clusters of Party labels produced by this “best-fitting” k cluster. Do the groupings make sense?
“`{r}
“`
**YOUR ANSWER HERE.**
1.e) Now repeat (c)-(d) after weighting the dfm by relative proportion of terms within documents. What difference did this make?
“`{r}
“`
**With just a few exceptions, now it seems each cluster corresponds to a different party.**
#### 3. **Hierarchical Clustering on texts.**
3.a) Compute the matrix of Euclidean distances between each of the party manifestos in the previous exercise. Should you use relative frequencies rather than counts here? Apply the option that makes more sense based on what you know about Euclidean distances.
“`{r}
“`
**YOUR ANSWER HERE.**
3.b) Plot the dendrogram and describe the groupings.
“`{r, fig.width = 6, fig.height = 8}
“`
**YOUR ANSWER HERE.**