MATH 189 Homework Assignment 5
Due May 10
Problem 1
Let¡¯s revisit the Fisher¡¯s iris dataset (iris.CSV) and consider it as a clustering problem. To mimic an unsupervised learning problem, the true labels of species should be discarded in your learning process, and will only be used when validating your clustering results.
1. Apply Agglomerative Hierarchical Clustering (agnes) to the dataset. Choose the Euclidean distance as similarity measure and the average linkage as the linkage function. Draw the dendrogram of your clustering results. Make a cut on the dendrogram at three clusters. Compare your clustering labels with the truth. Report the correct clustering rate, which is the ratio between the number of correctly classified observations and total sample size.
2. Report the correct clustering rate under the following three scenarios:
(1) Retaining all the settings in part 1, except using the Taxicab distance as the similarity measure.
(2) Retaining all the settings in part 1, except using the single linkage as the linkage function.
(3) Retaining all the settings in part 1, except standardizing each variable before applying the clustering. (Standardize each column in the dataset to have zero mean and unit variance).
th
, 2019
Some technical hints
For students working with R:
See week5_demo.pdf on Piazza for the implementation of
agnes.
To make a cutoff in dendrogram, you can use the following code clusterCut = cutree(hc.average, 3)
hc.average is the cluster object generated by hclust function and 3 refers to cutting the tree at 3 clusters. You can obtain the clustering labels by print clusterCut. For more details, see week5_demo.pdf on Piazza.
For students working with Matlab:
See the following link as an example to implement Hierarchical Clustering:
https://www.mathworks.com/help/stats/hierarchical-
clustering.html
For students working with Python:
See the following link as an example to implement Hierarchical Clustering:
https://stackabuse.com/hierarchical-clustering-with-python- and-scikit-learn/