ECE 219 Large-Scale Data Mining: Models and Algorithms
Project 2: Data Representations and Clustering
Introduction
Machine learning algorithms are applied to a wide variety of data, including text and images. Before applying these algorithms, one needs to convert the raw data into feature representa- tions that are suitable for downstream algorithms. In project 1, we studied feature extraction from text data, and the downstream task of classification. We also learned that reducing the dimension of the extracted features often helps with a downstream task.
In this project, we explore the concepts of feature extraction and clustering together. In an ideal world, all we need are data points – encoded using certain features– and AI should be able to find what is important to learn, or more specifically, determine what are the underlying modes or categories in the dataset. This is the ultimate goal of General AI: the machine is able to bootstrap a knowledge base, acting as its own teacher and interacting with the outside world to explore to be able to operate autonomously in an environment.
We first explore this field of unsupervised learning using textual data, which is a continuation of concepts learned in Project 1. We ask if a combination of feature engineering and clustering techniques can automatically separate a document set into groups that match known labels.
Next we focus on a new type of data, i.e. images. Specifically, we first explore how to use “deep learning” or “deep neural networks (DNNs)” to obtain image features. Large neural networks have been trained on huge labeled image datasets to recognize objects of different types from images. For example, networks trained on the Imagenet dataset can classify more than one thousand different categories of objects. Such networks can be viewed as comprising two parts: the first part maps a given RGB image into a feature vector using convolutional filters, and the second part then classifies this feature vector into an appropriate category, using a fully-connected multi-layered neural network (we will study such NNs in a later lecture). Such pre-trained networks could be considered as experienced agents that have learned to discover features that are salient for image understanding. Can one use the experience of such pre- trained agents in understanding new images that the machine has never seen before? It is akin to asking a human expert on forensics to explore a new crime scene. One would expect such an expert to be able to transfer their domain knowledge into a new scenario. In a similar vein, can a pre-trained network for image understanding be used for transfer learning? One could use the output of the network in the last few layers as expert features. Then, given a multi-modal dataset –consisting of images from categories that the DNN was not trained for– one can use feature engineering (such as dimensionality reduction) and clustering algorithms to automatically extract unlabeled categories from such expert features.
For both the text and image data, one can use a common set of multiple evaluation metrics to compare the groups extracted by the unsupervised learning algorithms to the corresponding ground truth human labels.
Clustering Methods
Clustering is the task of grouping a dataset in such a way that data points in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). Thus, there is an inherent notion of a metric that is used to compute similarity among data points, and different clustering algorithms differ in the type of similarity measure they use, e.g., Euclidean vs Riemannian geometry. Clustering algorithms are considered “unsupervised learning”, i.e. they do not require labels during training. In principle, if two categories of objects or concepts are distinct from some perspective (e.g. visual or functional), then data points from these two categories – when properly coded in a feature space and augmented with an associated distance metric – should form distinct clusters. Thus, if one can perform perfect clustering then one can discover and obtain computational characterizations of categories without any labeling. In practice, however, finding such optimal choices of features and metrics has proven to be a computationally intractable task, and any clustering result needs to be validated against tasks for which one can measure performance. Thus, we use labeled datasets in this project, which allows us to evaluate the learned clusters by comparing them with ground truth labels.
Below, we summarize several clustering algorithms:
K-means: K-means clustering is a simple and popular clustering algorithm. Given a set of data points {x1,…,xN} in multidimensional space, and a hyperparameter K denoting the number of clusters, the algorithm finds the K cluster centers such that each data point belongs to exactly one cluster. This cluster membership is found by minimizing the sum of the squares of the distances between each data point and the center of the cluster it belongs to. If we define μk to be the “center” of the kth cluster, and
1, if xn is assigned to cluster k
rnk = 0, otherwise , n = 1,…,N k = 1,…,K
Then our goal is to find rnk’s and μk’s that minimize J = rnk ∥xn − μk∥2. The
approach of K-means algorithm is to repeatedly perform the following two steps until
convergence:
1. (Re)assign each data point to the cluster whose center is nearest to the data point.
2. (Re)calculate the position of the centers of the clusters: setting the center of the cluster to the mean of the data points that are currently within the cluster.
The center positions may be initialized randomly.
Hierarchical Clustering Hierarchical clustering is a general family of clustering algorithms that builds nested clusters by merging or splitting them successively. This hierarchy of clusters is represented as a tree (or dendrogram). A flat clustering result is obtained by cutting the dendrogram at a level that yields a desired number of clusters.
DBSCAN DBSCAN or Density-Based Spatial Clustering of Applications with Noise finds core samples of high density and expands clusters from them. It is a density-based clustering non-parametric algorithm: Given a set of points, the algorithm groups together points that are closely packed together (points with many nearby neighbors), marking as outliers points that lie alone in low-density regions (whose nearest neighbors are too far away).
HDBSCAN HDBSCAN extends DBSCAN by converting it into a hierarchical clustering al- gorithm, and then using an empirical technique to extract a flat clustering based on the
stability of clusters (similar to the elbow method in k-Means). The resulting algorithm gets rid of the hyperparameter “epsilon”, which is necessary in DBSCAN (see here for more on that).
Common Clustering Evaluation Metrics
In order to evaluate a clustering pipeline, one can use the ground-truth class labels and compare them with the cluster labels. This analysis determines the quality of the clustering algorithm in recovering the ground-truth underlying labels. It also indicates if the adopted feature extrac- tion and dimensionality reduction methods retain enough information about the ground-truth classes. Below we provide several evaluation metrics available in sklearn.metrics. We shall emphasize that for clustering evaluation you do not need to separate your data to training and test sets. (why?)
Homogeneity is a measure of how “pure” the clusters are. If each cluster contains only data points from a single class, the homogeneity is satisfied.
Completeness indicates how much of the data points of a class are assigned to the same cluster.
V-measure is the harmonic average of homogeneity score and completeness score.
Adjusted Rand Index is similar to accuracy, which computes similarity between the clus- tering labels and ground truth labels. This method counts all pairs of points that both fall either in the same cluster and the same class or in different clusters and different classes.
Adjusted mutual information score measures the mutual information between the cluster label distribution and the ground truth label distributions.
Dimensionality Reduction Methods
In project 1, we studied SVD/PCA and NMF as linear dimensionality reduction techniques. Here, we consider some additional non-linear methods.
Uniform Manifold Approximation and Projection (UMAP) The UMAP algorithm constructs a graph-based representation of the high-dimensional data manifold, and learns a low-dimensional representation space based on the relative inter-point distances. UMAP allows more choices of distance metrics besides Euclidean distance. In particular, we are interested in “cosine distance” for text data, because as we shall see it bypasses the magnitude of the vectors, meaning that the length of the documents does not affect the distance metric.
Autoencoders An autoencoder1 is a special type of neural network that is trained to copy its input to its output. For example, given an image of a handwritten digit, an autoencoder first encodes the image into a lower dimensional latent representation, then decodes the latent representation back to an image. An autoencoder learns to compress the data while minimizing the reconstruction error. Further details can be found in chapter 14 of [4].
1also known as “auto-associative networks” in older jargon
Part 1 – Clustering on Text Data
In this part of the project, we will work with “20 Newsgroups” dataset, which is a collection of approximately 20, 000 documents, partitioned (nearly) evenly across 20 different newsgroups (newsgroups are discussion groups like forums, which originated during the early age of the Internet), each corresponding to a different topic. Use the fetch 20newsgroups function from scikit-learn to load the dataset. Detailed usage of the dataset and sample code can be found at this link.
To get started with a simple clustering task, we work with a well-separable subset of samples from the larger dataset. Specifically, we define two classes comprising of the following categories.
Table 1: Two well-separated classes
Class 1 comp.graphics comp.os.ms-windows.misc comp.sys.ibm.pc.hardware comp.sys.mac.hardware Class 2 rec.autos rec.motorcycles rec.sport.baseball rec.sport.hockey
Clustering with Sparse Text Representations
1. Generate sparse TF-IDF representations: Following the steps in Project 1, trans- form the documents into TF-IDF vectors. Use min df = 3, exclude the stopwords (no need to do stemming or lemmatization), and remove the headers and footers.
QUESTION 1: Report the dimensions of the TF-IDF matrix you obtain.
2. Clustering: Apply K-means clustering with k = 2 using the TF-IDF data. Note that the KMeans class in sklearn has parameters named random state, max iter and n init. Please use random state=0, max iter ≥ 1000 and n init ≥ 302. You can refer to sklearn – Clustering text documents using k-means for a basic work flow.
(a) Given the clustering result and ground truth labels, contingency table A is the matrix whose entries Aij is the number of data points that belong to the i’th class and the j’th cluster.
QUESTION 2: Report the contingency table of your clustering result. You may use the provided plotmat.py to visualize the matrix. Does the contingency matrix have to be square-shaped?
QUESTION 3: Report the 5 clustering measures explained in the introduction for K- means clustering.
Clustering with Dense Text Representations
As you may have observed, high-dimensional sparse TF-IDF vectors do not yield a good clus- tering result, especially when K-Means clustering is used. One of the reasons is that in a high-dimensional space, the Euclidean distance is not a good metric anymore, in the sense that the distances between data points tends to be almost the same (see [1]).
K-means clustering has other limitations. Since its objective is to minimize the sum of within-cluster l2 distances, it implicitly assumes that the clusters are isotropically shaped, i.e. round-shaped. When the clusters are not round-shaped, K-means may fail to identify the
2If you have enough computation power, the larger the better 4
clusters properly. Even when the clusters are round, K-means algorithm may also fail when the clusters have unequal variances. A direct visualization for these problems can be found at sklearn – Demonstration of k-means assumptions.
In this part we try to find a “better” representation tailored to the performance of the downstream task of K-means clustering. Towards finding a better representation, we can reduce the dimension of our data with different methods before clustering.
1. Generate dense representations for better K-Means Clustering
(a) First we want to find the effective dimension of the data through inspection of the top singular values of the TF-IDF matrix and see how many of them are significant in reconstructing the matrix with the truncated SVD representation. A guideline is to see what ratio of the variance of the original data is retained after the dimensionality reduction.
QUESTION 4: Report the plot of the percentage of variance that the top r principle components retain v.s. r, for r = 1 to 1000.
Hint: explained variance ratio of TruncatedSVD objects. See sklearn document.
(b) Now, use the following two methods to reduce the dimension of the data. Sweep over the dimension parameters for each method, and choose one that yields better results in terms of clustering purity metrics.
• Truncated SVD / PCA
Note that you don’t need to perform SVD multiple times: performing SVD with r = 1000 gives you the data projected on all the top 1000 principle components, so for smaller r’s, you just need to exclude the least important features.
QUESTION 5:
Let r be the dimension that we want to reduce the data to (i.e. n components).
Try r = 1,2,3,5,10,20,50,100,300, and plot the 5 measure scores v.s. r for both
SVD and NMF.
Report a good choice of r for SVD and NMF respectively.
Note: In the choice of r, there is a trade-off between the information preservation, and better performance of k-means in lower dimensions.
QUESTION 6: How do you explain the non-monotonic behavior of the measures as r increases?
QUESTION 7: Are these measures on average better than those computed in Question 3?
2. Visualize the clusters
We can visualize the clustering results by projecting the dimension-reduced data points
onto a 2-D plane by once again using SVD, and coloring the points according to the: • Ground truth class label;
• Clustering label
respectively.
QUESTION 8: Visualize the clustering results for:
• SVD with your optimal choice of r for K-Means clustering;
• NMF with your choice of r for K-Means clustering.
To recap, you can accomplish this by first creating the dense representations and then once again projecting these
representations into a 2-D plane for visualization.
QUESTION 9: What do you observe in the visualization? How are the data points of the two classes distributed? Is distribution of the data ideal for K-Means clustering?
3. Clustering of the Entire 20 Classes
We have been dealing with a relatively simple clustering task with only two well-separated classes. Now let’s face a more challenging one: clustering for the entire 20 categories in the 20newsgroups dataset.
QUESTION 10: Load documents with the same configuration as in Question 1, but for ALL 20 categories. Construct the TF-IDF matrix, reduce its dimensionality properly using either NMF or SVD, and perform K-Means clustering with k=20 . Visualize the contingency matrix and report the five clustering metrics.
There is a mismatch between cluster labels and class labels. For example, the cluster #3 may correspond to the class #8. As a result, the high-value entries of the 20 × 20 contingency matrix can be scattered around, making it messy to inspect, even if the clustering result is not bad.
One can use scipy.optimize.linear_sum_assignment to identify the best-matching cluster-class pairs, and permute the columns of the contingency matrix accordingly. See below for an example:
import numpy as np
from plotmat import plot_mat # using the provided plotmat.py from scipy.optimize import linear_sum_assignment
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(labels, clustering_labels)
rows, cols = linear_sum_assignment(cm, maximize=True)
plot_mat(cm[rows[:, np.newaxis], cols], xticklabels=cols, → yticklabels=rows, size=(15,15))
The clustering result is probably not very satisfactory for the 20 categories data. To see if we can improve this performance, we consider UMAP for dimensionality reduction. To motivate using UMAP, consider two documents that are about the same topic and are similar, but one is very long while the other is short. The magnitude of the TF-IDF vector will be high for the long document and low for the short one, even though the orientation of their TF-IDF vectors might be very close. In this case, the cosine distance adopted by UMAP will correctly identify the similarity, whereas Euclidean distance might fail.
QUESTION 11: Use UMAP to reduce the dimensionality of the 20 category TF-IDF matrix, and apply K-Means clustering with n_components=20 .
Find a good n components choice for UMAP, and compare the performance of two metrics by setting metric=”euclidean” and metric=”cosine” respectively.
Report the permuted contingency matrix and the five clustering evaluation metrics for “euclidean” and “cosine”.
QUESTION 12: Analyze the contingency matrix.
QUESTION 13: So far, we have attempted K-Means clustering with 4 different representation learning techniques (sparse representation, PCA, NMF, UMAP). Compare and contrast the results from the previous sections, and discuss which approach is best for the K-Means clustering task on the 20-class text data.
Clustering Algorithms that do not explicitly rely on the Gaussian distribution per cluster
While we have successfully shown in the previous section that some representation learning techniques perform better than others for the task of K-Means clustering on this text dataset, this sweep only covers a half of the end-to-end solution for representation learning. In this section we introduce 3 additional clustering algorithms.
1. Agglomerative Clustering
The AgglomerativeClustering object performs a hierarchical clustering using a bottom up approach: each observation starts in its own cluster, and clusters are successively merged together. There are 4 linkage criteria that determines the merge strategy.
QUESTION 14: Use UMAP to reduce the dimensionality properly, and perform Agglom- erative clustering with n_clusters=20 . Compare the performance of “ward” and “single” linkage criteria.
Report the five clustering evaluation metrics for each case.
2. DBSCAN and HDBSCAN
QUESTION 15: Apply DBSCAN and HDBSCAN on UMAP-transformed 20-category data.
Use min_cluster_size=100.
Experiment on the hyperparameters and report your findings in terms of the five
clustering evaluation metrics.
QUESTION 16: Contingency matrix
Plot the contingency matrix for the best clustering model from Question 15.
How many clusters are given by the model? What does “-1” mean for the clustering labels? Interpret the contingency matrix considering the answer to these questions.
QUESTION 17: Based on your experiments, which dimensionality reduction technique and clus- tering methods worked best together for 20-class text data and why? Follow the table below.
Hint: DBSCAN and HDBSCAN do not accept the number of clusters as an input parameter. So pay close attention to how the different clustering metrics are being computed for these methods.
QUESTION 18: Bonus. If you can find creative ways to further enhance the clustering perfor- mance, report your method and the results you obtain.
Module Dimensionality Reduction
Clustering
Alternatives