CS计算机代考程序代写 AI algorithm Clustering

Clustering

Reference book: Bishop: “Pattern Recognition and Machine
Learning” Chapter 9.1

Clustering

• Unsupervised learning

• Generating “classes”

• Distance/similarity measures

• Agglomerative methods

• Divisive methods

Unsupervised Learning

• No labels/responses. Finding structure in data.

• Dimensionality Reduction.

Clustering Subspace Learning
𝑇:ℝ$ → {1, 2, … ,𝐾} 𝑇:ℝ$ → ℝ-,𝑚 < 𝐷 Uses of Unsupervised Learning • Data compression RGB (12,165,73) (11,167,79) : Labels 3 43 : Dictionary 1 ~ (10, 160, 70) 2 ~ (40, 240, 20) : Uses of Unsupervised Learning • To improve classification/regression (semi-supervised learning) 1. From unlabeled data, learn a good feature 𝑇:ℝ$ → ℝ-. 2. To labeled data, apply transformation 𝑇:ℝ$ → ℝ-. 𝑇 𝒙2 ,𝒚2 , … , 𝑇 𝒙4 ,𝒚4 3. Perform classification/regression on transformed low- dimensional data. What is Clustering? • Unsupervised learning - no information from teacher • The process of partitioning a set of data into a set of meaningful (hopefully) sub-classes, called clusters • Cluster: • collection of data points that are “similar” to one another and collectively should be treated as group (it is not the group we learned in linear algebra) • as a collection, are sufficiently different from other groups What is Clustering Clustering Problem. Input. Training data 𝒮4 = 𝒙2,𝒙7, … ,𝒙4 , 𝒙8 ∈ ℝ $. Integer 𝐾 Output. Clusters 𝒞2,𝒞7, … ,𝒞; ⊂ 1, 2, … ,𝑁 such that every data point is in one and only one cluster. Cluster representatives 𝝁2, … ,𝝁; Some clusters could be empty! Clusters How to Specify a Cluster • By listing all its elements 𝒞2 = 1,3,4,7,8,11 𝒞7 = 2,5,6,9,10 1 2 3 11 10 6 7 8 9 5 4 How to Specify a Cluster • Using a representative a. A point in center of cluster (centroid) b. A point in the training data (exemplar) centroid exemplar 1 2 3 11 10 6 7 8 9 5 4 1 2 3 11 10 6 7 8 9 5 4 Each point 𝒙G will be assigned the closest representative. We can partition all the points in the space into regions, according to their closest representative. Voronoi Diagram Distance/Similarity Measures (sometimes called loss functions) A measure of how close two data points are. Nearby points (i.e. distance is small) are more likely to belong to the same cluster. • Euclidean Distance dist 𝒙,𝒚 = 𝒙 − 𝒚 7 ∶= N GO2 8 𝑥G − 𝑦G 7 Distance/Similarity Measures A measure of how alike two data points are. Similar points (i.e. similarity is large) are more likely to belong to the same cluster. • Cosine Similarity cos 𝒙,𝒚 = 𝒙T𝒚 𝒙 ‖𝒚‖ Distance (Similarity) Matrix similarity (or distance) of to ij i jd D D= Note that dij = dji (i.e., the matrix is symmetric). So, we only need the lower triangle part of the matrix. The diagonal is all 1’s (similarity) or all 0’s (distance) 1 2 1 12 1 2 21 2 1 2 n n n n n n I I I I d d I d d I d d • • • • ! ! ! " " " " ! • Similarity (Distance) Matrix • based on the distance or similarity measure we can construct a symmetric matrix of distance (or similarity values) • (i, j)th entry in the matrix is the distance (similarity) between items i and j Example: Term Similarities in Documents T1 T2 T3 T4 T5 T6 T7 T8 Doc1 0 4 0 0 0 2 1 3 Doc2 3 1 4 3 1 2 0 1 Doc3 3 0 0 0 3 0 3 0 Doc4 0 1 0 3 0 0 2 0 Doc5 2 2 2 3 1 4 0 2 sim T T w wi j jkik k N ( , ) ( )= × = å 1 T1 T2 T3 T4 T5 T6 T7 T2 7 T3 16 8 T4 15 12 18 T5 14 3 6 6 T6 14 18 16 18 6 T7 9 6 0 6 9 2 T8 7 17 8 9 3 16 3 Term-Term Similarity Matrix Similarity (Distance) Thresholds • A similarity (distance) threshold may be used to mark pairs that are “sufficiently” similar T1 T2 T3 T4 T5 T6 T7 T2 7 T3 16 8 T4 15 12 18 T5 14 3 6 6 T6 14 18 16 18 6 T7 9 6 0 6 9 2 T8 7 17 8 9 3 16 3 T1 T2 T3 T4 T5 T6 T7 T2 0 T3 1 0 T4 1 1 1 T5 1 0 0 0 T6 1 1 1 1 0 T7 0 0 0 0 0 0 T8 0 1 0 0 0 1 0 Using a threshold value of 10 in the previous example Graph Representation • The similarity matrix can be visualized as an undirected graph • each item is represented by a node, and edges represent the fact that two items are similar (a 1 in the similarity threshold matrix) T1 T2 T3 T4 T5 T6 T7 T2 0 T3 1 0 T4 1 1 1 T5 1 0 0 0 T6 1 1 1 1 0 T7 0 0 0 0 0 0 T8 0 1 0 0 0 1 0 T1 T3 T4 T6 T8 T5 T2 T7 If no threshold is used, then matrix can be represented as a weighted graph Training Loss Sum of squared distances to closest representative. loss ≈ 11× 1 7 = 11 assume length of each edge is about 1 Training Loss Sum of squared distances to closest representative (cluster center). loss ≈ 9× 0.1 7 = 0.09 assume length of each edge is about 0.1 Training loss Optimizing both clusters and representatives. ℒ 𝑹,𝜧;𝒮8 = N 8O2 4 N ^O2 ; 𝑟8^ 𝒙8 − 𝝁^ 7 where 𝒮8 = 𝒙2,𝒙7, … ,𝒙4 , 𝒙G∈ ℝ $ 𝑟8^ ∈ 0,1 denotes which of the 𝐾 clusters data point 𝒙8 is assigned to. If 𝒙8 is assigned to cluster 𝑘 then 𝑟8^ = 1, and 𝑟8a = 0 if 𝑗 ≠ 𝑘. 𝑹 = 𝑟8^ ∈ 0,1 8×^,𝑛 = 1,… ,𝑁, 𝑘 = 1,… ,𝐾 𝜧 = 𝝁2, … ,𝝁; e Goal: find values for 𝑹 and 𝜧 so as to minimize ℒ. Basic Clustering Methodology Two approaches: Agglomerative: pairs of items/clusters are successively linked to produce larger clusters Divisive (partitioning): items are initially placed in one cluster and then divided into separate groups Need to define an update rule for how to combine two clusters together. Single-link Clustering: Distance between clusters to be shortest distance between any two points in each cluster: Complete-link Clustering: Distance between clusters to be longest distance between two points from each cluster: Average link clustering: Distance is average distance between clusters Centroid clustering: Distance is distance between cluster centroids Hierarchical/Agglomerative Methods T1 T2 T3 T4 T5 T6 T7 T2 7 T3 16 8 T4 15 12 18 T5 14 3 6 6 T6 14 18 16 18 6 T7 9 6 0 6 9 2 T8 7 17 8 9 3 16 3 • Based on some methods of representing hierarchy of data points • One idea: hierarchical dendogram (connects points based on similarity) T5 T1 T3 T4 T2 T6 T8 T7 Hierarchical Agglomerative • Compute distance matrix • Put each data point in its own cluster • Find most similar pair of clusters • merge pairs of clusters (show merger in dendogram) • update similarity matrix • repeat until all data points are in one cluster K-Means Optimization Algorithm Goal. Minimize ℒ 𝑥,𝑦 . Coordinate Descent (Optimization). Repeat until convergence: 1. Find optimal 𝑥 while holding 𝑦 constant. 2. Find optimal 𝑦 while holding 𝑥 constant. ℒ 𝑹,𝜧;𝒮8 = N 8O2 4 N ^O2 ; 𝑟8^ 𝒙8 − 𝝁^ 7 Coordinate descent is an optimization algorithm that successively minimizes along coordinate directions to find the minimum of a function. Coordinate optimisation Optimization Algorithm Coordinate Descent (Optimization) Repeat until convergence: • Find best clusters given centres • Find best centres given clusters ℒ 𝑹,𝜧;𝒮8 = N 8O2 4 N ^O2 ; 𝑟8^ 𝒙8 − 𝝁^ 7 Optimization Algorithm 1. Initialize centers 𝝁2, … ,𝝁; from the data. 2. Repeat until no further change in training loss: a. For each 𝑛 ∈ {1, … ,𝑁}, 𝑟8^ = f 1, if 𝑘 = arg min a 𝒙8 − 𝝁a 7 0, otherwise. We assign the 𝑛th data point to the closest cluster centre. b. For each 𝑗 ∈ {1, … , 𝑘}, 𝝁^ = ∑8 𝑟8^𝒙8 ∑8 𝑟8^ We recompute cluster means. Convergence • Training loss always decreases in each step (coordinate descent). • Converges to local minimum, not necessarily global minimum. Repeat algorithm over many initial points, and pick the configuration with the smallest training loss. Training loss always decreases after every update step. Kmeans must terminate in finite time. Training loss always decreases after every update step. Kmeans must terminate in finite time. suppose k-means could run forever. Finitely many configurations, (K^N) many. Eventually, choose R now, and choose R again in the future, for some matrix R Discussion Initialization • Empty clusters o Pick data points to initialize clusters • Bad local minima o Initialize many times and pick solution with smallest training loss o Pick good starting positions Choose points in your dataset as the representative mu_k to begin with. Initialization Starting position of centroids Final position of centroids Problem. How to choose good starting positions? Solution. Place them far apart with high probability. Number of Clusters Number of Clusters How do we choose 𝑘, the optimal number of clusters? • Elbow method • Training Loss • Validation Loss Tr ai ni ng L os s Elbow Check your understanding • Clustering gives you an idea of how the data is distributed. • You have 1000 data vectors that are very similar to each other (e.g., Eu distance less than 0.0001). You can only divide them into a few clusters. • When you use kmeans, you usually obtain a global minimum of the loss function • When you use kmeans, you can never achieve global minimum of the loss function. • The number of clusters is a parameter that can be optimized by kmeans. • In K-fold cross validation, K is a hyperparameter. • Agglomerative clustering as we introduced in this lecture, has a fixed clustering result given distance measurement and the number of clusters desired, and tie breaker. Increasing K for K-means always ensures the optimal solution for that K has smaller loss.