Spectral Clustering
Introduction
Illustrate the key idea of spectral clustering
Copyright By PowCoder代写 加微信 powcoder
Define basic graph notations useful for spectral clustering
Revisiting k-means & Mixture Models
|K-means use “hard” membership while mixture models allow “soft” membership
|Both use feature/vector representation of the data as input➔E.g., Euclidean distance is one natural (dis)similarity measure.
-What if the input data is NOT represented in feature/vector, format?
• E.g., graph data.
• E.g., objects with only pair-wise similarities (like individuals on a social network➔community detection)
Revisiting k-means & Mixture Models
|In both k-means and mixture models, we look for compact clustering structures.
|In some cases, connected-component structures may be more desirable.
Source: Ng, A.Y., .J., and Yair, W. “On spectral clustering: Analysis and an algorithm.” Advances in neural information processing systems. 2002.
Spectral Clustering
|A family of methods for finding such similarity- based clusters
-“Spectral”: for using the eigenvalues (spectrum) of the similarity matrix of the data.
-Graph clustering, similarity- based clustering
|The objects to be clustered are not in a vector space.
-The primary feature is the similarity between objects.
-For any pair of objects i and j, we have a value s(i,j) measuring their similarity; all such values form the similarity matrix.
➔ Graphs are intuitive for representing/visualizing such data.
Graph Representation
|Definition: A graph G = (V, E) is defined by V, a set of N vertices, and E, a set of edges.
Undirected graph Directed graph
|In spectral clustering, we consider undirected graphs.
Graph Representation (1/4)
|Adjacency matrix W of undirected graph
– N × 𝑁 symmetric binary matrix
– The row and columns are indexed by the vertices and the entries represent the edges of the graph
൝𝑤𝑖,𝑗 = 0 𝑖𝑓 𝑣𝑒𝑟𝑡𝑖𝑐𝑒𝑠 𝑖, 𝑗 𝑎𝑟𝑒 𝑛𝑜𝑡 𝑐𝑜𝑛𝑛𝑒𝑐𝑡𝑒𝑑 𝑤𝑖,𝑗 = 1 𝑖𝑓 𝑣𝑒𝑟𝑡𝑖𝑐𝑒𝑠 𝑖, 𝑗 𝑎𝑟𝑒 𝑐𝑜𝑛𝑛𝑒𝑐𝑡𝑒𝑑
– Simple graph = zero diagonal
Graph Representation (2/4)
| Weighted adjacency matrix (sometimes called affinity matrix)
– Allow values other than 0 or 1
– Each edge is weighted by pairwise similarity
൝ 𝑤𝑖,𝑗 = 0 𝑖𝑓 𝑖, 𝑗 𝑎𝑟𝑒 𝑛𝑜𝑡 𝑐𝑜𝑛𝑛𝑒𝑐𝑡𝑒𝑑 𝑤𝑖,𝑗 = 𝑠(𝑖, 𝑗) 𝑖𝑓 𝑖, 𝑗 𝑎𝑟𝑒 𝑐𝑜𝑛𝑛𝑒𝑐𝑡𝑒𝑑
|𝒘𝒊,𝒋 may be defined through some kernel functions.
Graph Representation (3/4)
|Degree matrix D of undirected graph
– 𝑁 × 𝑁 diagonal matrix that contains information about the
degree of each vertex.
– Degree 𝑑(𝑣𝑖) of a vertex 𝑣𝑖: # of edges incident to the vertex.
• Extended to sum of weights from edges incident to the vertex. – So, we have:
𝑑(𝑣1) 0 0 𝐃=[0⋱0]
Graph Representation (4/4)
|Laplacian matrix L of undirected graph
– 𝐋 = 𝐃 − 𝐖 (Degree-Affinity) (Unnormalized)
– 𝐋 is symmetric and positive semi-definite
– N non-negative real-valued eigenvalues
– The smallest eigen-value is 0, the corresponding eigenvector is the 1-vector (all elements being 1).
– The smallest non-zero eigenvalue of L is called the spectral gap.
Spectral Clustering
A Graph Cut Formulation
Objective Objective
Define the graph partition formulation
Learn how to solve simple graph partition
Clustering as Graph Partition/Cut
|Find a partition of a graph such that the edges between different groups have a very low weight while the edges within a group have high weight.
|E.g., minimum cut
|More general, consider weighted edges.
CutSize = 2
2-way Spectral Graph Partitioning
|Weighted adjacency matrix W
-𝑤𝑖,𝑗 : the weight between two vertices i and j
|(Cluster) Membership vector 𝐪 𝑞𝑖= ቊ 1 𝑖 ∈ 𝐶𝑙𝑢𝑠𝑡𝑒𝑟 𝐴
−1 𝑖 ∈ 𝐶𝑙𝑢𝑠𝑡𝑒𝑟 𝐵
𝐪 = argmin𝑛 𝐶𝑢𝑡𝑆𝑖𝑧𝑒 , 𝐪∈[−1, 1]
𝐶𝑢𝑡𝑆𝑖𝑧𝑒 = J = 1 (𝑞𝑖 − 𝑞𝑗)2𝑤𝑖,𝑗 4 𝑖,𝑗
Solving the Optimization Problem
𝐪= argmin 1(𝑞𝑖−𝑞𝑗)2𝑤𝑖,𝑗, 𝐪∈[−1, 1]𝑛 4 𝑖,𝑗
|Directly solving the above problem requires combinatorial search→exponential complexity
|How to reduce the computational complexity?
Relaxation Approach
|Key difficulty: 𝐪𝐢 has to be either -1,1. -Relax 𝑞𝑖 to be any real number.
-Impose Constraint: σ𝑛 𝑞2 = 𝑛 𝑖=1 𝑖
𝐽=1 σ (𝑞 −𝑞)2𝑤 =1 σ (𝑞2−2𝑞𝑞 +𝑞2)𝑤
4 𝑖,𝑗 𝑖 𝑗 𝑖,𝑗 4 𝑖,𝑗 𝑖
=1σ2𝑞2σ𝑤 −1σ2𝑞𝑞𝑤
4 𝑖 𝑖 𝑗 𝑖,𝑗 4 𝑖,𝑗 𝑖 𝑗 𝑖,𝑗
=1σ𝑞2𝑑−1σ 𝑞(𝑑δ −𝑤 )𝑞 2 𝑖 𝑖 𝑖 2 𝑖,𝑗 𝑖 𝑖 𝑖,𝑗 𝑖,𝑗 𝑗
where 𝑑𝑖 = σ𝑗 𝑤𝑖,𝑗 and 𝐷 ≡ [𝑑𝑖δ𝑖,𝑗]
➔𝐽=1𝐪𝑇 𝐃−𝐖𝐪 2
Relaxation Approach (cont’d)
|The final problem formulation: 𝐪=argmin𝐽=argmin𝐪𝑇 𝐃−𝐖𝐪,
subject to σ𝑛 𝑞2 = 𝑛 𝑖=1 𝑖
|Solution: the second minimum eigenvector for D-W
𝐃−𝐖 𝐪=𝜆2 𝐪
Graph Laplacian
|𝑳 = 𝑫 − 𝑾
|L is semi-positive definitive matrix.
‒ For any 𝐱, we have 𝐱𝑇𝐋𝐱 ≥ 0. (Why?) |Minimum eigenvalue 𝝀𝟏 = 𝟎 (what is the
eigenvector?)
0 = λ1 < λ2 < λ3 ... < λ𝑘
|The eigenvector that corresponds to the second minimum eigenvalue 𝝀𝟐 gives the best bipartite graph partition.
Recovering the Partitions
|Due to the relaxation, q can be any number (not just -1 and 1)
|How to construct the partition based on the eigenvector?
|A simple strategy :
𝐴= 𝑖𝑞𝑖<0, 𝐵= 𝑖𝑞𝑖≥0
One Obvious Drawback
|Minimum cut does not balance the size of bipartite graphs.
|How should we consider other factors like the sizes of the partitions?
Spectral Clustering
Going Beyond MinCut
Discuss several graph cut approaches
Illustrate the algorithm through an example
|In MinCut, we used the following objective function:
𝐽 = 𝐶𝑢𝑡(𝐴, 𝐵) 𝑀𝑖𝑛𝐶𝑢𝑡
|We noted one drawback of MinCut: the sizes of the partitions are not considered.
|A few extensions exist.
Characterizing Graph Cut
|𝑪𝒖𝒕 𝑨,𝑩 = σ𝒊∈𝑨,𝒋∈𝑩 𝒘𝒊𝒋 𝒆.𝒈.,𝑪𝒖𝒕 𝑨,𝑩 = 𝟎.𝟑
|𝑪𝒖𝒕 𝑨,𝑨 = σ𝒊∈𝑨,𝒋∈𝑨 𝒘𝒊𝒋 𝒆.𝒈.,𝑪𝒖𝒕 𝑨,𝑨 = 𝟐.𝟔
|𝑪𝒖𝒕 𝑩,𝑩 = σ𝒊∈𝑩,𝒋∈𝑩 𝒘𝒊𝒋 𝒆.𝒈.,𝑪𝒖𝒕 𝑩,𝑩 = 𝟐.𝟒
|𝑽𝒐𝒍 𝑨 =σ σ𝒏 𝒘 𝒆.𝒈.,𝑽𝒐𝒍 𝑨 =𝟓.𝟓 𝒊∈𝑨 𝒊=𝟏 𝒊𝒋
|𝑽𝒐𝒍𝑩 =σ σ𝒏 𝒘 𝒆.𝒈.,𝑽𝒐𝒍𝑩 =𝟓.𝟏 𝒊∈𝑩 𝒊=𝟏 𝒊𝒋
| 𝑨 = 𝟒, 𝑩 = 𝟑
2 0. 1 0.1 5 0. 7 67
The Ratio Cut Method
|The Objective function:
𝐽 (𝐴,𝐵)=𝐶𝑢𝑡(𝐴,𝐵)(1 +1)
|Attempts to produce balanced clusters.
2 0. 1 0.1 5 0. 7 67
𝐽 (𝐴, 𝐵) = 𝑅𝑎𝑡𝑖𝑜𝐶𝑢𝑡
The Ratio Cut Method (cont’d)
|Similar to MinCut, the solution can be found by the following generalized eigenvalue problem:
𝐃−𝐖 𝐪=λ𝐃𝐪 𝐋𝐪 = λ𝐃𝐪
Normalized Cut (NCut)
|In Ratio Cut, the balance of the partitions is defined based on the number of vertices.
|We may consider the “size” of a set based on weights of its edges ➔ Ncut
|The objective function is:
𝐽 (𝐴,𝐵)=𝐶𝑢𝑡(𝐴,𝐵)( 1 + 1 )
Example: 𝐽 (𝐴,𝐵)=0.1134 𝑁𝐶𝑢𝑡
𝑉𝑜𝑙(𝐴) 𝑉𝑜𝑙(𝐵)
Additional Considerations
|In clustering, we should also consider within- cluster connections.
|A good partition should consider
-Inter-cluster connections, and -Intra-cluster connections.
|1st constrant: inter-connection should be minimized: 𝑴𝒊𝒏𝑪𝒖𝒕(𝑨, 𝑩)
|2nd constraint: intra-connection should be maximized : 𝑴𝒂𝒙𝑪𝒖𝒕 𝑨, 𝑨 𝒂𝒏𝒅 𝑴𝒂𝒙𝑪𝒖𝒕(𝑩, 𝑩)
|These requirements may be simultaneously satisfied by minimizing the objective function:
𝐽 (𝐴, 𝐵) = 𝐶𝑢𝑡(𝐴, 𝐵)( 1 +
1 ) 𝐶𝑢𝑡(𝐵,𝐵)
Example: 𝐽 (𝐴,𝐵)=0.240 𝑀𝑖𝑛𝑀𝑎𝑥𝐶𝑢𝑡
Normalized and MinMaxCut Methods
|Similar to before, we may relax the indicator vector q to real values.
|For both NCut and MinMaxCut, the solution may be found by solving generalized eigenvalue problems.
An Illustrative Example
Graph and Similarity Matrix
2 0.6 1 0.5
Graph and Laplacian Matrix
2 0.6 1 0.5
Solve Eigen Problem
|Pre-processing
-Build Laplacian matrix L of the graph.
-Eigenvalues Λ and
eigenvectors x of matrix L. -Map vertices to the
corresponding
components of the 2nd eigenvector.
Spectral Clustering
Split at value 0
Cluster A: Negative points Cluster B: Positive Points
2 0. 1 0.1 5 0. 7 60. 7
0. 48 0.0. 530. 0.2986
Spectral Clustering
Practical Considerations in Implementation
Discuss several practical implementation issues
Recursive Bi-partitioning
|Recursively apply bi-partitioning algorithm in a hierarchical divisive manner.
|Disadvantages: inefficient, stability issues.
K-way Graph Cuts
|Generalizing the 2-way objective functions :
𝐽 𝑅𝑎𝑡𝑖𝑜𝐶𝑢𝑡
(𝐴 ,...,𝐴 )= 1 𝑘
𝐽 (𝐴 , ... , 𝐴 ) = 𝑁𝐶𝑢𝑡1 𝑘
𝑘 𝐶𝑢𝑡(𝐴𝑖,𝐴𝑖)
𝑘 𝐶𝑢𝑡(𝐴𝑖,𝐴𝑖)
𝐽 𝑀𝑖𝑛𝑀𝑎𝑥𝐶𝑢𝑡
𝑘 𝐶𝑢𝑡(𝐴𝑖,𝐴𝑖)
(𝐴 , ... , 𝐴 ) =
𝑖=1 𝐶𝑢𝑡(𝐴𝑖,𝐴𝑖)
𝑉𝑜𝑙(𝐴𝑖 ) 𝑖=1
Implementation Considerations (1/4)
|Preprocessing: spectral clustering methods can be interpreted as tools for analysis of the block structure of the similarity matrix.
-Building such matrices may certainly ameliorate the results. |When building graphs from real data
-Calculation of the similarity matrix is not evident. -Choosing the similarity function can highly affect the results
of the following steps.
-A Gaussian kernel is often chosen, but other similarities like
cosine similarity might be proper for specific applications.
Implementation Considerations (2/4)
|Graph and similarity matrix construction: Laplacian matrices are generally chosen to be positive and semi-definite thus their eigenvalues will be non-negatives.
-A few variants
Unnormalized
−1 −1 −1 −1
𝑆𝑦 2 2 2 2 𝑳 =𝑫 ൗ𝑳𝑫 ൗ =𝑰−𝑫 ൗ𝑾𝑫 ൗ
Asymmetric
𝑳 =𝑫−1𝑳=𝑰−𝑫−1𝑾 𝐴𝑠
Implementation Considerations (3/4)
|Computing the eigenvectors. ‒Efficient methods exist for sparse matrices.
|Different ways of building the similarity graphs
‒ ε-neighborhood graph.
‒ k-nearest neighbor graph. ‒fully connected graph.
Implementation Considerations (4/4)
|Choosing k:
‒ Similar to k-means, there are many heuristics to use.
‒ The eigengap heuristic: to choose a k such that first k eigenvalues are very small but the (k+1)th one is relatively large.
|Clustering: simple algorithms other than k-means can be used in the last stage, such as simple linkage, k-lines, elongated k-means, mixture model, etc.
Recap: Pros and Cons of Spectral Clustering
|Advantages:
-Does not make strong assumptions on the forms of the
-Easy to implement, and can be implemented efficiently even for large data sets as long as the similarity graph is sparse.
-Good clustering results.
-Reasonably fast for sparse data sets of several thousand
|Disadvantages:
- May be sensitive to choice of parameters for neighborhood
- Computationally expensive for large datasets.
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com