(Answer)
Euclidean distance and cosine function break in high dimensions, which makes clustering in high-dimensional spaces difficult.
Almost all pair of points will have similar Euclidean distances (equally far away from each other). Or
The angle of almost any two vectors is close to 90¡ã (orthogonal).
(Answer) K-means (1.5 pts):
1. Initialize k clusters by selecting k points as centroids of the clusters.
2. Assign each point to the cluster whose centroid is closest to the point.
3. Update centroids based on points assigned in each cluster.
4. Reassign each point to the cluster whose new centroids is closest to the point.
5. Repeat steps 3 and 4 until convergence.
Hierarchical clustering (Either agglomerative or divisive: 1.5 pts): Agglomerative (bottom up):
1. Starts each point as a cluster.
2. Combine 2 closest clusters into one repeatedly.
3. Stop combining when the number of clusters become 1 or pre-defined value.
Divisive (top down): Start with one cluster and recursively split it
1
(Answer)
BFR
CURE
Assumptions (2 pts)
– Clusters are normally distributed around a centroid.
– The axes of the cluster align with the axes of the space. (The dimensions are independent)
– Euclidean space
– any shape of clusters
Representing a cluster
(2 pts)
2d+1 values (d: dimension)
– N: The number of points
– SUM vector: ith component = the sum of the coordinates of the points in the ith dimension
– SUMSQ vector: ith component = sum of squares of coordinates in the ith dimension
a set of representative points
Data space (1 pt)
Euclidean space
Euclidean space
2