www.cardiff.ac.uk/medic/irg-clinicalepidemiology
Clustering
Copyright By PowCoder代写 加微信 powcoder
Information modelling
& database systems
in the previous lecture we learnt how to compare objects using distance and similarity measures
in this lecture, we will learn about clustering, which is used to automatically group similar objects together
Information overload
a difficulty a person may
have when trying to deal
with more information
than they are able to process
to make sensible decisions
results in delaying making decisions or making the wrong decisions
an increasing problem in the context of big data
Clustering
solution: divide & conquer
cluster analysis divides data into groups (clusters) that are meaningful, useful or both
Clustering for understanding
classes are conceptually meaningful groups of objects that share common characteristics
classes play an important role in how we analyse and describe the world
in the context of understanding data, clusters are potential classes
Clustering for utility
some clustering techniques characterise each cluster in terms of a cluster prototype
cluster prototype is a data object that is representative of other objects in the cluster
cluster prototypes can be used as the basis for a number of data analysis or data processing techniques
What is clustering?
Clustering
clustering – a data mining technique that automatically groups objects with similar properties into clusters
objects within a cluster should be similar (or related) to one another and different from (or unrelated to) the objects in other clusters
the greater the similarity within a cluster and the greater the difference between clusters, the better
the clustering
unsupervised learning – no annotations/training required!
Clustering
the goal is to:
minimise intra–cluster distances
maximise inter–cluster distances
How many clusters?
2, 4 or 6?
notion of a cluster is ambiguous
difficult to decide what constitutes a cluster
the best definition of a cluster depends on the nature of data and desired outputs
Hierarchical vs. partitional clustering
two basic types of clustering:
hierarchical
non–hierarchical (partitional)
Hierarchical clustering
clusters are permitted to have subclusters
a set of nested clusters that are organised in a tree
each node (cluster) in the tree (except the leaves) is the union of its children (subclusters)
the root of the tree contains all objects
Partitional clustering
partitional clustering is simply a division of the set of objects into non–overlapping subsets (clusters)
note that hierarchical clustering can be viewed as a sequence of partitional clusterings
a partitional clustering can be obtained by cutting the tree at a particular level
Complete vs. partial clustering
a complete clustering assigns every object to a cluster
a partial clustering does not
the motivation for partial clustering is that some objects may not belong to well defined groups
many times data objects may represent noise, outliers of uninteresting background
Types of clusters
clustering aims to find useful groups of objects
usefulness is defined by the goals of data analysis
there are several different notions of a cluster that prove useful in practice:
well–separated
prototype–based
graph–based
density–based
shared–property (conceptual)
described by an objective function
Well–separated clusters
each object is closer (or more similar) to every object in the cluster than to any other object not in the cluster
a threshold can be used to specify that all objects in a cluster must be sufficiently close to one another
idealistic definition of a cluster
satisfied only when the data contains natural clusters that are quite far from one another
well–separated clusters do not need to be spherical, but can have any shape
Prototype–based clusters
each object is closer (more similar) to the prototype that defines the cluster than the prototype of any other cluster
for many types of data, the prototype can be regarded as the most central point
in such instances, we commonly refer to prototype–based clusters as centre–based clusters
such clusters tend to be spherical
Graph–based clusters
if the data is represented as a graph, where the nodes are objects and the links represent connections between objects, then a cluster can be defined as a connected component
a group of objects that are connected to one another, but are not connected to objects outside of the group
an important example of graph–based clusters are contiguity–based clusters, where two objects are connected if they are within a specified distance from each other
Density–based clusters
a cluster is a dense region of objects that is surrounded by a region of low density
a density–based definition of a cluster is often used when the clusters are irregular or intertwined and when noise and outliers are present
a contiguity–based definition of a cluster would not work well since the noise would tend to form connections between clusters
Shared–property (conceptual) clusters
a group of objects that share some property
encompasses all previous definitions of clusters
… but also covers new types of clusters
points in a cluster share some general property that derives from the entire set of objects
too sophisticated
takes us into the area of pattern recognition
Clusters described by an objective function
finds clusters that minimise or maximise an objective function
enumerate all possible ways of dividing the points into clusters and evaluate them using the objective function (NP hard!)
can have global or local objectives
hierarchical clustering algorithms typically have local objectives
partitional clustering algorithms typically have global objectives
Clustering strategies
Hierarchical clustering
hierarchical clustering = a method of cluster analysis which seeks to build a hierarchy of clusters
two strategies:
agglomerative (bottom-up approach)
each instance starts separately in its own cluster
pairs of clusters are merged recursively
divisive (top-down approach)
all instances start together in a single cluster
clusters are split performed recursively
the result is usually presented as a dendrogram
… may respond to a meaningful taxonomy
Hierarchical clustering
nested cluster diagram
dendrogram
Agglomerative hierarchical clustering
starting with individual points as clusters
successively merge the two closest clusters until only one cluster remains
basic algorithm:
Compute the proximity matrix.
Merge the closest two clusters.
Update the proximity matrix to reflect the proximity
between the new cluster and the original clusters.
until Only one cluster remains.
Cluster proximity
the key operation of the algorithm is the computation of the proximity between clusters
the definition of cluster proximity differentiates the various agglomerative hierarchical techniques
cluster proximity is typically defined with a particular type of cluster in mind
e.g. many agglomerative hierarchical clustering techniques, such as MIN, MAX & group average are defined with a graph–based clusters in mind
proximity?
Cluster proximity
MIN defines cluster proximity as the proximity between the closest two points in different clusters
this yields contiguity-based clusters
MAX defines cluster proximity as the proximity between the farthest two points in different clusters
alternative names:
MIN = single link
MAX = complete link
Cluster proximity
the group average technique defines cluster proximity to be the average pair–wise proximities of all pairs of points from different clusters
Cluster proximity
if, instead of a graph–based view, we take a
prototype–based view…
each cluster is represented by a centroid
proximity is commonly defined as the proximity between cluster centroids
Single–link (MIN) hierarchical clustering
Single link (MIN) hierarchical clustering
cluster proximity is defined as the proximity between the closest two points in different clusters
similarity matrix:
1 1.00 0.90 0.10 0.65 0.20
2 0.90 1.00 0.70 0.60 0.50
3 0.10 0.70 1.00 0.40 0.30
4 0.65 0.60 0.40 1.00 0.80
5 0.20 0.50 0.30 0.80 1.00
Single link (MIN) hierarchical clustering
start with all points as singleton clusters:
6 clusters: {a}, {b}, {c}, {d}, {e}, {f}
at each step merge two clusters with two closest points
Single link (MIN) hierarchical clustering
c and f are the two closest points
5 clusters: {a}, {b}, {c, f}, {d}, {e}
Single link (MIN) hierarchical clustering
{b} and {e} are the two closest clusters
4 clusters: {a}, {b, e}, {c, f}, {d}
Single link (MIN) hierarchical clustering
{c, f} and {b, e} are two closest clusters because of the proximity between b and c
3 clusters: {a}, {b, e, c, f}, {d}
Single link (MIN) hierarchical clustering
d is closer to c than a is to b
2 clusters: {a}, {b, c, d, e, f}
Single link (MIN) hierarchical clustering
merge the only two remaining clusters
1 cluster: {a, b, c, d, e, f}
Strength/limitation of MIN
strength: can handle non–elliptical shapes
limitation: sensitive to noise and outliers
Complete–link (MAX) hierarchical clustering
Complete link (MAX) hierarchical clustering
cluster proximity is defined as the proximity between the farthest two points in different clusters
similarity matrix:
sim({3}, {1, 2}) = sim(3, 1) = 0.10
sim({3}, {4, 5}) = sim(3, 5) = 0.30
sim({1, 2}, {4, 5}) = sim(1, 5) = 0.20
1 1.00 0.90 0.10 0.65 0.20
2 0.90 1.00 0.70 0.60 0.50
3 0.10 0.70 1.00 0.40 0.30
4 0.65 0.60 0.40 1.00 0.80
5 0.20 0.50 0.30 0.80 1.00
Complete link (MAX) hierarchical clustering
Complete link (MAX) hierarchical clustering
Complete link (MAX) hierarchical clustering
Complete link (MAX) hierarchical clustering
Complete link (MAX) hierarchical clustering
Complete link (MAX) hierarchical clustering
Strength/limitation of MAX
strength: less susceptible to noise and outliers
limitation: tends to break large clusters
Hierarchical clustering: group average
Hierarchical clustering: group average
cluster proximity is defined as the average of pair–wise proximity between points in the two clusters
similarity matrix:
1 1.00 0.90 0.10 0.65 0.20
2 0.90 1.00 0.70 0.60 0.50
3 0.10 0.70 1.00 0.40 0.30
4 0.65 0.60 0.40 1.00 0.80
5 0.20 0.50 0.30 0.80 1.00
Hierarchical clustering: group average
Hierarchical clustering: group average
Hierarchical clustering: group average
Hierarchical clustering: group average
Hierarchical clustering: group average
Hierarchical clustering: group average
Strength/limitation of group average
group average is a compromise between MIN and MAX
strength: less susceptible to noise and outliers
limitation: biased towards spherical clusters
Hierarchical clustering: Summary
Comparison
group average
Time and space requirements
n = number of points
space: O(n2)
it uses a proximity matrix n x n
time: O(n3)
there are n steps
at each step the proximity matrix of size n x n needs to be searched and updated
can be reduced to O(n2log(n)) in some approaches
Limitations
once a decision has been made to combine two clusters, it cannot be undone
no objective function is directly optimised
different schemes have problems with one or more of the following:
sensitivity to noise and outliers
difficulty handling different sized clusters and convex shapes
breaking large clusters
K–means clustering
Non-hierarchical clustering
non-hierarchical clustering attempts to
directly partition the dataset into a set
of disjoint clusters
k-means clustering aims to partition dataset into k clusters in which each instance belongs to the cluster with the nearest mean
basic principles:
each cluster is associated with its centre (centroid)
each point is assigned to the cluster with the closest centroid
K–means algorithm
Select k points as the initial centroids.
Form k clusters by assigning all
points to the closest centroid.
Recompute the centroid for each cluster.
until The centroids don’t change.
initial centroids are often chosen randomly
stopping condition is often changed to
“until relatively few points change clusters”
K–means clustering
works well when a given set is composed of a number of distinct classes
the number of clusters k must be specified
problem: How to choose k objectively?
K-means will converge for most common similarity measures in the first few iterations
less computationally intensive than hierarchical methods: O(n k i a), where n, k, i and a are the numbers of points, clusters, iterations and attributes
modest space requirements because only points and centroids are stored: O((n + k) a)
Two different k–means clusterings
optimal clustering
sub–optimal clustering
original points
Importance of choosing initial centroids
Problems with choosing initial centroids
if there are k “real” clusters then the chance of selecting one centroid from each cluster is small
… very small when k is large!
sometimes the initial centroids will readjust themselves in the “right” way and sometimes they will not
possible “solutions”:
multiple runs may help
sample and use hierarchical clustering to
determine initial centroids
select most widely separated
Limitations of k–means
has problems when clusters are of differing:
non–spherical shapes
Evaluation
Evaluation
evaluating clustering algorithms is complex because it is difficult to find an objective measure of quality of clusters
typical objective functions in clustering formalise the goal of attaining:
cohesion – high intra-cluster similarity
(instances within a cluster are similar)
separation – low inter-cluster similarity
(instances from different clusters are dissimilar)
Graph–based view of cohesion and separation
mathematically, cohesion and separation for a
graph–based cluster can be expressed as follows:
separation
Prototype–based view of cohesion and separation
mathematically, cohesion and separation for a prototype–based cluster can be expressed as follows:
separation
Cluster validity
previous definitions of cohesion and separation give us some simple and well–defined measures of cluster validity
they can be combined into an overall validity measure by using a weighted sum:
the weights will vary depending on the cluster validity measure
in some cases, the weights are simply 1 or the cluster sizes
/docProps/thumbnail.jpeg
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com