Lecture 10: Clustering
GGR376
Dr. Adams
Clustering
Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group are more similar to each other than to those in other groups.
Clustering Visual I
Clustering Visual II
Clustering Visual III
Iris Example
Example from: Teja KodaliData Analyst at DV Trading LLC, Chicago (IL)
Iris Data
library(datasets) kable(head(iris))
Sepal.Length Sepal.Width Petal.Length
5.1 3.5 1.4 4.9 3.0 1.4 4.7 3.2 1.3 4.6 3.1 1.5 5.0 3.6 1.4 5.4 3.9 1.7
Petal.Width Species
0.2 setosa 0.2 setosa 0.2 setosa 0.2 setosa 0.2 setosa 0.4 setosa
Petal Length and Width
Iris Petals
2.5
2.0
1.5
1.0
0.5
0.0
246
Length
Species
setosa versicolor virginica
Width
Clustering
set.seed(20)
irisCluster <- kmeans(iris[, 3:4], 3, nstart = 20)
# Cluster Sizes
irisCluster$size
[1] 50 52 48
Compare Clusters to Species
setosa versicolor virginica
1 50 0 0 2 0 48 4 3 0 2 46
Plot Correct / Incorrect
Iris Petals
2.5
2.0
1.5
1.0
0.5
0.0
Identified
Incorrect Correct
246
Length
Correct Species
setosa versicolor virginica
Width
Case Study Example
Kernel density estimation and K-means clustering to profile road accident hotspots
“Every year approximately 3500 people are killed on Britain’s roads and 40,000 are seriously injured.”
“Road traffic accidents according to the WHO are the leading injury related cause of death among people aged 10–24.”
Anderson, T. K. (2009). Kernel density estimation and K-means clustering to profile road accident hotspots. Accident Analysis & Prevention, 41(3), 359-364.
Objectives - Anderson 2009
1. Present a methodology for the identification of high density accident zones using GIS and kernel density estimation (KDE)
2. Add attribute data to the accident zones.
3. Identify similar zones using a K-means clustering algorithm
with regards to attribute data and compare.
Methodology - Anderson 2009
5 years (1999–2003) of road accident data for the metropolitan area of London
Kernal density calculated with traffic accidents
Assigned attributes: Road length, Cycle lane length, Pedestrian
crossings, London underground stations, Traffic lights, Bus
stops, Schools, & Speed cameras.
Applied k-means clustering to attribute data
Kernal Density - Anderson 2009
Hot Spots
Hot Spots were based on a threshold of accidents per cell. Convert point observations to a surface.
Hot Spots
Clustering - Anderson 209
Attempted to classify each of the 428 hotspots into relatively homogeneous types based on their environmental characteristics
Given the number of hotspots identified in the kernel density K-means clustering was applied
K-means requires a specification of the number of clusters
Outcomes from Clusters - Anderson 2009
Outcomes from Clusters II - Anderson 2009
Extracted themes based on accident type and environmental conditions.
Policy?
k-means Clustering
k-means clustering aims to partition n observations into k clusters, where each observation belongs to the cluster with the nearest mean, serving as a centre of the cluster.
k-means Overview
Unsupervised clustering algorithm
You do not supply any information on clusters
k is the number of clusters Usually user specified
An iterative approach
I.e. a solution is found through optimization
Does not necessarily find a global solutions
NP-Hard problem
Different solutions are possible each iteration
Especially as n gets large
k-means Video
K-means clustering: how it works https://www.youtube.com/watch?v=_aWzGGNrcic
Process
Locate k random centroids to initiate clusters.
Compute for each n the distance to each centroid.
Assign each n to nearest centroid.
Compute the centroid mean
k-means
Iterate until convergence or you reach the max number of
iterations.
Typically requires < 50 iterations.
Distance Measure
Euclidean Distance
Straight line distance between two points.
Depending on the data, you may want to standardize your values.
If the units are drastically differet
Convert to a mean of 0 and a standard deviation of 1
k-means challenges
1. Selecting k
2. There are actually k clusters.
3. SSE is the right objective to minimize.
4. Clusters have the same SSE.
5. All variables have the same importance for every cluster.
What if we don’t know k
Not always will you know how many clusters exist (or are required).
Hierarchical Clustering
k-means is a partitioning approach
The end result is a set of final clusters.
Hierarchical methods
Divide data or aggregate in stages
End result includes all stages from a single cluster to n clusters
Select k clusters after the analysis
Bottom Up Method
Each point begins as its own cluster. Combine the two closet clusters
Repeat until a single cluster remains.
Dendrogram
Cluster Dendrogram
US States hclust (*, "average")
Florida North Carolina
California Maryland
Arizona New Mexico
Delaware Alabama Louisiana
Illinois New York
Michigan Nevada
Alaska Mississippi South Carolina
Washington Oregon Wyoming
Oklahoma Virginia
Rhode Island Massachusetts New Jersey
Missouri Arkansas Tennessee
Georgia Colorado
Connecticut Pennsylvania
West Virginia Maine South Dakota
North Dakota Vermont
Iowa New Hampshire
Texas
Idaho Nebraska
Kentucky Montana Ohio
Utah Indiana Kansas
Minnesota Wisconsin
Hawaii
Height
0 50 100 150
Agglomeration methods (i.e, linkage methods)
Complete Linkage Single Linkage
Mean Linkage
Centroid Linkage Ward’s Method
The linkage criterion determines the distance between sets of observations as a function of the pairwise distances between observations.
Complete linkage clustering
Computes all pairwise dissimilarities between the elements in cluster 1 and the elements in cluster 2.
Considers the largest value (i.e., maximum value) of these dissimilarities as the distance between the two clusters.
Tends to produce compact clusters.
Single linkage clustering
Computes all pairwise dissimilarities between the elements in cluster 1 and the elements in cluster 2.
Considers the smallest of these dissimilarities as a linkage criterion.
Tends to produce long, “loose” clusters.
Mean linkage clustering
Computes all pairwise dissimilarities between the elements in cluster 1 and the elements in cluster 2.
Considers the average of these dissimilarities as the distance between the two clusters.
Centroid linkage clustering
It computes the dissimilarity between the centroid for cluster 1 and the centroid for cluster 2.
Ward’s method
Minimizes the total within-cluster variance. At each step the pair of clusters with minimum between-cluster distance are merged.
Spatially Constrained Clustering
Often times when dealing with spatial data we may have a reason to force data to be spatially contiguous.
Spatially constrained clustering is the approach.
Regionalization
Contiguity-constrained clustering
Gerry Mandering
https://www.youtube.com/watch?v=Mky11UJb9AY
The Problem
The new areal units may be at odds with the two objectives:
Similarity of attributes
Non-spatial clustering
Similarity of location Spatial contiguity
These may not be aligned.
Modified k-means
Start with k-means result
Split / combine the non-contiguous clusters The k number may change
Inefficient approach
The Weighting Approach
Without explicit contiguity.
Multi-objective
Include x-y components as attributes in the clustering
Does not ensure contiguity
Choosing a weight for the x-y attributes is challenging
Eventually with strong enough weights
Automatic Zoning Procedure (AZP)
Start with an initial solution Output from k-mean
Swapping polygons between zones No optimal solution
N-P hard problem
Graph-based methods
Spatial contiguity is constructed with a graph
Observations are nodes
Connections are edges
Splitting edges to create clusters
Optimization Problems
Defined and an integer programming problem 1or0
Specify a criteria to optimize
e.g. maximum difference between groups
Common Methods
SKATER REDCAP Max-p
Implementations
R
spdep::skater
spdep::ClustGeo
Semi-constrained
GeoDa
Open-source GUI
RedCap
Open-source GUI
Assignment
Spatially constrained clustering can be done outside of R, but you will need to describe how you did this in your methods.