ETX2250/ETF5922: Data Visualization and Analytics
Hierarchical clustering
Lecturer:
Department of Econometrics and Business Statistics
Week 7
What is cluster analysis?
Cluster: collection of data objects
Grouping similar objects in a cluster
No pre-de ned classes means we are doing unsupervised classi cation
For analysis of data
Cluster analysis – grouping a set of data objects into cluster
In what contexts would you want to complete a cluster analysis?
2/68
Cluster analysis applications:
Market segmentation – targeted marketing
Market structure analysis – identify groups of similar products according to competitive measures of similarity
Finance – develop a balanced portfolio eg. Volatility, beta, vega, returns over different periods Land use – identify are of similar land use in Earth observation database
Medical eg. Types of tumour
Similar responses on a survey
Dimension reduction
3/68
Clustering methods
4/68
Hierachical Clustering
Agglomerative
1. Group each observation into cluster of one
2. Merge most similar clusters
3. Merge those a but further apart
4. Sequence of nested clusters
Devisive
1. Start with one large cluster (all observations)
2. Recursively split into smaller clusters
5/68
Partitional clustering
K-means
1. Random assignment of points in the midst of cluster
2. Repeatedly re-assign points cluster
3. Minimise within-cluster distances
4. Maximise between-cluster distances
]
6/68
An example
Distances between points in red; Names of points in black
7/68
An example
Distances between points in red; Names of points in black
8/68
An example
Distances between points in red; Names of points in black
9/68
An example
10/68
Dendrogram
Visually, see what the possible numbers of clusters are by introducing a horizontal line
Exercise: Based on the dendrogram on the next slide, if there are to be 7 clusters, list the ID numbers corresponding to the cluster containing the ID 10
In the Hierarchical clustering tutorial, we use R to select the clusters and then investigate cluster properties
13/68
Example
Red line = 4 clusters Green line = 7 clusters
14/68
Distance
We need to decide
How to de ne the distance between two data points
Recall that the data may be of different types How to de ne the distance between two clusters
This will depend on the distance between two points, but once that distance is decided there are still many alternative ways of de ning distance between clusters
15/68
Distance
In the example, we used
Euclidean distance (the usual shortest distance between two points)
between-cluster distance was taken to be the distance between the two points, one in each cluster, closest together
When you have a data frame whose columns represent many different kinds of variable, de ning distance between data points takes some thought.
16/68
Distance
Need to decide:
What are “distances” between points (=cases=rows)?
And for the hierarchical method, in addition What are “distances” between clusters?
Distance is a dissimilarity measure (larger distance, less similar)
For categorical variables, see how many attributes match: Matching is a similarity measure (more matching, more similar)
Where a similarity measure s has been used, a corresponding “distance” d is de ned to feed into the cluster algorithm:
17/68
s− − − − 1− √ = d
Distance between clusters
Distance between clusters
In this diagram, we assume distance between clusters is distance between the closest points.
19/68
Distance between clusters
Single Linkage
Shortest distance between clusters
Average linkage
Average distance between all linkage
Centroid linkage
Average distance between clusters
Complete linkage
Longest distance between clusters
20/68
Distance Matrix to Dendrogram
A
B
C
D
E
A 0 2 81013 C 810 0 1 5 E 13 14 5 4 0
Given these distances, create a dendrogram for: 1. Complete linkage
2. Single linkage
B
2
0
10
15
14
D
10
15
1
0
4
21/68
Complete linkage
library(ggplot2)
library(ggdendro)
library(ade4)
hc_complete<-hclust(as.dist(dist_matrix), method = "complete") ggdendrogram(hc_complete)+ theme_gray()
22/68
Breaking down the steps
Let’s start with the smallest distance:
C-D (dist 1): This is rst cluster, recompute the similarities by considering C and D as one observation =1
A-B (dist 2): is the second cluster, recompute the similarities = 2
For the next step, look for the larger distance between D-E or C-E. Which is longer? (D-E = 4), (C-D = 5). For complete linkage, look for maximum distance between cluster A new cluster C-D-E = 5
Now look for maximum distance between A-B and C-D-E A-B to C-D-E = 15
23/68
Single linkage
hc_single<-hclust(as.dist(dist_matrix), method = "single")
ggdendrogram(hc_single)+ theme_gray()
24/68
Breaking down the steps
Let’s start with the smallest distance:
C-D (dist 1): This is rst cluster, recompute the similarities by considering C and D as one observation =1
A-B (dist 2): is the second cluster, recompute the similarities = 2
For the next step, look for the shorter distance between D-E or C-E. Which is shorter? (D-E = 4), (C-D = 5). For single linkage, look for minimum distance between cluster
A new cluster C-D-E = 4
Now look for minimum distance between A-B and C-D-E
A-B to C-D-E = 8
25/68
Centroid
A centroid is the average observation of a cluster
That is, the average value for each variable across the points in the cluster The centroid doesn’t have to be an actual observation
26/68
Hierachical Cluster: Pros and Cons
Hierarchical clustering methods differ in the method of representing similarity between clusters, each with advantages and disadvantages
Single linkage
Most versatile algorithm
Poorly delineated cluster structures within data can produce unacceptable snakelike chairs for clusters
Complete linkage
Eliminates the chaining problem
Only considers outermost observations in a cluster, so can be very impacted by outliers
27/68
Hierachical Cluster: Pros and Cons Average linkage
Based on average similarity of all individuals in a cluster and tends to generate clusters with small within- cluster variation
Less affected by outliers
Centroid linkage
Measures distance between cluster centroids Less affected by outliers
28/68
Different types of distance
Numerical distance
For numerical variables, dissimilarity is measured by: Euclidean distance
Manhattan distance
Many others
30/68
Numerical distance
Euclidean distance becomes smaller as observations become more similar with respect to their variable values. Same is true for Manhattan distances. Euclidean distance
Manhattan distance
31/68
|1y−2y|+|1x−2x|= Md
2)1y−2y(+2)1x−2x(√= Ed −−−−−−−−−−−−−−−−−−
Euclidean distance
Most common method to measure the distance between observations, when observations include continuous variables.
Let observations and each comprise measurements of variables.
The Euclidean distance between observations between and is
32/68
2)vsdiK − usdiK( + 2)vemocnI − uemocnI( + 2)vegA − uegA(√ = v,ud −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
vu
)vsdiK ,vemocnI ,vegA( = v )usdiK ,uemocnI ,uegA( = u
3
Manhattan Distance
Manhattan distance: Distance between observations that include continuous variables
Let observations u and v each comprise measurements of the variables Age, Income, and Kids. The Manhattan distance between observations and is
Also need to standardize variables for Manhattan distance Manhattan distance is more robust to outliers
33/68
|vsdiK − usdiK| + |vemocnI − uemocnI| + |vegA − uegA| = v,ud vu
Matching (Dis)Similarity between observations
We've discussed using distance measures for numeric variables.
How can we discuss similarity measures for nominal or binary variables? For binary variables we consider two (there are many others!)
Matching coe cient Jaccard's coe cient
34/68
Matching coe cient
Simplest overlap measure of similarity between two observations
Find the proportion of variables with matching values
Used when clustering observations solely on the basis of categorical variables encoded as 0 or 1 Matching coe cient
35/68
selbairav fo rebmun latot = MS v dna usbo rof eulav gnihctam selbairav N
Weakeness of matching coe cient
If two observations both have a 0 entry for a categorical variable, this is counted as a sign of similarity between the two observations
However, matching 0 entries do not necessarily imply similarity
If 1 means “Owns a Toyota Land Cruiser” and 0 means “Does not own a Toyota Land Cruiser” then two observations sharing a zero does not indicate signi cant similarity
36/68
Jaccard's coe cient: A similarity measure that does not count matching zero entries
37/68
)v dna usbo rof seulav orez gnihctam htiw srav N( − )srav N latot( = JS v dna u sbo rof eulav oreznon gnihctam htiw srav N
Your turn!
For u and v as follows, what is the similarity using Matching
Jaccard
u= 0 0 0 0 0 1 1 1 1 1
v=
0
0
0
0
1
1
0
0
0
0
38/68
Example: Know Thy Customer
Example
KTC is a Financial Advising Company
Wishes to perform market segmentation of its clients. Thus each client is a case or “point”
Variables: Age, Gender, Annual Income, Marital Status, Number of Children, Car Loan or not, Home Mortgage or not
What type is each variable?
How should similarity between cases be measured in relation to values of
Numerical variables? Categorical variables?
40/68
Example
Corresponding to each customer, KTC has a vector of measurements on seven customer variables: Age, Female, Income, Married, Children, Car Loan, Mortgage
... ... ... ... ... ... ...
270 555391 0 0 1
61-year-old female with an annual income of $57,881, married with two children, no car loan and but a mortgage.
27-year-old male with annual income of $55,539, married, no children, no car loan but mortgage Start by considering only the binary variables: Female, Married, Car Loan, Mortgage
Age
Female
Income
Married
Children
CarLoan
Mortgage
61
1
57881
1
2
0
1
41/68
Similarity for Categorical Variables
Female
Married
CarLoan
Mortgage
case
1101u 1111w
0
1
0
1
v
Calculate the similarity between u and v
v and w
w and u
Using Jaccard and Matching
42/68
Similarity for categorical variables
Binary variables in KTC data: Gender
Married or not
Car loan or not
Home mortgage or not
Given these variables, would you use Matching or Jaccard?
43/68
Example
Corresponding to each customer, KTC has a vector of measurements on seven customer variables: Age, Female, Income, Married, Children, Car Loan, Mortgage
... ... ... ... ... ... ... 270 555391 0 0 1
Age
Female
Income
Married
Children
CarLoan
Mortgage
61
1
57881
1
2
0
1
44/68
Similarity for contionuous variables?
Age
Income
Children
61 57881 2
27
55539
0
What is the problem with this?
45/68
2432 = 2,1d 4−2−1−6−8−4−5−√ = 2,1d 4−−+−4−6−9−4−8−4−5−+−−6−51−1−√ = 2,1d
2−)−0−−2−(−−+−2)−9−3−5−5−5−−−1−8−8−7−5(−−+−2−)−7−2−−−1−6(−√ = 2,1d
Similarity for contionuous variables?
Age
Income
Children
61 57881 2
27
55539
0
What about Manhattan Distance?
The difference of scale are less, but the variables age and number of children are still swamped by income.
46/68
1732 = 2,1d 2+2432+43= 2,1d |0−2|+|93555−18875|+|72−16|= 2,1d
Scaling distances
Euclidean and Manhattan distance are highly in uenced by the scale on which variables are measured. Therefore, it is common to standardize the units of each variable of each observation u;
Example: , the value of variable Income in observation u, is replaced with its z-score.
The conversion to z-scores also makes it easier to identify outlier measurements, which can distort the Euclidean distance between observations.
Recall: What are z-scores?
47/68
uemocn I
Z- scores
-Consider values of a variable
Suppose the mean and standard deviation of this set of value is given by and Then the Z- score corresponding to is given by
If certain variables are known to be more important, deliberately choose unequal weighting
48/68
s ̄x
ix nx,...,2x,1x = x
s =iz ̄x − i x
Variables of Mixed Types
A database may contain all types of variables
A weighted formula is used to combine the effects of numerical and categorical variables [See the gower metric in the command daisy in the package cluster]
We focus on cluster analysis using just one type of variable
49/68
Limitations of Hierarchical Clustering
Need to calculate and store an distance matrix, infeasible for large datasets Low stability: change a few records and a very different result may be obtained In uenced by choice of metric, especially when using “average”
Sensitive to outliers
Only one pass through the data: misallocations can never be xed
50/68
nxn
Interpreting, Pro ling, & Validating Clusters
The clusters centroid, a mean pro le of the cluster on each clustering variable, is particularly useful in the interpretation stage
Interpretation involves examining the distinguishing characteristics of each cluster’s pro le and identifying substantial differences between clusters
If there is not substantial difference between clusters, then other cluster solutions should be examined
The cluster centroid should also be assessed for correspondence with the researcher’s prior expectations based on theory or practical experience
51/68
Interpreting, Pro ling, & Validating Clusters
Validation is essential in cluster analysis since the clusters are descriptive of structure and require additional support for their relevance:
Cross-validation empirically validates a cluster solution by creating two sub-samples (randomly splitting the sample) and then comparing the two cluster solutions for consistency with respect to number of clusters and the cluster pro les.
Validation is also achieved by examining differences on variables not included in the cluster analysis but for which there is a theoretical and relevant reason to expect variation across the clusters
52/68
HowtoinR
An example using the Palmer Penguins
R packages : ggdendro, ade4 library(tidytuesdayR)
penguins<- tidytuesdayR::tt_load(2020, week = 31)$penguins
##
## Downloading file 1 of 2: `penguins.csv`
## Downloading file 2 of 2: `penguins_raw.csv`
head(penguins)
## # A tibble: 6 x 8
## species island bill_length_mm bill_depth_mm flipper_length_~ body_mass_g se
##
## 1 ~
## 2 ~
## 3 ~
## 4 ~
39.1 18.7 181
39.5 17.4 186
40.3 18 195
NA NA NA
mutate(torgersen_yn = ifelse(island == “Torgersen”,1,0),
female_yn = ifelse(sex == “female”,1,0))
Let’s also only select the rst 50 observations
penguins_min <- penguins[1:50,]
55/68
Focus on binary variables
Calculate the distance (method 1 is Jaccard, 2 is Matching). Which would be best to use in our case?
library(ade4)
penguins_binary <- penguins_min %>%
select(“torgersen_yn”,”female_yn”)%>%
drop_na() # Removes any NA entries
distance_binary <- dist.binary(penguins_binary, method = 2)
Calculate clusters using average linkage (other options are possible!)
cluster_binary<-hclust(distance_binary, method = "average")
56/68
Create a dendrogram
library(ggdendro) ggdendrogram(cluster_binary)
57/68
58/68
Calculate clusters
Specify the number of clusters we want ($k$) or the height at which to cut , and add to the data frame
penguins_binary <- penguins_binary %>%
mutate(cut1_binary = cutree(cluster_binary, k= 3))
head(penguins_binary)
## # A tibble: 6 x 3
## torgersen_yn female_yn cut1_binary
##
## 1 1 0 1
## 2 1 1 1
## 3 1 1 1
## 4 1 1 1
## 5 1 0 1
## 6 1 1 1
59/68
h
Focus on continuous variables
Use bill length and body weight for the cluster Use scale function to scale the variables
penguins_cont <- penguins_min %>%
select(“bill_length_mm”,”body_mass_g”)%>%
drop_na() %>% # Removes any NA entries
scale() %>%
data.frame()
distance_cont <- dist(penguins_cont, method = "euclidean")
60/68
Hierachical clustering
Run complete clustering
cluster_cont <- hclust(distance_cont, method = "complete") Plot the dendrogram
61/68
ggdendrogram(cluster_cont)
62/68
What about the clusters?
Choose 3 clusters.
penguins_cont <- penguins_cont %>%
mutate(cut1_continuous = factor(cutree(cluster_cont, k = 3)))
63/68
Plot the clusters
library(ggplot2)
ggplot(penguins_cont, aes(x=body_mass_g, y= bill_length_mm,
geom_point()
colour = cut1_continuous))+
64/68
65/68
A cautionary example
What should we call the different categories?
Larger animal, larger bill = call this group “pelicans” Larger animal, small bill = call this group “chickens” Smaller animal, small bill = call this group “sparrows”
But the group names don’t change the birds from being all penguins. Where else have we seen the group names being overinterpreted?
e.g., personality tests look at which items correlate strongly together to form different personality “types”. An example of this is the Eagle – Dove – Owl – Peacock test.
66/68
Summary
There are two types of distances to pay attention to in hierarchical clustering. The distance between points
The distance between clusters
The choice of how to measure these distances has implications for the shape and potential validity of the clusters.
Validation and interpretation is very important, and must be completed carefully!
67/68
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Lecturer:
Department of Econometrics and Business Statistics
Week 7