计算机代考程序代写 algorithm ETX2250/ETF5922: Data Visualization and Analytics

ETX2250/ETF5922: Data Visualization and Analytics
K means clustering
Lecturer:
Department of Econometrics and Business Statistics 
 Week 8

Partitional Clustering
1. Random assignment of points in the midst of cluster 2. Repeatedly re-assign points cluster
3. Minimise within-cluster distances
4. Maximise between-cluster distances
One example is k-means, which we focus on today
2/50

Comparing clustering algorithms

Hierarchical clustering
Suitable for small data set (<500 observations) Easily examine solutions with increasing numbers of clusters Convenient method if you want to observe how clusters are nested  K-means clustering Suitable if you know how many clusters you want Large data set Good for summarising data with k “average” observations that describe the data with the minimum amount of error 3/50 Obtaining clusters: K means method Non-hierarchical clustering method Require number of clusters be specied before assigning observations No single objective method for determining the number of clusters 4/50 Obtaining clusters: K means method Consider the following Want cohesion with, diversity between How much improvement does one more cluster provide? Single member or very small clusters are not usually useful All clusters should be signicantly different across the set of clustering variables The utility of the clusters ultimately depends on external validation Other variables not used in the analysis External information, e.g. location, geopolitical information 5/50 K-means Clustering Given the value of K, the K-means algorithm randomly partitions the observations into K clusters After all observations have been assigned to a cluster, the resulting cluster centroids are calculated Using the updated cluster centroids, all observations are reassigned to the cluster with the closest centroid (usually Euclidean distance) Continue until the assignments no longer change or a predetermined maximum number of iterations is reached 6/50 K-means Clustering Number of K clusters needs to be specied, usually advisable to try several values of K Initial K clusters are chosen randomly and the results do depend strongly on this particular starting point – procedure should be repeated a number of times (>10) with different starting points
The “best” result is chosen
7/50

The algorithm in pictures
Thanks to ! Artwork by @allison_horst
8/50

The algorithm in pictures
Thanks to ! Artwork by @allison_horst
9/50

The algorithm in pictures
Thanks to ! Artwork by @allison_horst
10/50

The algorithm in pictures
Thanks to ! Artwork by @allison_horst
11/50

The algorithm in pictures
Thanks to ! Artwork by @allison_horst
12/50

The algorithm in pictures
Thanks to ! Artwork by @allison_horst
13/50

The algorithm in pictures
Thanks to ! Artwork by @allison_horst
14/50

The algorithm in pictures
Thanks to ! Artwork by @allison_horst
15/50

The algorithm in pictures
Thanks to ! Artwork by @allison_horst
16/50

The algorithm in pictures
Thanks to ! Artwork by @allison_horst
17/50

The algorithm in pictures
Thanks to ! Artwork by @allison_horst
18/50

The algorithm in pictures
Thanks to ! Artwork by @allison_horst
19/50

Centroid
A centroid is the average observation of a cluster
The average value for each variable across the observations in the cluster Centroid does not have to be an actual observation
20/50

Recall KTC Example
Example: continuous variables in KTC data
These are: Age, Income, Children (columns 2, 4 and 6
149 117546 0 1 0 0 351 116375 1 0 1 0
Use Euclidean distance: the variables will need standardising so that each of the three variables is on a similar scale. This is easily done in R.
ID
Age
Female
Income
Married
Children
CarLoan
Mortgage
2
40
0
30085
1
3
1
1
4
23
1
20375
1
3
0
0
21/50

Within Sum of Squares
Use value to represent the Euclidean distance in n dimensional space, where n is the number of variables
Suppose we have clusters, call them
We are trying to select the clusters in such a way as to minimise the “within sum of squares”
22/50
jS∈x 1=j 2∥j ̄m − x∥∑ ∑
k
kS,…,1S k
∥∥

Total SS and between SS
Total SS = the sum of squared distances between each data point to the global sample mean Obtain between SS:
1. For each cluster calculate the squared distance from the centroid of the cluster to the global sample mean
2. Multiply the number of points in the cluster
3. Add up the results for each cluster
4. If there were no discernible pattern of clustering, the three means of the three groups would be close to the global mean, and between SS would be a very small fraction of total SS5
23/50

Visualise
24/50

Group centroids
25/50

Overall centroid
26/50

Within sum of squares
27/50

Total sum of squares
28/50

Between sum of squares
29/50

Outcome measure
The algorithm reapeats this process (calculate cluster centroid, assign observations to the cluster with the nearest cenroid) until there is no change in the clusters or a specied maximum number of iterations has been reached
Since we want little variation within clusters and a lot of variations between clusters, the ratio
Is a possible measure of how good the clustering that k-means has found One rule of thumb is that the ratio
should exceed one (i.e. there is more variation between clusters than within. )
30/50
NHTWSS NWTBSS
TOTSS NWTBSS

HowtoinR

Know thy Customer
Remember from your tutorials last week
ktc_df <- read_csv(here::here("data/KTC.csv")) head(ktc_df) ## # A tibble: 6 x 8 ## ID Age Female Income Married Children CarLoan Mortgage ##
## 1 1 48 1 17546 0 1 0 0
## 2 2 40 0 30085 1 3 1 1
## 3 3 51 1 16575 1 0 1 0
## 4 4 23 1 20375 1 3 0 0
## 5 5 57 1 50576 1 0 0 0
## 6 6 57 1 37870 1 2 0 0
32/50

Focus only on the numeric predictors:
ktc_numeric <- ktc_df %>%
select(ID,Age,Income, Children)%>%
# We move the ID to rownames so we can keep track of the rows but ID
#doesn’t get included in the clustering (which it would if we left
#it in the data frame)
column_to_rownames(var = “ID”) %>%
# Remember last week? k-means has the same sensitivity to scale,
# so we need to scale all our numeric variables.
scale
33/50

head(ktc_numeric)
## Age Income Children
## 1 0.1559026 -0.7637480 0.06169096
## 2 -0.4574848 0.1512843 1.91241969
## 3 0.3859229 -0.8346067 -0.86367341
## 4 -1.7609332 -0.5573020 1.91241969
## 5 0.8459635 1.6466130 -0.86367341
## 6 0.8459635 0.7193939 0.98705533
34/50

How do we do kmeans?
set.seed(145)
km <- kmeans(ktc_numeric, centers = 3, nstart = 25) centers is the number of clusters. Note that for k-means (unlike hierarchical clustering), we need to specify this. You can also specify a set of cluster centers nstart is the number of times you run through the procedure. You only need to specify this if you specify the number of clusters (why? more on that later...) 35/50 What does km look like? km ## K-means clustering with 3 clusters of sizes 8, 11, 11 ## ## Cluster means: ## Age Income Children ## 1 -0.6300000 -0.4368752 1.3340670 ## 2 0.9644588 0.9939719 -0.3589292 ## 3 -0.5062770 -0.6762445 -0.6113013 ## ## Clustering vector: ## 12345678 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 ## 3 1 3 1 2 2 3 2 1 1 2 3 3 2 3 3 1 2 2 3 2 1 2 3 1 2 ## 27 28 29 30 ## 3 3 1 2 ## ## Within cluster sum of squares by cluster: 36/50 What do the clusters look like? Let's use our visualisation skills! ktc_numeric %>%
#This is technically not a dataframe, so let’s make it so
as.data.frame() %>%
# Add in the cluster allocations
mutate(k3 = factor(km$cluster))%>%
ggplot(data = .,
aes (x =Age,
y= Income,
colour = k3))+
geom_point()+
ggthemes::scale_color_colorblind()
37/50

38/50

How can we summarise the clusters?
Cluster 1 characterised by younger, low income customers
Cluster 2 characterised by older, high income customers
Cluster 3 is characterised by younger, low income customers
It appears that the separation between 1 and 3 is not good because they appear very similar…
39/50

What do the clusters look like (pt 2)?
ktc_numeric %>%
#This is technically not a dataframe, so let’s make it so
as.data.frame() %>%
# Add in the cluster allocations
mutate(k3 = factor(km$cluster))%>%
ggplot(data = .,
aes (x =Age,
y= Income,
colour = k3))+
geom_point()+
facet_grid(.~Children)+
ggthemes::scale_color_colorblind()
40/50

What do the clusters look like (pt 2)?
41/50

How can we summarise the clusters (pt 2)?
Cluster 1 characterised by younger, low income customers with more children Cluster 2 characterised by older, high income customers with less children Cluster 3 is characterised by younger, low income customers with no children The seperation appears much better!
42/50

How stable is k-means?
We can explore by running the same kmeans, with the same number of clusters a few times over:
43/50

But wait…why are we using k = 3?
What if we tried different values of ?
44/50
k

But wait…why are we using k = 3?
It’s dicult to tell from visual inspection which cluster is best (sometimes it is). Can we visualise a diagnostic of some kind instead?
Consider:
The total within sum of squares. This measures the compactness (i.e., goodness) of the clustering. Smaller is better
45/50

To do this let’s run for a number of different k
We could do this by hand, but it is a bit of a pain
Fortunately there’s a function in the factoextra1 package to help us!
library(factoextra)
#specify the dataframe, method (kmeans) and within sum of squares (wss) fviz_nbclust(ktc_numeric, kmeans, method = “wss”)
46/50

Elbow plot
47/50

A sharper elbow
48/50

Summary
Unlike hierarchical clustering, kmeans requires us to specify the number of clusters directly
kmeans is also iterative, which can mean we are less likely to get strange shaped clusters (see the snake like cluster from the previous lecture)
Choosing k can be dicult, but we can use the Elbow plot to help to understand what number is best.
49/50

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Lecturer:
Department of Econometrics and Business Statistics 
 Week 8