CS计算机代考程序代写 gui case study algorithm Lecture 10: Clustering

Lecture 10: Clustering
GGR376
Dr. Adams

Clustering
Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group are more similar to each other than to those in other groups.

Clustering Visual I

Clustering Visual II

Clustering Visual III

Iris Example
Example from: Teja KodaliData Analyst at DV Trading LLC, Chicago (IL)

Iris Data
library(datasets) kable(head(iris))
Sepal.Length Sepal.Width Petal.Length
5.1 3.5 1.4 4.9 3.0 1.4 4.7 3.2 1.3 4.6 3.1 1.5 5.0 3.6 1.4 5.4 3.9 1.7
Petal.Width Species
0.2 setosa 0.2 setosa 0.2 setosa 0.2 setosa 0.2 setosa 0.4 setosa

Petal Length and Width
Iris Petals
2.5
2.0
1.5
1.0
0.5
0.0
246
Length
Species
setosa versicolor virginica
Width

Clustering
set.seed(20)
irisCluster <- kmeans(iris[, 3:4], 3, nstart = 20) # Cluster Sizes irisCluster$size [1] 50 52 48 Compare Clusters to Species setosa versicolor virginica 1 50 0 0 2 0 48 4 3 0 2 46 Plot Correct / Incorrect Iris Petals 2.5 2.0 1.5 1.0 0.5 0.0 Identified Incorrect Correct 246 Length Correct Species setosa versicolor virginica Width Case Study Example Kernel density estimation and K-means clustering to profile road accident hotspots 􏰀 “Every year approximately 3500 people are killed on Britain’s roads and 40,000 are seriously injured.” 􏰀 “Road traffic accidents according to the WHO are the leading injury related cause of death among people aged 10–24.” Anderson, T. K. (2009). Kernel density estimation and K-means clustering to profile road accident hotspots. Accident Analysis & Prevention, 41(3), 359-364. Objectives - Anderson 2009 1. Present a methodology for the identification of high density accident zones using GIS and kernel density estimation (KDE) 2. Add attribute data to the accident zones. 3. Identify similar zones using a K-means clustering algorithm with regards to attribute data and compare. Methodology - Anderson 2009 􏰀 5 years (1999–2003) of road accident data for the metropolitan area of London 􏰀 Kernal density calculated with traffic accidents 􏰀 Assigned attributes: Road length, Cycle lane length, Pedestrian crossings, London underground stations, Traffic lights, Bus stops, Schools, & Speed cameras. 􏰀 Applied k-means clustering to attribute data Kernal Density - Anderson 2009 Hot Spots 􏰀 Hot Spots were based on a threshold of accidents per cell. 􏰀 Convert point observations to a surface. Hot Spots Clustering - Anderson 209 􏰀 Attempted to classify each of the 428 hotspots into relatively homogeneous types based on their environmental characteristics 􏰀 Given the number of hotspots identified in the kernel density 􏰀 K-means clustering was applied 􏰀 K-means requires a specification of the number of clusters Outcomes from Clusters - Anderson 2009 Outcomes from Clusters II - Anderson 2009 􏰀 Extracted themes based on accident type and environmental conditions. 􏰀 Policy? k-means Clustering k-means clustering aims to partition n observations into k clusters, where each observation belongs to the cluster with the nearest mean, serving as a centre of the cluster. k-means Overview 􏰀 Unsupervised clustering algorithm 􏰀 You do not supply any information on clusters 􏰀 k is the number of clusters 􏰀 Usually user specified 􏰀 An iterative approach 􏰀 I.e. a solution is found through optimization 􏰀 Does not necessarily find a global solutions 􏰀 NP-Hard problem 􏰀 Different solutions are possible each iteration 􏰀 Especially as n gets large k-means Video K-means clustering: how it works https://www.youtube.com/watch?v=_aWzGGNrcic Process 􏰀 Locate k random centroids to initiate clusters. 􏰀 Compute for each n the distance to each centroid. 􏰀 Assign each n to nearest centroid. 􏰀 Compute the centroid mean 􏰀 k-means 􏰀 Iterate until convergence or you reach the max number of iterations. 􏰀 Typically requires < 50 iterations. Distance Measure Euclidean Distance Straight line distance between two points. Depending on the data, you may want to standardize your values. 􏰀 If the units are drastically differet 􏰀 Convert to a mean of 0 and a standard deviation of 1 k-means challenges 1. Selecting k 2. There are actually k clusters. 3. SSE is the right objective to minimize. 4. Clusters have the same SSE. 5. All variables have the same importance for every cluster. What if we don’t know k Not always will you know how many clusters exist (or are required). Hierarchical Clustering 􏰀 k-means is a partitioning approach 􏰀 The end result is a set of final clusters. 􏰀 Hierarchical methods 􏰀 Divide data or aggregate in stages 􏰀 End result includes all stages from a single cluster to n clusters 􏰀 Select k clusters after the analysis Bottom Up Method 􏰀 Each point begins as its own cluster. 􏰀 Combine the two closet clusters 􏰀 Repeat until a single cluster remains. Dendrogram Cluster Dendrogram US States hclust (*, "average") Florida North Carolina California Maryland Arizona New Mexico Delaware Alabama Louisiana Illinois New York Michigan Nevada Alaska Mississippi South Carolina Washington Oregon Wyoming Oklahoma Virginia Rhode Island Massachusetts New Jersey Missouri Arkansas Tennessee Georgia Colorado Connecticut Pennsylvania West Virginia Maine South Dakota North Dakota Vermont Iowa New Hampshire Texas Idaho Nebraska Kentucky Montana Ohio Utah Indiana Kansas Minnesota Wisconsin Hawaii Height 0 50 100 150 Agglomeration methods (i.e, linkage methods) 􏰀 Complete Linkage 􏰀 Single Linkage 􏰀 Mean Linkage 􏰀 Centroid Linkage 􏰀 Ward’s Method The linkage criterion determines the distance between sets of observations as a function of the pairwise distances between observations. Complete linkage clustering 􏰀 Computes all pairwise dissimilarities between the elements in cluster 1 and the elements in cluster 2. 􏰀 Considers the largest value (i.e., maximum value) of these dissimilarities as the distance between the two clusters. 􏰀 Tends to produce compact clusters. Single linkage clustering 􏰀 Computes all pairwise dissimilarities between the elements in cluster 1 and the elements in cluster 2. 􏰀 Considers the smallest of these dissimilarities as a linkage criterion. 􏰀 Tends to produce long, “loose” clusters. Mean linkage clustering 􏰀 Computes all pairwise dissimilarities between the elements in cluster 1 and the elements in cluster 2. 􏰀 Considers the average of these dissimilarities as the distance between the two clusters. Centroid linkage clustering It computes the dissimilarity between the centroid for cluster 1 and the centroid for cluster 2. Ward’s method Minimizes the total within-cluster variance. At each step the pair of clusters with minimum between-cluster distance are merged. Spatially Constrained Clustering Often times when dealing with spatial data we may have a reason to force data to be spatially contiguous. Spatially constrained clustering is the approach. 􏰀 Regionalization 􏰀 Contiguity-constrained clustering Gerry Mandering https://www.youtube.com/watch?v=Mky11UJb9AY The Problem The new areal units may be at odds with the two objectives: 􏰀 Similarity of attributes 􏰀 Non-spatial clustering 􏰀 Similarity of location 􏰀 Spatial contiguity These may not be aligned. Modified k-means 􏰀 Start with k-means result 􏰀 Split / combine the non-contiguous clusters 􏰀 The k number may change 􏰀 Inefficient approach The Weighting Approach Without explicit contiguity. 􏰀 Multi-objective 􏰀 Include x-y components as attributes in the clustering 􏰀 Does not ensure contiguity 􏰀 Choosing a weight for the x-y attributes is challenging 􏰀 Eventually with strong enough weights Automatic Zoning Procedure (AZP) 􏰀 Start with an initial solution 􏰀 Output from k-mean 􏰀 Swapping polygons between zones 􏰀 No optimal solution 􏰀 N-P hard problem Graph-based methods 􏰀 Spatial contiguity is constructed with a graph 􏰀 Observations are nodes 􏰀 Connections are edges 􏰀 Splitting edges to create clusters Optimization Problems 􏰀 Defined and an integer programming problem 􏰀 1or0 􏰀 Specify a criteria to optimize 􏰀 e.g. maximum difference between groups Common Methods 􏰀 SKATER 􏰀 REDCAP 􏰀 Max-p Implementations 􏰀R 􏰀 spdep::skater 􏰀 spdep::ClustGeo 􏰀 Semi-constrained 􏰀 GeoDa 􏰀 Open-source GUI 􏰀 RedCap 􏰀 Open-source GUI Assignment Spatially constrained clustering can be done outside of R, but you will need to describe how you did this in your methods.