程序代写代做代考 decision tree data mining algorithm Data Mining and Machine Learning

Data Mining and Machine Learning
Clustering I Peter Jančovič
Slide 1
Data Mining and Machine Learning

Data Mining
 Objective of Data Mining is to find structure and patterns in large, abstract data sets
– Is the data homogeneous or does it consist of several separately identifiable subsets?
– Are there patterns in the data?
– If so, do these patterns have an intuitive interpretation?
– Are there correlations in the data?
– Is there redundancy in the data?
Slide 2
Data Mining and Machine Learning

Partitioning data into “clusters”
 In this lecture we will start to develop tools to understand the structure of data that can be partitioned into (more or less) distinct subsets
 Can think of these subsets as arising from distinct “sources”
 We will consider three different techniques: – Clustering
– Multi-modal statistical modelling (Gaussian Mixture Models – GMMs)
– Decision trees
Slide 3
Data Mining and Machine Learning

Clustering – Objectives
 To explain the motivation for clustering
 To introduce the ideas of distance and distortion
 To describe agglomerative and divisive clustering
 To explain the relationships between clustering and decision trees
Slide 4
Data Mining and Machine Learning

What does the data look like?
14 13 12 11 10
9 8 7 6
Slide 5
6 7 8 9 10 11 12 13 14
Data Mining and Machine Learning

Structure of data
 Typical real data is not uniformly distrubuted
 It has structure
 Variables might be correlated
 The data might be grouped into natural ‘clusters’ – it may have been generated by several different “sources”
 The purpose of cluster analysis is to find this underlying structure automatically
Slide 6
Data Mining and Machine Learning

Clusters and centroids
 Assume clusters are spherical – determined by centres
 Cluster centres are called centroids
 Questions: How many centroids do we need? Where should we put them?
Slide 7
Data Mining and Machine Learning

Distance
 A function d(x,y) defined on pairs of points x and y is called a distance or metric if it satisfies:
– d(x,y)  0 and d(x,y) = 0 if and only if x = y
– d(x,y) = d(y,x) for all points x and y (symmetry)
– d(x,z)  d(x,y) + d(y,z) for all points x, y and z (triangle inequality)
Slide 8
Data Mining and Machine Learning

Example metrics
 The most common metric is the Euclidean metric
 If x = [x1, x2,…,xN] and y = [y1,y2,…,yN] then: 222
dx,y xyxy…xy 1122NN
 This is normal distance in Euclidean space
 There are lots of others, but focus on this one
Slide 9
Data Mining and Machine Learning

The Lp Metrics
 Euclidean distance is sometimes called the L2-metric
 12 2
 N
d x,y x y 
2
nn n1 
 It is one of a family of metrics called the Lp-metrics
 1 dpx,y xyp
N
p
nn n1 
Slide 10
Data Mining and Machine Learning

Special Lp metrics
 p=1 – the ‘City Block’ metric
N d1x,y x y 

n1 
 p=∞
d x, y maxn1,…,N xn  yn
nn
Slide 11
Data Mining and Machine Learning

Unit sphere
 For a metric d defined on N dimensional space, the
unit sphere is the set of vectors x such that d(x,0) = 1
 What do the unit spheres in 2D look like for these metrics?
Slide 12
Data Mining and Machine Learning

Example Unit Spheres (2D)
1 L2 unitsphere,x2 +y2 =1
L1 unit sphere |x|+|y|=1
-1 1
-1
L∞ unit sphere, max{x,y} = 1
Slide 13
Data Mining and Machine Learning

Distortion
 Distortion is a measure of how well a set of centroids models a set of data
 Suppose we have:
– data points y1, y2,…,yT – centroids c1,…,cM
 For each data point yt let ci(t) be its closest centroid
 In other words: d(yt, ci(t)) = minmd(yt,cm)
Slide 14
Data Mining and Machine Learning

Distortion
 The distortion for the centroid set C = c1,…,cM is defined by:
T DistCdy,c 
t it t1
 In other words, the distortion is the sum of distances between each data point and its nearest centroid
 The task of clustering is to find a centroid set C such that the distortion Dist(C) is minimised
Slide 15
Data Mining and Machine Learning

Types of Clustering
 We will start with two types of cluster analysis:
– Agglomerative clustering, or ‘bottom-up’ hierarchical clustering
– Divisive clustering, or ‘top-down’ clustering
 In the next lecture we will focus on a more sophisticated clustering method called k-means clustering
Slide 16
Data Mining and Machine Learning

Agglomerative clustering
 Agglomerative clustering begins by assuming that each data point belongs to its own, unique, 1 point cluster – each point is a centroid
 Clusters are then combined until the required number of centroids is obtained
 The simplest agglomerative clustering algorithm is one which, at each stage, combines the two closest centroids into a single centroid
Slide 17
Data Mining and Machine Learning

14 13 12 11 10
9 8 7 6
Original data (302 points)
6 7 8 9 10 11 12 13 14
Slide 18
Data Mining and Machine Learning

14 13 12 11 10
9 8 7 6
252 centroids
6 7 8 9 10 11 12 13 14 6
Slide 19
Data Mining and Machine Learning

14 13 12 11 10
9 8 7 6
152 centroids
6 7 8 9 10 11 12 13 14
Slide 20
Data Mining and Machine Learning

14 13 12 11 10
9 8 7 6
52 centroids
6 7 8 9 10 11 12 13 14 6
Slide 21
Data Mining and Machine Learning

14 13 12 11 10
9
8
7
6 6
12 centroids
6 7 8 9 10 11 12 13 14 6
Slide 22
Data Mining and Machine Learning

Optimality of agglomerative clustering
 The result of agglomerative clustering is not optimal  Generally it does not result in a set of centroids C
such that
 For example,
– Outliers may be given their own centroids
– Dense clusters may be given too few centroids
Slide 23
Data Mining and Machine Learning

Divisive Clustering
 Divisive clustering begins by assuming that there is just one centroid – typically in the centre of the set of data points
 That point is replaced with 2 new centroids
 Then each of these is replaced with 2 new centroids …
Slide 24
Data Mining and Machine Learning

Original data (302 points)
14
13
12
11  10
9 8 7 6
6 7 8 9 10 11 12 13 14
Slide 25
Data Mining and Machine Learning

14 13 12 11 10
9 8 7 6


Original data (302 points)
6 7 8 9 10 11 12 13 14
Slide 26
Data Mining and Machine Learning

Optimality of divisive clustering
 The result of agglomerative clustering is not optimal
 Generally it does not result in a set of centroids C such that
 Sequential decision making is normally suboptimal
Slide 27
– Decisions are not reversible
– If a point goes to a particular half of a partition it will never be re-allocated to the other half
– Probably not how a human would do it Data Mining and Machine Learning

Top down clustering – divisive
Single centroid – whole set
Decision tree interpretation
. . . .
Multiple centroids – one per data point
Bottom up clustering – agglomerative
Slide 28
Data Mining and Machine Learning

Optimality
 An ‘optimal’ set of centroids is one which minimises the distortion
 In general, neither method gives optimal sets of centroids
 A more principled approach would be to think of distortion as a function of the centroid set and minimize it
Slide 29
Data Mining and Machine Learning

Notation and method
 N dimensional space
 T data points X = {x1,…,xT}
 K centroids C = {c1,…,cK}
 Calculate
for each k and n, set to zero and solve Data Mining and Machine Learning
Slide 30

Summary
 Distance metrics and distortion  Agglomerative clustering
 Divisive clustering
 Decision tree interpretation
Slide 31
Data Mining and Machine Learning