DATA MINING AND ANALYSIS
Fundamental Concepts and Algorithms
MOHAMMED J. ZAKI
Rensselaer Polytechnic Institute, Troy, New York
WAGNER MEIRA JR.
Universidade Federal de Minas Gerais, Brazil
32 Avenue of the Americas, New York, NY 10013-2473, USA
Cambridge University Press is part of the University of Cambridge.
It furthers the University’s mission by disseminating knowledge in the pursuit of education, learning, and research at the highest international levels of excellence.
www.cambridge.org
Information on this title: www.cambridge.org/9780521766333
Copyright Mohammed J. Zaki and Wagner Meira Jr. 2014
This publication is in copyright. Subject to statutory exception and to the provisions of relevant collective licensing agreements, no reproduction of any part may take place without the written permission of Cambridge University Press.
First published 2014
A catalog record for this publication is available from the British Library.
Library of Congress Cataloging in Publication Data
Zaki, Mohammed J., 1971–
Data mining and analysis: fundamental concepts and algorithms / Mohammed J. Zaki, Rensselaer Polytechnic Institute, Troy, New York, Wagner Meira Jr., Universidade Federal de Minas Gerais, Brazil.
pages cm
Includes bibliographical references and index.
ISBN 978-0-521-76633-3 (hardback)
1. Data mining. I. Meira, Wagner, 1967– II. Title. QA76.9.D343Z36 2014
006.3′ 12–dc23 2013037544
ISBN 978-0-521-76633-3 Hardback
Cambridge University Press has no responsibility for the persistence or accuracy of URLs for external or third-party Internet Web sites referred to in this publication and does not guarantee that any content on such Web sites is, or will remain, accurate or appropriate.
Contents
Contents iii Preface vii
1 DataMiningandAnalysis…………………….. 1
1.1 1.2 1.3 1.4 1.5 1.6 1.7
PART I
Data Matrix 1 Attributes 3 Data: Algebraic and Geometric View 4 Data: Probabilistic View 14 Data Mining 25 Further Reading 30 Exercises 30
DATA ANALYSIS FOUNDATIONS 31
2 NumericAttributes ……………………….. 33
2.1 Univariate Analysis 33
2.2 Bivariate Analysis 42
2.3 Multivariate Analysis 48
2.4 Data Normalization 52
2.5 Normal Distribution 54
2.6 Further Reading 60
2.7 Exercises 60
3 CategoricalAttributes………………………. 63
3.1 Univariate Analysis 63
3.2 Bivariate Analysis 72
3.3 Multivariate Analysis 82
3.4 Distance and Angle 87
3.5 Discretization 89
3.6 Further Reading 91
3.7 Exercises 91
4 GraphData…………………………… 93
4.1 Graph Concepts 93
iii
iv
Contents
4.2 Topological Attributes 97
4.3 Centrality Analysis 102
4.4 Graph Models 112
4.5 Further Reading 132
4.6 Exercises 132
5 KernelMethods…………………………. 134
5.1 Kernel Matrix 138
5.2 Vector Kernels 144
5.3 Basic Kernel Operations in Feature Space 148
5.4 Kernels for Complex Objects 154
5.5 Further Reading 161
5.6 Exercises 161
6 High-dimensionalData……………………… 163
6.1 High-dimensional Objects 163
6.2 High-dimensional Volumes 165
6.3 Hypersphere Inscribed within Hypercube 168
6.4 Volume of Thin Hypersphere Shell 169
6.5 Diagonals in Hyperspace 171
6.6 Density of the Multivariate Normal 172
6.7 Appendix: Derivation of Hypersphere Volume 175
6.8 Further Reading 180
6.9 Exercises 180
7 DimensionalityReduction ……………………. 183
7.1 7.2 7.3 7.4 7.5 7.6
PART II
Background 183 Principal Component Analysis 187 Kernel Principal Component Analysis 202 Singular Value Decomposition 208 Further Reading 213 Exercises 214
FREQUENT PATTERN MINING 215
8 ItemsetMining …………………………. 217
8.1 Frequent Itemsets and Association Rules 217
8.2 Itemset Mining Algorithms 221
8.3 Generating Association Rules 234
8.4 Further Reading 236
8.5 Exercises 237
9 SummarizingItemsets ……………………… 242
9.1 Maximal and Closed Frequent Itemsets 242
9.2 Mining Maximal Frequent Itemsets: GenMax Algorithm 245
9.3 Mining Closed Frequent Itemsets: Charm Algorithm 248
9.4 Nonderivable Itemsets 250
9.5 Further Reading 256
9.6 Exercises 256
Contents v
10 SequenceMining………………………… 259
10.1 Frequent Sequences 259
10.2 Mining Frequent Sequences 260
10.3 Substring Mining via Suffix Trees 267
10.4 Further Reading
10.5 Exercises
11 Graph Pattern Mining . . . .
11.1 Isomorphism and Support
11.2 Candidate Generation
11.3 The gSpan Algorithm
11.4 Further Reading
11.5 Exercises
…………………… 301
Rule and Pattern Assessment Measures 301 Significance Testing and Confidence Intervals 316 Further Reading 328 Exercises 328
CLUSTERING 331
12 Pattern and Rule Assessment
12.1 12.2 12.3 12.4
PART III
277 277
…………………… 280 280 284 288 296 297
13 Representative-basedClustering…………………. 333
13.1 K-means Algorithm 333
13.2 Kernel K-means 338
13.3 Expectation-Maximization Clustering 342
13.4 Further Reading 360
13.5 Exercises 361
14 HierarchicalClustering……………………… 364
14.1 Preliminaries 364
14.2 Agglomerative Hierarchical Clustering 366
14.3 Further Reading 372
14.4 Exercises 373
15 Density-basedClustering…………………….. 375
15.1 The DBSCAN Algorithm 375
15.2 Kernel Density Estimation 379
15.3 Density-based Clustering: DENCLUE 385
15.4 Further Reading 390
15.5 Exercises 391
16 SpectralandGraphClustering………………….. 394
16.1 Graphs and Matrices 394
16.2 Clustering as Graph Cuts 401
16.3 Markov Clustering 416
16.4 Further Reading 422
16.5 Exercises 423
vi
17
PART IV
Contents ClusteringValidation ………………………. 425
17.1 17.2 17.3 17.4 17.5
External Measures 425 Internal Measures 440 Relative Measures 448 Further Reading 461 Exercises 462
CLASSIFICATION 464
18 ProbabilisticClassification ……………………. 466
18.1 Bayes Classifier 466
18.2 Naive Bayes Classifier 472
18.3 K Nearest Neighbors Classifier 476
18.4 Further Reading 478
18.5 Exercises 478
19 DecisionTreeClassifier……………………… 480
19.1 Decision Trees 482
19.2 Decision Tree Algorithm 484
19.3 Further Reading 495
19.4 Exercises 495
20 LinearDiscriminantAnalysis …………………… 497
20.1 Optimal Linear Discriminant 497
20.2 Kernel Discriminant Analysis 504
20.3 Further Reading 510
20.4 Exercises 511
21 SupportVectorMachines…………………….. 513
21.1 Support Vectors and Margins 513
21.2 SVM: Linear and Separable Case 519
21.3 Soft Margin SVM: Linear and Nonseparable Case 523
21.4 Kernel SVM: Nonlinear Case 529
21.5 SVM Training Algorithms 533
21.6 Further Reading 544
21.7 Exercises 545
22 ClassificationAssessment…………………….. 547
22.1 Classification Performance Measures 547
22.2 Classifier Evaluation 561
22.3 Bias-Variance Decomposition 571
22.4 Further Reading 580
22.5 Exercises 581
Index
585
Preface
This book is an outgrowth of data mining courses at Rensselaer Polytechnic Institute (RPI) and Universidade Federal de Minas Gerais (UFMG); the RPI course has been offered every Fall since 1998, whereas the UFMG course has been offered since 2002. Although there are several good books on data mining and related topics, we felt that many of them are either too high-level or too advanced. Our goal was to write an introductory text that focuses on the fundamental algorithms in data mining and analysis. It lays the mathematical foundations for the core data mining methods, with key concepts explained when first encountered; the book also tries to build the intuition behind the formulas to aid understanding.
The main parts of the book include exploratory data analysis, frequent pattern mining, clustering, and classification. The book lays the basic foundations of these tasks, and it also covers cutting-edge topics such as kernel methods, high-dimensional data analysis, and complex graphs and networks. It integrates concepts from related disciplines such as machine learning and statistics and is also ideal for a course on data analysis. Most of the prerequisite material is covered in the text, especially on linear algebra, and probability and statistics.
The book includes many examples to illustrate the main technical concepts. It also has end-of-chapter exercises, which have been used in class. All of the algorithms in the book have been implemented by the authors. We suggest that readers use their favorite data analysis and mining software to work through our examples and to implement the algorithms we describe in text; we recommend the R software or the Python language with its NumPy package. The datasets used and other supplementary material such as project ideas and slides are available online at the book’s companion site and its mirrors at RPI and UFMG:
• http://dataminingbook.info
• http://www.cs.rpi.edu/~zaki/dataminingbook • http://www.dcc.ufmg.br/dataminingbook
Having understood the basic principles and algorithms in data mining and data analysis, readers will be well equipped to develop their own methods or use more advanced techniques.
vii
viii
Preface
1
23
14 6 7 15
5 4 19 18 8
13 16 20 21 11 9 10
1722 12 Figure 0.1. Chapter dependencies
Suggested Roadmaps
The chapter dependency graph is shown in Figure 0.1. We suggest some typical roadmaps for courses and readings based on this book. For an undergraduate-level course, we suggest the following chapters: 1–3, 8, 10, 12–15, 17–19, and 21–22. For an undergraduate course without exploratory data analysis, we recommend Chapters 1, 8–15, 17–19, and 21–22. For a graduate course, one possibility is to quickly go over the material in Part I or to assume it as background reading and to directly cover Chapters 9–22; the other parts of the book, namely frequent pattern mining (Part II), clustering (Part III), and classification (Part IV), can be covered in any order. For a course on data analysis the chapters covered must include 1–7, 13–14, 15 (Section 2), and 20. Finally, for a course with an emphasis on graphs and kernels we suggest Chapters 4, 5, 7 (Sections 1–3), 11–12, 13 (Sections 1–2), 16–17, and 20–22.
Acknowledgments
Initial drafts of this book have been used in several data mining courses. We received many valuable comments and corrections from both the faculty and students. Our thanks go to
• Muhammad Abulaish, Jamia Millia Islamia, India
• Mohammad Al Hasan, Indiana University Purdue University at Indianapolis
• Marcio Luiz Bunte de Carvalho, Universidade Federal de Minas Gerais, Brazil
• Lo ̈ıc Cerf, Universidade Federal de Minas Gerais, Brazil
• Ayhan Demiriz, Sakarya University, Turkey
• Murat Dundar, Indiana University Purdue University at Indianapolis
• Jun Luke Huan, University of Kansas
• Ruoming Jin, Kent State University
• Latifur Khan, University of Texas, Dallas
Preface ix
• Pauli Miettinen, Max-Planck-Institut fu ̈ r Informatik, Germany
• Suat Ozdemir, Gazi University, Turkey
• Naren Ramakrishnan, Virginia Polytechnic and State University
• Leonardo Chaves Dutra da Rocha, Universidade Federal de Sa ̃o Joa ̃o del-Rei, Brazil
• Saeed Salem, North Dakota State University
• Ankur Teredesai, University of Washington, Tacoma
• Hannu Toivonen, University of Helsinki, Finland
• Adriano Alonso Veloso, Universidade Federal de Minas Gerais, Brazil
• Jason T.L. Wang, New Jersey Institute of Technology
• Jianyong Wang, Tsinghua University, China
• Jiong Yang, Case Western Reserve University
• Jieping Ye, Arizona State University
We would like to thank all the students enrolled in our data mining courses at RPI and UFMG, as well as the anonymous reviewers who provided technical comments on various chapters. We appreciate the collegial and supportive environment within the computer science departments at RPI and UFMG and at the Qatar Computing Research Institute. In addition, we thank NSF, CNPq, CAPES, FAPEMIG, Inweb – the National Institute of Science and Technology for the Web, and Brazil’s Science without Borders program for their support. We thank Lauren Cowles, our editor at Cambridge University Press, for her guidance and patience in realizing this book.
Finally, on a more personal front, MJZ dedicates the book to his wife, Amina, for her love, patience and support over all these years, and to his children, Abrar and Afsah, and his parents. WMJ gratefully dedicates the book to his wife Patricia; to his children, Gabriel and Marina; and to his parents, Wagner and Marlene, for their love, encouragement, and inspiration.
CHAPTER 1 Data Mining and Analysis
Data mining is the process of discovering insightful, interesting, and novel patterns, as well as descriptive, understandable, and predictive models from large-scale data. We begin this chapter by looking at basic properties of data modeled as a data matrix. We emphasize the geometric and algebraic views, as well as the probabilistic interpretation of data. We then discuss the main data mining tasks, which span exploratory data analysis, frequent pattern mining, clustering, and classification, laying out the roadmap for the book.
1.1 DATA MATRIX
Data can often be represented or abstracted as an n × d data matrix, with n rows and d columns, where rows correspond to entities in the dataset, and columns represent attributes or properties of interest. Each row in the data matrix records the observed attribute values for a given entity. The n × d data matrix is given as
X1X2···Xd
x1 x11 x12 ··· x1d D=x2 x21 x22 ··· x2d . . . . . . . . . . . . . . .
xn xn1 xn2 ··· xnd where xi denotes the ith row, which is a d-tuple given as
xi =(xi1,xi2,…,xid) andXj denotesthejthcolumn,whichisann-tuplegivenas
Xj =(x1j,x2j,…,xnj)
Depending on the application domain, rows may also be referred to as entities, instances, examples, records, transactions, objects, points, feature-vectors, tuples, and so on. Likewise, columns may also be called attributes, properties, features, dimensions, variables, fields, and so on. The number of instances n is referred to as the size of
1
2
Data Mining and Analysis
Class
Table 1.1. Extract from the Iris dataset
Sepal Sepal Petal Petal length width length width
X4
1.5 Iris-versicolor 1.5 Iris-versicolor 1.3 Iris-versicolor 0.2 Iris-setosa 1.0 Iris-versicolor 0.2 Iris-setosa 2.2 Iris-virginica 1.9 Iris-virginica
. . . . . . x149 7.7 3.8 6.7 2.2 Iris-virginica
X1 X2 X3
x1 5.9 3.0 4.2
x2 6.9 3.1 4.9 x3 6.6 2.9 4.6
X5 . . . . . .
4.6 3.2 1.4 x5 6.0 2.2 4.0
x4
x6 x7 x8
4.7 3.2 1.3 6.5 3.0 5.8 5.8 2.7 5.1
x150 5.1 3.4 1.5 0.2 Iris-setosa
the data, whereas the number of attributes d is called the dimensionality of the data. The analysis of a single attribute is referred to as univariate analysis, whereas the simultaneous analysis of two attributes is called bivariate analysis and the simultaneous analysis of more than two attributes is called multivariate analysis.
Example 1.1. Table 1.1 shows an extract of the Iris dataset; the complete data forms a 150 × 5 data matrix. Each entity is an Iris flower, and the attributes include sepal length, sepal width, petal length, and petal width in centimeters, and the type or class of the Iris flower. The first row is given as the 5-tuple
x1 =(5.9,3.0,4.2,1.5,Iris-versicolor)
Not all datasets are in the form of a data matrix. For instance, more complex datasets can be in the form of sequences (e.g., DNA and protein sequences), text, time-series, images, audio, video, and so on, which may need special techniques for analysis. However, in many cases even if the raw data is not a data matrix it can usually be transformed into that form via feature extraction. For example, given a database of images, we can create a data matrix in which rows represent images and columns correspond to image features such as color, texture, and so on. Sometimes, certain attributes may have special semantics associated with them requiring special treatment. For instance, temporal or spatial attributes are often treated differently. It is also worth noting that traditional data analysis assumes that each entity or instance is independent. However, given the interconnected nature of the world we live in, this assumption may not always hold. Instances may be connected to other instances via various kinds of relationships, giving rise to a data graph, where a node represents an entity and an edge represents the relationship between two entities.
Attributes 3 1.2 ATTRIBUTES
Attributes may be classified into two main types depending on their domain, that is, depending on the types of values they take on.
Numeric Attributes
A numeric attribute is one that has a real-valued or integer-valued domain. For example, Age with domain(Age) = N, where N denotes the set of natural numbers (non-negative integers), is numeric, and so is petal length in Table 1.1, with domain(petallength) = R+ (the set of all positive real numbers). Numeric attributes that take on a finite or countably infinite set of values are called discrete, whereas those that can take on any real value are called continuous. As a special case of discrete, if an attribute has as its domain the set {0,1}, it is called a binary attribute. Numeric attributes can be classified further into two types:
• Interval-scaled: For these kinds of attributes only differences (addition or subtraction) make sense. For example, attribute temperature measured in ◦ C or ◦ F is interval-scaled. If it is 20 ◦C on one day and 10 ◦C on the following day, it is meaningful to talk about a temperature drop of 10 ◦C, but it is not meaningful to say that it is twice as cold as the previous day.
• Ratio-scaled: Here one can compute both differences as well as ratios between values. For example, for attribute Age, we can say that someone who is 20 years old is twice as old as someone who is 10 years old.
Categorical Attributes
A categorical attribute is one that has a set-valued domain composed of a set of symbols. For example, Sex and Education could be categorical attributes with their domains given as
domain(Sex) = {M,F} domain(Education) = {HighSchool,BS,MS,PhD}
Categorical attributes may be of two types:
• Nominal: The attribute values in the domain are unordered, and thus only equality comparisons are meaningful. That is, we can check only whether the value of the attribute for two given instances is the same or not. For example, Sex is a nominal attribute. Also class in Table 1.1 is a nominal attribute with domain(class) = {iris-setosa,iris-versicolor,iris-virginica}.
• Ordinal: The attribute values are ordered, and thus both equality comparisons (is one value equal to another?) and inequality comparisons (is one value less than or greater than another?) are allowed, though it may not be possible to quantify the difference between values. For example, Education is an ordinal attribute because its domain values are ordered by increasing educational qualification.
4 Data Mining and Analysis 1.3 DATA: ALGEBRAIC AND GEOMETRIC VIEW
If the d attributes or dimensions in the data matrix D are all numeric, then each row can be considered as a d-dimensional point:
xi =(xi1,xi2,…,xid)∈Rd
or equivalently, each row may be considered as a d-dimensional column vector (all
vectors are assumed to be column vectors by default): xi1
xi2
x i = . . . = x i 1 x i 2 · · · x i d
xid
T d ∈ R
where T is the matrix transpose operator.
The d-dimensional Cartesian coordinate space is specified via the d unit vectors,
called the standard basis vectors, along each of the axes. The j th standard basis vector ej is the d-dimensional unit vector whose jth component is 1 and the rest of the components are 0
ej =(0,…,1j,…,0)T
Any other vector in Rd can be written as a linear combination of the standard basis
vectors. For example, each of the points xi can be written as the linear combination
xi =xi1e1 +xi2e2 +···+xided =
where the scalar value xij is the coordinate value along the j th axis or attribute.
d j=1
xijej
Example 1.2. Consider the Iris data in Table 1.1. If we project the entire data onto the first two attributes, then each row can be considered as a point or a vector in 2-dimensional space. For example, the projection of the 5-tuple x1 = (5.9, 3.0, 4.2, 1.5, Iris-versicolor) on the first two attributes is shown in Figure 1.1a. Figure 1.2 shows the scatterplot of all the n = 150 points in the 2-dimensional space spanned by the first two attributes. Likewise, Figure 1.1b shows x1 as a point and vector in 3-dimensional space, by projecting the data onto the first three attributes. The point (5.9,3.0,4.2) can be seen as specifying the coefficients in the linear combination of the standard basis vectors in R3:
1 0 0 5.9
x1 =5.9e1 +3.0e2 +4.2e3 =5.90+3.01+4.20=3.0 0 0 1 4.2
Data: Algebraic and Geometric View
5
x1 =(5.9,3.0,4.2)
X3
X2
4 3 2 1 0
x1 =(5.9,3.0)
X1
0123456
X2
4.5 4.0 3.5 3.0 2.5
2
X1
Figure 1.1. Row x1 as a point and vector in (a) R2 and (b) R3.
(b)
(a)
5.0
5.5
6.0
X1: sepal length
7.0
7.5
8.0
4
Each numeric column or attribute can also be treated as a vector in an
n-dimensional space Rn:
x1j
4.5
Figure1.2. Scatterplot:sepallengthversussepalwidth.Thesolidcircleshowsthemeanpoint.
x2j X j = . . .
6.5
xnj
5 6
3 4
1 2
4 3 2
1
123
X2: sepal width
6 Data Mining and Analysis If all attributes are numeric, then the data matrix D is in fact an n × d matrix, also
written as D ∈ Rn×d , given as
x11 x12 ··· x1d —xT1 —
x 2 1 x 2 2 · · · x 2 d — x T2 — | | | D= . . … . = . = X1 X2 ··· Xd
.||| x n 1 x n 2 · · · x n d — x Tn —
As we can see, we can consider the entire dataset as an n × d matrix, or equivalently as asetofnrowvectorsxTi ∈Rd orasasetofdcolumnvectorsXj ∈Rn.
1.3.1 Distance and Angle
Treating data instances and attributes as vectors, and the entire dataset as a matrix, enables one to apply both geometric and algebraic methods to aid in the data mining and analysis tasks.
Let a, b ∈ Rm be two m-dimensional vectors given as
a1 b1
a2 b2 a= . b= .
am bm
The dot product between a and b is defined as the scalar value b1
T b2 a b = a 1 a 2 · · · a m × . . .
bm =a1b1 +a2b2 +···+ambm
m i=1
√ m ∥a∥= aTa= a12+a2+···+am2= ai2
i=1
The unit vector in the direction of a is given as
u= a = 1 a ∥a∥ ∥a∥
Dot Product
=
T h e E u c l i d e a n n o r m o r l e n g t h o f a v e c t o r a ∈ R m i s d e f i n ed a s
Length
aibi
Data: Algebraic and Geometric View 7
By definition u has length ∥u∥ = 1, and it is also called a normalized vector, which can be used in lieu of a in some analysis tasks.
The Euclidean norm is a special case of a general class of norms, known as Lp-norm, defined as
1 1mp
∥a∥p = |a1|p +|a2|p +···+|am|p p = |ai|p i=1
for any p ̸= 0. Thus, the Euclidean norm corresponds to the case when p = 2.
Distance
From the Euclidean norm we can define the Euclidean distance between a and b, as follows
δ(a,b)=∥a−b∥=
m
(a−b)T(a−b)= (ai −bi)2 (1.1)
i=1
Thus, the length of a vector is simply its distance from the zero vector 0, all of whose elements are 0, that is, ∥a∥ = ∥a − 0∥ = δ(a,0).
From the general Lp-norm we can define the corresponding Lp-distance function, given as follows
δp(a,b) = ∥a − b∥p (1.2) If p is unspecified, as in Eq. (1.1), it is assumed to be p = 2 by default.
Angle
The cosine of the smallest angle between vectors a and b, also called the cosine similarity, is given as
cosθ= aTb = a T b (1.3) ∥a∥ ∥b∥ ∥a∥ ∥b∥
Thus, the cosine of the angle between a and b is given as the dot product of the unit vectors a and b .
∥a∥ ∥b∥ m The Cauchy–Schwartz inequality states that for any vectors a and b in R
|aTb|≤∥a∥·∥b∥
It follows immediately from the Cauchy–Schwartz inequality that
−1 ≤ cos θ ≤ 1
8
Data Mining and Analysis
(5,3)
X1 Figure 1.3. Distance and angle. Unit vectors are shown in gray.
4 3 2 1 0
X2
(1, 4)
θ
012345
Because the smallest angle θ ∈ [0◦,180◦] and because cosθ ∈ [−1,1], the cosine similarity value ranges from +1, corresponding to an angle of 0◦, to −1, corresponding to an angle of 180◦ (or π radians).
Orthogonality
Two vectors a and b are said to be orthogonal if and only if aTb = 0, which in turn
implies that cos θ = 0, that is, the angle between them is 90◦ or π radians. In this case, 2
we say that they have no similarity.
Example 1.3 (Distance and Angle). Figure 1.3 shows the two vectors a=53 andb=14
Using Eq. (1.1), the Euclidean distance between them is given as δ(a,b)=(5−1)2 +(3−4)2 =√16+1=√17=4.12
The distance can also be computed as the magnitude of the vector: a−b=5−1= 4
3 4 −1 because∥a−b∥=42+(−1)2 =√17=4.12.
The unit vector in the direction of a is given as
a 1 5 1 5
ua =∥a∥=√52+32 3 =√34 3 = 0.51
0.86
a−b
b
a
Data: Algebraic and Geometric View 9
The unit vector in the direction of b can be computed similarly: ub = 0.24
0.97
These unit vectors are also shown in gray in Figure 1.3.
By Eq. (1.3) the cosine of the angle between a and b is given as 5T 1
3 4 17 1 cosθ = √52 +32√12 +42 = √34×17 = √2
We can get the angle by computing the inverse of the cosine: θ = cos−11/√2 = 45◦
Let us consider the Lp -norm for a with p = 3; we get ∥a∥3 = 53 + 331/3 = (152)1/3 = 5.34
The distance between a and b using Eq. (1.2) for the Lp -norm with p = 3 is given as ∥a − b∥3 = (4,−1)T3 = 43 + |−1|31/3 = (65)1/3 = 4.02
1.3.2 Mean and Total Variance
Mean
The mean of the data matrix D is the vector obtained as the average of all the points:
1 n mean(D)=μ= n xi
i=1
The total variance of the data matrix D is the average squared distance of each point
Total Variance
from the mean:
1 n
var(D)= n
i=1
1 n
δ(xi,μ)2 = n
∥xi −μ∥2
(1.4)
Simplifying Eq. (1.4) we obtain
var(D)=1n ∥xi∥2−2xTiμ+∥μ∥2 n i=1
i=1
1n 1n = ∥xi∥2 −2nμT xi +n∥μ∥2
n i=1 ni=1
10
Data Mining and Analysis
1 n = n i=1 ∥xi∥2 −2nμTμ+n∥μ∥2
1n
=n i=1∥xi∥2 −∥μ∥2
The total variance is thus the difference between the average of the squared magnitude of the data points and the squared magnitude of the mean (average of the points).
Centered Data Matrix
Often we need to center the data matrix by making the mean coincide with the origin of the data space. The centered data matrix is obtained by subtracting the mean from all the points:
xT1 μT xT1 −μT zT1
Z=D−1·μT =xT2−μT=xT2 −μT=zT2 (1.5)
. . . . x Tn μ T x Tn − μ T z Tn
where zi = xi − μ represents the centered point corresponding to xi , and 1 ∈ Rn is the n-dimensional vector all of whose elements have value 1. The mean of the centered data matrix Z is 0 ∈ Rd , because we have subtracted the mean μ from all the points xi .
1.3.3 Orthogonal Projection
Often in data mining we need to project a point or vector onto another vector, for example, to obtain a new point after a change of the basis vectors. Let a, b ∈ Rm be two m-dimensional vectors. An orthogonal decomposition of the vector b in the direction
4 3 2 1 0
X2
b
012345
Figure 1.4. Orthogonal projection.
a
X1
r=b⊥ p=b∥
Data: Algebraic and Geometric View 11
of another vector a, illustrated in Figure 1.4, is given as
b=b∥ +b⊥ =p+r (1.6)
where p = b∥ is parallel to a, and r = b⊥ is perpendicular or orthogonal to a. The vector p is called the orthogonal projection or simply projection of b on the vector a. Note that the point p ∈ Rm is the point closest to b on the line passing through a. Thus, the magnitude of the vector r = b − p gives the perpendicular distance between b and a, which is often interpreted as the residual or error vector between the points b and p.
We can derive an expression for p by noting that p = ca for some scalar c, as p is parallel to a. Thus, r = b − p = b − ca. Because p and r are orthogonal, we have
which implies that
Therefore, the projection of b on a is given as
pTr=(ca)T(b−ca)=caTb−c2aTa=0 c = aTb
aTa
p=b∥ =ca=aTba
X2
(1.7)
aTa
Example 1.4. Restricting the Iris dataset to the first two dimensions, sepal length and sepal width, the mean point is given as
mean(D) = 5.843 3.054
1.5 1.0 0.5 0.0
−0.5
−1.0 −2.0
l
X1
−1.5 −1.0 −0.5 0.0
Figure 1.5. Projecting the centered data onto the line l.
0.5
1.0
1.5
2.0
12 Data Mining and Analysis
which is shown as the black circle in Figure 1.2. The corresponding centered data is shown in Figure 1.5, and the total variance is var(D) = 0.868 (centering does not change this value).
Figure 1.5 shows the projection of each point onto the line l, which is the line that maximizes the separation between the class iris-setosa (squares) from the other two classes, namely iris-versicolor (circles) and iris-virginica (triangles). The
line l is given as the set of all the points (x1,x2)T satisfying the constraint x1 =
−2.15 x2 c 2.75 for all scalars c ∈ R.
1.3.4 Linear Independence and Dimensionality Given the data matrix
D=x1 x2 ··· xnT =X1 X2 ··· Xd
we are often interested in the linear combinations of the rows (points) or the columns (attributes). For instance, different linear combinations of the original d attributes yield new derived attributes, which play a key role in feature extraction and dimensionality reduction.
Given any set of vectors v1,v2,…,vk in an m-dimensional vector space Rm, their linear combination is given as
c1v1 +c2v2 +···+ckvk
where ci ∈ R are scalar values. The set of all possible linear combinations of the k vectors is called the span, denoted as span(v1,…,vk), which is itself a vector space beingasubspaceofRm.Ifspan(v1,…,vk)=Rm,thenwesaythatv1,…,vk isaspanning set for Rm.
Row and Column Space
There are several interesting vector spaces associated with the data matrix D, two of which are the column space and row space of D. The column space of D, denoted col(D), is the set of all linear combinations of the d attributes Xj ∈ Rn, that is,
col(D)=span(X1,X2,…,Xd)
By definition col(D) is a subspace of Rn. The row space of D, denoted row(D), is the
set of all linear combinations of the n points xi ∈ Rd , that is, row(D) = span(x1,x2,…,xn)
By definition row(D) is a subspace of Rd. Note also that the row space of D is the column space of DT:
row(D) = col(DT)
Data: Algebraic and Geometric View 13
Linear Independence
We say that the vectors v1,…,vk are linearly dependent if at least one vector can be written as a linear combination of the others. Alternatively, the k vectors are linearly dependent if there are scalars c1,c2,…,ck, at least one of which is not zero, such that
c1v1 +c2v2 +···+ckvk =0
On the other hand, v1,··· ,vk are linearly independent if and only if
c1v1+c2v2+···+ckvk =0impliesc1 =c2 =···=ck =0
Simply put, a set of vectors is linearly independent if none of them can be written as a
linear combination of the other vectors in the set.
Dimension and Rank
Let S be a subspace of Rm. A basis for S is a set of vectors in S, say v1,…,vk, that are linearly independent and they span S, that is, span(v1,…,vk) = S. In fact, a basis is a minimal spanning set. If the vectors in the basis are pairwise orthogonal, they are said to form an orthogonal basis for S. If, in addition, they are also normalized to be unit vectors, then they make up an orthonormal basis for S. For instance, the standard basis for Rm is an orthonormal basis consisting of the vectors
1 0 0 e1 = 0. e2 = 1. ··· em = 0.
. . 001
Any two bases for S must have the same number of vectors, and the number of vectors in a basis for S is called the dimension of S, denoted as dim(S). Because S is a subspace of Rm, we must have dim(S) ≤ m.
It is a remarkable fact that, for any matrix, the dimension of its row and column space is the same, and this dimension is also called the rank of the matrix. For the data matrix D ∈ Rn×d, we have rank(D) ≤ min(n,d), which follows from the fact that the column space can have dimension at most d, and the row space can have dimension at most n. Thus, even though the data points are ostensibly in a d dimensional attribute space (the extrinsic dimensionality), if rank(D) < d, then the data points reside in a lower dimensional subspace of Rd, and in this case rank(D) gives an indication about the intrinsic dimensionality of the data. In fact, with dimensionality reduction methods it is often possible to approximate D ∈ Rn×d with a derived data matrix D′ ∈ Rn×k, which has much lower dimensionality, that is, k ≪ d. In this case k may reflect the “true” intrinsic dimensionality of the data.
.
Example 1.5. The line l in Figure 1.5 is given as l = span−2.15 2.75T, with
dim(l) = 1. After normalization, we obtain the orthonormal basis for l as the unit
vector
√ 1 −2.15=−0.615 12.19 2.75 0.788
14 Data Mining and Analysis Table1.2. Irisdataset:sepallength(incentimeters).
5.9 6.9 6.6 4.6 6.0 4.7 6.5 5.8 6.7 6.7 5.1 5.1 5.7 6.1 4.9 5.0 5.0 5.7 5.0 7.2 5.9 6.5 5.7 5.5 4.9 5.0 5.5 4.6 7.2 6.8 5.4 5.0 5.7 5.8 5.1 5.6 5.8 5.1 6.3 6.3 5.6 6.1 6.8 7.3 5.6 4.8 7.1 5.7 5.3 5.7 5.7 5.6 4.4 6.3 5.4 6.3 6.9 7.7 6.1 5.6 6.1 6.4 5.0 5.1 5.6 5.4 5.8 4.9 4.6 5.2 7.9 7.7 6.1 5.5 4.6 4.7 4.4 6.2 4.8 6.0 6.2 5.0 6.4 6.3 6.7 5.0 5.9 6.7 5.4 6.3 4.8 4.4 6.4 6.2 6.0 7.4 4.9 7.0 5.5 6.3 6.8 6.1 6.5 6.7 6.7 4.8 4.9 6.9 4.5 4.3 5.2 5.0 6.4 5.2 5.8 5.5 7.6 6.3 6.4 6.3 5.8 5.0 6.7 6.0 5.1 4.8 5.7 5.1 6.6 6.4 5.2 6.4 7.7 5.8 4.9 5.4 5.1 6.0 6.5 5.5 7.2 6.9 6.2 6.5 6.0 5.4 5.5 6.7 7.7 5.1
1.4 DATA: PROBABILISTIC VIEW
The probabilistic view of the data assumes that each numeric attribute X is a random variable, defined as a function that assigns a real number to each outcome of an experiment (i.e., some process of observation or measurement). Formally, X is a function X : O → R, where O, the domain of X, is the set of all possible outcomes of the experiment, also called the sample space, and R, the range of X, is the set of real numbers. If the outcomes are numeric, and represent the observed values of the random variable, then X : O → O is simply the identity function: X(v) = v for all v ∈ O. The distinction between the outcomes and the value of the random variable is important, as we may want to treat the observed values differently depending on the context, as seen in Example 1.6.
A random variable X is called a discrete random variable if it takes on only a finite or countably infinite number of values in its range, whereas X is called a continuous random variable if it can take on any value in its range.
Example 1.6. Consider the sepal length attribute (X1) for the Iris dataset in Table 1.1. All n = 150 values of this attribute are shown in Table 1.2, which lie in the range [4.3,7.9], with centimeters as the unit of measurement. Let us assume that these constitute the set of all possible outcomes O.
By default, we can consider the attribute X1 to be a continuous random variable, given as the identity function X1(v) = v, because the outcomes (sepal length values) are all numeric.
On the other hand, if we want to distinguish between Iris flowers with short and long sepal lengths, with long being, say, a length of 7 cm or more, we can define a discrete random variable A as follows:
A(v)=0 ifv<7 1 ifv≥7
In this case the domain of A is [4.3,7.9], and its range is {0,1}. Thus, A assumes nonzero probability only at the discrete values 0 and 1.
Data: Probabilistic View 15 Probability Mass Function
If X is discrete, the probability mass function of X is defined as f (x) = P (X = x) for all x ∈ R
In other words, the function f gives the probability P (X = x ) that the random variable X has the exact value x. The name “probability mass function” intuitively conveys the fact that the probability is concentrated or massed at only discrete values in the range of X, and is zero for all other values. f must also obey the basic rules of probability. That is, f must be non-negative:
f(x)≥0 and the sum of all probabilities should add to 1:
f(x)=1 x
Example 1.7 (Bernoulli and Binomial Distribution). In Example 1.6, A was defined as a discrete random variable representing long sepal length. From the sepal length data in Table 1.2 we find that only 13 Irises have sepal length of at least 7 cm. We can thus estimate the probability mass function of A as follows:
and
f(1)=P(A=1)= 13 =0.087=p 150
f(0)=P(A=0)= 137 =0.913=1−p 150
In this case we say that A has a Bernoulli distribution with parameter p ∈ [0, 1], which denotes the probability of a success, that is, the probability of picking an Iris with a long sepal length at random from the set of all points. On the other hand, 1 − p is the probability of a failure, that is, of not picking an Iris with long sepal length.
Let us consider another discrete random variable B, denoting the number of Irises with long sepal length in m independent Bernoulli trials with probability of success p. In this case, B takes on the discrete values [0,m], and its probability mass function is given by the Binomial distribution
f(k)=P(B=k)=mkpk(1−p)m−k
The formula can be understood as follows. There are mk ways of picking k long sepal length Irises out of the m trials. For each selection of k long sepal length Irises, the total probability of the k successes is pk , and the total probability of m − k failures is (1 − p)m−k . For example, because p = 0.087 from above, the probability of observing exactly k = 2 Irises with long sepal length in m = 10 trials is given as
f (2) = P (B = 2) = 10(0.087)2(0.913)8 = 0.164 2
Figure 1.6 shows the full probability mass function for different values of k for m = 10. Because p is quite small, the probability of k successes in so few a trials falls off rapidly as k increases, becoming practically zero for values of k ≥ 6.
16
Data Mining and Analysis
P(B=k) 0.4
0.3 0.2 0.1
k Figure 1.6. Binomial distribution: probability mass function (m = 10, p = 0.087).
0 1 2 3 4 5 6 7 8 9 10
Probability Density Function
If X is continuous, its range is the entire set of real numbers R. The probability of any specific value x is only one out of the infinitely many possible values in the range of X, which means that P(X = x) = 0 for all x ∈ R. However, this does not mean that the value x is impossible, because in that case we would conclude that all values are impossible! What it means is that the probability mass is spread so thinly over the range of values that it can be measured only over intervals [a,b] ⊂ R, rather than at specific points. Thus, instead of the probability mass function, we define the probability density function, which specifies the probability that the variable X takes on values in any interval [a,b] ⊂ R:
b
P X∈[a,b] = f(x)dx
a
As before, the density function f must satisfy the basic laws of probability:
and
f (x) ≥ 0, for all x ∈ R
∞
f (x) dx = 1
−∞
We can get an intuitive understanding of the density function f by considering the probability density over a small interval of width 2ǫ > 0, centered at x, namely
Data: Probabilistic View 17 [x−ǫ,x+ǫ]:
x+ǫ
P X∈[x−ǫ,x+ǫ] = f(x)dx ≃ 2ǫ·f(x)
x − ǫ
f(x)≃P X∈[x−ǫ,x+ǫ] (1.8)
2ǫ
f (x) thus gives the probability density at x, given as the ratio of the probability mass to the width of the interval, that is, the probability mass per unit distance. Thus, it is important to note that P (X = x) ̸= f (x).
Even though the probability density function f (x) does not specify the probability P (X = x ), it can be used to obtain the relative probability of one value x1 over another x2 because for a given ǫ > 0, by Eq. (1.8), we have
P(X∈[x1 −ǫ,x1 +ǫ]) ≃ 2ǫ·f(x1) = f(x1) (1.9) P(X∈[x2 −ǫ,x2 +ǫ]) 2ǫ·f(x2) f(x2)
Thus, if f (x1 ) is larger than f (x2 ), then values of X close to x1 are more probable than values close to x2, and vice versa.
Example 1.8 (Normal Distribution). Consider again the sepal length values from the Iris dataset, as shown in Table 1.2. Let us assume that these values follow a Gaussian or normal density function, given as
f(x)=√ 1 exp−(x−μ)2 2πσ2 2σ2
There are two parameters of the normal density distribution, namely, μ, which represents the mean value, and σ 2 , which represents the variance of the values (these parameters are discussed in Chapter 2). Figure 1.7 shows the characteristic “bell” shape plot of the normal distribution. The parameters, μ = 5.84 and σ 2 = 0.681, were estimated directly from the data for sepal length in Table 1.2.
1
Whereas f (x = μ) = f (5.84) = √2π · 0.681 exp{0} = 0.483, we emphasize that
the probability of observing X = μ is zero, that is, P(X = μ) = 0. Thus, P(X = x) is not given by f(x), rather, P(X = x) is given as the area under the curve for an infinitesimally small interval [x − ǫ,x + ǫ] centered at x, with ǫ > 0. Figure 1.7 illustrates this with the shaded region centered at μ = 5.84. From Eq. (1.8), we have
P(X=μ)≃2ǫ·f(μ)=2ǫ·0.483=0.967ǫ
As ǫ → 0, we get P (X = μ) → 0. However, based on Eq. (1.9) we can claim that the probability of observing values close to the mean value μ = 5.84 is 2.69 times the probability of observing values close to x = 7, as
f (5.84) = 0.483 = 2.69 f (7) 0.18
18
Data Mining and Analysis
f(x) 0.5
0.4 0.3 0.2 0.1
μ±ǫ
0
23456789
x
Figure 1.7. Normal distribution: probability density function (μ = 5.84, σ 2 = 0.681).
Cumulative Distribution Function
For any random variable X, whether discrete or continuous, we can define the cumulative distribution function (CDF) F : R → [0,1], which gives the probability of observing a value at most some given value x:
F(x)=P(X≤x) forall −∞
is a binary indicator variable that indicates whether the given condition is satisfied
or not. Intuitively, to obtain the empirical CDF we compute, for each value x ∈ R,
how many points in the sample are less than or equal to x. The empirical CDF puts a
probability mass of 1 at each point xi . Note that we use the notation Fˆ to denote the n
fact that the empirical CDF is an estimate for the unknown population CDF F .
Inverse Cumulative Distribution Function
Define the inverse cumulative distribution function or quantile function for a random variable X as follows:
F −1 (q ) = min{x | F (x ) ≥ q } for q ∈ [0, 1] (2.2)
That is, the inverse CDF gives the least value of X, for which q fraction of the values are lower, and 1 − q fraction of the values are higher. The empirical inverse cumulative distribution function Fˆ −1 can be obtained from Eq. (2.1).
Empirical Probability Mass Function
The empirical probability mass function (PMF) of X is given as ˆ 1n
f(x)=P(X=x)= n I(xi =x) (2.3) i=1
where 1 ifxi=x I(xi =x)= 0 ifxi ̸=x
The empirical PMF also puts a probability mass of 1 at each point xi . n
2.1.1 Measures of Central Tendency
These measures given an indication about the concentration of the probability mass,
the “middle” values, and so on.
Mean
The mean, also called the expected value, of a random variable X is the arithmetic average of the values of X. It provides a one-number summary of the location or central tendency for the distribution of X.
The mean or expected value of a discrete random variable X is defined as μ=E[X]=xf(x) (2.4)
x where f (x) is the probability mass function of X.
Univariate Analysis 35
The expected value of a continuous random variable X is defined as ∞
μ=E[X]= xf(x)dx −∞
where f (x) is the probability density function of X.
Sample Mean The sample mean is a statistic, that is, a function μˆ : {x1,x2,…,xn} → R,
defined as the average value of xi ’s:
1 n
μˆ = n x i ( 2 . 5 ) i=1
It serves as an estimator for the unknown mean value μ of X. It can be derived by plugging in the empirical PMF fˆ(x) in Eq. (2.4):
ˆ 1n 1n μˆ = x f ( x ) = x n I ( x i = x ) = n x i
x x i=1 i=1
Sample Mean Is Unbiased An estimator θˆ is called an unbiased estimator for parameter θ if E[θˆ ] = θ for every possible value of θ . The sample mean μˆ is an unbiased estimator for the population mean μ, as
1 n 1 n E [ μˆ ] = E n x i = n
i=1 i=1
1 n
E [ x i ] = n μ = μ ( 2 . 6 )
i=1
where we use the fact that the random variables xi are IID according to X, which implies that they have the same mean μ as X, that is, E[xi ] = μ for all xi . We also used the fact that the expectation function E is a linear operator, that is, for any two random variables X and Y, and real numbers a and b, we have E [aX + bY] = aE[X] + bE[Y].
Robustness We say that a statistic is robust if it is not affected by extreme values (such as outliers) in the data. The sample mean is unfortunately not robust because a single large value (an outlier) can skew the average. A more robust measure is the trimmed mean obtained after discarding a small fraction of extreme values on one or both ends. Furthermore, the mean can be somewhat misleading in that it is typically not a value that occurs in the sample, and it may not even be a value that the random variable can actually assume (for a discrete random variable). For example, the number of cars per capita is an integer-valued random variable, but according to the US Bureau of Transportation Studies, the average number of passenger cars in the United States was 0.45 in 2008 (137.1 million cars, with a population size of 304.4 million). Obviously, one cannot own 0.45 cars; it can be interpreted as saying that on average there are 45 cars per 100 people.
Median
The median of a random variable is defined as the value m such that
P (X ≤ m) ≥ 1 and P (X ≥ m) ≥ 1 22
36 Numeric Attributes
In other words, the median m is the “middle-most” value; half of the values of X are less and half of the values of X are more than m. In terms of the (inverse) cumulative distribution function, the median is therefore the value m for which
F (m) = 0.5 or m = F −1 (0.5)
The sample median can be obtained from the empirical CDF [Eq. (2.1)] or the
empirical inverse CDF [Eq. (2.2)] by computing
Fˆ ( m ) = 0 . 5 o r m = Fˆ − 1 ( 0 . 5 )
A simpler approach to compute the sample median is to first sort all the values xi (i ∈ [1, n]) in increasing order. If n is odd, the median is the value at position n+1 . If n
2
Unlike the mean, median is robust, as it is not affected very much by extreme values. Also, it is a value that occurs in the sample and a value the random variable can actually assume.
Mode
The mode of a random variable X is the value at which the probability mass function or the probability density function attains its maximum value, depending on whether X is discrete or continuous, respectively.
The sample mode is a value for which the empirical probability mass function [Eq. (2.3)] attains its maximum, given as
mode(X) = argmax fˆ(x) x
The mode may not be a very useful measure of central tendency for a sample because by chance an unrepresentative element may be the most frequent element. Furthermore, if all values in the sample are distinct, each of them will be the mode.
is even, the values at positions n and n + 1 are both medians. 22
Example 2.1 (Sample Mean, Median, and Mode). Consider the attribute sepal length (X1) in the Iris dataset, whose values are shown in Table 1.2. The sample mean is given as follows:
μˆ = 1 (5.9+6.9+···+7.7+5.1)= 876.5 =5.843 150 150
Figure 2.1 shows all 150 values of sepal length, and the sample mean. Figure 2.2a shows the empirical CDF and Figure 2.2b shows the empirical inverse CDF for sepal length.
Because n = 150 is even, the sample median is the value at positions n = 75 and 2
n + 1 = 76 in sorted order. For sepal length both these values are 5.8; thus the 2
sample median is 5.8. From the inverse CDF in Figure 2.2b, we can see that Fˆ (5.8) = 0.5 or 5.8 = Fˆ −1(0.5)
The sample mode for sepal length is 5, which can be observed from the frequency of 5 in Figure 2.1. The empirical probability mass at x = 5 is
fˆ(5)= 10 =0.067 150
2.1 Univariate Analysis
Frequency
37
4.5
X1
0.75 0.50 0.25
0
8.0 7.5 7.0 6.5 6.0 5.5 5.0 4.5
4
4
4.5
5.0 5.5 6.0 x
7.5 8.0
4
5.0 5.5 6.0 6.5 7.0 7.5 8.0 μˆ = 5 . 8 4 3
Figure 2.1. Sample mean for sepal length. Multiple occurrences of the same value are shown stacked. 1.00
6.5 7.0 (a) Empirical CDF
0
0.25 0.50 0.75 q
(b) Empirical inverse CDF
1.00
Figure 2.2. Empirical CDF and inverse CDF: sepal length.
Fˆ −1(q) Fˆ (x)
38 Numeric Attributes 2.1.2 Measures of Dispersion
The measures of dispersion give an indication about the spread or variation in the values of a random variable.
Range
The value range or simply range of a random variable X is the difference between the maximum and minimum values of X, given as
r = max{X} − min{X}
The (value) range of X is a population parameter, not to be confused with the range of the function X, which is the set of all the values X can assume. Which range is being used should be clear from the context.
The sample range is a statistic, given as
n
rˆ = m a x { x i } − m i n { x i }
n
i=1 i=1
By definition, range is sensitive to extreme values, and thus is not robust.
Interquartile Range
Quartiles are special values of the quantile function [Eq. (2.2)] that divide the data into four equal parts. That is, quartiles correspond to the quantile values of 0.25, 0.5, 0.75, and 1.0. The first quartile is the value q1 = F−1(0.25), to the left of which 25% of the points lie; the second quartile is the same as the median value q2 = F −1 (0.5), to the left of which 50% of the points lie; the third quartile q3 = F −1 (0.75) is the value to the left of which 75% of the points lie; and the fourth quartile is the maximum value of X, to the left of which 100% of the points lie.
A more robust measure of the dispersion of X is the interquartile range (IQR), defined as
IQR=q3 −q1 =F−1(0.75)−F−1(0.25) (2.7)
IQR can also be thought of as a trimmed range, where we discard 25% of the low and high values of X. Or put differently, it is the range for the middle 50% of the values of X. IQR is robust by definition.
The sample IQR can be obtained by plugging in the empirical inverse CDF in Eq. (2.7):
ˆ −1 ˆ −1 IQR=qˆ3 −qˆ1 =F (0.75)−F (0.25)
Variance and Standard Deviation
The variance of a random variable X provides a measure of how much the values of X deviate from the mean or expected value of X. More formally, variance is the expected
Univariate Analysis
value of the squared deviation from the mean, defined as (x − μ)2 f (x)
x σ2 =var(X)=E[(X−μ)2]= ∞
(x − μ)2 f (x) dx −∞
if X is discrete
if X is continuous
39
(2.8)
The standard deviation, σ , is defined as the positive square root of the variance, σ 2 . We can also write the variance as the difference between the expectation of X2 and
the square of the expectation of X:
σ2 =var(X)=E[(X−μ)2]=E[X2 −2μX+μ2] =E[X2]−2μE[X]+μ2 =E[X2]−2μ2 +μ2
= E[X2] − (E[X])2 (2.9)
It is worth noting that variance is in fact the second moment about the mean, corresponding to r = 2, which is a special case of the rth moment about the mean for a random variable X, defined as E [(x − μ)r ].
Sample Variance The sample variance is defined as 1 n
σˆ 2 = n ( x i − μˆ ) 2 ( 2 . 1 0 ) i=1
It is the average squared deviation of the data values xi from the sample mean μˆ , and can be derived by plugging in the empirical probability function fˆ from Eq. (2.3) into Eq. (2.8), as
ˆ1n 1n
σˆ2 = (x−μˆ)2f(x)= (x−μˆ)2 I(xi =x) = (xi −μˆ)2
x x n i=1 n i=1
The sample standard deviation is given as the positive square root of the sample
variance:
1 n
σˆ = n ( x i − μˆ ) 2 i=1
The standard score, also called the z-score, of a sample value xi is the number of standard deviations the value is away from the mean:
z i = x i − μˆ σˆ
Put differently, the z-score of xi measures the deviation of xi from the mean value μˆ , i n u n i t s o f σˆ .
40 Numeric Attributes
Geometric Interpretation of Sample Variance We can treat the data sample for attribute X as a vector in n-dimensional space, where n is the sample size. That is, we write X = (x1,x2,…,xn)T ∈ Rn. Further, let
x 1 − μˆ x 2 − μˆ
Z = X − 1 · μˆ = . . . x n − μˆ
denote the mean subtracted attribute vector, where 1 ∈ Rn is the n-dimensional vector all of whose elements have value 1. We can rewrite Eq. (2.10) in terms of the magnitude of Z, that is, the dot product of Z with itself:
1 1 1n
σˆ2 = n ∥Z∥2 = n ZTZ= n (xi −μˆ)2 (2.11)
i=1
The sample variance can thus be interpreted as the squared magnitude of the centered attribute vector, or the dot product of the centered attribute vector with itself, normalized by the sample size.
Example 2.2. Consider the data sample for sepal length shown in Figure 2.1. We can see that the sample range is given as
max{xi}−min{xi}=7.9−4.3=3.6 ii
From the inverse CDF for sepal length in Figure 2.2b, we can find the sample IQR as follows:
qˆ1 =Fˆ−1(0.25)=5.1
qˆ3 =Fˆ−1(0.75)=6.4
I Q R = q ˆ 3 − q ˆ 1 = 6 . 4 − 5 . 1 = 1 . 3
The sample variance can be computed from the centered data vector via Eq. (2.11):
σˆ2 = 1(X−1·μˆ)T(X−1·μˆ)=102.168/150=0.681 n
The sample standard deviation is then
σˆ = √0.681 = 0.825
Variance of the Sample Mean Because the sample mean μˆ is itself a statistic, we can compute its mean value and variance. The expected value of the sample mean is simply μ, as we saw in Eq. (2.6). To derive an expression for the variance of the sample mean,
Univariate Analysis 41 we utilize the fact that the random variables xi are all independent, and thus
n n
var xi = var(xi)
i=1 i=1
Further, because all the xi’s are identically distributed as X, they have the same
variance as X, that is,
var(xi)=σ2 foralli Combining the above two facts, we get
Further, note that
n n n var xi = var(xi) =
i=1 i=1 i=1
n
E xi =nμ
σ2 = nσ2
(2.12)
(2.13)
i=1
Using Eqs. (2.9), (2.12), and (2.13), the variance of the sample mean μˆ can be
(2.14)
computed as
var(μˆ)=E[(μˆ−μ)]=E[μˆ ]−μ =E n xi −n2E xi
2 221n21n2 i=1 i=1
1 n 2 n 2 1 n =n2 E xi −E xi =n2var xi
i=1 i=1 = n
i=1
σ2
In other words, the sample mean μˆ varies or deviates from the mean μ in proportion to the population variance σ2. However, the deviation can be made smaller by considering larger sample size n.
Sample Variance Is Biased, but Is Asymptotically Unbiased The sample variance in Eq. (2.10) is a biased estimator for the true population variance, σ 2 , that is, E[σˆ 2 ] ̸= σ 2 . To show this we make use of the identity
n n
(xi −μ)2 =n(μˆ −μ)2 + (xi −μˆ)2
i=1 i=1
Computing the expectation of σˆ 2 by using Eq. (2.15) in the first step, we get
(2.15)
(2.16)
1 n E[σˆ2]=E n (xi −μˆ)2
=E
1 n
n (xi −μ)2 −E[(μˆ −μ)2]
i=1
i=1
42 Numeric Attributes Recall that the random variables xi are IID according to X, which means that they have
the same mean μ and variance σ 2 as X. This means that E[(xi −μ)2]=σ2
Further,fromEq.(2.14)thesamplemeanμˆ hasvarianceE[(μˆ−μ)2]=σ2.Plugging n
these into the Eq. (2.16) we get
E [ σˆ 2 ] = 1 n σ 2 − σ 2
nn =n−1σ2
n
The sample variance σˆ2 is a biased estimator of σ2, as its expected value differs from
the population variance by a factor of n−1 . However, it is asymptotically unbiased, that n
is, the bias vanishes as n → ∞ because
lim n − 1 = lim 1 − 1 = 1
n→∞ n n→∞ n Put differently, as the sample size increases, we have
E [ σˆ 2 ] → σ 2 a s n → ∞
2.2 BIVARIATE ANALYSIS
In bivariate analysis, we consider two attributes at the same time. We are specifically interested in understanding the association or dependence between them, if any. We thus restrict our attention to the two numeric attributes of interest, say X1 and X2, with the data D represented as an n × 2 matrix:
X 1 X 2
x11 x12 D=x21 x22
. . . . . . xn1 xn2
Geometrically, we can think of D in two ways. It can be viewed as n points or vectors in 2-dimensional space over the attributes X1 and X2, that is, xi = (xi1,xi2)T ∈ R2. Alternatively, it can be viewed as two points or vectors in an n-dimensional space comprising the points, that is, each column is a vector in Rn, as follows:
X1 = (x11,x21,…,xn1)T X2 = (x12,x22,…,xn2)T
In the probabilistic view, the column vector X = (X1 , X2 )T is considered a bivariate vector random variable, and the points xi (1 ≤ i ≤ n) are treated as a random sample drawn from X, that is, xi ’s are considered independent and identically distributed as X.
Bivariate Analysis 43 Empirical Joint Probability Mass Function
(2.17)
The empirical joint probability mass function for X is given as ˆ 1n
argument is true:
f(x)=P(X=x)= n I(xi =x) i=1
ˆ 1 n
f(x1,x2)=P(X1 =x1,X2 =x2)= n
where x = (x1,x2)T and I is a indicator variable that takes on the value 1 only when its
I(xi1 =x1,xi2 =x2) 1 if xi1 = x1 and xi2 = x2
I(xi = x) = 0 otherwise
As in the univariate case, the probability function puts a probability mass of 1 at each
i=1
point in the data sample.
2.2.1 Measures of Location and Dispersion
Mean
The bivariate mean is defined as the expected value of the vector random variable X,
defined as follows:
μ = E[X] = EX1 = E[X1] = μ1 (2.18) X2 E[X2] μ2
In other words, the bivariate mean vector is simply the vector of expected values along each attribute.
The sample mean vector can be obtained from fˆX1 and fˆX2, the empirical probability mass functions of X1 and X2, respectively, using Eq. (2.5). It can also be computed from the joint empirical PMF in Eq. (2.17)
ˆ 1n 1n
μˆ = x f ( x ) = x n I ( x i = x ) = n x i ( 2 . 1 9 )
x x i=1 i=1
We can compute the variance along each attribute, namely σ12 for X1 and σ2 for X2
using Eq. (2.8). The total variance [Eq. (1.4)] is given as v a r ( D ) = σ 12 + σ 2 2
The sample variances σˆ12 and σˆ2 can be estimated using Eq. (2.10), and the sample total variance is simply σˆ12 + σˆ2.
2.2.2 Measures of Association
Covariance
The covariance between two attributes X1 and X2 provides a measure of the association or linear dependence between them, and is defined as
n
Variance
σ12 = E[(X1 − μ1 )(X2 − μ2 )] (2.20)
44
By linearity of expectation, we have
σ12 = E[(X1 − μ1)(X2 − μ2)]
= E[X1X2 − X1μ2 − X2μ1 + μ1μ2]
= E[X1X2] − μ2E[X1] − μ1E[X2] + μ1μ2 =E[X1X2]−μ1μ2
= E[X1X2] − E[X1]E[X2]
Numeric Attributes
(2.21)
Eq. (2.21) can be seen as a generalization of the univariate variance [Eq. (2.9)] to the bivariate case.
If X1 and X2 are independent random variables, then we conclude that their covariance is zero. This is because if X1 and X2 are independent, then we have
which in turn implies that
E[X1X2] = E[X1] · E[X2] σ12 = 0
However, the converse is not true. That is, if σ12 = 0, one cannot claim that X1 and X2 are independent. All we can say is that there is no linear dependence between them, but we cannot rule out that there might be a higher order relationship or dependence between the two attributes.
The sample covariance between X1 and X2 is given as 1 n
σˆ12 = n (xi1 − μˆ 1)(xi2 − μˆ 2) (2.22) i=1
It can be derived by substituting the empirical joint probability mass function fˆ(x1,x2) from Eq. (2.17) into Eq. (2.20), as follows:
σ ˆ 1 2 = E [ ( X 1 − μ ˆ 1 ) ( X 2 − μ ˆ 2 ) ]
= (x1 −μˆ1)(x2 −μˆ2)fˆ(x1,x2)
x=(x1,x2)T
1 n
= n (x1 −μˆ1)·(x2 −μˆ2)·I(xi1 =x1,xi2 =x2) x=(x1,x2)T i=1
1 n
= n ( x i 1 − μˆ 1 ) ( x i 2 − μˆ 2 )
i=1
Notice that sample covariance is a generalization of the sample variance
[Eq. (2.10)] because
σˆ11 = n
and similarly, σˆ22 = σˆ2.
1 n
(xi −μ1)(xi −μ1)= n (xi −μ1)2 =σˆ12
i=1
1 n i=1
Bivariate Analysis 45
Correlation
The correlation between variables X1 and X2 is the standardized covariance, obtained by normalizing the covariance with the standard deviation of each variable, given as
ρ12 = σ12 = σ12
σ 1 σ 2 σ 12 σ 2 2
(2.23)
(2.24)
The sample correlation for attributes X1 and X2 is given as σˆ12 ni=1(xi1 −μˆ1)(xi2 −μˆ2)
ρˆ12 = σˆ1σˆ2 = ni=1(xi1 − μˆ 1)2 ni=1(xi2 − μˆ 2)2 Geometric Interpretation of Sample Covariance and Correlation
Let Z1 and Z2 denote the centered attribute vectors in Rn, given as follows: x11 −μˆ1 x12 −μˆ2
x21 − μˆ 1 x22 − μˆ 2 Z1 =X1 −1·μˆ1 = . Z2 =X2 −1·μˆ2 = .
x n 1 − μˆ 1
The sample covariance [Eq. (2.22)] can then be written as
σ ˆ 1 2 = Z T1 Z 2 n
x n 2 − μˆ 2
In other words, the covariance between the two attributes is simply the dot product between the two centered attribute vectors, normalized by the sample size. The above can be seen as a generalization of the univariate sample variance given in Eq. (2.11).
x1
Z1
θ
Z2
xn
Figure 2.3. Geometric interpretation of covariance and correlation. The two centered attribute vectors are shown in the (conceptual) n-dimensional space Rn spanned by the n points.
x2
46
ρˆ 1 2 =
the two centered attribute vectors, as illustrated in Figure 2.3.
Covariance Matrix
The variance–covariance information for the two attributes X1 and X2 can be summarized in the square 2 × 2 covariance matrix, given as
The sample correlation [Eq. (2.24)] can be written as
Z T1 Z 2
Z T1 Z 2 ∥Z1∥ ∥Z2∥
Z 1 T Z 2 ∥Z1∥ ∥Z2∥
= c o s θ ( 2 . 2 5 ) Thus, the correlation coefficient is simply the cosine of the angle [Eq. (1.3)] between
ZT1 Z1
ZT2 Z2
=
=
Numeric Attributes
= E[(X − μ)(X − μ)T ] =EX1 −μ1X1 −μ1
X2 −μ2
= E[(X1 − μ1)(X1 − μ1)]
E[(X2 − μ2)(X1 − μ1)] =σ12 σ12
σ21 σ2
X2 −μ2
E[(X1 − μ1)(X2 − μ2)]
E[(X2 − μ2)(X2 − μ2)]
Because σ12 = σ21 , is a symmetric matrix. The covariance matrix records the attribute specific variances on the main diagonal, and the covariance information on the off-diagonal elements.
The total variance of the two attributes is given as the sum of the diagonal elements of , which is also called the trace of , given as
v a r ( D ) = t r ( ) = σ 12 + σ 2 2
We immediately have tr() ≥ 0.
The generalized variance of the two attributes also considers the covariance, in
addition to the attribute variances, and is given as the determinant of the covariance matrix , denoted as || or det(). The generalized covariance is non-negative, because
||=det()=σ2σ2 −σ2 =σ2σ2 −ρ2 σ2σ2 =(1−ρ2 )σ2σ2 1 2 12 1 2 12 1 2 12 1 2
where we used Eq. (2.23), that is, σ12 = ρ12σ1σ2. Note that |ρ12| ≤ 1 implies that ρ122 ≤ 1, which in turn implies that det() ≥ 0, that is, the determinant is non-negative.
The sample covariance matrix is given as
= σˆ 12 σˆ 1 2
σˆ12 σˆ2
The sample covariance matrix shares the same properties as , that is, it is symmetric and || ≥ 0, and it can be used to easily obtain the sample total and generalized variance.
(2.26)
Bivariate Analysis
47
4.0
3.5
3.0
2.5
2
5.0
5.5 6.0
X1: sepal length
7.0
7.5
8.0
4
4.5
Figure2.4. Correlationbetweensepallengthandsepalwidth.
6.5
Example 2.3 (Sample Mean and Covariance). Consider the sepal length and sepal width attributes for the Iris dataset, plotted in Figure 2.4. There are n = 150 points in the d = 2 dimensional attribute space. The sample mean vector is given as
μˆ = 5 . 8 4 3 3.054
The sample covariance matrix is given as
= 0.681 −0.039
−0.039 0.187
The variance for sepal length is σˆ12 = 0.681, and that for sepal width is σˆ2 = 0.187. The covariance between the two attributes is σˆ12 = −0.039, and the correlation
between them is
−0.039 = −0.109 0.681 · 0.187
ρˆ 1 2 = √
Thus, there is a very weak negative correlation between these two attributes, as evidenced by the best linear fit line in Figure 2.4. Alternatively, we can consider the attributes sepal length and sepal width as two points in Rn. The correlation is then the cosine of the angle between them; we have
ρˆ 12 = cos θ = −0.109, which implies that θ = cos−1 (−0.109) = 96.26◦
The angle is close to 90◦, that is, the two attribute vectors are almost orthogonal, indicating weak correlation. Further, the angle being greater than 90◦ indicates negative correlation.
X2: sepal width
48 Numeric Attributes
The sample total variance is given as
tr() = 0.681 + 0.187 = 0.868
and the sample generalized variance is given as
|| = det() = 0.681 · 0.187 − (−0.039)2 = 0.126
2.3 MULTIVARIATE ANALYSIS
In multivariate analysis, we consider all the d numeric attributes X1,X2,…,Xd. The
full data is an n × d matrix, given as
X1 X2
x11 x12 D=x21 x22
xi =(xi1,xi2,…,xid)T ∈Rd
In the column view, the data can be considered as a set of d points or vectors in the
n-dimensional space spanned by the data points
Xj =(x1j,x2j,…,xnj)T ∈Rn
In the probabilistic view, the d attributes are modeled as a vector random variable, X = (X1,X2,…,Xd)T, and the points xi are considered to be a random sample drawn from X, that is, they are independent and identically distributed as X.
Mean
Generalizing Eq. (2.18), the multivariate mean vector is obtained by taking the mean of each attribute, given as
E[X1] μ1 E[X2] μ2
μ=E[X]= . = . E[Xd ] μd Generalizing Eq. (2.19), the sample mean is given as
··· Xd
··· x1d ··· x2d … .
··· xnd
In the row view, the data can be considered as a set of n points or vectors in the
d-dimensional attribute space
. . xn1 xn2
1 n μˆ=n xi
i=1
Multivariate Analysis 49
Covariance Matrix
Generalizing Eq. (2.26) to d dimensions, the multivariate covariance information is captured by the d × d (square) symmetric covariance matrix that gives the covariance for each pair of attributes:
σ12 σ12 ··· σ1d = E[(X − μ)(X − μ)T ] = σ21 σ2 · · · σ2d
The diagonal element σi2 specifies the attribute variance for Xi, whereas the off-diagonal elements σij = σji represent the covariance between attribute pairs Xi and Xj .
Covariance Matrix Is Positive Semidefinite
It is worth noting that is a positive semidefinite matrix, that is, aTa ≥ 0 for any d-dimensional vector a
To see this, observe that
aTa = aTE[(X − μ)(X − μ)T]a = E[aT(X − μ)(X − μ)Ta] =E[Y2]
≥0
w h e r e Y i s t h e r a n d o m v a r i a b l e Y = a T ( X − μ ) = di = 1 a i ( X i − μ i ) , a n d w e u s e t h e f a c t that the expectation of a squared random variable is non-negative.
Because is also symmetric, this implies that all the eigenvalues of are real and non-negative. In other words the d eigenvalues of can be arranged from the largest to the smallest as follows: λ1 ≥ λ2 ≥ · · · ≥ λd ≥ 0. A consequence is that the determinant of is non-negative:
d i=1
The total variance is given as the trace of the covariance matrix: var(D)=tr()=σ12 +σ2 +···+σd2
Total and Generalized Variance
det() =
λi ≥ 0
(2.27)
(2.28)
Being a sum of squares, the total variance must be non-negative.
The generalized variance is defined as the determinant of the covariance matrix,
det(), also denoted as ||. It gives a single value for the overall multivariate scatter. From Eq. (2.27) we have det() ≥ 0.
··· ··· ··· ··· σd1 σd2 ··· σd2
50
Sample Covariance Matrix
The sample covariance matrix is given as
Instead of computing the sample covariance matrix element-by-element, we can obtain it via matrix operations. Let Z represent the centered data matrix, given as the matrixofcenteredattributevectorsZi =Xi −1·μˆi,where1∈Rn:
T|| | Z=D−1·μˆ = Z1 Z2 ··· Zd
|||
Alternatively, the centered data matrix can also be written in terms of the centered
p o i n t s z i = x i − μˆ :
xT1−μˆT—zT1 — Z=D−1·μˆT =xT2 −μˆT=— zT2 — . .
xTn−μˆT — zTn —
In matrix notation, the sample covariance matrix can be written as
ZT1 Z1 ZT1 Z2 ··· ZT1 Zd
= 1ZTZ= 1ZT2Z1 ZT2Z2 ··· ZT2Zd (2.30)
n n…… Z Td Z 1 Z Td Z 2 · · · Z Td Z d
The sample covariance matrix is thus given as the pairwise inner or dot products of the centered attribute vectors, normalized by the sample size.
In terms of the centered points zi , the sample covariance matrix can also be written as a sum of rank-one matrices obtained as the outer product of each centered point:
1 n
= n z i · z Ti ( 2 . 3 1 )
i=1
Example 2.4 (Sample Mean and Covariance Matrix). Let us consider all four numeric attributes for the Iris dataset, namely sepal length, sepal width, petal length, and petal width. The multivariate sample mean vector is given as
μˆ = 5.843 3.054 3.759 1.199T
σˆ 12 =E[(X−μˆ)(X−μˆ)T]= σˆ 2 1
σˆ 1 2 σˆ 2 2
· · ·
σˆ 1 d
Numeric Attributes
(2.29)
σˆ 2 d σˆ d 1 σˆ d 2 · · · σˆ d2
· · ·
··· ··· ··· ···
Multivariate Analysis
and the sample covariance matrix is given as 0.681 −0.039
= −0.039 0.187 1.265 −0.320 0.513 −0.117
51
The sample total variance is
var(D) = tr() = 0.681 + 0.187 + 3.092 + 0.579 = 4.539
and the generalized variance is
det() = 1.853 × 10−3
1.265 −0.320 3.092 1.288
0.513 −0.117
1.288 0.579
Example 2.5 (Inner and Outer Product). To illustrate the inner and outer product–based computation of the sample covariance matrix, consider the 2-dimensional dataset
A1 A2
D = 1 0.8 5 2.4
9 5.5
μˆ =μˆ1=15/3= 5
μˆ 2 8 . 7 / 3 2 . 9 and the centered data matrix is then given as
1 0.8 1 −4 −2.1 Z=D−1·μT =5 2.4−1 5 2.9 = 0 −0.5
9 5.5 1 4 2.6
The inner-product approach [Eq. (2.30)] to compute the sample covariance matrix
The mean vector is as follows:
gives
1T 1−404−4−2.1
=nZZ=3 −2.1 −0.5 2.6 · 0 −0.5 4 2.6
= 1 32 18.8=10.67 6.27 3 18.8 11.42 6.27 3.81
Alternatively, the outer-product approach [Eq. (2.31)] gives 1 n
= 1 −4 ·−4 −2.1+ 0 ·0 −0.5+ 4 ·4 2.6 3 −2.1 −0.5 2.6
= n z i · z Ti i=1
52 Numeric Attributes
= 1 16.0 8.4 + 0.0 0.0 + 16.0 10.4 3 8.4 4.41 0.0 0.25 10.4 6.76
= 1 32.0 18.8 = 10.67 6.27 3 18.8 11.42 6.27 3.81
where the centered points zi are the rows of Z. We can see that both the inner and outer product approaches yield the same sample covariance matrix.
2.4 DATA NORMALIZATION
When analyzing two or more attributes it is often necessary to normalize the values of
the attributes, especially in those cases where the values are vastly different in scale.
Range Normalization
Let X be an attribute and let x1,x2,…,xn be a random sample drawn from X. In range normalization each value is scaled by the sample range rˆ of X:
xi′ = xi −mini{xi} = xi −mini{xi}
rˆ maxi{xi}−mini{xi}
After transformation the new attribute takes on values in the range [0, 1].
Standard Score Normalization
In standard score normalization, also called z-normalization, each value is replaced by
its z-score:
x i ′ = x i − μˆ σˆ
where μˆ is the sample mean and σˆ 2 is the sample variance of X. After transformation, the new attribute has mean μˆ ′ = 0, and standard deviation σˆ ′ = 1.
Example 2.6. Consider the example dataset shown in Table 2.1. The attributes Age and Income have very different scales, with the latter having much larger values. Consider the distance between x1 and x2:
∥x1 − x2∥ = (2,200)T = 22 + 2002 = √40004 = 200.01
As we can observe, the contribution of Age is overshadowed by the value of Income. The sample range for Age is rˆ = 40 − 12 = 28, with the minimum value 12. After
range normalization, the new attribute is given as
Age′ =(0,0.071,0.214,0.393,0.536,0.571,0.786,0.893,0.964,1)T
For example, for the point x2 = (x21,x22) = (14,500), the value x21 = 14 is transformed into
x 2′ 1 = 1 4 − 1 2 = 2 = 0 . 0 7 1 28 28
Data Normalization
53
Table 2.1. Dataset for normalization
xi
Age (X1)
Income (X2)
x1
12
300
x2
14
500
x3
18
1000
x4
23
2000
x5
27
3500
x6
28
4000
x7
34
4300
x8
37
6000
x9
39
2500
x10
40
2700
Likewise, the sample range for Income is 6000 − 300 = 5700, with a minimum value of 300; Income is therefore transformed into
Income′ =(0,0.035,0.123,0.298,0.561,0.649,0.702,1,0.386,0.421)T
so that x22 = 0.035. The distance between x1 and x2 after range normalization is given
as
x′1 − x′2 = (0, 0)T − (0.071, 0.035)T = (−0.071, −0.035)T = 0.079
We can observe that Income no longer skews the distance.
For z-normalization, we first compute the mean and standard deviation of both
attributes:
Age is transformed into
Age′ =(−1.56,−1.35,−0.94,−0.43,−0.02,0.08,0.70,1.0,1.21,1.31)T
For instance, the value x21 = 14, for the point x2 = (x21,x22) = (14,500), is transformed as
x2′1 = 14−27.2 =−1.35 9.77
Likewise, Income is transformed into
Income′ =(−1.38,−1.26,−0.97,−0.39,0.48,0.77,0.94,1.92,−0.10,0.01)T
so that x22 = −1.26. The distance between x1 and x2 after z-normalization is given as x′1 − x′2 = (−1.56,−1.38)T − (1.35,−1.26)T = (−0.18,−0.12)T = 0.216
Age
Income
μˆ σˆ
27.2 9.77
2680 1726.15
54 Numeric Attributes
2.5 NORMAL DISTRIBUTION
The normal distribution is one of the most important probability density functions, especially because many physically observed variables follow an approximately normal distribution. Furthermore, the sampling distribution of the mean of any arbitrary probability distribution follows a normal distribution. The normal distribution also plays an important role as the parametric distribution of choice in clustering, density estimation, and classification.
2.5.1 Univariate Normal Distribution
A random variable X has a normal distribution, with the parameters mean μ and
variance σ2, if the probability density function of X is given as follows: f(x|μ,σ2)=√ 1 exp−(x−μ)2
The term (x − μ)2 measures the distance of a value x from the mean μ of the distribution, and thus the probability density decreases exponentially as a function of the distance from the mean. The maximum value of the density occurs at the mean
1 , which is inversely proportional to the standard 2πσ2
2πσ2 2σ2
value x = μ, given as f (μ) = √ deviation σ of the distribution.
Example 2.7. Figure 2.5 plots the standard normal distribution, which has the parameters μ = 0 and σ 2 = 1. The normal distribution has a characteristic bell shape, and it is symmetric about the mean. The figure also shows the effect of different values of standard deviation on the shape of the distribution. A smaller value (e.g., σ = 0.5) results in a more “peaked” distribution that decays faster, whereas a larger value (e.g., σ = 2) results in a flatter distribution that decays slower. Because the normal distribution is symmetric, the mean μ is also the median, as well as the mode, of the distribution.
Probability Mass
Given an interval [a,b] the probability mass of the normal distribution within that interval is given as
b P(a≤x≤b)= f(x|μ,σ2)dx
a
In particular, we are often interested in the probability mass concentrated within k standard deviations from the mean, that is, for the interval [μ − kσ, μ + kσ ], which can be computed as
Pμ−kσ≤x≤μ+kσ=√1 2πσ
μ+kσ 2
exp −(x−μ) dx 2σ2
μ−kσ
Normal Distribution
f(x) 0.8
0.7 0.6 0.5 0.4 0.3 0.2 0.1
55
σ = 0.5
σ=1
0
−6 −5 −4 −3 −2 −1 0 1 2 3 4 5
x
standard normal distribution:
P(−k≤z≤k)=√2π e 2 dz −k
σ=2
Figure2.5. Normaldistribution:μ=0,anddifferentvariances.
Via a change of variable z = x−μ , we get an equivalent formulation in terms of the
σ
k
1 −1z2
k
2 −1z2
=√2π e2 dz 0
−1z2
The last step follows from the fact that e 2 is symmetric, and thus the integral over
(2.32)
Using Eq. (2.32) we can compute the probability mass within k standard deviations of the mean. In particular, for k = 1, we have
P(μ−σ ≤x≤μ+σ)=erf(1/√2)=0.6827
the range [−k,k] is equivalent to 2 times the integral over the range [0,k]. Finally, via z
another change of variable t = √2 , we get
√2k/2 √
0 where erf is the Gauss error function, defined as
√ P(−k≤z≤k)=2·P 0≤t≤k/ 2 =√π e−t2dt=erf k/ 2
x
2 −t2
erf(x)=√π e dt 0
56 Numeric Attributes
which means that 68.27% of all points lie within 1 standard deviation from the mean. For k = 2, we have erf(2/√2) = 0.9545, and for k = 3 we have erf(3/√2) = 0.9973. Thus, almost the entire probability mass (i.e., 99.73%) of a normal distribution is within ±3σ from the mean μ.
2.5.2 Multivariate Normal Distribution
Given the d -dimensional vector random variable X = (X1 , X2 , . . . , Xd )T , we say that X has a multivariate normal distribution, with the parameters mean μ and covariance matrix , if its joint multivariate probability density function is given as follows:
f(x|μ,)= √ 1 √ exp−(x−μ)T −1 (x−μ) (2.33) (2π)d || 2
where || is the determinant of the covariance matrix. As in the univariate case, the term
(xi −μ)T −1 (xi −μ) (2.34)
measures the distance, called the Mahalanobis distance, of the point x from the mean μ of the distribution, taking into account all of the variance–covariance information between the attributes. The Mahalanobis distance is a generalization of Euclidean distance because if we set = I, where I is the d × d identity matrix (with diagonal elements as 1’s and off-diagonal elements as 0’s), we get
(xi −μ)T I−1 (xi −μ)=∥xi −μ∥2
The Euclidean distance thus ignores the covariance information between the attributes, whereas the Mahalanobis distance explicitly takes it into consideration.
The standard multivariate normal distribution has parameters μ = 0 and = I. Figure 2.6a plots the probability density of the standard bivariate (d = 2) normal
distribution, with parameters
and
μ=0=0 =I=1 0
01
This corresponds to the case where the two attributes are independent, and both follow the standard normal distribution. The symmetric nature of the standard normal distribution can be clearly seen in the contour plot shown in Figure 2.6b. Each level curve represents the set of points x with a fixed density value f (x).
Geometry of the Multivariate Normal
Let us consider the geometry of the multivariate normal distribution for an arbitrary mean μ and covariance matrix . Compared to the standard normal distribution, we can expect the density contours to be shifted, scaled, and rotated. The shift or translation comes from the fact that the mean μ is not necessarily the origin 0. The
Normal Distribution
57
3 4
0.0007 0.007
0.05 0.13
−4
−3 0
f (x) 0.21
0.14 0.07
0 X2 1
2
3
−2 1
−1
0
X1
1
2
3 3
44
−4 0 −3
−2 −1
(a)
Figure2.6. (a)Standardbivariatenormaldensityand(b)itscontourplot.Parameters:μ=(0,0)T,=I.
scaling or skewing is a result of the attribute variances, and the rotation is a result of the covariances.
The shape or geometry of the normal distribution becomes clear by considering the eigen-decomposition of the covariance matrix. Recall that is a d × d symmetric positive semidefinite matrix. The eigenvector equation for is given as
ui =λiui
Here λi is an eigenvalue of and the vector ui ∈ Rd is the eigenvector corresponding to λi . Because is symmetric and positive semidefinite it has d real and non-negative eigenvalues, which can be arranged in order from the largest to the smallest as follows: λ1 ≥ λ2 ≥ . . . λd ≥ 0. The diagonal matrix is used to record these eigenvalues:
λ1 0 ··· 0 = 0 λ2 · · · 0
. . … . 0 0 ··· λd
−4 −3 −2 −1 0 −4
−3 −2 −1
1 2
4
X1 (b)
X2 2
58 Numeric Attributes Further, the eigenvectors are unit vectors (normal) and are mutually orthogonal,
that is, they are orthonormal:
uTiui=1 foralli
uTi uj =0 foralli̸=j
The eigenvectors can be put together into an orthogonal matrix U, defined as a matrix
with normal and mutually orthogonal columns:
|| | U= u1 u2 ··· ud
|||
The eigen-decomposition of can then be expressed compactly as follows: = UUT
This equation can be interpreted geometrically as a change in basis vectors. From the original d dimensions corresponding to the d attributes Xj , we derive d new dimensions ui . is the covariance matrix in the original space, whereas is the covariance matrix in the new coordinate space. Because is a diagonal matrix, we can immediately conclude that after the transformation, each new dimension ui has variance λi, and further that all covariances are zero. In other words, in the new space, the normal distribution is axis aligned (has no rotation component), but is skewed in each axis proportional to the eigenvalue λi , which represents the variance along that dimension (further details are given in Section 7.2.4).
Total and Generalized Variance d
The determinant of the covariance matrix is is given as det() = i=1 λi . Thus, the generalized variance of is the product of its eigenvalues.
Given the fact that the trace of a square matrix is invariant to similarity transformation, such as a change of basis, we conclude that the total variance var(D) for a dataset D is invariant, that is,
d
var(D)=tr()=
i=1
Inotherwordsσ12+···+σd2 =λ1+···+λd.
σi2 =
d i=1
λi =tr()
Example 2.8 (Bivariate Normal Density). Treating attributes sepal length (X1) and sepal width (X2) in the Iris dataset (see Table 1.1) as continuous random
variables, we can define a continuous bivariate random variable X = X1. X2
Assuming that X follows a bivariate normal distribution, we can estimate its parameters from the sample. The sample mean is given as
μˆ =(5.843,3.054)T
Normal Distribution
59
5 X2
1 u2
f(x) 2
4 3
2
3
4
5
6
7 u1
8
9
and the sample covariance matrix is given as
= 0.681 −0.039
−0.039 0.187
X1
Figure2.7. Iris:sepallengthandsepalwidth,bivariatenormaldensityandcontours.
The plot of the bivariate normal density for the two attributes is shown in Figure 2.7. The figure also shows the contour lines and the data points.
Consider the point x2 = (6.9, 3.1)T. We have
x2 − μˆ = 6.9 − 5.843 = 1.057
3.1 3.054 0.046 The Mahalanobis distance between x2 and μˆ is
(xi − μˆ )T −1 (xi − μˆ ) = 1.057 0.046 0.681 −0.039−1 1.057 −0.039 0.187 0.046
= 1.057 0.046 1.486 0.31 1.057 0.31 5.42 0.046
= 1.701
whereas the squared Euclidean distance between them is
∥(x2 − μˆ )∥2 = 1.057 0.046 1.057 = 1.119 0.046
The eigenvalues and the corresponding eigenvectors of are as follows:
λ1 =0.684 u1 =(−0.997,0.078)T λ2 =0.184 u2 =(−0.078,−0.997)T
60 Numeric Attributes
These two eigenvectors define the new axes in which the covariance matrix is given as =0.684 0
0 0.184
The angle between the original axes e1 = (1,0)T and u1 specifies the rotation angle
for the multivariate normal:
c o s θ = e T1 u 1 = − 0 . 9 9 7
θ = cos−1 (−0.997) = 175.5◦
Figure 2.7 illustrates the new coordinate axes and the new variances. We can see that in the original axes, the contours are only slightly rotated by angle 175.5◦ (or −4.5◦).
2.6 FURTHER READING
There are several good textbooks that cover the topics discussed in this chapter in more depth; see Evans and Rosenthal (2011) and Wasserman (2004) and Rencher and Christensen (2012).
Evans, M. and Rosenthal, J. (2011). Probability and Statistics: The Science of Uncertainty. 2nd ed. New York: W. H. Freeman.
Rencher, A. C. and Christensen, W. F. (2012). Methods of multivariate analysis. 3rd ed. Hoboken, NJ: John Wiley & Sons.
Wasserman, L. (2004). All of Statistics: A Concise Course in Statistical Inference. New York: Springer Science + Business Media.
2.7 EXERCISES
Q1. True or False:
(a) Mean is robust against outliers.
(b) Median is robust against outliers.
(c) Standard deviation is robust against outliers.
Q2. Let X and Y be two random variables, denoting age and weight, respectively. Consider a random sample of size n = 20 from these two variables
X = (69,74,68,70,72,67,66,70,76,68,72,79,74,67,66,71,74,75,75,76) Y = (153, 175, 155, 135, 172, 150, 115, 137, 200, 130, 140, 265, 185, 112, 140,
150, 165, 185, 210, 220)
(a) Find the mean, median, and mode for X.
(b) What is the variance for Y?
Exercises
61
(c) Plot the normal distribution for X.
(d) What is the probability of observing an age of 80 or higher?
(e) Find the 2-dimensional mean μˆ and the covariance matrix for these two
variables.
(f) What is the correlation between age and weight?
(g) Draw a scatterplot to show the relationship between age and weight.
Q3. ShowthattheidentityinEq.(2.15)holds,thatis,
n
n
(xi −μˆ)2 Q4. Prove that if xi are independent random variables, then
i=1
i=1 var xi = var(xi)
(xi −μ)2 =n(μˆ −μ)2 +
n n i=1 i=1
This fact was used in Eq. (2.12).
Q5. Define a measure of deviation called mean absolute deviation for a random variable
X as follows:
1 n
n i=1 |xi −μ| Is this measure robust? Why or why not?
Q6. ProvethattheexpectedvalueofavectorrandomvariableX=(X1,X2)Tissimplythe vector of the expected value of the individual random variables X1 and X2 as given in Eq. (2.18).
Q7. Show that the correlation [Eq. (2.23)] between any two random variables X1 and X2 lies in the range [−1,1].
Q8. Given the dataset in Table 2.2, compute the covariance matrix and the generalized variance.
Table 2.2. Dataset for Q8
Q9. Show that the outer-product in Eq.(2.31) for the sample covariance matrix is equivalent to Eq. (2.29).
Q10. Assume that we are given two univariate normal distributions, NA and NB, and let their mean and standard deviation be as follows: μA = 4, σA = 1 and μB = 8, σB = 2.
(a) For each of the following values xi ∈ {5,6,7} find out which is the more likely
normal distribution to have produced it.
(b) Derive an expression for the point for which the probability of having been
X1
X2
X3
x1 x2 x3
17 11 11
17 9 8
12 13 19
produced by both the normals is the same.
62
Q11.
Numeric Attributes
Consider Table 2.3. Assume that both the attributes X and Y are numeric, and the table represents the entire population. If we know that the correlation between X and Y is zero, what can you infer about the values of Y?
Table 2.3. Dataset for Q11
X
Y
1 0 1 0 0
a b c a c
Q12.
Under what conditions will the covariance matrix be identical to the correlation matrix,whose(i,j)entrygivesthecorrelationbetweenattributesXi andXj?What can you conclude about the two variables?
CHAPTER 3 CategoricalAttributes
In this chapter we present methods to analyze categorical attributes. Because categorical attributes have only symbolic values, many of the arithmetic operations cannot be performed directly on the symbolic values. However, we can compute the frequencies of these values and use them to analyze the attributes.
3.1 UNIVARIATE ANALYSIS
We assume that the data consists of values for a single categorical attribute, X. Let the domain of X consist of m symbolic values dom(X) = {a1,a2,…,am}. The data D is thus
an n × 1 symbolic data matrix given as
X
x1 D=x2 .
xn
Let us first consider the case when the categorical attribute X has domain {a1,a2}, with m = 2. We can model X as a Bernoulli random variable, which takes on two distinct values, 1 and 0, according to the mapping
X(v)=1 ifv=a1 0 ifv=a2
The probability mass function (PMF) of X is given as
P (X = x) = f (x) = p1 if x = 1 p0 if x = 0
where each point xi ∈ dom(X). 3.1.1 Bernoulli Variable
63
64 Categorical Attributes
where p1 and p0 are the parameters of the distribution, which must satisfy the condition
p1 + p0 = 1
Because there is only one free parameter, it is customary to denote p1 = p, from which it follows that p0 = 1 − p. The PMF of Bernoulli random variable X can then be written compactly as
P(X=x)=f(x)=px(1−p)1−x
We can see that P(X=1)=p1(1−p)0 =p and P(X=0)=p0(1−p)1 =1−p, as
desired.
Mean and Variance
The expected value of X is given as μ=E[X]=1·p+0·(1−p)=p
and the variance of X is given as
σ2 =var(X)=E[X2]−(E[X])2
=(12 ·p+02 ·(1−p))−p2 =p−p2 =p(1−p)
Sample Mean and Variance
(3.1)
To estimate the parameters of the Bernoulli variable X, we assume that each symbolic point has been mapped to its binary value. Thus, the set {x1,x2,…,xn} is assumed to be a random sample drawn from X (i.e., each xi is IID with X).
The sample mean is given as
μˆ = n x i = n = pˆ ( 3 . 2 )
1n n1 i=1
where n1 is the number of points with xi = 1 in the random sample (equal to the number of occurrences of symbol a1).
Let n0 = n − n1 denote the number of points with xi = 0 in the random sample. The sample variance is given as
1 n
σˆ 2 = n ( x i − μˆ ) 2
i=1
= n1 (1−pˆ)2 + n−n1 (−pˆ)2
nn =pˆ(1−pˆ)2 +(1−pˆ)pˆ2
=pˆ(1−pˆ)(1−pˆ +pˆ) = pˆ ( 1 − pˆ )
The sample variance could also have been obtained directly from Eq.(3.1), by substituting pˆ for p.
Univariate Analysis 65
Example 3.1. Consider the sepal length attribute (X1) for the Iris dataset in Table 1.1. Let us define an Iris flower as Long if its sepal length is in the range [7, ∞], and Short if its sepal length is in the range [−∞, 7). Then X1 can be treated as a categorical attribute with domain {Long,Short}. From the observed sample of size n = 150, we find 13 long Irises. The sample mean of X1 is
μˆ = pˆ = 1 3 / 1 5 0 = 0 . 0 8 7
σˆ2 =pˆ(1−pˆ)=0.087(1−0.087)=0.087·0.913=0.079
and its variance is
Binomial Distribution: Number of Occurrences
Given the Bernoulli variable X, let {x1,x2,…,xn} denote a random sample of size n drawn from X. Let N be the random variable denoting the number of occurrences of the symbol a1 (value X = 1) in the sample. N has a binomial distribution, given as
f (N = n1| n,p) = n pn1 (1 − p)n−n1 (3.3) 1
In fact, N is thesum of the n independent Bernoulli random variables xi IID with X, that is, N = ni=1 xi. By linearity of expectation, the mean or expected number of occurrences of symbol a1 is given as
nn n
μN =E[N]=E xi = E[xi]= p=np
i=1 i=1 i=1 Because xi are all independent, the variance of N is given as
2 n n
σN =var(N)= var(xi)= p(1−p)=np(1−p)
i=1 i=1
Example 3.2. Continuing with Example 3.1, we can use the estimated parameter pˆ = 0.087 to compute the expected number of occurrences N of Long sepal length Irises via the binomial distribution:
E[N]=npˆ =150·0.087=13
In this case, because p is estimated from the sample via pˆ , it is not surprising that the expected number of occurrences of long Irises coincides with the actual occurrences. However, what is more interesting is that we can compute the variance in the number of occurrences:
var(N)=npˆ(1−pˆ)=150·0.079=11.9
66 Categorical Attributes
As the sample size increases, the binomial distribution given in Eq. 3.3 tends to a normal distribution with μ = 13 and σ = √11.9 = 3.45 for our example. Thus, with confidence greater than 95% we can claim that the number of occurrences of a1 will lie in the range μ ± 2σ = [9.55, 16.45], which follows from the fact that for a normal distribution 95.45% of the probability mass lies within two standard deviations from the mean (see Section 2.5.1).
3.1.2 Multivariate Bernoulli Variable
We now consider the general case when X is a categorical attribute with domain {a1,a2,…,am}. We can model X as an m-dimensional Bernoulli random variable X = (A1,A2,…,Am)T, where each Ai is a Bernoulli variable with parameter pi denoting the probability of observing symbol ai . However, because X can assume only one of the symbolic values at any one time, if X=ai, then Ai =1, and Aj =0 for all j ̸= i. The range of the random variable X is thus the set {0,1}m, with the further restriction that if X = ai , then X = ei , where ei is the ith standard basis vector ei ∈ Rm given as
i−1 m−i T
ei = ( 0,…,0,1,0,…,0 )
In ei, only the ith element is 1 (eii = 1), whereas all other elements are zero (eij =0,∀j̸=i).
This is precisely the definition of a multivariate Bernoulli variable, which is a generalization of a Bernoulli variable from two outcomes to m outcomes. We thus model the categorical attribute X as a multivariate Bernoulli variable X defined as
X(v)=ei ifv=ai
The range of X consists of m distinct vector values {e1,e2,…,em}, with the PMF of X
given as
P(X=ei)=f(ei)=pi
where pi is the probability of observing value ai. These parameters must satisfy the
condition
The PMF can be written compactly as follows:
m P(X=ei)=f(ei)= peij
Because eii = 1, and eij = 0 for j ̸= i, we can see that, as expected, we have
f(ei)=
(3.4)
m
pi =1 i=1
m
j1im1im
j=1
peij =pei0×···peii···×peim=p0×···p1···×p0 =pi
j j=1
Univariate Analysis 67 Table3.1. Discretizedsepallengthattribute
Bins
Domain Counts
[4.3, 5.2] (5.2, 6.1] (6.1, 7.0] (7.0, 7.9]
Very Short (a1) Short (a2) Long (a3)
Very Long (a4)
n1 = 45 n2 = 50 n3 = 43 n4 = 12
Example 3.3. Let us consider the sepal length attribute (X1) for the Iris dataset shown in Table 1.2. We divide the sepal length into four equal-width intervals, and give each interval a name as shown in Table 3.1. We consider X1 as a categorical attribute with domain
{a1 =VeryShort,a2 =Short,a3 =Long,a4 =VeryLong}
We model the categorical attribute X1 as a multivariate Bernoulli variable X,
defined as
e1 = (1,0,0,0) X(v) = e2 = (0,1,0,0) e3 = (0,0,1,0) e4 = (0,0,0,1)
if v = a1 if v = a2 if v = a3 if v = a4
For example, the symbolic point x1 = Short = a2 is represented as the vector (0,1,0,0)T = e2.
Mean
The mean or expected value of X can be obtained as
μ=E[X]=
Sample Mean
m i=1
eif(ei)=
m i=1
1 0 p1 0 0 p 2
eipi =.p1 +···+.pm = . =p (3.5) 0 1 pm
Assume that each symbolic point xi ∈ D is mapped to the variable xi = X(xi). The
mapped dataset x1,x2,…,xn is then assumed to be a random sample IID with X. We
can compute the sample mean by placing a probability mass of 1 at each point
1n
μˆ = n
i=1
x i =
m ni n2/n pˆ2
n e i = . . . = . . . = pˆ n m / n pˆ m
( 3 . 6 )
n n 1 / n pˆ 1
i=1
where ni is the number of occurrences of the vector value ei in the sample, which is equivalent to the number of occurrences of the symbol ai. Furthermore, we have
68
Categorical Attributes
f (x)
0.3 0.2 0.1
0.3
0.333
0.287
0.08
0x e1 e2 e3 e4
Very Short Short Long Very Long
Figure 3.1. Probability mass function: sepal length.
mi=1 ni = n, which follows from the fact that X can take on only m distinct values ei, and the counts for each value must add up to the sample size n.
Example 3.4 (Sample Mean). Consider the observed counts ni for each of the values ai (ei ) of the discretized sepal length attribute, shown in Table 3.1. Because the total sample size is n = 150, from these we can obtain the estimates pˆ i as follows:
pˆ 1 = 4 5 / 1 5 0 = 0 . 3
pˆ 2 = 50/150 = 0.333 pˆ 3 = 43/150 = 0.287 pˆ 4 = 1 2 / 1 5 0 = 0 . 0 8
The PMF for X is plotted in Figure 3.1, and the sample mean for X is given as 0.3
μˆ = pˆ = 0.333 0.287
0.08
Covariance Matrix
Recall that an m-dimensional multivariate Bernoulli variable is simply a vector of m Bernoulli variables. For instance, X = (A1,A2,…,Am)T, where Ai is the Bernoulli variable corresponding to symbol ai. The variance–covariance information between the constituent Bernoulli variables yields a covariance matrix for X.
Univariate Analysis 69 Let us first consider the variance along each Bernoulli variable Ai . By Eq. (3.1),
we immediately have
σi2 =var(Ai)=pi(1−pi)
Next consider the covariance between Ai and Aj. Utilizing the identity in
Eq. (2.21), we have
σij =E[AiAj]−E[Ai]·E[Aj]=0−pipj =−pipj
which follows from the fact that E[Ai Aj ] = 0, as Ai and Aj cannot both be 1 at the same time, and thus their product Ai Aj = 0. This same fact leads to the negative relationship between Ai and Aj . What is interesting is that the degree of negative association is proportional to the product of the mean values for Ai and Aj .
From the preceding expressions for variance and covariance, the m × m covariance matrix for X is given as
σ12 σ12 … σ1m p1(1−p1) −p1p2 ··· −p1pm
σ12 σ2 … σ2m −p1p2 p2(1−p2) ··· −p2pm = . . … . = . . … .
σ1m σ2m … σm2 −p1pm −p2pm ··· pm(1−pm) Notice how each row in sums to zero. For example, for row i, we have
Define P as the m × m diagonal matrix:
p1 0 ··· 0
P=diag(p)=diag(p ,p ,…,p )=0 p2 ··· 0 1 2 m . . … .
0 0 ··· pm We can compactly write the covariance matrix of X as
= P − p · pT
The sample covariance matrix can be obtained from Eq. (3.8) in a straightforward
manner:
= P − pˆ · pˆ T ( 3 . 9 )
whereP=diag(pˆ),andpˆ=μˆ =(pˆ1,pˆ2,…,pˆm)T denotestheempiricalprobabilitymass function for X.
−pip1 −pip2 −···+pi(1−pi)−···−pipm =pi −pi
Because is symmetric, it follows that each column also sums to zero.
(3.7)
Sample Covariance Matrix
m j=1
pj =pi −pi =0
(3.8)
70 Categorical Attributes Example 3.5. Returning to the discretized sepal length attribute in Example 3.4,
we have μˆ = pˆ = (0.3, 0.333, 0.287, 0.08)T. The sample covariance matrix is given as
= P − pˆ · pˆ T
0.3 0 0 00.3
= 0 0.333 0 0 − 0.333 0.3
0 0 0.287 00.287 0 0 0 0.08 0.08
0.333 0.287 0.08
0.3 0 0 0 0.09 0.1
= 0 0.333 0 0 − 0.1 0.111 0.096 0.027
0.024 0 0 0 0.08 0.024 0.027 0.023 0.006
0.086
0 0 0.287 0 0.086 0.096 0.082 0.023
0.21 −0.1 −0.086 −0.024 = −0.1 0.222 −0.096 −0.027
−0.086 −0.096 0.204 −0.023 −0.024 −0.027 −0.023 0.074
One can verify that each row (and column) in sums to zero.
It is worth emphasizing that whereas the modeling of categorical attribute X as a multivariateBernoullivariable,X=(A1,A2,…,Am)T,makesthestructureofthemean and covariance matrix explicit, the same results would be obtained if we simply treat the mapped values X(xi ) as a new n × m binary data matrix, and apply the standard definitions of the mean and covariance matrix from multivariate numeric attribute analysis (see Section 2.3). In essence, the mapping from symbols ai to binary vectors ei is the key idea in categorical attribute analysis.
Example3.6. ConsiderthesampleDofsizen=5forthesepallengthattributeX1 in the Iris dataset, shown in Table 3.2a. As in Example 3.1, we assume that X1 has only two categorical values {Long,Short}. We model X1 as the multivariate Bernoulli variable X1 defined as
X (v) = e1 = (1,0)T if v =Long(a1) 1 e2 =(0,1)T ifv=Short(a2)
The sample mean [Eq. (3.6)] is
μˆ = pˆ = (2/5, 3/5)T = (0.4, 0.6)T
and the sample covariance matrix [Eq. (3.9)] is
=P−pˆpˆT =0.4 0
= 0.4 0
0 −0.40.4 0.6 0.6 0.6
0 − 0.16 0.24 = 0.24 −0.24 0.6 0.24 0.36 −0.24 0.24
Univariate Analysis 71 Table 3.2. (a) Categorical dataset. (b) Mapped binary dataset. (c) Centered dataset.
(a) (b) (c)
X
x1 x2 x3 x4 x5
Short
Short
Long
Short
Long
A1
A2
x1 x2 x3 x4 x5
0 0 1 0 1
1 1 0 1 0
n1n2 …nm i i=1
Z1
−0.4 −0.4 0.6 −0.4 0.6
Z2
z1 z2 z3 z4 z5
0.4
0.4 −0.6 0.4 −0.6
To show that the same result would be obtained via standard numeric analysis, we map the categorical attribute X to the two Bernoulli attributes A1 and A2 corresponding to symbols Long and Short, respectively. The mapped dataset is shown in Table 3.2b. The sample mean is simply
151T T μˆ = 5 xi = 5(2,3) =(0.4,0.6)
i=1
Next, we center the dataset by subtracting the mean value from each attribute. After centering, the mapped dataset is as shown in Table 3.2c, with attribute Zi as the centered attribute Ai . We can compute the covariance matrix using the inner-product form [Eq. (2.30)] on the centered column vectors. We have
σ 1 2 = 1 Z T1 Z 1 = 1 . 2 / 5 = 0 . 2 4 5
σ 2 2 = 1 Z T2 Z 2 = 1 . 2 / 5 = 0 . 2 4 5
σ 1 2 = 1 Z T1 Z 2 = − 1 . 2 / 5 = − 0 . 2 4 5
Thus, the sample covariance matrix is given as
= 0 . 2 4 − 0 . 2 4
−0.24 0.24
which matches the result obtained by using the multivariate Bernoulli modeling approach.
Multinomial Distribution: Number of Occurrences
Given a multivariate Bernoulli variable X and a random sample {x1,x2,…,xn} drawn from X. Let Ni be the random variable corresponding to the number of occurrences of symbol ai in the sample, and let N = (N1,N2,…,Nm)T denote the vector random variable corresponding to the joint distribution of the number of occurrences over all the symbols. Then N has a multinomial distribution, given as
n m
f N=(n1,n2,…,nm) |p = pni
72 Categorical Attributes We can see that this is a direct generalization of the binomial distribution in Eq. (3.3).
The term
n = n!
n1n2 …nm n1!n2!…nm!
denotes the number of ways of choosing ni occurrences of each symbol ai from a sample of size n, with mi=1 ni = n.
The mean and covariance matrix of N are given as n times the mean and covariance matrix of X. That is, the mean of N is given as
np1 μN =E[N]=nE[X]=n·μ=n·p= .
npm
and its covariance matrix is given as
np1(1−p1) −np1p2 ··· −np1pm =n·(P−ppT)= −np1p2 np2(1−p2) ··· −np2pm N . . … .
−np1pm −np2pm ··· npm(1−pm) Likewise the sample mean and covariance matrix for N are given as
μˆN =npˆ N =nP−pˆpˆT 3.2 BIVARIATE ANALYSIS
Assume that the data comprises two categorical attributes, X1 and X2, with dom(X1)={a11,a12,…,a1m1}
dom(X2)={a21,a22,…,a2m2}
We are given n categorical points of the form xi = (xi1,xi2)T with xi1 ∈ dom(X1) and
xi2 ∈ dom(X2). The dataset is thus an n × 2 symbolic data matrix:
X 1 X 2
x11 x12 D=x21 x22
. . . . . . xn1 xn2
We can model X1 and X2 as multivariate Bernoulli variables X1 and X2 with dimensions m1 and m2, respectively. The probability mass functions for X1 and X2 are
Bivariate Analysis
given according to Eq. (3.4):
P(X1 = e1i) = f1(e1i) = pi1 =
73
(pj2)ejk
where e1i is the ith standard basis vector in Rm1 (for attribute X1) whose kth component
P(X2 =e2j)=f2(e2j)=pj2 =
is e1 , and e is the jth standard basis vector in Rm2 (for attribute X ) whose kth
component is e2 . Further, the parameter p1 denotes the probability of observing jk i
k=1
ik 2j 2
symbol a1i , and pj2 denotes the probability of observing symbol a2j . Together they must satisfy the conditions: m1 p1 = 1 and m2 p2 = 1.
i=1 i j=1 j
The joint distribution of X1 and X2 is modeled as the d′ = m1 + m2 dimensional
vector variable X = X1, specified by the mapping X2
X(v1,v2)T = X1(v1) = e1i X2(v2) e2j
provided that v1 = a1i and v2 = a2j . The range of X thus consists of m1 × m2 distinct pairsofvectorvalues(e1i,e2j)T,with1≤i≤m1 and1≤j≤m2.ThejointPMFofX is given as
mm
T 1 2 ei1r·ej2s
P X=(e1i,e2j) =f(e1i,e2j)=pij = pij r=1 s=1
where pij the probability of observing the symbol pair (a1i , a2j ). These probability parameters must satisfy the condition m1 m2 pij = 1. The joint PMF for X can be
expressed as the m1 × m2 matrix
p21 p22 … p2m2
i=1 j=1
p11 p12 …p1m2
m1 1 (pi1)eik
k=1
m2 2
P 1 2 = . . . . . . . . . . . . ( 3 . 1 0 ) pm11 pm12 … pm1m2
Example3.7. Considerthediscretizedsepallengthattribute(X1)inTable3.1.We also discretize the sepal width attribute (X2 ) into three values as shown in Table 3.3. We thus have
dom(X1) = {a11 =VeryShort,a12 =Short,a13 =Long,a14 =VeryLong} dom(X2) = {a21 =Short,a22 =Medium,a23 =Long}
The symbolic point x = (Short,Long) = (a12,a23), is mapped to the vector
X(x) = e12 = (0,1,0,0 | 0,0,1)T ∈ R7 e23
74
Categorical Attributes
Table3.3. Discretizedsepalwidthattribute
Bins
Domain
Counts
[2.0, 2.8] (2.8, 3.6] (3.6, 4.4]
Short (a1) Medium (a2 ) Long (a3)
47 88 15
where we use | to demarcate the two subvectors e12 = (0,1,0,0)T ∈ R4 and e23 = (0, 0, 1)T ∈ R3 , corresponding to symbolic attributes sepal length and sepal width, respectively. Note that e12 is the second standard basis vector in R4 for X1, and e23 is the third standard basis vector in R3 for X2.
Mean
The bivariate mean can easily be generalized from Eq. (3.5), as follows: μ = E[X] = EX1 = E[X1] = μ1 = p1
X2 E[X2] μ2 p2
where μ1 = p1 = (p1,…,pm1 )T and μ2 = p2 = (p12,…,pm2 )T are the mean vectors for
12
X1 and X2. The vectors p1 and p2 also represent the probability mass functions for X1
and X2, respectively.
Sample Mean
The sample mean can also be generalized from Eq. (3.6), by placing a probability mass of 1 at each point:
n
n1 pˆ1 . .
. . n 2m pˆ m2
μˆ=1x =1 i=1 i 1i=1nm1=pˆm1= pˆ1 = μˆ1
1 1 i=1 j=1nje2j .1 .1
m1 n1e
n i nm2 2 nn2pˆ2 pˆ2 μˆ2
n
22
where nji is the observed frequency of symbol aij in the sample of size n, and μˆ i = pˆ i =
(p1i , p2i , . . . , pmi )T is the sample mean vector for Xi , which is also the empirical PMF for
i attribute Xi .
Covariance Matrix
The covariance matrix for X is the d′ × d′ = (m1 + m2) × (m1 + m2) matrix given as
= 11 12 (3.11)
T12 22
where 11 is the m1 × m1 covariance matrix for X1, and 22 is the m2 × m2 covariance
matrix for X2, which can be computed using Eq. (3.8). That is, 1 1 = P 1 − p 1 p T1
2 2 = P 2 − p 2 p T2
Bivariate Analysis 75 where P1 = diag(p1) and P2 = diag(p2). Further, 12 is the m1 × m2 covariance matrix
between variables X1 and X2, given as
12 = E[(X1 − μ1)(X2 − μ2)T] = E[X1XT2 ] − E[X1]E[X2]T = P 1 2 − μ 1 μ T2
= P 1 2 − p 1 p T2
p11 −p1p12 p12 −p1p2 ··· p1m2 −p1pm2 2
p21 −p1p2 p22 −p1p2 ··· p2m −p1p2 = 21 22 2 2m2
. . . . . . . . . . . . pm 1−pm1 p2 pm 2−pm1 p2 ··· pm m −pm1 pm2
where P12 represents the joint PMF for X given in Eq. (3.10).
Incidentally, each row and each column of 12 sums to zero. For example, consider
row i and column j:
2 2
k=1 k=1
1111121212
mm (pik −pi1pk2)= pik
−pi1 =pi1 −pi1 =0 − p j2 = p j2 − p j2 = 0
k=1 k=1 mm
1 1 ( p k j − p k1 p j2 ) =
p k j
which follows from the fact that summing the joint mass function over all values of X2, yields the marginal distribution of X1, and summing it over all values of X1 yields the marginaldistributionforX2.Notethatpj2 istheprobabilityofobservingsymbola2j;it should not be confused with the square of pj . Combined with the fact that 11 and 22 also have row and column sums equal to zero via Eq. (3.7), the full covariance matrix has rows and columns that sum up to zero.
Sample Covariance Matrix
The sample covariance matrix is given as
where
= 11 12 (3.12) T12 22
1 1 = P 1 − pˆ 1 pˆ T1
2 2 = P 2 − pˆ 2 pˆ T2
1 2 = P 1 2 − pˆ 1 pˆ T2
Here P1 = diag(pˆ1) and P2 = diag(pˆ2), and pˆ1 and pˆ2 specify the empirical probability mass functions for X1, and X2, respectively. Further, P12 specifies the empirical joint PMF for X1 and X2, given as
ˆ 1n nij
P12(i,j)=f(e1i,e2j)= n Iij(xk)= n =pˆij (3.13)
k=1
76
Categorical Attributes
where Iij is the indicator variable
Iij(xk)=1 ifxk1 =e1i andxk2 =e2j
0 otherwise
Taking the sum of Iij(xk) over all the n points in the sample yields the number of occurrences, nij, of the symbol pair (a1i,a2j) in the sample. One issue with the cross-attribute covariance matrix 12 is the need to estimate a quadratic number of parameters. That is, we need to obtain reliable counts nij to estimate the parameters pij , for a total of O(m1 × m2) parameters that have to be estimated, which can be a problem if the categorical attributes have many symbols. On the other hand, estimating 11 and 22 requires that we estimate m1 and m2 parameters, corresponding to pi1 and pj2 , respectively. In total, computing requires the estimation of m1 m2 + m1 + m2 parameters.
Example 3.8. We continue with the bivariate categorical attributes X1 and X2 in Example 3.7. From Example 3.4, and from the occurrence counts for each of the values of sepal width in Table 3.3, we have
0.3 47 0.313 μˆ1 =pˆ1 =0.333 μˆ2 =pˆ2 = 1 88=0.587
0.287 150 15 0.1 0.08
Thus, the mean for X = X1 is given as X2
μˆ =μˆ1=pˆ1=(0.3,0.333,0.287,0.08|0.313,0.587,0.1)T μˆ 2 pˆ 2
From Example 3.5 we have
0.21 −0.1 −0.086 −0.024 11 = −0.1 0.222 −0.096 −0.027
−0.086 −0.096 0.204 −0.023 −0.024 −0.027 −0.023 0.074
In a similar manner we can obtain
0.215 −0.184 −0.031 22 = −0.184 0.242 −0.059
−0.031 −0.059 0.09
Next, we use the observed counts in Table 3.4 to obtain the empirical joint PMF for X1 and X2 using Eq. (3.13), as plotted in Figure 3.2. From these probabilities we
get
7 33 5 0.047 0.22 0.033 E[X1XT2 ] = P12 = 1 24 18 8 = 0.16 0.12 0.053 150 13 30 0 0.087 0.2 0
3 7 2 0.02 0.047 0.013
Bivariate Analysis 77 Table3.4. ObservedCounts(nij):sepallengthandsepalwidth
X2
Short (e21 )
Medium (e22 )
Long (e23 )
X1
Very Short (e11) Short (e22) Long (e13)
Very Long (e14)
7 24 13 3
33 18 30 7
5 8 0 2
0.2
e12 e13
0.087
e11
e21
e22
0.16
f (x) 0.2
0.22
0.1
0.12
0.047
e14
X1 0.047 e23
0.02
0.053
0.033
0.013
X2
E[X1]E[X2]T =μˆ1μˆT2 =pˆ1pˆT2
= 0.333 0.313
0
Figure3.2. Empiricaljointprobabilitymassfunction:sepallengthandsepalwidth.
0.587 0.1
Further, we have
0.3
0.287 0.08
0.094 0.176
= 0.104 0.196 0.033
0.09 0.168 0.029 0.025 0.047 0.008
0.03
78 Categorical Attributes We can now compute the across-attribute sample covariance matrix 12 for X1
and X2 using Eq. (3.11), as follows:
1 2 = P 1 2 − pˆ 1 pˆ T2
−0.047 = 0.056 −0.003 −0.005
0.044 −0.076 0.032 0
0.003 0.02 −0.029
0.005
One can observe that each row and column in 12 sums to zero. Putting it all together,
= 1 1 1 2 T12 22
0.21 −0.1 −0.086 −0.1 0.222 −0.096 −0.086 −0.096 0.204
from 11, 22 and 12 we obtain the sample covariance matrix as follows
−0.024 −0.047 0.044 0.003 −0.027 0.056 −0.076 0.02 −0.023 −0.003 0.032 −0.029
0.074 −0.005 0 0.005 −0.0470.056−0.003−0.005 0.215−0.184−0.031
= −0.024 −0.027 −0.023
0.044 −0.076 0.032 0 −0.184 0.242 −0.059
0.003 0.02 −0.029 0.005 −0.031 −0.059 0.09 In , each row and column also sums to zero.
3.2.1 Attribute Dependence: Contingency Analysis
Testing for the independence of the two categorical random variables X1 and X2 can be done via contingency table analysis. The main idea is to set up a hypothesis testing framework, where the null hypothesis H0 is that X1 and X2 are independent, and the alternative hypothesis H1 is that they are dependent. We then compute the value of the chi-square statistic χ2 under the null hypothesis. Depending on the p-value, we either accept or reject the null hypothesis; in the latter case the attributes are considered to be dependent.
Contingency Table
A contingency table for X1 and X2 is the m1 × m2 matrix of observed counts nij for all pairs of values (e1i , e2j ) in the given sample of size n, defined as
n11 n12 ···n1m2 n21 n22 ··· n2m2
N 1 2 = n · P 1 2 = . . . . . . . . . . . . nm11 nm12 ··· nm1m2
Bivariate Analysis
79
Table3.5. Contingencytable:sepallengthvs.sepalwidth
Sepal width (X2 )
Short
a21
Medium
a22
Long
a23
Row Counts
Very Short (a11 ) Short (a12 )
Long (a13 )
Very Long (a14 )
7 24 13 3
33 18 30 7
5 8 0 2
n1 = 45 n 12 = 5 0 n 13 = 4 3 n 14 = 1 2 n=150
Column Counts
n 21 = 4 7
n2 = 88
n 23 = 1 5
where P12 is the empirical joint PMF for X1 and X2, computed via Eq. (3.13). The contingency table is then augmented with row and column marginal counts, as follows:
n 1 1 n 21 N1 =n·pˆ1 = . N2 =n·pˆ2 = .
n1m1 n2m2
Note that the marginal row and column entries and the sample size satisfy the following
constraints: n 1i =
mmmmmm
2
1
1 i=1
2 j=1
1 2 i=1 j=1
n j2 =
It is worth noting that both N1 and N2 have a multinomial distribution with
n j2 =
a multinomial distribution with parameters P12 = {pij }, for 1 ≤ i ≤ m1 and 1 ≤ j ≤ m2 .
χ2 Statistic and Hypothesis Testing
Under the null hypothesis X1 and X2 are assumed to be independent, which means that their joint probability mass function is given as
pˆ i j = pˆ i1 · pˆ j2
Under this independence assumption, the expected frequency for each pair of values
is given as
12 n1inj2n1inj2
eij =n·pˆij =n·pˆi ·pˆj =n· n · n = n (3.14)
However, from the sample we already have the observed frequency of each pair of values, nij . We would like to determine whether there is a significant difference in the observed and expected frequencies for each pair of values. If there is no
n =
parameters p1 = (p1,…,pm1 ) and p2 = (p12,…,pm2 ), respectively. Further, N12 also has
j=1
i=1 12
n i j
n i j
n i j
n 1i =
Example 3.9 (Contingency Table). Table 3.4 shows the observed counts for the discretized sepal length (X1) and sepal width (X2) attributes. Augmenting the table with the row and column marginal counts and the sample size yields the final contingency table shown in Table 3.5.
Sepal length (X1 )
80 Categorical Attributes
significant difference, then the independence assumption is valid and we accept the null hypothesis that the attributes are independent. On the other hand, if there is a significant difference, then the null hypothesis should be rejected and we conclude that the attributes are dependent.
The χ2 statistic quantifies the difference between observed and expected counts for each pair of values; it is defined as follows:
mm
2 12(nij−eij)2
χ= e (3.15) i=1 j=1 ij
At this point, we need to determine the probability of obtaining the computed χ2 value. In general, this can be rather difficult if we do not know the sampling distribution of a given statistic. Fortunately, for the χ2 statistic it is known that its sampling distribution follows the chi-squared density function with q degrees of freedom:
1 q−1−x f(x|q)= 2q/2Ŵ(q/2)x2 e 2
where the gamma function Ŵ is defined as
∞ Ŵ(k>0)= xk−1e−xdx
0
(3.16)
(3.17)
The degrees of freedom, q, represent the number of independent parameters. In the contingency table there are m1 × m2 observed counts nij . However, note that each row i and each column j must sum to n1i and nj2, respectively. Further, the sum of the row and column marginals must also add to n; thus we have to remove (m1 + m2) parameters from the number of independent parameters. However, doing this removes one of the parameters, say nm1 m2 , twice, so we have to add back one to the count. The total degrees of freedom is therefore
q = |dom(X1)| × |dom(X2)| − (|dom(X1)| + |dom(X2)|) + 1 =m1m2 −m1 −m2 +1
=(m1 −1)(m2 −1)
p-value
The p-value of a statistic θ is defined as the probability of obtaining a value at least as extreme as the observed value, say z, under the null hypothesis, defined as
p-value(z)=P(θ ≥z)=1−F(θ)
where F (θ ) is the cumulative probability distribution for the statistic.
The p-value gives a measure of how surprising is the observed value of the statistic. If the observed value lies in a low-probability region, then the value is more surprising. In general, the lower the p-value, the more surprising the observed value, and the
Bivariate Analysis
81
Table 3.6. Expected counts
X2
Short (a21 )
Medium (a22 )
Long (a23 )
X1
Very Short (a11 ) Short (a12 )
Long (a13 )
Very Long (a14 )
14.1 15.67 13.47 3.76
26.4 29.33 25.23 7.04
4.5 5.0 4.3 1.2
more the grounds for rejecting the null hypothesis. The null hypothesis is rejected if the p-value is below some significance level, α. For example, if α = 0.01, then we reject the null hypothesis if p-value(z) ≤ α. The significance level α corresponds to the probability of rejecting the null hypothesis when it is true. For a given significance level α, the value of the test statistic, say z, with a p-value of p-value(z) = α, is called a critical value. An alternative test for rejection of the null hypothesis is to check if χ2 > z, as in that case the p-value of the observed χ2 value is bounded by α, that is, p-value(χ2) ≤ p-value(z) = α. The value 1 − α is also called the confidence level.
Example 3.10. Consider the contingency table for sepal length and sepal width in Table 3.5. We compute the expected counts using Eq. (3.14); these counts are shown in Table 3.6. For example, we have
e11 = n1n21 = 45·47 = 2115 =14.1 n 150 150
Next we use Eq. (3.15) to compute the value of the χ 2 statistic, which is given as χ2 =21.8.
Further, the number of degrees of freedom is given as q =(m1 −1)·(m2 −1)=3·2=6
The plot of the chi-squared density function with 6 degrees of freedom is shown in Figure 3.3. From the cumulative chi-squared distribution, we obtain
p-value(21.8) = 1 − F (21.8|6) = 1 − 0.9987 = 0.0013
At a significance level of α = 0.01, we would certainly be justified in rejecting the null hypothesis because the large value of the χ2 statistic is indeed surprising. Further, at the 0.01 significance level, the critical value of the statistic is
z=F−1(1−0.01|6)=F−1(0.99|6)=16.81
This critical value is also shown in Figure 3.3, and we can clearly see that the observed value of 21.8 is in the rejection region, as 21.8 > z = 16.81. In effect, we reject the null hypothesis that sepal length and sepal width are independent, and accept the alternative hypothesis that they are dependent.
82
Categorical Attributes
f(x|6) 0.15
0.12 0.09 0.06 0.03
0
0 5 10
α = 0.01
H0 Rejection Region
x 15 16.8 20 21.8 25
Figure3.3. Chi-squareddistribution(q=6). 3.3 MULTIVARIATE ANALYSIS
Assume that the dataset comprises d categorical attributes Xj (1 ≤ j ≤ d) with dom(Xj) = {aj1,aj2,…,ajmj }. We are given n categorical points of the form xi = (xi1,xi2,…,xid)T withxij ∈dom(Xj).Thedatasetisthusann×dsymbolicmatrix
xn1 xn2 ··· xnd
Each categorical data point v = d′-dimensional binary vector
X1 X = . . .
Xd
(v1 , v2 , . . . , vd )T is therefore
represented
as a
X1 X2 ··· Xd
x11 x12 ··· x1d D =x21 x22 ··· x2d . . … .
Each attribute Xi is modeled as an mi -dimensional multivariate Bernoulli variable Xi , and their joint distribution is modeled as a d′ = dj=1 mj dimensional vector random variable
X1(v1) e1k1 X(v)= . = .
Xd(vd) edkd
Multivariate Analysis 83 provided vi = aiki , the ki th symbol of Xi . Here eiki is the ki th standard basis vector
in Rmi . Mean
Generalizing from the bivariate case, the mean and sample mean for X are given as
μ 1 p 1 μˆ 1 pˆ 1 μ=E[X]= . = . μˆ = . = .
μ d p d μˆ d pˆ d
where pi = (p1i ,…,pmi )T is the PMF for Xi, and pˆi = (pˆ1i ,…,pˆmi )T is the empirical
ii
The covariance matrix for X, and its estimate from the sample, are given as the d′ ×d′ matrices:
PMF for Xi .
Covariance Matrix
· · · 1 d
· · · 2d
.. ··· ··· . ··· ··· ··· . ···
T2 d
whered′= di=1mi,andij (andij)isthemi×mj covariancematrix(anditsestimate)
for attributes Xi and Xj :
(3.18)
Here Pij is the joint PMF and Pij is the empirical joint PMF for Xi and Xj , which can be computed using Eq. (3.13).
11 12 ··· 1 d T1 2 2 2 · · · 2 d
1 1 T1 2
1 2 2 2
= .. T1 d T2 d · · · d d
=
T1 d
· · · d d ij =Pij −pipjT ij =Pij −pˆipˆjT
Example 3.11 (Multivariate Analysis). Let us consider the 3-dimensional subset of the Iris dataset, with the discretized attributes sepal length (X1) and sepal width (X2), and the categorical attribute class (X3). The domains for X1 and X2 are given in Table 3.1 and Table 3.3, respectively, and dom(X3) = {iris-versicolor, iris-setosa, iris-virginica}. Each value of X3 occurs 50 times.
The categorical point x = (Short,Medium,iris-versicolor) is modeled as the
vector
e12
X(x) = e22 = (0,1,0,0 | 0,1,0 | 1,0,0)T ∈ R10
e31
From Example 3.8 and the fact that each value in dom(X3) occurs 50 times in a
sample of n = 150, the sample mean is given as
μˆ 1 pˆ 1
μˆ =μˆ2=pˆ2=(0.3,0.333,0.287,0.08|0.313,0.587,0.1|0.33,0.33,0.33)T
μˆ 3 pˆ 3
84 Categorical Attributes
Using pˆ 3 = (0.33, 0.33, 0.33)T we can compute the sample covariance matrix for
X3 using Eq. (3.9):
Using Eq. (3.18) we obtain
1 3 = 0.082
0.076 2 3 = −0.042 −0.033
0.222 3 3 = −0.111
−0.111 0.222 −0.111
0.16 −0.038 −0.096 −0.027
−0.098 0.044 0.053
−0.111 −0.111 0.222
−0.093 −0.044 0.084
0.053 0.022
−0.002 −0.02
−0.111 −0.067
0.011 −0.027
Combined with 11, 22 and 12 from Example 3.8, the final sample covariance matrix is the 10 × 10 symmetric matrix given as
1 1 1 2 1 3 = T12 22 23
T13 T23 33
3.3.1 Multiway Contingency Analysis
For multiway dependence analysis, we have to first determine the empirical joint
n i 1 i 2 . . . i d
n =pˆi1i2…id
probability mass function for X:
ˆ 1 n
f(e1i1,e2i2,…,edid)= n Ii1i2…id(xk)= k=1
where Ii1i2…id is the indicator variable
Ii1i2…id (xk) = 1 if xk1 = e1i1 ,xk2 = e2i2 ,…,xkd = edid
0 otherwise
ThesumofIi1i2…id overallthenpointsinthesampleyieldsthenumberofoccurrences, ni1i2…id, of the symbolic vector (a1i1,a2i2,…,adid). Dividing the occurrences by the sample size results in the probability of observing those symbols. Using the notation i = ( i 1 , i 2 , . . . , i d ) t o d e n o t e t h e i n d e x t u p l e , w e c a n w r i t e t h e j o i n t e m p i r i c a l P M F a s t h e d-dimensional matrix P of size m1 ×m2 ×···×md = di=1 mi, given as
P(i) = pˆi for all index tuples i, with 1 ≤ i1 ≤ m1,…,1 ≤ id ≤ md where pˆ i = pˆ i1 i2 …id . The d -dimensional contingency table is then given as
N=n×P=niforallindextuplesi, with1≤i1 ≤m1,…,1≤id ≤md
Multivariate Analysis 85 where ni = ni1 i2 …id . The contingency table is augmented with the marginal count vectors
Ni for all d attributes Xi :
n i1
N i = n pˆ i = . . . nimi
wherepˆi istheempiricalPMFforXi.
χ2-Test
We can test for a d-way dependence between the d categorical attributes using the null hypothesis H0 that they are d-way independent. The alternative hypothesis H1 is that they are not d-way independent, that is, they are dependent in some way. Note that d-dimensional contingency analysis indicates whether all d attributes taken together are independent or not. In general we may have to conduct k-way contingency analysis to test if any subset of k ≤ d attributes are independent or not.
Under the null hypothesis, the expected number of occurrences of the symbol tuple (a1i1,a2i2,…,adid ) is given as
d n1n2…nd
ei=n·pˆi=n· pˆij=i1i2 id (3.19)
The chi-squared statistic measures the difference between the observed counts ni and the expected counts ei:
j nd−1 j=1
mmm
2 (ni −ei)2 1 2 d (ni1,i2,…,id −ei1,i2,…,id )2
χ= e = ··· e (3.20) i i i1=1i2=1 id=1 i1,i2,…,id
The χ2 statistic follows a chi-squared density function with q degrees of freedom.
For the d-way contingency table we can compute q by noting that there are ostensibly
d |dom(Xi)| independent parameters (the counts). However, we have to remove i=1
di=1 |dom(Xi)| degrees of freedom because the marginal count vector along each dimension Xi must equal Ni. However, doing so removes one of the parameters d times, so we need to add back d − 1 to the free parameters count. The total number of degrees of freedom is given as
d d
q= |dom(Xi)|− |dom(Xi)|+(d−1)
i=1 i=1
d d
= mi − mi +d−1
i=1 i=1
(3.21)
To reject the null hypothesis, we have to check whether the p-value of the observed χ2 value is smaller than the desired significance level α (say α = 0.01) using the chi-squared density with q degrees of freedom [Eq. (3.16)].
86 Categorical Attributes
Figure 3.4. 3-Way contingency table (with marginal counts along each dimension). Table 3.7. 3-Way expected counts
X3 (a31 /a32 /a33 )
X2
a21
a22
a23
X1
a11 a12 a13 a14
1.25 4.49 5.22 4.70
2.35 8.41 9.78 8.80
0.40 1.43 1.67 1.50
Example 3.12. Consider the 3-way contingency table in Figure 3.4. It shows the observed counts for each tuple of symbols (a1i , a2j , a3k ) for the three attributes sepal length (X1), sepal width (X2), and class (X3). From the marginal counts for X1 and X2 in Table 3.5, and the fact that all three values of X3 occur 50 times, we can compute the expected counts [Eq. (3.19)] for each cell. For instance,
e(4,1,1) = n14 ·n21 ·n31 = 45·47·50 =4.7 1502 150 · 150
The expected counts are the same for all three values of X3 and are given in Table 3.7. The value of the χ 2 statistic [Eq. (3.20)] is given as
χ2 =231.06
a14 45
5 0 0 17 12 0
X a13 50 1 a12 43
5 11 0 0 0 0
a11 12
000 X1 000
1 33 5 038
100 730
8 19 0 372
a21 47
X2 a22
88 a23 15
X2
X3
X3
a31 a32 a33 50 50 50
Distance and Angle 87
Using Eq. (3.21), the number of degrees of freedom is given as
q = 4 · 3 · 3 − (4 + 3 + 3) + 2 = 36 − 10 + 2 = 28
In Figure 3.4 the counts in bold are the dependent parameters. All other counts are independent. In fact, any eight distinct cells could have been chosen as the dependent parameters.
For a significance level of α = 0.01, the critical value of the chi-square distribution is z = 48.28. The observed value of χ2 = 231.06 is much greater than z, and it is thus extremely unlikely to happen under the null hypothesis. We conclude that the three attributes are not 3-way independent, but rather there is some dependence between them. However, this example also highlights one of the pitfalls of multiway contingency analysis. We can observe in Figure 3.4 that many of the observed counts are zero. This is due to the fact that the sample size is small, and we cannot reliably estimate all the multiway counts. Consequently, the dependence test may not be reliable as well.
3.4 DISTANCE AND ANGLE
With the modeling of categorical attributes as multivariate Bernoulli variables, it is
possible to compute the distance or the angle between any two points xi and xj : e1i1 e1j1
x i = . . . x j = . . . ed id ed jd
The different measures of distance and similarity rely on the number of matching and mismatching values (or symbols) across the d attributes Xk. For instance, we can compute the number of matching values s via the dot product:
d
s = x Ti x j = ( e k i k ) T e k j k
k=1
On the other hand, the number of mismatches is simply d − s. Also useful is the norm
of each point:
Euclidean Distance
∥ x i ∥ 2 = x Ti x i = d
The Euclidean distance between xi and xj is given as
δ(xi,xj)=xi −xj=xTi xi −2xixj +xjTxj =2(d−s)
Thus, the maximum Euclidean distance between any two points is √2d, which happens when there are no common symbols between them, that is, when s = 0.
88 Categorical Attributes
Hamming Distance
The Hamming distance between xi and xj is defined as the number of mismatched values:
δH(xi,xj)=d−s= 1δ(xi,xj)2 2
Hamming distance is thus equivalent to half the squared Euclidean distance.
Cosine Similarity
The cosine of the angle between xi and xj is given as x Ti x j s
cosθ = xi·xj = d
Jaccard Coefficient
The Jaccard Coefficient is a commonly used similarity measure between two categori- cal points. It is defined as the ratio of the number of matching values to the number of distinct values that appear in both xi and xj , across the d attributes:
J(xi,xj)= s = s 2(d−s)+s 2d−s
where we utilize the observation that when the two points do not match for dimension k, they contribute 2 to the distinct symbol count; otherwise, if they match, the number of distinct symbols increases by 1. Over the d − s mismatches and s matches, the number of distinct symbols is 2(d − s) + s.
Example 3.13. Consider the 3-dimensional categorical data from Example 3.11. The symbolic point (Short,Medium,iris-versicolor) is modeled as the vector
e12
x1 = e22 = (0,1,0,0 | 0,1,0 | 1,0,0)T ∈ R10
e31
and the symbolic point (VeryShort,Medium,iris-setosa) is modeled as
e11
x2 = e22 = (1,0,0,0 | 0,1,0 | 0,1,0)T ∈ R10
e32
The number of matching symbols is given as
s = x T1 x 2 = ( e 1 2 ) T e 1 1 + ( e 2 2 ) T e 2 2 + ( e 3 1 ) T e 3 2
1 0 0
=0 1 0 00+0 1 01+1 0 01
0 0 0 =0+1+0=1
Discretization 89
The Euclidean and Hamming distances are given as δ(x1,x2)=2(d−s)=√2·2=√4=2
δH (x1 , x2 ) = d − s = 3 − 1 = 2 The cosine and Jaccard similarity are given as
cosθ = s = 1 = 0.333 d3
J(x1,x2)= s = 1 =0.2 2d−s 5
3.5 DISCRETIZATION
Discretization, also called binning, converts numeric attributes into categorical ones. It is usually applied for data mining methods that cannot handle numeric attributes. It can also help in reducing the number of values for an attribute, especially if there is noise in the numeric measurements; discretization allows one to ignore small and irrelevant differences in the values.
Formally, given a numeric attribute X, and a random sample {xi }ni=1 of size n drawn from X, the discretization task is to divide the value range of X into k consecutive intervals, also called bins, by finding k−1 boundary values v1,v2,…,vk−1 that yield the k intervals:
[xmin,v1], (v1,v2], …, (vk−1,xmax] where the extremes of the range of X are given as
xmin =min{xi} xmax =max{xi} ii
The resulting k intervals or bins, which span the entire range of X, are usually mapped to symbolic values that comprise the domain for the new categorical attribute X.
Equal-Width Intervals
The simplest binning approach is to partition the range of X into k equal-width intervals. The interval width is simply the range of X divided by k:
w = xmax − xmin k
Thus, the ith interval boundary is given as
vi =xmin +iw, fori=1,…,k−1
Equal-Frequency Intervals
In equal-frequency binning we divide the range of X into intervals that contain (approximately) equal number of points; equal frequency may not be possible due to repeated values. The intervals can be computed from the empirical quantile or
90 Categorical Attributes
inverse cumulative distribution function Fˆ −1(q) for X [Eq. (2.2)]. Recall that Fˆ −1(q) = min{x | P (X ≤ x ) ≥ q }, for q ∈ [0, 1]. In particular, we require that each interval contain 1/k of the probability mass; therefore, the interval boundaries are given as follows:
8.0 7.5 7.0 6.5 6.0 5.5 5.0 4.5
4
vi =Fˆ−1(i/k)fori=1,…,k−1
Example3.14. ConsiderthesepallengthattributeintheIrisdataset.Itsminimum and maximum values are
xmin = 4.3 xmax = 7.9
We discretize it into k = 4 bins using equal-width binning. The width of an interval is
given as
w= 7.9−4.3 = 3.6 =0.9 44
and therefore the interval boundaries are
v1 =4.3+0.9=5.2 v2 =4.3+2·0.9=6.1 v3 =4.3+3·0.9=7.0
The four resulting bins for sepal length are shown in Table 3.1, which also shows the number of points ni in each bin, which are not balanced among the bins.
For equal-frequency discretization, consider the empirical inverse cumulative distribution function (CDF) for sepal length shown in Figure 3.5. With k = 4 bins, the bin boundaries are the quartile values (which are shown as dashed lines):
v1 =Fˆ−1(0.25)=5.1 v2 =Fˆ−1(0.50)=5.8 v3 =Fˆ−1(0.75)=6.4
The resulting intervals are shown in Table 3.8. We can see that although the interval widths vary, they contain a more balanced number of points. We do not get identical
0
0.25 0.50 0.75 1.00 q
Figure3.5. EmpiricalinverseCDF:sepallength.
Fˆ − 1 ( q )
Exercises
91
Table 3.8. Equal-frequency discretization: sepal length
n1 = 41 n2 = 39 n3 = 35 n4 = 35
Bin
Width
Count
[4.3, 5.1] (5.1, 5.8] (5.8, 6.4] (6.4, 7.9]
0.8 0.7 0.6 1.5
counts for all the bins because many values are repeated; for instance, there are nine points with value 5.1 and there are seven points with value 5.8.
3.6 FURTHER READING
For a comprehensive introduction to categorical data analysis see Agresti (2012). Some aspects also appear in Wasserman (2004). For an entropy-based supervised discretization method that takes the class attribute into account see Fayyad and Irani (1993).
Agresti, A. (2012). Categorical Data Analysis. 3rd ed. Hoboken, NJ: John Wiley & Sons.
Fayyad, U. M. and Irani, K. B. (1993). Multi-interval Discretization of Continuous-valued Attributes for Classification Learning. Proceedings of the 13th International Joint Conference on Artificial Intelligence. Morgan-Kaufmann, pp. 1022–1027.
Wasserman, L. (2004). All of Statistics: A Concise Course in Statistical Inference. New York: Springer Science + Business Media.
3.7 EXERCISES
Q1. Showthatforcategoricalpoints,thecosinesimilaritybetweenanytwovectorsinlies
in the range cos θ ∈ [0, 1], and consequently θ ∈ [0◦ , 90◦ ].
Q2. ProvethatE[(X1−μ1)(X2−μ2)T]=E[X1XT2]−E[X1]E[X2]T.
Table 3.9. Contingency table for Q3
X=a X=b X=c
Z=f
Z=g
Y=d 5 15 20
Y=e 10 5 10
Y=d 10 5 25
Y=e 5 20 10
92 Categorical Attributes Table 3.10. χ2 Critical values for different p-values for different degrees of freedom (q): For example, for
q = 5 degrees of freedom, the critical value of χ 2 = 11.070 has p-value = 0.05.
Q3. Consider the 3-way contingency table for attributes X,Y,Z shown in Table 3.9. Compute the χ2 metric for the correlation between Y and Z. Are they dependent or independent at the 95% confidence level? See Table 3.10 for χ2 values.
Q4. Consider the “mixed” data given in Table 3.11. Here X1 is a numeric attribute and X2 is a categorical one. Assume that the domain of X2 is given as dom(X2) = {a,b}. Answer the following questions.
(a) What is the mean vector for this dataset?
(b) What is the covariance matrix?
Q5. In Table 3.11, assuming that X1 is discretized into three bins, as follows:
c1 = (−2, −0.5] c2 = (−0.5, 0.5] c3 =(0.5,2]
Answer the following questions:
(a) Construct the contingency table between the discretized X1 and X2 attributes.
Include the marginal counts.
(b) Compute the χ 2 statistic between them.
(c) Determine whether they are dependent or not at the 5% significance level. Use the χ2 critical values from Table 3.10.
Table 3.11. Dataset for Q4 and Q5
0.3 −0.3
0.44 −0.60 0.40 1.20 −0.12 −1.60 1.60 −1.32
q
0.995 0.99 0.975 0.95 0.90 0.10 0.05 0.025 0.01 0.005
1 2 3 4 5 6
—— 0.001 0.004 0.016
2.706 4.605 6.251 7.779 9.236 10.645
3.841 5.991 7.815 9.488 11.070 12.592
5.024 7.378 9.348 11.143 12.833 14.449
6.635 7.879
9.210 10.597 11.345 12.838 13.277 14.860 15.086 16.750 16.812 18.548
0.010 0.020 0.072 0.115 0.207 0.297 0.412 0.554 0.676 0.872
0.051 0.103 0.211 0.216 0.352 0.584 0.484 0.711 1.064 0.831 1.145 1.610 1.237 1.635 2.204
X1
X2
a b a a a b a b b a
CHAPTER 4 Graph Data
The traditional paradigm in data analysis typically assumes that each data instance is independent of another. However, often data instances may be connected or linked to other instances via various types of relationships. The instances themselves may be described by various attributes. What emerges is a network or graph of instances (or nodes), connected by links (or edges). Both the nodes and edges in the graph may have several attributes that may be numerical or categorical, or even more complex (e.g., time series data). Increasingly, today’s massive data is in the form of such graphs or networks. Examples include the World Wide Web (with its Web pages and hyperlinks), social networks (wikis, blogs, tweets, and other social media data), semantic networks (ontologies), biological networks (protein interactions, gene regulation networks, metabolic pathways), citation networks for scientific literature, and so on. In this chapter we look at the analysis of the link structure in graphs that arise from these kinds of networks. We will study basic topological properties as well as models that give rise to such graphs.
4.1 GRAPH CONCEPTS
Graphs
Formally, a graph G = (V,E) is a mathematical structure consisting of a finite nonempty set V of vertices or nodes, and a set E ⊆ V × V of edges consisting of unordered pairs of vertices. An edge from a node to itself, (vi , vi ), is called a loop. An undirected graph without loops is called a simple graph. Unless mentioned explicitly, we will consider a graph to be simple. An edge e = (vi , vj ) between vi and vj is said to be incident with nodes vi and vj ; in this case we also say that vi and vj are adjacent to one another, and that they are neighbors. The number of nodes in the graph G, given as |V| = n, is called the order of the graph, and the number of edges in the graph, given as |E| = m, is called the size of G.
A directed graph or digraph has an edge set E consisting of ordered pairs of vertices. A directed edge (vi , vj ) is also called an arc, and is said to be from vi to vj . We also say that vi is the tail and vj the head of the arc.
93
94 Graph Data
A weighted graph consists of a graph together with a weight wij for each edge (vi , vj ) ∈ E. Every graph can be considered to be a weighted graph in which the edges have weight one.
Subgraphs
AgraphH=(VH,EH)iscalledasubgraphofG=(V,E)ifVH ⊆VandEH ⊆E.We also say that G is a supergraph of H. Given a subset of the vertices V′ ⊆ V, the induced subgraph G′ = (V′ , E′ ) consists exactly of all the edges present in G between vertices in V′.Moreformally,forallvi,vj ∈V′,(vi,vj)∈E′ ⇐⇒ (vi,vj)∈E.Inotherwords,two nodes are adjacent in G′ if and only if they are adjacent in G. A (sub)graph is called complete (or a clique) if there exists an edge between all pairs of nodes.
Degree
The degree of a node vi ∈ V is the number of edges incident with it, and is denoted as d(vi) or just di. The degree sequence of a graph is the list of the degrees of the nodes sorted in non-increasing order.
Let Nk denote the number of vertices with degree k. The degree frequency distribution of a graph is given as
(N0,N1,…,Nt)
where t is the maximum degree for a node in G. Let X be a random variable denoting the degree of a node. The degree distribution of a graph gives the probability mass
function f for X, given as
where f(k) = P(X = k) = Nk is the probability of a node with degree k, given as
the number of nodes Nk with degree k, divided by the total number of nodes n. In graph analysis, we typically make the assumption that the input graph represents a population, and therefore we write f instead of fˆ for the probability distributions.
For directed graphs, the indegree of node vi, denoted as id(vi), is the number of edges with vi as head, that is, the number of incoming edges at vi. The outdegree of vi, denoted od(vi), is the number of edges with vi as the tail, that is, the number of outgoing edges from vi .
Path and Distance
A walk in a graph G between nodes x and y is an ordered sequence of vertices, starting at x and ending at y,
x=v0,v1,…,vt−1,vt =y
such that there is an edge between every pair of consecutive vertices, that is, (vi−1,vi)∈E for all i = 1,2,…,t. The length of the walk, t, is measured in terms of hops – the number of edges along the walk. In a walk, there is no restriction on the number of times a given vertex may appear in the sequence; thus both the vertices and edges may be repeated. A walk starting and ending at the same vertex (i.e., with y = x) is called closed. A trail is a walk with distinct edges, and a path is a walk with distinct vertices (with the exception of the start and end vertices). A closed path with length
f(0),f(1),…,f(t) n
Graph Concepts 95
v1 v2 v1 v2
v3 v4 v5 v6 v3 v4 v5 v6
v7 v8 v7 v8 (a) (b)
Figure 4.1. (a) A graph (undirected). (b) A directed graph.
t ≥ 3 is called a cycle, that is, a cycle begins and ends at the same vertex and has distinct nodes.
A path of minimum length between nodes x and y is called a shortest path, and the length of the shortest path is called the distance between x and y, denoted as d(x,y). If no path exists between the two nodes, the distance is assumed to be d(x,y) = ∞.
Connectedness
Two nodes vi and vj are said to be connected if there exists a path between them. A graph is connected if there is a path between all pairs of vertices. A connected component, or just component, of a graph is a maximal connected subgraph. If a graph has only one component it is connected; otherwise it is disconnected, as by definition there cannot be a path between two different components.
For a directed graph, we say that it is strongly connected if there is a (directed) path between all ordered pairs of vertices. We say that it is weakly connected if there exists a path between node pairs only by considering edges as undirected.
Example 4.1. Figure 4.1a shows a graph with |V| = 8 vertices and |E| = 11 edges. Because (v1,v5) ∈ E, we say that v1 and v5 are adjacent. The degree of v1 is d(v1) = d1 = 4. The degree sequence of the graph is
(4,4,4,3,2,2,2,1)
and therefore its degree frequency distribution is given as
(N0,N1,N2,N3,N4) = (0,1,3,1,3)
We have N0 = 0 because there are no isolated vertices, and N4 = 3 because there are three nodes, v1, v4 and v5, that have degree k = 4; the other numbers are obtained in a similar fashion. The degree distribution is given as
f (0), f (1), f (2), f (3), f (4) = (0, 0.125, 0.375, 0.125, 0.375)
The vertex sequence (v3,v1,v2,v5,v1,v2,v6) is a walk of length 6 between v3 and v6. We can see that vertices v1 and v2 have been visited more than once. In
96 Graph Data
Adjacency Matrix
A graph G = (V, E), with |V| = n vertices, can be conveniently represented in the form of an n × n, symmetric binary adjacency matrix, A, defined as
A(i,j) = 1 if vi is adjacent to vj 0 otherwise
If the graph is directed, then the adjacency matrix A is not symmetric, as (vi , vj ) ∈ E obviously does not imply that (vj , vi ) ∈ E.
If the graph is weighted, then we obtain an n × n weighted adjacency matrix, A,
contrast, the vertex sequence (v3,v4,v7,v8,v5,v2,v6) is a path of length 6 between v3 and v6. However, this is not the shortest path between them, which happens to be (v3,v1,v2,v6) with length 3. Thus, the distance between them is given as d(v3,v6) = 3.
Figure 4.1b shows a directed graph with 8 vertices and 12 edges. We can see that edge (v5,v8) is distinct from edge (v8,v5). The indegree of v7 is id(v7) = 2, whereas its outdegree is od(v7) = 0. Thus, there is no (directed) path from v7 to any other vertex.
defined as
A(i,j)=wij ifvi isadjacenttovj 0 otherwise
wherewij istheweightonedge(vi,vj)∈E.Aweightedadjacencymatrixcanalwaysbe converted into a binary one, if desired, by using some threshold τ on the edge weights
A(i,j)=1 ifwij ≥τ (4.1) 0 otherwise
Graphs from Data Matrix
Many datasets that are not in the form of a graph can nevertheless be converted into one. Let D = {xi }ni=1 (with xi ∈ Rd ), be a dataset consisting of n points in a d-dimensional space. We can define a weighted graph G = (V, E), where there exists a node for each point in D, and there exists an edge between each pair of points, with weight
wij =sim(xi,xj)
where sim(xi,xj) denotes the similarity between points xi and xj. For instance, similarity can be defined as being inversely related to the Euclidean distance between the points via the transformation
wij =sim(xi,xj)=exp−xi−xj2 (4.2) 2σ2
where σ is the spread parameter (equivalent to the standard deviation in the normal density function). This transformation restricts the similarity function sim() to lie in the range [0, 1]. One can then choose an appropriate threshold τ and convert the weighted adjacency matrix into a binary one via Eq. (4.1).
Topological Attributes
97
Figure 4.2. Iris similarity graph.
Example 4.2. Figure 4.2 shows the similarity graph for the Iris dataset (see Table 1.1). The pairwise similarity between distinct pairs of points was computed using Eq. (4.2), with σ = 1/√2 (we do not allow loops, to keep the graph simple). The mean similarity between points was 0.197, with a standard deviation of 0.290.
A binary adjacency matrix was obtained via Eq. (4.1) using a threshold of τ = 0.777, which results in an edge between points having similarity higher than two standard deviations from the mean. The resulting Iris graph has 150 nodes and 753 edges.
The nodes in the Iris graph in Figure 4.2 have also been categorized according to their class. The circles correspond to class iris-versicolor, the triangles to iris-virginica, and the squares to iris-setosa. The graph has two big components, one of which is exclusively composed of nodes labeled as iris-setosa.
4.2 TOPOLOGICAL ATTRIBUTES
In this section we study some of the purely topological, that is, edge-based or structural, attributes of graphs. These attributes are local if they apply to only a single node (or an edge), and global if they refer to the entire graph.
Degree
We have already defined the degree of a node vi as the number of its neighbors. A more general definition that holds even when the graph is weighted is as follows:
di =A(i,j) j
98 Graph Data The degree is clearly a local attribute of each node. One of the simplest global attribute
is the average degree:
μd =idi
n
The preceding definitions can easily be generalized for (weighted) directed graphs. For example, we can obtain the indegree and outdegree by taking the summation over the incoming and outgoing edges, as follows:
id(vi) = A(j,i) j
od(vi) = A(i,j) j
The average indegree and average outdegree can be obtained likewise.
Average Path Length
The average path length, also called the characteristic path length, of a connected graph is given as
i j>i d(vi,vj) μL = n
2
2
= n(n−1)
d(vi,vj)
where n is the number of nodes in the graph, and d(vi,vj) is the distance between
vi and vj . For a directed graph, the average is over all ordered pairs of vertices: μL = 1 d(vi,vj)
i j>i
n(n−1) i j
For a disconnected graph the average is taken over only the connected pairs of vertices.
Eccentricity
The eccentricity of a node vi is the maximum distance from vi to any other node in the graph:
e(vi)=maxd(vi,vj) j
If the graph is disconnected the eccentricity is computed only over pairs of vertices with finite distance, that is, only for vertices connected by a path.
Radius and Diameter
The radius of a connected graph, denoted r(G), is the minimum eccentricity of any node in the graph:
r(G)=mine(vi)=minmaxd(vi,vj) iij
Topological Attributes 99 The diameter, denoted d(G), is the maximum eccentricity of any vertex in the
graph:
d(G)=maxe(vi)=maxd(vi,vj) i i,j
For a disconnected graph, the diameter is the maximum eccentricity over all the connected components of the graph.
The diameter of a graph G is sensitive to outliers. A more robust notion is effective diameter, defined as the minimum number of hops for which a large fraction, typically 90%, of all connected pairs of nodes can reach each other. More formally, let H(k) denote the number of pairs of nodes that can reach each other in k hops or less. The effective diameter is defined as the smallest value of k such that H(k) ≥ 0.9 × H(d(G)).
Example 4.3. For the graph in Figure 4.1a, the eccentricity of node v4 is e(v4) = 3 because the node farthest from it is v6 and d(v4,v6) = 3. The radius of the graph is r(G) = 2; both v1 and v5 have the least eccentricity value of 2. The diameter of the graph is d(G) = 4, as the largest distance over all the pairs is d(v6,v7) = 4.
The diameter of the Iris graph is d(G) = 11, which corresponds to the bold path
connecting the gray nodes in Figure 4.2. The degree distribution for the Iris graph
is shown in Figure 4.3. The numbers at the top of each bar indicate the frequency.
For example, there are exactly 13 nodes with degree 7, which corresponds to the
probability f (7) = 13 = 0.0867. 150
The path length histogram for the Iris graph is shown in Figure 4.4. For instance, 1044 node pairs have a distance of 2 hops between them. With n = 150 nodes, there
0.10
0.09
0.08
0.07
0.06
0.05
6
0.04 0.03 0.02 0.01
13
13
888
10
9
7
6
6
66
555 44
33
2 11111
1
2
00000
1 3 5 7 9 11131517192123252729313335
Degree: k
Figure 4.3. Iris graph: degree distribution.
4
1
f(k)
100
1000
900
800
700
600
500
400
300
200
100
0
Graph Data
1044
831
753
668
529
330
240
146
90
30
12
0 1 2 3 4 5 6 7 8 9 10 11 Path Length: k
Clustering Coefficient
Figure 4.4. Iris graph: path length histogram.
are n2 = 11, 175 pairs. Out of these 6502 pairs are unconnected, and there are a total of 4673 reachable pairs. Out of these 4175 = 0.89 fraction are reachable in 6 hops, and
4673
4415 = 0.94 fraction are reachable in 7 hops. Thus, we can determine that the effective 4673
diameter is 7. The average path length is 3.58.
The clustering coefficient of a node vi is a measure of the density of edges in the neighborhood of vi . Let Gi = (Vi , Ei ) be the subgraph induced by the neighbors of vertexvi.Notethatvi ̸∈Vi,asweassumethatGissimple.Let|Vi|=ni bethenumber of neighbors of vi , and |Ei | = mi be the number of edges among the neighbors of vi . The clustering coefficient of vi is defined as
no. of edges in Gi mi 2·mi C(vi)= maximum number of edges in Gi = ni = ni(ni −1)
2
The clustering coefficient gives an indication about the “cliquishness” of a node’s neighborhood, because the denominator corresponds to the case when Gi is a complete subgraph.
The clustering coefficient of a graph G is simply the average clustering coefficient over all the nodes, given as
C(G) = 1 C(vi ) ni
Because C(vi) is well defined only for nodes with degree d(vi) ≥ 2, we can define C(vi ) = 0 for nodes with degree less than 2. Alternatively, we can take the summation only over nodes with d (vi ) ≥ 2.
Frequency
Topological Attributes 101
The clustering coefficient C(vi) of a node is closely related to the notion of transitive relationships in a graph or network. That is, if there exists an edge between vi and vj , and another between vi and vk , then how likely are vj and vk to be linked or connected to each other. Define the subgraph composed of the edges (vi , vj ) and (vi , vk ) to be a connected triple centered at vi . A connected triple centered at vi that includes (vj , vk ) is called a triangle (a complete subgraph of size 3). The clustering coefficient of node vi can be expressed as
C(vi ) = no. of triangles including vi
no. of connected triples centered at vi
Notethatthenumberofconnectedtriplescenteredatvi issimplydi=ni(ni−1),where 22
di = ni is the number of neighbors of vi .
Generalizing the aforementioned notion to the entire graph yields the transitivity
of the graph, defined as
T(G) = 3 × no. of triangles in G no. of connected triples in G
The factor 3 in the numerator is due to the fact that each triangle contributes to three connected triples centered at each of its three vertices. Informally, transitivity measures the degree to which a friend of your friend is also your friend, say, in a social network.
Efficiency
The efficiency for a pair of nodes vi and vj is defined as 1 . If vi and vj are not d(vi,vj)
connected, then d (vi , vj ) = ∞ and the efficiency is 1/∞ = 0. As such, the smaller the distance between the nodes, the more “efficient” the communication between them. The efficiency of a graph G is the average efficiency over all pairs of nodes, whether connected or not, given as
2 1 n(n−1) i j>i d(vi,vj)
The maximum efficiency value is 1, which holds for a complete graph.
The local efficiency for a node vi is defined as the efficiency of the subgraph Gi induced by the neighbors of vi . Because vi ̸∈ Gi , the local efficiency is an indication of the local fault tolerance, that is, how efficient is the communication between neighbors
of vi when vi is removed or deleted from the graph.
Example 4.4. For the graph in Figure 4.1a, consider node v4. Its neighborhood graph is shown in Figure 4.5. The clustering coefficient of node v4 is given as
22
C ( v 4 ) = 42 = 6 = 0 . 3 3
The clustering coefficient for the entire graph (over all nodes) is given as
C(G) = 1 1 + 1 + 1 + 1 + 1 + 0 + 0 + 0 = 2.5 = 0.3125 823338
102
Graph Data
v1
v3
v5
v7
Figure 4.5. Subgraph G4 induced by node v4.
The local efficiency of v4 is given as 21+1+1+1+1+1
4 · 3 d(v1,v3) d(v1,v5) d(v1,v7) d(v3,v5) d(v3,v7) d(v5,v7)
= 1(1+1+0+0.5+0+0)= 2.5 =0.417 66
4.3 CENTRALITY ANALYSIS
The notion of centrality is used to rank the vertices of a graph in terms of how “central” or important they are. A centrality can be formally defined as a function c : V → R, that induces a total order on V. We say that vi is at least as central as vj if c(vi ) ≥ c(vj ).
4.3.1 Basic Centralities
Degree Centrality
The simplest notion of centrality is the degree di of a vertex vi – the higher the degree, the more important or central the vertex. For directed graphs, one may further consider the indegree centrality and outdegree centrality of a vertex.
Eccentricity Centrality
According to this notion, the less eccentric a node is, the more central it is. Eccentricity centrality is thus defined as follows:
c(vi)= 1 = 1 e(vi) maxj d(vi,vj)
A node vi that has the least eccentricity, that is, for which the eccentricity equals the graph radius, e(vi) = r(G), is called a center node, whereas a node that has the highest eccentricity, that is, for which eccentricity equals the graph diameter, e(vi) = d(G), is called a periphery node.
Centrality Analysis 103
Eccentricity centrality is related to the problem of facility location, that is, choosing the optimum location for a resource or facility. The central node minimizes the maximum distance to any node in the network, and thus the most central node would be an ideal location for, say, a hospital, because it is desirable to minimize the maximum distance someone has to travel to get to the hospital quickly.
Closeness Centrality
Whereas eccentricity centrality uses the maximum of the distances from a given node, closeness centrality uses the sum of all the distances to rank how central a node is
1
c ( v i ) =
j d(vi,vj)
Anodevi withthesmallesttotaldistance, jd(vi,vj),iscalledthemediannode. Closeness centrality optimizes a different objective function for the facility location problem. It tries to minimize the total distance over all the other nodes, and thus a median node, which has the highest closeness centrality, is the optimal one to, say, locate a facility such as a new coffee shop or a mall, as in this case it is not as
important to minimize the distance for the farthest node.
Betweenness Centrality
For a given vertex vi the betweenness centrality measures how many shortest paths between all pairs of vertices include vi. This gives an indication as to the central “monitoring” role played by vi for various pairs of nodes. Let ηjk denote the number ofshortestpathsbetweenverticesvj andvk,andletηjk(vi)denotethenumberofsuch paths that include or contain vi . Then the fraction of paths through vi is denoted as
γjk(vi)= ηjk(vi) ηjk
Ifthetwoverticesvj andvk arenotconnected,weassumeγjk(vi)=0. The betweenness centrality for a node vi is defined as
c(vi)=γjk(vi)=ηjk(vi) (4.3)
j̸=i k̸=i j̸=i k̸=i ηjk k>j k>j
Example 4.5. Consider Figure 4.1a. The values for the different node centrality measures are given in Table 4.1. According to degree centrality, nodes v1, v4, and v5 are the most central. The eccentricity centrality is the highest for the center nodes in the graph, which are v1 and v5. It is the least for the periphery nodes, of which there are two, v6 and, v7.
Nodes v1 and v5 have the highest closeness centrality value. In terms of betweenness, vertex v5 is the most central, with a value of 6.5. We can compute this value by considering only those pairs of nodes vj and vk that have at least one shortest
104
Graph Data
Table 4.1. Centrality values
Centrality
v1
v2
v3
v4
v5
v6
v7
v8
Degree
4
3
2
4
4
1
2
2
Eccentricity
e(vi )
0.5 2
0.33 3
0.33 3
0.33 3
0.5 2
0.25 4
0.25 4
0.33 3
Closeness
j d(vi,vj) Betweenness
0.100 10
0.083 12
0.071 14
0.091 11
0.100 10
0.056 18
0.067 15
0.071 14
4.5
6
0
5
6.5
0
0.83
1.17
path passing through v5, as only these node pairs have γjk(v5) > 0 in Eq. (4.3). We have
c(v5) = γ18(v5) + γ24(v5) + γ27(v5) + γ28(v5) + γ38(v5) + γ46(v5) + γ48(v5) + γ67(v5) + γ68(v5)
= 1 + 1 + 2 + 1 + 2 + 1 + 1 + 2 + 1 = 6.5 23 3223
4.3.2 Web Centralities
We now consider directed graphs, especially in the context of the Web. For example, hypertext documents have directed links pointing from one document to another; citation networks of scientific articles have directed edges from a paper to the cited papers, and so on. We consider notions of centrality that are particularly suited to such Web-scale graphs.
Prestige
We first look at the notion of prestige, or the eigenvector centrality, of a node in a directed graph. As a centrality, prestige is supposed to be a measure of the importance or rank of a node. Intuitively the more the links that point to a given node, the higher its prestige. However, prestige does not depend simply on the indegree; it also (recursively) depends on the prestige of the nodes that point to it.
Let G = (V, E) be a directed graph, with |V| = n. The adjacency matrix of G is an n × n asymmetric matrix A given as 1 if (u, v) ∈ E
A(u,v) = 0 if (u,v) ̸∈ E
Let p(u) be a positive real number, called the prestige score for node u. Using the intuition that the prestige of a node depends on the prestige of other nodes pointing to it, we can obtain the prestige score of a given node v as follows:
p(v)=A(u,v)·p(u) u
= AT(v,u)·p(u) u
Centrality Analysis
0 0 0 1 0
v4
v3 (a)
105
0 0 1 0 0
v1
v5
v2
0 0 1 0 1
A=1 0 0 0 0 A =0 1 0 1 0
0 1 1 0 1 1 0 0 0 0 01000 01010
(b)
Figure 4.6. Example graph (a), adjacency matrix (b), and its transpose (c).
For example, in Figure 4.6, the prestige of v5 depends on the prestige of v2 and v4. Across all the nodes, we can recursively express the prestige scores as
p′ =ATp (4.4)
where p is an n-dimensional column vector corresponding to the prestige scores for each vertex.
Starting from an initial prestige vector we can use Eq. (4.4) to obtain an updated prestige vector in an iterative manner. In other words, if pk−1 is the prestige vector across all the nodes at iteration k − 1, then the updated prestige vector at iteration k is given as
pk =ATpk−1
= AT(ATpk−2) = AT2 pk−2
= AT2 (ATpk−3) = AT3 pk−3
= .
= ATk p0
where p0 is the initial prestige vector. It is well known that the vector pk converges to the dominant eigenvector of AT with increasing k.
The dominant eigenvector of AT and the corresponding eigenvalue can be
computed using the power iteration approach whose pseudo-code is shown in
Algorithm 4.1. The method starts with the vector p0, which can be initialized to the
vector (1,1,…,1)T ∈ Rn. In each iteration, we multiply on the left by AT, and scale
the intermediate pk vector by dividing it by the maximum entry pk[i] in pk to prevent
numeric overflow. The ratio of the maximum entry in iteration k to that in k − 1, given
as λ = pk [i] , yields an estimate for the eigenvalue. The iterations continue until the pk−1[i]
difference between successive eigenvector estimates falls below some threshold ǫ > 0.
T
0 0 0 1 1
(c)
106
Graph Data
ALGORITHM4.1. PowerIterationMethod:DominantEigenvector
1 2 3 4 5 6 7 8 9
10 11
POWERITERATION (A,ǫ):
k ← 0 // iteration
p0 ← 1 ∈ Rn// initial vector repeat
k←k+1
pk ← ATpk−1 // eigenvector estimate
i ← arg maxj pk [j ] // maximum value index
λ ← pk [i]/pk−1[i] // eigenvalue estimate
pk ← 1 pk // scale vector pk[i]
until∥pk−pk−1∥≤ǫ
p ← 1 pk // normalize eigenvector
Table 4.2. Power method via scaling
∥pk ∥ return p, λ
p0
p1
p2
p3
1 1
1 1
λ
1 0.5
2 1 2→ 1
2
1 21
0.5
1 0.67 1.5 1
1.5→ 1 0.5 0.33
1.5 1 1.5
1 0.75
1.33 1 1.33→ 1 0.67 0.5
1.33 1 1.33
p4
p5
p6
p7
1 0.67
1.5 1 1.5 → 1
1.5
0.75
1.5 1
0.5
1 0.67
1.5 1 1.5 → 1
1.5
0.67
1.5 1
0.44
1 0.69
1.44 1 1.44→ 1 0.67 0.46
1.44 1 1.444
1 0.68
1.46 1 1.46→ 1 0.69 0.47
1.46 1 1.462
Example 4.6. Consider the example shown in Figure 4.6. Starting with an initial prestige vector p0 = (1, 1, 1, 1, 1)T , in Table 4.2 we show several iterations of the power method for computing the dominant eigenvector of AT. In each iteration we obtain pk = ATpk−1. For example, 0 0 1 0 01 1
T 0 0 0 1 1 1 2 p1 =A p0 =0 1 0 1 01=2 1 0 0 0 01 1
010101 2
Centrality Analysis
2.25
2.00
1.75
1.50
107
λ = 1.466 Figure 4.7. Convergence of the ratio to dominant eigenvalue.
1.25
0 2 4 6 8 10 12 14 16
Before the next iteration, we scale p1 by dividing each entry by the maximum value in the vector, which is 2 in this case, to obtain
1 0.5
1 2 1 p1 = 22= 1 1 0.5
21 pk = ATpk−1 ≃ λpk−1
As k becomes large, we get
which implies that the ratio of the maximum element of pk to that of pk−1 should approach λ. The table shows this ratio for successive iterations. We can see in Figure 4.7 that within 10 iterations the ratio converges to λ = 1.466. The scaled dominant eigenvector converges to
1
1.466 pk = 1.466 0.682
1.466
After normalizing it to be a unit vector, the dominant eigenvector is given as
0.356
0.521 p = 0.521 0.243
0.521
Thus, in terms of prestige, v2, v3, and v5 have the highest values, as all of them have indegree 2 and are pointed to by nodes with the same incoming values of prestige. On the other hand, although v1 and v4 have the same indegree, v1 is ranked higher, because v3 contributes its prestige to v1, but v4 gets its prestige only from v1.
108 Graph Data
PageRank
PageRank is a method for computing the prestige or centrality of nodes in the context of Web search. The Web graph consists of pages (the nodes) connected by hyperlinks (the edges). The method uses the so-called random surfing assumption that a person surfing the Web randomly chooses one of the outgoing links from the current page, or with some very small probability randomly jumps to any of the other pages in the Web graph. The PageRank of a Web page is defined to be the probability of a random web surfer landing at that page. Like prestige, the PageRank of a node v recursively depends on the PageRank of other nodes that point to it.
Normalized Prestige We assume for the moment that each node u has outdegree at least 1. We discuss later how to handle the case when a node has no outgoing edges. Let od(u) = v A(u,v) denote the outdegree of node u. Because a random surfer can choose among any of its outgoing links, if there is a link from u to v, then the probability of visiting v from u is 1 .
od(u)
Starting from an initial probability or PageRank p0(u) for each node, such that
p0(u) = 1 u
we can compute an updated PageRank vector for v as follows: p(v) = A(u, v) · p(u)
u od(u)
= N(u,v)·p(u)
u
= NT(v,u)·p(u)
u
where N is the normalized adjacency matrix of the graph, given as
1 if (u,v) ∈ E N(u,v)=od(u)
0 if (u,v) ̸∈ E
Across all nodes, we can express the PageRank vector as follows:
p′ =NTp
So far, the PageRank vector is essentially a normalized prestige vector.
(4.5)
(4.6)
Random Jumps In the random surfing approach, there is a small probability of jumping from one node to any of the other nodes in the graph, even if they do not have a link between them. In essence, one can think of the Web graph as a (virtual) fully connected directed graph, with an adjacency matrix given as
1 1 ··· 1 A =1 =1 1 ··· 1
r n×n . . … . 1 1 ··· 1
Centrality Analysis 109
Here 1n×n is the n × n matrix of all ones. For the random surfer matrix, the outdegree
of each node is od(u) = n, and the probability of jumping from u to any node v is
simply 1 = 1 . Thus, if one allows only random jumps from one node to another, the od(u) n
PageRank can be computed analogously to Eq. (4.5):
p(v) = Ar (u, v) · p(u) u od(u)
= Nr(u,v)·p(u)
1 1 ··· 1 nn n
u u
NTr (v,u)·p(u)
where Nr is the normalized adjacency matrix of the fully connected Web graph,
given as
=
1 1 ··· 1
N =n n n=1A =11
r ……nr nn×n 1 1 ··· 1
nnn
Across all the nodes the random jump PageRank vector can be represented as
p ′ = N Tr p
PageRank The full PageRank is computed by assuming that with some small probability, α, a random Web surfer jumps from the current node u to any other random node v, and with probability 1 − α the user follows an existing link from u to v. In other words, we combine the normalized prestige vector, and the random jump vector, to obtain the final PageRank vector, as follows:
p′ =(1−α)NTp+αNTr p
= ( 1 − α ) N T + α N Tr p ( 4 . 7 ) =MTp
where M = (1 − α)N + αNr is the combined normalized adjacency matrix. The PageRank vector can be computed in an iterative manner, starting with an initial PageRank assignment p0, and updating it in each iteration using Eq. (4.7). One minor problem arises if a node u does not have any outgoing edges, that is, when od(u) = 0. Such a node acts like a sink for the normalized prestige score. Because there is no outgoing edge from u, the only choice u has is to simply jump to another random node. Thus, we need to make sure that if od(u) = 0 then for the row corresponding to u in M, denoted as Mu, we set α = 1, that is,
Mu = Mu if od(u) > 0 1 1T if od(u) = 0
where 1n is the n-dimensional vector of all ones. We can use the power iteration method in Algorithm 4.1 to compute the dominant eigenvector of MT.
nn
110 Graph Data
Example 4.7. Consider the graph in Figure 4.6. The normalized adjacency matrix is given as
0 0 0 1 0
0 0 0.5 0 0.5 N=1 0 0 0 0 0 0.33 0.33 0 0.33
01000
Because there are n = 5 nodes in the graph, the normalized random jump adjacency matrix is given as
0.2
0.2 Nr = 0.2 0.2 0.2
Assuming that α = 0.1, the combined
0.02 0.02
0.02 0.02 M = 0.9N + 0.1Nr = 0.92 0.02 0.02 0.32 0.02 0.92
0.2
0.2 0.2 0.2 0.2
Computing the dominant eigenvector
and eigenvalue of MT we obtain λ = 1 and
0.419
0.546 p = 0.417 0.422
0.417 Node v2 has the highest PageRank value.
0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2
normalized adjacency matrix is given as
0.02 0.92 0.47 0.02 0.02 0.02 0.32 0.02 0.02 0.02
0.02 0.47 0.02 0.32 0.02
Hub and Authority Scores
Note that the PageRank of a node is independent of any query that a user may pose, as it is a global value for a Web page. However, for a specific user query, a page with a high global PageRank may not be that relevant. One would like to have a query-specific notion of the PageRank or prestige of a page. The Hyperlink Induced Topic Search (HITS) method is designed to do this. In fact, it computes two values to judge the importance of a page. The authority score of a page is analogous to PageRank or prestige, and it depends on how many “good” pages point to it. On the other hand, the hub score of a page is based on how many “good” pages it points to. In other words, a page with high authority has many hub pages pointing to it, and a page with high hub score points to many pages that have high authority.
Centrality Analysis 111
Given a user query the HITS method first uses standard search engines to retrieve the set of relevant pages. It then expands this set to include any pages that point to some page in the set, or any pages that are pointed to by some page in the set. Any pages originating from the same host are eliminated. HITS is applied only on this expanded query specific graph G.
We denote by a(u) the authority score and by h(u) the hub score of node u. The authority score depends on the hub score and vice versa in the following manner:
a(v)=AT(v,u)·h(u) u
h(v)= A(v,u)·a(u) u
a′ =ATh h′ = Aa
In matrix notation, we obtain
In fact, we can rewrite the above recursively as follows:
ak = AThk−1 = AT(Aak−1) = (ATA)ak−1
hk = Aak−1 = A(AThk−1) = (AAT)hk−1
In other words, as k → ∞, the authority score converges to the dominant eigenvector of ATA, whereas the hub score converges to the dominant eigenvector of AAT. The power iteration method can be used to compute the eigenvector in both cases. Starting with an initial authority vector a = 1n, the vector of all ones, we can compute the vector h = Aa. To prevent numeric overflows, we scale the vector by dividing by the maximum element. Next, we can compute a = ATh, and scale it too, which completes one iteration. This process is repeated until both a and h converge.
Example 4.8. For the graph in Figure 4.6, we can iteratively compute the authority and hub score vectors, by starting with a = (1,1,1,1,1)T. In the first iteration,
we have
010001 1 After scaling by dividing by the maximum value 3, we get
0.33
′ 0.67 h = 0.33 1
0.33
0 0 0 1 0 1 1
0 0 1 0 1 1 2 h=Aa=1 0 0 0 01=1 0 1 1 0 11 3
112 Graph Data
Next we update a as follows:
0 0 1 0 0 0.33 0.33 T ′ 0 0 0 1 10.67 1.33 a=A h =0 1 0 1 00.33=1.67 1 0 0 0 01 0.33
01010 0.33 1.67 After scaling by dividing by the maximum value 1.67, we get
0.2
′ 0.8 a = 1 0.2
1
This sets the stage for the next iteration. The process continues until a and h converge
to the dominant eigenvectors of ATA and AAT, respectively, given as 0 0
0.46 0.58 a = 0.63 h= 0 0 0.79
0.63 0.21
From these scores, we conclude that v4 has the highest hub score because it points to three nodes – v2, v3, and v5 – with good authority. On the other hand, both v3 and v5 have high authority scores, as the two nodes v4 and v2 with the highest hub scores point to them.
4.4 GRAPH MODELS
Surprisingly, many real-world networks exhibit certain common characteristics, even though the underlying data can come from vastly different domains, such as social networks, biological networks, telecommunication networks, and so on. A natural question is to understand the underlying processes that might give rise to such real-world networks. We consider several network measures that will allow us to compare and contrast different graph models. Real-world networks are usually large and sparse. By large we mean that the order or the number of nodes n is very large, and by sparse we mean that the graph size or number of edges m = O(n). The models we study below make a similar assumption that the graphs are large and sparse.
Small-world Property
It has been observed that many real-world graphs exhibit the so-called small-world property that there is a short path between any pair of nodes. We say that a graph G exhibits small-world behavior if the average path length μL scales logarithmically with
Graph Models 113
the number of nodes in the graph, that is, if
μL ∝logn
where n is the number of nodes in the graph. A graph is said to have ultra-small-world property if the average path length is much smaller than log n, that is, if μL ≪ log n.
Scale-free Property
In many real-world graphs it has been observed that the empirical degree distribution f (k) exhibits a scale-free behavior captured by a power-law relationship with k, that is, the probability that a node has degree k satisfies the condition
f (k) ∝ k−γ (4.8)
Intuitively, a power law indicates that the vast majority of nodes have very small degrees, whereas there are a few “hub” nodes that have high degrees, that is, they connect to or interact with lots of nodes. A power-law relationship leads to a scale-free or scale invariant behavior because scaling the argument by some constant c does not change the proportionality. To see this, let us rewrite Eq. (4.8) as an equality by introducing a proportionality constant α that does not depend on k, that is,
(4.9)
f (k) = αk−γ f(ck)=α(ck)−γ =(αc−γ)k−γ ∝k−γ
Then we have
Also, taking the logarithm on both sides of Eq. (4.9) gives
log f (k) = log(αk−γ )
or logf(k)=−γlogk+logα
which is the equation of a straight line in the log-log plot of k versus f (k), with −γ giving the slope of the line. Thus, the usual approach to check whether a graph has scale-free behavior is to perform a least-square fit of the points logk,logf(k) to a line, as illustrated in Figure 4.8a.
In practice, one of the problems with estimating the degree distribution for a graph is the high level of noise for the higher degrees, where frequency counts are the lowest. One approach to address the problem is to use the cumulative degree distribution F (k), which tends to smooth out the noise. In particular, we use F c (k) = 1 − F (k), which gives the probability that a randomly chosen node has degree greater than k. If f (k) ∝ k−γ , and assuming that γ > 1, we have
c k∞∞−γ F (k)=1−F(k)=1− f(x)= f(x)= x
0kk ∞ x − γ + 1 ∞ 1
≃ x−γdx=−γ+1k =(γ−1)·k−(γ−1)
k ∝k−(γ−1)
114
−2 −4 −6 −8
−10
−12
Graph Data
−γ =−2.15
−14
012345678
Degree: log2 k (a) Degree distribution
0
−2
−4
−6
−8
−10
−12
−(γ − 1) = −1.85
−14
012345678
Degree: log2 k
(b) Cumulative degree distribution
Figure 4.8. Degree distribution and its cumulative distribution.
In other words, the log-log plot of Fc(k) versus k will also be a power law with slope −(γ − 1) as opposed to −γ . Owing to the smoothing effect, plotting log k versus logFc(k) and observing the slope gives a better estimate of the power law, as illustrated in Figure 4.8b.
Clustering Effect
Real-world graphs often also exhibit a clustering effect, that is, two nodes are more likely to be connected if they share a common neighbor. The clustering effect is captured by a high clustering coefficient for the graph G. Let C(k) denote the average clustering coefficient for all nodes with degree k; then the clustering effect also
Probability: log2 Fc(k)
Probability: log2 f (k)
Graph Models 115
manifests itself as a power-law relationship between C(k) and k: C(k) ∝ k−γ
In other words, a log-log plot of k versus C(k) exhibits a straight line behavior with negative slope −γ . Intuitively, the power-law behavior indicates hierarchical clustering of the nodes. That is, nodes that are sparsely connected (i.e., have smaller degrees) are part of highly clustered areas (i.e., have higher average clustering coefficients). Further, only a few hub nodes (with high degrees) connect these clustered areas (the hub nodes have smaller clustering coefficients).
Example 4.9. Figure 4.8a plots the degree distribution for a graph of human protein interactions, where each node is a protein and each edge indicates if the two incident proteins interact experimentally. The graph has n = 9521 nodes and m = 37,060 edges. A linear relationship between logk and logf(k) is clearly visible, although very small and very large degree values do not fit the linear trend. The best fit line after ignoring the extremal degrees yields a value of γ = 2.15. The plot of logk versus logFc(k) makes the linear fit quite prominent. The slope obtained here is −(γ − 1) = 1.85, that is, γ = 2.85. We can conclude that the graph exhibits scale-free behavior (except at the degree extremes), with γ somewhere between 2 and 3, as is typical of many real-world graphs.
The diameter of the graph is d(G) = 14, which is very close to log2 n = log2 (9521) = 13.22. The network is thus small-world.
Figure 4.9 plots the average clustering coefficient as a function of degree. The log-log plot has a very weak linear trend, as observed from the line of best fit that gives a slope of −γ = −0.55. We can conclude that the graph exhibits weak hierarchical clustering behavior.
−2 −4
−6 −8
−γ=−0.55
12345678
Degree: log2 k
Figure 4.9. Average clustering coefficient distribution.
Average Clustering Coefficient: log2 C(k)
116 Graph Data 4.4.1 Erdo ̈s–Re ́nyi Random Graph Model
The Erdo ̈ s–Re ́ nyi (ER) model generates a random graph such that any of the possible graphs with a fixed number of nodes and edges has equal probability of being chosen.
The ER model has two parameters: the number of nodes n and the number of edges m. Let M denote the maximum number of edges possible among the n nodes,
that is,
The ER model specifies a collection of graphs G(n,m) with n nodes and m edges, such
that each graph G ∈ G has equal probability of being selected:
P(G)= 1 =M−1 Mm
m
where Mm is the number of possible graphs with m edges (with n nodes) corresponding to the ways of choosing the m edges out of a total of M possible edges.
LetV={v1,v2,…,vn}denotethesetofnnodes.TheERmethodchoosesarandom graph G = (V,E) ∈ G via a generative process. At each step, it randomly selects two distinct vertices vi,vj ∈ V, and adds an edge (vi,vj) to E, provided the edge is not already in the graph G. The process is repeated until exactly m edges have been added to the graph.
Let X be a random variable denoting the degree of a node for G ∈ G. Let p denote the probability of an edge in G, which can be computed as
M=n= n(n−1) 22
Average Degree
m m 2m
p = M = n2 = n ( n − 1 )
For any given node in G its degree can be at most n − 1 (because we do not allow loops). Because p is the probability of an edge for any node, the random variable X, corresponding to the degree of a node, follows a binomial distribution with probability of success p, given as
f(k)=P(X=k)=n−1pk(1−p)n−1−k k
The average degree μd is then given as the expected value of X: μd =E[X]=(n−1)p
We can also compute the variance of the degrees among the nodes by computing the variance of X:
σd2 =var(X)=(n−1)p(1−p)
Degree Distribution
To obtain the degree distribution for large and sparse random graphs, we need to derive an expression for f(k) = P(X = k) as n → ∞. Assuming that m = O(n), we
Graph Models
can write p = m = n(n−1)/2
117
O(n) = 1 → 0. In other words, we are interested in the n(n−1)/2 O(n)
asymptotic behavior of the graphs as n → ∞ and p → 0.
Under these two trends, notice that the expected value and variance of X can be
rewritten as
E[X]=(n−1)p ≃npasn→∞ var(X)=(n−1)p(1−p) ≃ npasn→∞andp→0
In other words, for large and sparse random graphs the expectation and variance of X are the same:
E[X] = var(X) = np
and the binomial distribution can be approximated by a Poisson distribution with
parameter λ, given as
f(k)= λke−λ k!
where λ = np represents both the expected value and variance of the distribution. Using Stirling’s approximation of the factorial k! ≃ kke−k√2πk we obtain
λk e−λ λk e−λ e−λ (λe)k
≃ kke−k√2πk = √2π √kkk k −1 −k
f(k)= k! In other words, we have
for α = λe = npe. We conclude that large and sparse random graphs follow a Poisson degree distribution, which does not exhibit a power-law relationship. Thus, in one crucial respect, the ER random graph model is not adequate to describe real-world scale-free graphs.
Clustering Coefficient
Let us consider a node vi in G with degree k. The clustering coefficient of vi is given as C(vi ) = 2mi
where k = ni also denotes the number of nodes and mi denotes the number of edges in the subgraph induced by neighbors of vi . However, because p is the probability of an edge, the expected number of edges mi among the neighbors of vi is simply
f(k)∝αk 2k
k(k−1)
Thus, we obtain
mi = pk(k−1) 2
C(vi)= 2mi =p k(k−1)
In other words, the expected clustering coefficient across all nodes of all degrees is uniform, and thus the overall clustering coefficient is also uniform:
C(G) = 1 C(vi ) = p
n
i
118 Graph Data
Furthermore, for sparse graphs we have p → 0, which in turn implies that C(G) = C(vi ) → 0. Thus, large random graphs have no clustering effect whatsoever, which is contrary to many real-world networks.
Diameter
We saw earlier that the expected degree of a node is μd = λ, which means that within one hop from a given node, we can reach λ other nodes. Because each of the neighbors of the initial node also has average degree λ, we can approximate the number of nodes that are two hops away as λ2. In general, at a coarse level of approximation (i.e., ignoring shared neighbors), we can estimate the number of nodes at a distance of k hops away from a starting node vi as λk . However, because there are a total of n distinct vertices in the graph, we have
t
λk =n k=1
where t denotes the maximum number of hops from vi . We have t λ t + 1 − 1
λk= λ−1 ≃λt k=1
Plugging into the expression above, we have λt ≃n or
tlogλ≃logn whichimplies t ≃ log n ∝ log n
log λ
Because the path length from a node to the farthest node is bounded by t, it follows
that the diameter of the graph is also bounded by that value, that is, d(G)∝logn
assuming that the expected degree λ is fixed. We can thus conclude that random graphs satisfy at least one property of real-world graphs, namely that they exhibit small-world behavior.
4.4.2 Watts–Strogatz Small-world Graph Model
The random graph model fails to exhibit a high clustering coefficient, but it is small-world. The Watts–Strogatz (WS) model tries to explicitly model high local clustering by starting with a regular network in which each node is connected to its k neighbors on the right and left, assuming that the initial n vertices are arranged in a large circular backbone. Such a network will have a high clustering coefficient, but will not be small-world. Surprisingly, adding a small amount of randomness in the regular network by randomly rewiring some of the edges or by adding a small fraction of random edges leads to the emergence of the small-world phenomena.
The WS model starts with n nodes arranged in a circular layout, with each node connected to its immediate left and right neighbors. The edges in the initial layout are
Graph Models
119
v0
v7
v6
v5
v1
v2
v3
v4
Figure4.10. Watts–Strogatzregulargraph:n=8,k=2.
called backbone edges. Each node has edges to an additional k − 1 neighbors to the left and right. Thus, the WS model starts with a regular graph of degree 2k, where each node is connected to its k neighbors on the right and k neighbors on the left, as illustrated in Figure 4.10.
Clustering Coefficient and Diameter of Regular Graph
Consider the subgraph Gv induced by the 2k neighbors of a node v. The clustering coefficient of v is given as
C(v) = mv (4.10) Mv
where mv is the actual number of edges, and Mv is the maximum possible number of edges, among the neighbors of v.
To compute mv , consider some node ri that is at a distance of i hops (with 1 ≤ i ≤ k) from v to the right, considering only the backbone edges. The node ri has edges to k − i of its immediate right neighbors (restricted to the right neighbors of v), and to k − 1 of its left neighbors (all k left neighbors, excluding v). Owing to the symmetry about v, a node li that is at a distance of i backbone hops from v to the left has the same number of edges. Thus, the degree of any node in Gv that is i backbone hops away from v is given as
di =(k−i)+(k−1)=2k−i−1
Because each edge contributes to the degree of its two incident nodes, summing the
degrees of all neighbors of v, we obtain
k
2mv =2 2k−i−1
i=1
120 Graph Data mv = 2k2 − k(k + 1) − k
2
mv = 3k(k−1) (4.11) 2
On the other hand, the number of possible edges among the 2k neighbors of v is
given as
Mv =2k= 2k(2k−1) =k(2k−1) 22
Plugging the expressions for mv and Mv into Eq. (4.10), the clustering coefficient of a node v is given as
C(v)=mv =3k−3 Mv 4k−2
As k increases, the clustering coefficient approaches 3 because C(G) = C(v) → 3 as 44
k→∞.
The WS regular graph thus has a high clustering coefficient. However, it does not
satisfy the small-world property. To see this, note that along the backbone, the farthest
node from v has a distance of at most n hops. Further, because each node is connected 2
to k neighbors on either side, one can reach the farthest node in at most n/2 hops. More k
precisely, the diameter of a regular WS graph is given as
n ifniseven d(G)= 2k
The regular graph has a diameter that scales linearly in the number of nodes, and thus it is not small-world.
Random Perturbation of Regular Graph
Edge Rewiring Starting with the regular graph of degree 2k, the WS model perturbs the regular structure by adding some randomness to the network. One approach is to randomly rewire edges with probability r. That is, for each edge (u,v) in the graph, with probability r, replace v with another randomly chosen node avoiding loops and duplicate edges. Because the WS regular graph has m = kn total edges, after rewiring, r m of the edges are random, and (1 − r )m are regular.
Edge Shortcuts An alternative approach is that instead of rewiring edges, we add a few shortcut edges between random pairs of nodes, as shown in Figure 4.11. The total number of random shortcut edges added to the network is given as mr = knr, so that r can be considered as the probability, per edge, of adding a shortcut edge. The total number of edges in the graph is then simply m + mr = (1 + r)m = (1 + r)kn. Because r ∈ [0,1], the number of edges then lies in the range [kn,2kn].
In either approach, if the probability r of rewiring or adding shortcut edges is r = 0, then we are left with the original regular graph, with high clustering coefficient, but with no small-world property. On the other hand, if the rewiring or shortcut probability r = 1, the regular structure is disrupted, and the graph approaches a random graph, with little to no clustering effect, but with small-world property. Surprisingly, introducing
n−1 if n is odd 2k
Graph Models 121
Figure4.11. Watts–Strogatzgraph(n=20,k=3):shortcutedgesareshowndotted.
only a small amount of randomness leads to a significant change in the regular network. As one can see in Figure 4.11, the presence of a few long-range shortcuts reduces the diameter of the network significantly. That is, even for a low value of r, the WS model retains most of the regular local clustering structure, but at the same time becomes small-world.
Properties of Watts–Strogatz Graphs
Degree Distribution Let us consider the shortcut approach, which is easier to analyze. In this approach, each vertex has degree at least 2k. In addition there are the shortcut edges, which follow a binomial distribution. Each node can have n′ = n − 2k − 1 additional shortcut edges, so we take n′ as the number of independent trials to add edges. Because a node has degree 2k, with shortcut edge probability of r, we expect roughly 2kr shortcuts from that node, but the node can connect to at most n − 2k − 1 other nodes. Thus, we can take the probability of success as
p = 2kr = 2kr (4.12) n−2k−1 n′
Let X denote the random variable denoting the number of shortcuts for each node. Then the probability of a node with j shortcut edges is given as
f(j)=P(X=j)=n′pj(1−p)n′−j j
with E[X] = n′p = 2kr. The expected degree of each node in the network is therefore 2k + E[X] = 2k + 2kr = 2k(1 + r)
It is clear that the degree distribution of the WS graph does not adhere to a power law. Thus, such networks are not scale-free.
122 Graph Data
Clustering Coefficient After the shortcut edges have been added, each node v has expected degree 2k(1 + r), that is, it is on average connected to 2kr new neighbors, in addition to the 2k original ones. The number of possible edges among v’s neighbors is given as
Mv = 2k(1+r)(2k(1+r)−1) =(1+r)k(4kr+2k−1) 2
Because the regular WS graph remains intact even after adding shortcuts, the
neighbors of v retain all 3k(k−1) initial edges, as given in Eq. (4.11). In addition, some 2
of the shortcut edges may link pairs of nodes among v’s neighbors. Let Y be the random variable that denotes the number of shortcut edges present among the 2k(1+r) neighbors of v; then Y follows a binomial distribution with probability of success p, as given in Eq. (4.12). Thus, the expected number of shortcut edges is given as
E[Y] = pMv
Let mv be the random variable corresponding to the actual number of edges present among v’s neighbors, whether regular or shortcut edges. The expected number of edges among the neighbors of v is then given as
E[mv]=E3k(k−1)+Y= 3k(k−1)+pMv 22
Because the binomial distribution is essentially concentrated around the mean, we can now approximate the clustering coefficient by using the expected number of edges, as follows:
E[mv] 3k(k−1) +pMv 3k(k−1) C(v)≃=2 =+p
Mv Mv 2Mv = 3(k−1) + 2kr
(1+r)(4kr+2(2k−1)) n−2k−1
using the value of p given in Eq. (4.12). For large graphs we have n → ∞, so we can
drop the second term above, to obtain
C(v)≃ 3(k−1) = 3k−3 (4.13)
(1+r)(4kr+2(2k−1)) 4k−2+2r(2kr+4k−1)
As r → 0, the above expression becomes equivalent to Eq. (4.10). Thus, for small values
of r the clustering coefficient remains high.
Diameter Deriving an analytical expression for the diameter of the WS model with random edge shortcuts is not easy. Instead we resort to an empirical study of the behavior of WS graphs when a small number of random shortcuts are added. In Example 4.10 we find that small values of shortcut edge probability r are enough to reduce the diameter from O(n) to O(logn). The WS model thus leads to graphs that are small-world and that also exhibit the clustering effect. However, the WS graphs do not display a scale-free degree distribution.
Graph Models
123
167
100
90
80
70
60
50
40
30
20
10
Edge probability: r
Figure 4.12. Watts-Strogatz model: diameter (circles) and clustering coefficient (triangles).
1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1
00
0 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 0.20
Example 4.10. Figure 4.12 shows a simulation of the WS model, for a graph with n = 1000 vertices and k = 3. The x-axis shows different values of the probability r of adding random shortcut edges. The diameter values are shown as circles using the left y-axis, whereas the clustering values are shown as triangles using the right y-axis. These values are the averages over 10 runs of the WS model. The solid line gives the clustering coefficient from the analytical formula in Eq. (4.13), which is in perfect agreement with the simulation values.
The initial regular graph has diameter
d(G)= n =1000=167
2k 6 and its clustering coefficient is given as
C(G)= 3(k−1) = 6 =0.6 2(2k − 1) 10
We can observe that the diameter quickly reduces, even with very small edge addition probability. For r = 0.005, the diameter is 61. For r = 0.1, the diameter shrinks to 11, which is on the same scale as O(log2 n) because log2 1000 ≃ 10. On the other hand, we can observe that clustering coefficient remains high. For r = 0.1, the clustering coefficient is 0.48. Thus, the simulation study confirms that the addition of even a small number of random shortcut edges reduces the diameter of the WS regular graph from O(n) (large-world) to O(log n) (small-world). At the same time the graph retains its local clustering property.
Diameter: d(G)
Clustering coefficient: C(G)
124 Graph Data 4.4.3 Baraba ́si–Albert Scale-free Model
The Baraba ́ si–Albert (BA) model tries to capture the scale-free degree distributions of real-world graphs via a generative process that adds new nodes and edges at each time step. Further, the edge growth is based on the concept of preferential attachment; that is, edges from the new vertex are more likely to link to nodes with higher degrees. For this reason the model is also known as the rich get richer approach. The BA model mimics a dynamically growing graph by adding new vertices and edges at each time-step t = 1,2,…. Let Gt denote the graph at time t, and let nt denote the number of nodes, and mt the number of edges in Gt .
Initialization
The BA model starts at time-step t = 0, with an initial graph G0 with n0 nodes and m0 edges. Each node in G0 should have degree at least 1; otherwise it will never be chosen for preferential attachment. We will assume that each node has initial degree 2, being connected to its left and right neighbors in a circular layout. Thus m0 = n0.
Growth and Preferential Attachment
The BA model derives a new graph Gt+1 from Gt by adding exactly one new node u and adding q ≤ n0 new edges from u to q distinct nodes vj ∈ Gt , where node vj is chosen with probability πt (vj ) proportional to its degree in Gt , given as
πt(vj)= dj (4.14) vi∈Gt di
Because only one new vertex is added at each step, the number of nodes in Gt is given as
nt = n0 + t
Further, because exactly q new edges are added at each time-step, the number of edges
in Gt is given as
Because the sum of the degrees is two times the number of edges in the graph, we have
vi ∈Gt We can thus rewrite Eq. (4.14) as
πt (vj ) = dj (4.15) 2(m0 +qt)
As the network grows, owing to preferential attachment, one intuitively expects high degree hubs to emerge.
mt =m0+qt d(vi)=2mt =2(m0+qt)
Example4.11. Figure4.13showsagraphgeneratedaccordingtotheBAmodel,with parametersn0 =3,q=2,andt=12.Initially,attimet=0,thegraphhasn0 =3 vertices, namely {v0,v1,v2} (shown in gray), connected by m0 = 3 edges (shown in bold). At each time step t = 1,…,12, vertex vt+2 is added to the growing network
Graph Models
v3
v4
v5
v6
125
v1 v2
v0
v14
v13
v12
v11
v10 v9
v7
Figure4.13. Baraba ́si–Albertgraph(n0=3,q=2,t=12).
v8
and is connected to q = 2 vertices chosen with a probability proportional to their degree.
For example, at t = 1, vertex v3 is added, with edges to v1 and v2 , chosen according to the distribution
π0(vi)=1/3 fori=0,1,2
At t = 2, v4 is added. Using Eq. (4.15), nodes v2 and v3 are preferentially chosen
according to the probability distribution
π1(v0)=π1(v3)= 2 =0.2 10
π1(v1)=π1(v2)= 3 =0.3 10
The final graph after t = 12 time-steps shows the emergence of some hub nodes, such as v1 (with degree 9) and v3 (with degree 6).
Degree Distribution
We now study two different approaches to estimate the degree distribution for the BA model, namely the discrete approach, and the continuous approach.
Discrete Approach The discrete approach is also called the master-equation method. Let Xt be a random variable denoting the degree of a node in Gt , and let ft (k) denote the probability mass function for Xt. That is, ft(k) is the degree distribution for the
126 Graph Data
graph Gt at time-step t. Simply put, ft (k) is the fraction of nodes with degree k at time t. Let nt denote the number of nodes and mt the number of edges in Gt. Further, let nt (k) denote the number of nodes with degree k in Gt . Then we have
ft (k) = nt (k) nt
Because we are interested in large real-world graphs, as t → ∞, the number of nodes and edges in Gt can be approximated as
nt = n0 + t ≃ t mt =m0+qt≃qt
(4.16) Based on Eq. (4.14), at time-step t + 1, the probability πt (k) that some node with
degree k in Gt is chosen for preferential attachment can be written as k·nt(k)
πt(k)= i i·nt(i)
Dividing the numerator and denominator by nt , we have
k · nt (k) k · f (k) πt(k)= nt = t
(4.17) Note that the denominator is simply the expected value of Xt , that is, the mean degree
i · nt (i) i · ft (i) int i
in Gt , because
E[Xt]=μd(Gt)=i·ft(i) (4.18) i
Note also that in any graph the average degree is given as
μd(Gt)=idi =2mt ≃2qt=2q (4.19)
nt nt t
where we used Eq. (4.16), that is, mt = qt. Equating Eqs. (4.18) and (4.19), we can
rewrite the preferential attachment probability [Eq. (4.17)] for a node of degree k as πt(k)= k·ft(k) (4.20)
2q
We now consider the change in the number of nodes with degree k, when a new vertex u joins the growing network at time-step t + 1. The net change in the number of nodes with degree k is given as the number of nodes with degree k at time t + 1 minus the number of nodes with degree k at time t, given as
(nt +1)·ft+1(k)−nt ·ft(k)
Using the approximation that nt ≃ t from Eq. (4.16), the net change in degree k nodes is
(nt +1)·ft+1(k)−nt ·ft(k)=(t+1)·ft+1(k)−t·ft(k) (4.21)
The number of nodes with degree k increases whenever u connects to a vertex vi of degreek−1inGt,asinthiscasevi willhavedegreekinGt+1.Overtheqedgesadded
Graph Models 127 at time t + 1, the number of nodes with degree k − 1 in Gt that are chosen to connect
to u is given as
qπt(k−1)= q·(k−1)·ft(k−1) = 1 ·(k−1)·ft(k−1) (4.22) 2q 2
where we use Eq. (4.20) for πt (k − 1). Note that Eq. (4.22) holds only when k > q . This is because vi must have degree at least q, as each node that is added at time t ≥ 1 has initialdegreeq.Therefore,ifdi =k−1,thenk−1≥qimpliesthatk>q(wecanalso ensure that the initial n0 nodes have degree q by starting with clique of size n0 = q + 1).
At the same time, the number of nodes with degree k decreases whenever u connectstoavertexvi withdegreekinGt,asinthiscasevi willhaveadegreek+1in Gt +1 . Using Eq. (4.20), over the q edges added at time t + 1, the number of nodes with degree k in Gt that are chosen to connect to u is given as
q·πt(k)= q·k·ft(k) = 1 ·k·ft(k) (4.23) 2q 2
Based on the preceding discussion, when k > q, the net change in the number of nodes with degree k is given as the difference between Eqs. (4.22) and (4.23) in Gt :
q·πt(k−1)−q·πt(k)= 1·(k−1)·ft(k−1)−1k·ft(k) (4.24) 22
Equating Eqs. (4.21) and (4.24) we obtain the master equation for k > q:
(t + 1) · ft+1(k) − t · ft (k) = 1 · (k − 1) · ft (k − 1) − 1 · k · ft (k) (4.25)
On the other hand, when k = q , assuming that there are no nodes in the graph with degree less than q, then only the newly added node contributes to an increase in the number of nodes with degree k = q by one. However, if u connects to an existing node vi with degree k, then there will be a decrease in the number of degree k nodes because in this case vi will have degree k + 1 in Gt +1 . The net change in the number of nodes with degree k is therefore given as
1−q·πt(k)=1− 1 ·k·ft(k) (4.26) 2
Equating Eqs. (4.21) and (4.26) we obtain the master equation for the boundary condition k = q:
(t + 1) · ft+1(k) − t · ft (k) = 1 − 1 · k · ft (k) (4.27) 2
Our goal is now to obtain the stationary or time-invariant solutions for the master equations. In other words, we study the solution when
ft+1(k) = ft (k) = f (k) (4.28) The stationary solution gives the degree distribution that is independent of time.
22
128 Graph Data Let us first derive the stationary solution for k = q. Substituting Eq. (4.28) into
Eq. (4.27) and setting k = q, we obtain
(t + 1) · f (q) − t · f (q) = 1 − 1 · q · f (q)
2
2f(q)=2−q·f(q), which implies that
f(q)= 2 q+2
(4.29) The stationary solution for k > q gives us a recursion for f (k) in terms of f (k − 1):
(t+1)·f(k)−t·f(k)= 1·(k−1)·f(k−1)−1·k·f(k) 22
2f(k)=(k−1)·f(k−1)−k·f(k), which implies that f(k)=k−1·f(k−1) (4.30)
k+2
Expanding (4.30) until the boundary condition k = q yields
f (k) = (k − 1) · f (k − 1) (k+2)
= (k−1)(k−2) ·f(k−2)
(k+2)(k+1) .
= (k−1)(k−2)(k−3)(k−4)···(q+3)(q+2)(q+1)(q)·f(q) (k+2)(k+1)(k)(k−1)···(q+6)(q+5)(q+4)(q+3)
= (q+2)(q+1)q ·f(q) (k+2)(k+1)k
Plugging in the stationary solution for f(q) from Eq.(4.29) gives the general solution
f(k)= (q+2)(q+1)q · 2 = 2q(q+1) (k+2)(k+1)k (q+2) k(k+1)(k+2)
For constant q and large k, it is easy to see that the degree distribution scales as
f (k) ∝ k−3 (4.31)
In other words, the BA model yields a power-law degree distribution with γ = 3, especially for large degrees.
Continuous Approach The continuous approach is also called the mean-field method. In the BA model, the vertices that are added early on tend to have a higher degree, because they have more chances to acquire connections from the vertices that are added to the network at a later time. The time dependence of the degree of a vertex can be approximated as a continuous random variable. Let ki = dt (i) denote the degree of vertex vi at time t. At time t, the probability that the newly added node u links to
Graph Models 129
vi isgivenasπt(i).Further,thechangeinvi’sdegreepertime-stepisgivenasq·πt(i). Using the approximation that nt ≃ t and mt ≃ qt from Eq. (4.16), the rate of change of ki with time can be written as
dki =q·πt(i)=q· ki = ki dt 2qt 2t
Rearranging the terms in the preceding equation dki dt
= ki 2t
and integrating on both sides,
(4.32)
we have
1dki= 1dt
ki 2t
lnki =1lnt+C
2
elnki =elnt1/2 ·eC, which implies
q α = √t
(4.33)
(4.34)
ki =α·t1/2
where C is the constant of integration, and thus α = eC is also a constant.
Let ti denote the time when node i was added to the network. Because the initial degree for any node is q , we obtain the boundary condition that ki = q at time t = ti .
Plugging these into Eq. (4.32), we get
ki = α · t1/2 = q, which implies that
i
i
Substituting Eq. (4.33) into Eq. (4.32) leads to the particular solution
ki =α·√t=q·t/ti
Intuitively, this solution confirms the rich-gets-richer phenomenon. It suggests that if a node vi is added early to the network (i.e., ti is small), then as time progresses (i.e., t gets larger), the degree of vi keeps on increasing (as a square root of the time t).
Let us now consider the probability that the degree of vi at time t is less than some value k, i.e., P (ki < k). Note that if ki < k, then by Eq. (4.34), we have
ki
130 Graph Data Thus, we can write
P(ki
In other words, the probability that node vi has degree less than k is the same as the probability that the time t at which v enters the graph is greater than q2 t, which in
ii k2
turn can be expressed as 1 minus the probability that t is less than or equal to q2 t.
i k2
Note that vertices are added to the graph at a uniform rate of one vertex per
time-step, that is, 1 ≃ 1 . Thus, the probability that t is less than or equal to q2 t is
given as
ntt i k2
P(ki
E[C(Gt )] = O (log nt )2 nt
which is only slightly better than the clustering coefficient for random graphs, which scale as O(n−1). In Example 4.12, we empirically study the clustering coefficient and
t
diameter for random instances of the BA model with a given set of parameters.
Example 4.12. Figure 4.14 plots the empirical degree distribution obtained as the average of 10 different BA graphs generated with the parameters n0 = 3, q = 3, and for t = 997 time-steps, so that the final graph has n = 1000 vertices. The slope of the line in the log-log scale confirms the existence of a power law, with the slope given as −γ = −2.64.
The average clustering coefficient over the 10 graphs was C(G) = 0.019, which is not very high, indicating that the BA model does not capture the clustering effect. On the other hand, the average diameter was d(G) = 6, indicating ultra-small-world behavior.
−2
−4 −6 −8
−10
−12
−γ = −2.64
−14
1234567
Degree: log2 k
Figure4.14. Baraba ́si–Albertmodel(n0=3,t=997,q=3):degreedistribution.
Probability: log2 f (k)
132 Graph Data
4.5 FURTHER READING
The theory of random graphs was founded in Erdo ̋ s and Re ́ nyi (1959); for a detailed treatment of the topic see Bolloba ́s (2001). Alternative graph models for real-world networks were proposed in Watts and Strogatz (1998) and Baraba ́ si and Albert (1999). One of the first comprehensive books on graph data analysis was Wasserman and Faust (1994). More recent books on network science Lewis (2009) and Newman (2010). For PageRank see Brin and Page (1998), and for the hubs and authorities approach see Kleinberg (1999). For an up-to-date treatment of the patterns, laws, and models (including the RMat generator) for real-world networks, see Chakrabarti and Faloutsos (2012).
Baraba ́si, A.-L. and Albert, R. (1999). Emergence of scaling in random networks. science, 286 (5439): 509–512.
Bolloba ́ s, B. (2001). Random graphs. 2nd ed. Vol. 73. New York: Cambridge university press.
Brin, S. and Page, L. (1998). The anatomy of a large-scale hypertextual Web search engine. Computer networks and ISDN systems, 30 (1): 107–117.
Chakrabarti, D. and Faloutsos, C. (2012). Graph mining: laws, tools, and case studies. Synthesis Lectures on Data Mining and Knowledge Discovery, 7 (1): 1–207.
Erdo ̋s, P. and Re ́nyi, A. (1959). On random graphs. Publicationes Mathematicae Debrecen, 6, 290–297.
Kleinberg, J. M. (1999). Authoritative sources in a hyperlinked environment. Journal of the ACM, 46 (5): 604–632.
Lewis, T. G. (2009). Network Science: Theory and Applications. Hoboken, NJ: John Wiley & Sons.
Newman, M. (2010). Networks: An Introduction. Oxford: Oxford University Press. Wasserman, S. and Faust, K. (1994). Social Network Analysis: Methods and Applica- tions. Structural Analysis in the Social Sciences. New York: Cambridge University
Press.
Watts, D. J. and Strogatz, S. H. (1998). Collective dynamics of ’small-world’ networks.
nature, 393 (6684): 440–442.
4.6 EXERCISES
Q1. Given the graph in Figure 4.15, find the fixed-point of the prestige vector.
ab
c
Figure 4.15. Graph for Q1
Exercises 133 Q2. GiventhegraphinFigure4.16,findthefixed-pointoftheauthorityandhubvectors.
a
bc
Figure 4.16. Graph for Q2.
Q3. Consider the double star graph given in Figure 4.17 with n nodes, where only nodes 1 and 2 are connected to all other vertices, and there are no other links. Answer the following questions (treating n as a variable).
(a) What is the degree distribution for this graph?
(b) What is the mean degree?
(c) What is the clustering coefficient for vertex 1 and vertex 3?
(d) What is the clustering coefficient C(G) for the entire graph? What happens to the clustering coefficient as n → ∞?
(e) What is the transitivity T(G) for the graph? What happens to T(G) and n → ∞?
(f) What is the average path length for the graph?
(g) What is the betweenness value for node 1?
(h) What is the degree variance for the graph?
3 4 5 ··············· n
12 Figure 4.17. Graph for Q3.
Q4. Consider the graph in Figure 4.18. Compute the hub and authority score vectors. Which nodes are the hubs and which are the authorities?
13
245
Figure 4.18. Graph for Q4.
Q5. ProvethatintheBAmodelattime-stept+1,theprobabilityπt(k)thatsomenode with degree k in Gt is chosen for preferential attachment is given as
k·nt(k) πt (k) = i i · nt (i)
CHAPTER 5 Kernel Methods
Before we can mine data, it is important to first find a suitable data representation that facilitates data analysis. For example, for complex data such as text, sequences, images, and so on, we must typically extract or construct a set of attributes or features, so that we can represent the data instances as multivariate vectors. That is, given a data instance x (e.g., a sequence), we need to find a mapping φ, so that φ(x) is the vector representation of x. Even when the input data is a numeric data matrix, if we wish to discover nonlinear relationships among the attributes, then a nonlinear mapping φ may be used, so that φ(x) represents a vector in the corresponding high-dimensional space comprising nonlinear attributes. We use the term input space to refer to the data space for the input data x and feature space to refer to the space of mapped vectors φ(x). Thus, given a set of data objects or instances xi, and given a mapping function φ, we can transform them into feature vectors φ (xi ), which then allows us to analyze complex data instances via numeric analysis methods.
Example 5.1 (Sequence-based Features). Consider a dataset of DNA sequences over the alphabet = {A,C,G,T}. One simple feature space is to represent each sequence in terms of the probability distribution over symbols in . That is, given a sequence x with length |x| = m, the mapping into feature space is given as
φ(x) = {P(A),P(C),P(G),P(T)}
where P (s) = ns is the probability of observing symbol s ∈ , and ns is the number of
m
times s appears in sequence x. Here the input space is the set of sequences ∗, and
the feature space is R4. For example, if x = ACAGCAGTA, with m = |x| = 9, since A occurs four times, C and G occur twice, and T occurs once, we have
φ (x) = (4/9, 2/9, 2/9, 1/9) = (0.44, 0.22, 0.22, 0.11) Likewise, for another sequence y = AGCAAGCGAG, we have
φ (y) = (4/10, 2/10, 4/10, 0) = (0.4, 0.2, 0.4, 0)
The mapping φ now allows one to compute statistics over the data sample to make inferences about the population. For example, we may compute the mean
134
Kernel Methods 135
symbol composition. We can also define the distance between any two sequences, for example,
δ(x,y)=φ(x)−φ(y)
= (0.44 − 0.4)2 + (0.22 − 0.2)2 + (0.22 − 0.4)2 + (0.11 − 0)2 = 0.22
We can compute larger feature spaces by considering, for example, the probability distribution over all substrings or words of size up to k over the alphabet , and so on.
Example 5.2 (Nonlinear Features). As an example of a nonlinear mapping consider the mapping φ that takes as input a vector x = (x1,x2)T ∈ R2 and maps it to a “quadratic” feature space via the nonlinear mapping
φ(x) = (x12,x2,√2x1x2)T ∈ R3 For example, the point x = (5.9, 3)T is mapped to the vector
φ(x)=(5.92,32,√2·5.9·3)T =(34.81,9,25.03)T
The main benefit of this transformation is that we may apply well-known linear analysis methods in the feature space. However, because the features are nonlinear combinations of the original attributes, this allows us to mine nonlinear patterns and relationships.
Whereas mapping into feature space allows one to analyze the data via algebraic and probabilistic modeling, the resulting feature space is usually very high-dimensional; it may even be infinite dimensional. Thus, transforming all the input points into feature space can be very expensive, or even impossible. Because the dimensionality is high, we also run into the curse of dimensionality highlighted later in Chapter 6.
Kernel methods avoid explicitly transforming each point x in the input space into the mapped point φ(x) in the feature space. Instead, the input objects are represented via their n × n pairwise similarity values. The similarity function, called a kernel, is chosen so that it represents a dot product in some high-dimensional feature space, yet it can be computed without directly constructing φ(x). Let I denote the input space, which can comprise any arbitrary set of objects, and let D = {xi }ni=1 ⊂ I be a dataset comprising n objects in the input space. We can represent the pairwise similarity values between points in D via the n × n kernel matrix, defined as
K(x1,x1) K(x1,x2)
K(x2,x1) K(x2,x2) K = . . . . . .
K(xn,x1) K(xn,x2)
··· K(x1,xn) ··· K(x2,xn) ….
··· K(xn,xn)
where K : I × I → R is a kernel function on any two points in input space. However, we require that K corresponds to a dot product in some feature space. That is, for any
136 Kernel Methods
xi , xj ∈ I , the kernel function should satisfy the condition
K(xi,xj)=φ(xi)Tφ(xj) (5.1)
where φ : I → F is a mapping from the input space I to the feature space F . Intuitively, this means that we should be able to compute the value of the dot product using the original input representation x, without having recourse to the mapping φ(x). Obviously, not just any arbitrary function can be used as a kernel; a valid kernel function must satisfy certain conditions so that Eq. (5.1) remains valid, as discussed in Section 5.1.
It is important to remark that the transpose operator for the dot product applies only when F is a vector space. When F is an abstract vector space with an inner product, the kernel is written as K(xi,xj)=⟨φ(xi),φ(xj)⟩. However, for convenience we use the transpose operator throughout this chapter; when F is an inner product space it should be understood that
X2
x4
x1
x5
x x2 3
K x1 x2 x3 x4 x5
φ(xi)Tφ(xj)≡⟨φ(xi),φ(xj)⟩
Example 5.3 (Linear and Quadratic Kernels). Consider the identity mapping, φ(x) → x. This naturally leads to the linear kernel, which is simply the dot product between two input vectors, and thus satisfies Eq. (5.1):
φ(x)Tφ(y) = xTy = K(x,y)
For example, consider the first five points from the two-dimensional Iris dataset
shown in Figure 5.1a:
x1 =5.9 x2 =6.9 x3 =6.6 x4 =4.6 x5 = 6
3 3.1 2.9 3.2 2.2 The kernel matrix for the linear kernel is shown in Figure 5.1b. For example,
K(x1,x2)=xT1x2 =5.9×6.9+3×3.1=40.71+9.3=50.01
3.0 2.5
x1 43.81 x2 50.01 x3 47.64 x4 36.74 x5 42.00
50.01 57.22 54.53 41.66 48.22
47.64 36.74 54.53 41.66 51.97 39.64 39.64 31.40 45.98 34.64
(b)
42.00 48.22 45.98 34.64 40.84
2X1 4.5 5.0 5.5 6.0 6.5
(a)
Figure 5.1. (a) Example points. (b) Linear kernel matrix.
Kernel Methods 137
Consider the quadratic mapping φ : R2 → R3 from Example 5.2, that maps
x = (x1,x2)T as follows:
φ(x) = (x12,x2,√2x1x2)T
The dot product between the mapping for two input points x, y ∈ R2 is given as
φ(x)Tφ(y) = x12y12 + x2y2 + 2x1y1x2y2
We can rearrange the preceding to obtain the (homogeneous) quadratic kernel
function as follows:
φ(x)Tφ(y) = x12y12 + x2y2 + 2x1y1x2y2 = (x1y1 + x2y2)2
= (xTy)2 = K(x, y)
We can thus see that the dot product in feature space can be computed by evaluating the kernel in input space, without explicitly mapping the points into feature space. For example, we have
φ(x1)=(5.92,32,√2·5.9·3)T =(34.81,9,25.03)T
φ(x2)=(6.92,3.12,√2·6.9·3.1)T =(47.61,9.61,30.25)T φ(x1)Tφ(x2)=34.81×47.61+9×9.61+25.03×30.25=2501
We can verify that the homogeneous quadratic kernel gives the same value K(x1,x2) = (xT1 x2)2 = (50.01)2 = 2501
We shall see that many data mining methods can be kernelized, that is, instead of mapping the input points into feature space, the data can be represented via the n × n kernel matrix K, and all relevant analysis can be performed over K. This is usually done via the so-called kernel trick, that is, show that the analysis task requires only dot products φ (xi )T φ (xj ) in feature space, which can be replaced by the corresponding kernel K(xi , xj ) = φ (xi )T φ (xj ) that can be computed efficiently in input space. Once the kernel matrix has been computed, we no longer even need the input points xi, as all operations involving only dot products in the feature space can be performed over the n × n kernel matrix K. An immediate consequence is that when the input data is the typical n × d numeric matrix D and we employ the linear kernel, the results obtained by analyzing K are equivalent to those obtained by analyzing D (as long as only dot products are involved in the analysis). Of course, kernel methods allow much more flexibility, as we can just as easily perform non-linear analysis by employing nonlinear kernels, or we may analyze (non-numeric) complex objects without explicitly constructing the mapping φ(x).
138 Kernel Methods
Example 5.4. Consider the five points from Example 5.3 along with the linear kernel matrix shown in Figure 5.1. The mean of the five points in feature space is simply the mean in input space, as φ is the identity function for the linear kernel:
1 5
μφ 2 = μTφ μφ = (6.02 + 2.882) = 44.29
Because this involves only a dot product in feature space, the squared magnitude can be computed directly from K. As we shall see later [see Eq. (5.12)] the squared norm of the mean vector in feature space is equivalent to the average value of the kernel matrix K. For the kernel matrix in Figure 5.1b we have
1 5 5 1 1 0 7 . 3 6
52 K(xi,xj)= 25 =44.29
i=1 j=1
which matches the μφ2 value computed earlier. This example illustrates that operations involving dot products in feature space can be cast as operations over the kernel matrix K.
1 5
φ(xi) = 5 xi = (6.00,2.88)T
i=1
Now consider the squared magnitude of the mean in feature space:
μφ = 5
i=1
Kernel methods offer a radically different view of the data. Instead of thinking of the data as vectors in input or feature space, we consider only the kernel values between pairs of points. The kernel matrix can also be considered as a weighted adjacency matrix for the complete graph over the n input points, and consequently there is a strong connection between kernels and graph analysis, in particular algebraic graph theory.
5.1 KERNEL MATRIX
Let I denote the input space, which can be any arbitrary set of data objects, and let D = {x1,x2,…,xn} ⊂ I denote a subset of n objects in the input space. Let φ : I → F be a mapping from the input space into the feature space F, which is endowed with a dot product and norm. Let K: I × I → R be a function that maps pairs of input objects to their dot product value in feature space, that is, K(xi , xj ) = φ (xi )T φ (xj ), and let K be the n × n kernel matrix corresponding to the subset D.
The function K is called a positive semidefinite kernel if and only if it is symmetric: K(xi,xj)=K(xj,xi)
and the corresponding kernel matrix K for any subset D ⊂ I is positive semidefinite, that is,
aTKa ≥ 0, for all vectors a ∈ Rn
Kernel Matrix 139 which implies that
n n
aiajK(xi,xj)≥0, forallai ∈R,i∈[1,n] (5.2) i=1 j=1
We first verify that if K(xi,xj) represents the dot product φ(xi)Tφ(xj) in some feature space, then K is a positive semidefinite kernel. Consider any dataset D, and let K = {K(xi , xj )} be the corresponding kernel matrix. First, K is symmetric since the dot product is symmetric, which also implies that K is symmetric. Second, K is positive semidefinite because
T nn
a Ka= aiajK(xi,xj)
i=1 j=1 n n
= aiajφ(xi)Tφ(xj)
i=1 j=1
n Tn
= aiφ(xi) ajφ(xj)
i=1 j=1 n 2
= aiφ(xi) ≥0 i=1
Thus, K is a positive semidefinite kernel.
We now show that if we are given a positive semidefinite kernel K : I × I → R,
then it corresponds to a dot product in some feature space F. 5.1.1 Reproducing Kernel Map
For the reproducing kernel map φ, we map each point x ∈ I into a function in a functional space {f : I → R} comprising functions that map points in I into R. Algebraically this space of functions is an abstract vector space where each point happens to be a function. In particular, any x ∈ I in the input space is mapped to the following function:
φ(x)=K(x,·)
where the · stands for any argument in I. That is, each object x in the input space gets mapped to a feature point φ(x), which is in fact a function K(x,·) that represents its similarity to all other points in the input space I.
Let F be the set of all functions or points that can be obtained as a linear combination of any subset of feature points, defined as
= f=f(·)= αi K(xi,·)m∈N,αi ∈R,{x1,…,xm}⊆I
F =span K(x,·)|x∈I m
i=1
We use the dual notation f and f (·) interchangeably to emphasize the fact that each point f in the feature space is in fact a function f (·). Note that by definition the feature point φ(x) = K(x,·) belongs to F.
140
Kernel Methods
Let f, g ∈ F be any two points in feature space:
mb g=g(·)=βj K(xj,·) j=1
ma mb
fTg=f(·)Tg(·)=αiβjK(xi,xj) (5.3)
i=1 j=1
We emphasize that the notation fTg is only a convenience; it denotes the inner product ⟨f,g⟩ because F is an abstract vector space, with an inner product as defined above.
We can verify that the dot product is bilinear, that is, linear in both arguments, because
ma mb ma mb fTg=αi βj K(xi,xj)=αi g(xi)=βj f(xj)
i=1 j=1 i=1 j=1 The fact that K is positive semidefinite implies that
ma ma
∥f∥2 =fTf=αiαjK(xi,xj)≥0
i=1 j=1
Thus, the space F is a pre-Hilbert space, defined as a normed inner product space, because it is endowed with a symmetric bilinear dot product and a norm. By adding the limit points of all Cauchy sequences that are convergent, F can be turned into a Hilbert space, defined as a normed inner product space that is complete. However, showing this is beyond the scope of this chapter.
The space F has the so-called reproducing property, that is, we can evaluate a function f (·) = f at a point x ∈ I by taking the dot product of f with φ(x), that is,
T T ma
f φ(x)=f(·) K(x,·)= αi K(xi,x)=f(x)
i=1
For this reason, the space F is also called a reproducing kernel Hilbert space.
All we have to do now is to show that K(xi , xj ) corresponds to a dot product in the feature space F . This is indeed the case, because using Eq. (5.3) for any two feature
points φ (xi ), φ (xj ) ∈ F their dot product is given as φ(xi)Tφ(xj)=K(xi,·)TK(xj,·)=K(xi,xj)
The reproducing kernel map shows that any positive semidefinite kernel corre- sponds to a dot product in some feature space. This means we can apply well known algebraic and geometric methods to understand and analyze the data in these spaces.
Empirical Kernel Map
The reproducing kernel map φ maps the input space into a potentially infinite dimensional feature space. However, given a dataset D = {xi}ni=1, we can obtain a finite
ma f=f(·)=αi K(xi,·) i=1
Define the dot product between two points as
Kernel Matrix 141 dimensional mapping by evaluating the kernel only on points in D. That is, define the
map φ as follows:
Tn φ(x) = K(x1,x),K(x2,x),…,K(xn,x) ∈ R
which maps each point x ∈ I to the n-dimensional vector comprising the kernel values of x with each of the objects xi ∈ D. We can define the dot product in feature space as
n k=1
where Ki denotes the ith column of K, which is also the same as the ith row of K (considered as a column vector), as K is symmetric. However, for φ to be a valid map, we require that φ (xi )T φ (xj ) = K(xi , xj ), which is clearly not satisfied by Eq. (5.4). One solution is to replace KTi Kj in Eq. (5.4) with KTi AKj for some positive semidefinite matrix A such that
K Ti A K j = K ( x i , x j )
If we can find such an A, it would imply that over all pairs of mapped points we have
T n n Ki AKj = K(xi,xj)
φ(xi)Tφ(xj)=
K(xk,xi)K(xk,xj)=KTi Kj (5.4)
φ(x) = K · K(x1,x),K(x2,x),…,K(xn,x) ∈ R so that the dot product yields
i,j =1 i,j =1 KAK=K
which can be written compactly as
This immediately suggests that we take A = K−1, the (pseudo) inverse of the kernel
matrix K. The modified map φ, called the empirical kernel map, is then defined as −1/2 T n
φ(xi)Tφ(xj)=K−1/2 KiTK−1/2 Kj = KTi K−1/2K−1/2 Kj
= K Ti K − 1 K j Over all pairs of mapped points, we have
KTi K−1 Kj ni,j=1 =KK−1 K=K
as desired. However, it is important to note that this empirical feature representation is valid only for the n points in D. If points are added to or removed from D, the kernel map will have to be updated for all points.
5.1.2 Mercer Kernel Map
In general different feature spaces can be constructed for the same kernel K. We now describe how to construct the Mercer map.
142 Kernel Methods
Data-specific Kernel Map
The Mercer kernel map is best understood starting from the kernel matrix for the dataset D in input space. Because K is a symmetric positive semidefinite matrix, it has real and non-negative eigenvalues, and it can be decomposed as follows:
K=UUT
where U is the orthonormal matrix of eigenvectors ui = (ui1,ui2,…,uin)T ∈ Rn (for i = 1,…,n), and is the diagonal matrix of eigenvalues, with both arranged in non-increasing order of the eigenvalues λ1 ≥ λ2 ≥ . . . ≥ λn ≥ 0:
|| | λ10···0 U=u u ··· u =0 λ2 ··· 0
1 2 n . . … . || | 00···λn
The kernel matrix K can therefore be rewritten as the spectral sum K=λ1u1uT1 +λ2u2uT2 +···+λnunuTn
In particular the kernel function between xi and xj is given as K(xi,xj)=λ1 u1i u1j +λ2 u2i u2j ···+λn uni unj
Mercer map φ as follows:
T
φ(xi) = λ1 u1i, λ2 u2i,…, λn uni (5.6) then K(xi,xj) is a dot product in feature space between the mapped points φ(xi) and
φ(xj) because
φ(xi)Tφ(xj)=λ1 u1i,…,λn uniλ1 u1j,…,λn unjT =λ1 u1i u1j +···+λn uni unj =K(xi,xj)
Noting that Ui =(u1i,u2i,…,uni)T is the ith row of U, we can rewrite the Mercer map φ as
φ(xi)=√Ui (5.7) Thus, the kernel value is simply the dot product between scaled rows of U:
φ(xi)Tφ(xj)=√UiT√Uj=UTi Uj
The Mercer map, defined equivalently in Eqs. (5.6) and (5.7), is obviously restricted to the input dataset D, just like the empirical kernel map, and is therefore called the data-specific Mercer kernel map. It defines a data-specific feature space of dimensionality at most n, comprising the eigenvectors of K.
n k=1
(5.5) where uki denotes the ith component of eigenvector uk. It follows that if we define the
=
λk uki ukj
Kernel Matrix 143
Example 5.5. Let the input dataset comprise the five points shown in Figure 5.1a, and let the corresponding kernel matrix be as shown in Figure 5.1b. Computing the eigen-decomposition of K, we obtain λ1 = 223.95, λ2 = 1.29, and λ3 = λ4 = λ5 = 0. The effective dimensionality of the feature space is 2, comprising the eigenvectors u1 and u2. Thus, the matrix U is given as follows:
u1
U1 −0.442 U= U2 −0.505 U3 −0.482 U4 −0.369 U5 −0.425
u 2 0.163 −0.134 −0.181 0.813
−0.512
=223.95 0 √=√223.95 √0 =14.965 0
and we have
0 1.29 0 1.29 0 1.135
The kernel map is specified via Eq. (5.7). For example, for x1 = (5.9, 3)T and
x2 = (6.9, 3.1)T we have
φ(x1) = √U1 = 14.965
0
φ(x2) = √U2 = 14.965 0
0 −0.442 = −6.616 1.135 0.163 0.185
0 −0.505 = −7.563 1.135 −0.134 −0.153
Their dot product is given as
φ(x1)Tφ(x2) = 6.616 × 7.563 − 0.185 × 0.153
= 50.038 − 0.028 = 50.01 which matches the kernel value K(x1,x2) in Figure 5.1b.
Mercer Kernel Map
For compact continuous spaces, analogous to the discrete case in Eq. (5.5), the kernel value between any two points can be written as the infinite spectral decomposition
K(xi,xj)=∞ λk uk(xi)uk(xj) k=1
where {λ1 , λ2 , . . .} is the infinite set of eigenvalues, and u1 (·), u2 (·), . . . is the corresponding set of orthogonal and normalized eigenfunctions, that is, each function ui (·) is a solution to the integral equation
K(x,y) ui(y) dy = λiui(x)
144 Kernel Methods and K is a continuous positive semidefinite kernel, that is, for all functions a(·) with a
finite square integral (i.e., a(x)2 dx < ∞) K satisfies the condition K(x1,x2) a(x1) a(x2) dx1 dx2 ≥ 0
We can see that this positive semidefinite kernel for compact continuous spaces is analogous to the the discrete kernel in Eq. (5.2). Further, similarly to the data-specific Mercer map [Eq. (5.6)], the general Mercer kernel map is given as
T φ(xi) = λ1 u1(xi), λ2 u2(xi),...
with the kernel value being equivalent to the dot product between two mapped points:
K(xi,xj)=φ(xi)Tφ(xj)
5.2 VECTOR KERNELS
We now consider two of the most commonly used vector kernels in practice. Kernels that map an (input) vector space into another (feature) vector space are called vector kernels. For multivariate input data, the input vector space will be the d-dimensional real space Rd. Let D comprise n input points xi ∈ Rd, for i = 1,2,...,n. Commonly used (nonlinear) kernel functions over vector data include the polynomial and Gaussian kernels, as described next.
Polynomial Kernel
Polynomial kernels are of two types: homogeneous or inhomogeneous. Let x, y ∈ Rd . The homogeneous polynomial kernel is defined as
Kq(x,y)=φ(x)Tφ(y)=(xTy)q (5.8)
where q is the degree of the polynomial. This kernel corresponds to a feature space spanned by all products of exactly q attributes.
The most typical cases are the linear (with q = 1) and quadratic (with q = 2) kernels, given as
K1(x,y) = xTy
K2(x,y)=(xTy)2 The inhomogeneous polynomial kernel is defined as
Kq(x,y)=φ(x)Tφ(y)=(c+xTy)q (5.9)
where q is the degree of the polynomial, and c ≥ 0 is some constant. When c = 0 we obtain the homogeneous kernel. When c > 0, this kernel corresponds to the feature space spanned by all products of at most q attributes. This can be seen from the binomial expansion
Kq(x,y)=(c+xTy)q =q qkcq−kxTyk k=0
Vector Kernels 145 For example, for the typical value of c = 1, the inhomogeneous kernel is a weighted
sum of the homogeneous polynomial kernels for all powers up to q, that is, (1+xTy)q =1+qxTy+q2xTy2 +···+qxTyq−1 +xTyq
Example 5.6. Consider the points x1 and x2 in Figure 5.1.
x1 = 5.9 x2 = 6.9
3 3.1 The homogeneous quadratic kernel is given as
K(x1,x2) = (xT1 x2)2 = 50.012 = 2501 The inhomogeneous quadratic kernel is given as
K(x1,x2)=(1+xT1x2)2 =(1+50.01)2 =51.012 =2602.02
For the polynomial kernel it is possible to construct a mapping φ from the input to thefeaturespace.Letn0,n1,…,nd denotenon-negativeintegers,suchthat d ni =q.
d q i=0 Further, let n = (n0,n1,…,nd), and let |n| = i=0 ni = q. Also, let n denote the
multinomial coefficient
q= q = q!
n n0,n1,…,nd n0!n1!…nd!
The multinomial expansion of the inhomogeneous kernel is then given as
d q
Kq(x,y)=(c+xTy)q = c+ xkyk =(c+x1y1 +···+xdyd)q
q k=1
= |n|=q n cn0 (x1y1)n1 (x2y2)n2 …(xdyd)nd
= qcn0 xn1 xn2 …xnd yn1 yn2 …ynd |n|=qn 12 d 12 d
√d √d = an xnk an ynk
|n|=q k=1 = φ(x)Tφ(y)
kk k=1
q n
where an = n c 0, and the summation is over all n = (n0,n1,…,nd) such that |n| =
T
n0+n1+···+nd=q.Usingthenotationxn= d xnk,themappingφ:Rd→Rmis
given as the vector
φ(x) = (…,a xn,…)T = …, cn0 xnk , …
k=1 k
q d
nnk k=1
146 Kernel Methods where the variable n = (n0 , . . . , nd ) ranges over all the possible assignments, such that
|n| = q. It can be shown that the dimensionality of the feature space is given as m=d+q
q
Example 5.7 (Quadratic Polynomial Kernel). Let x,y ∈ R2 and let c = 1. The inhomogeneous quadratic polynomial kernel is given as
K(x,y)=(1+xTy)2 =(1+x1y1 +x2y2)2
The set of all assignments n = (n0 , n1 , n2 ), such that |n| = q = 2, and the corresponding terms in the multinomial expansion are shown below.
Assignments n = (n0,n1,n2)
(1, 1, 0) (1, 0, 1) (0, 1, 1) (2, 0, 0) (0, 2, 0) (0, 0, 2)
Coefficient
an = qncn0 xnyn = dk=1(xiyi)ni
Variables x1y1
2 2 2 1 1 1
x2 y2 x1y1x2y2 1 (x1y1)2 (x2y2)2
Thus, the kernel can be written as
K(x,y)=1+2x1y1 +2x2y2 +2x1y1x2y2 +x12y12 +x2y2
= 1,√2×1,√2×2,√2x1x2,x12,x21,√2y1,√2y2,√2y1y2,y12,y2T
= φ(x)Tφ(y)
When the input space is R2, the dimensionality of the feature space is given as
m=d+q=2+2=4=6 q22
In this case the inhomogeneous quadratic kernel with c = 1 corresponds to the mapping φ : R2 → R6, given as
φ(x) = 1,√2×1,√2×2,√2x1x2, x12, x2T For example, for x1 = (5.9,3)T and x2 = (6.9,3.1)T, we have
φ(x1)=1,√2·5.9,√2·3,√2·5.9·3, 5.92, 32T = 1, 8.34, 4.24, 25.03, 34.81, 9T
φ(x2)=1,√2·6.9,√2·3.1,√2·6.9·3.1, 6.92, 3.12T =1,9.76,4.38,30.25,47.61,9.61T
Vector Kernels 147
Thus, the inhomogeneous kernel value is φ(x1)Tφ(x2)=1+81.40+18.57+757.16+1657.30+86.49=2601.92
On the other hand, when the input space is R2, the homogeneous quadratic kernel corresponds to the mapping φ : R2 → R3, defined as
φ(x) = √2x1x2, x12, x2T
because only the degree 2 terms are considered. For example, for x1 and x2, we have
and thus
√ 2 2T T φ(x1)= 2·5.9·3, 5.9 , 3 = 25.03,34.81,9
√ 22T T φ(x2)= 2·6.9·3.1, 6.9 , 3.1 = 30.25,47.61,9.61
K(x1,x2)=φ(x1)Tφ(x2)=757.16+1657.3+86.49=2500.95
These values essentially match those shown in Example 5.6 up to four significant digits.
Gaussian Kernel
The Gaussian kernel, also called the Gaussian radial basis function (RBF) kernel, is
defined as
K(x, y) = exp − x − y2 (5.10) 2σ2
where σ > 0 is the spread parameter that plays the same role as the standard deviation in a normal density function. Note that K(x, x) = 1, and further that the kernel value is inversely related to the distance between the two points x and y.
Example 5.8. Consider again the points x1 and x2 in Figure 5.1: x1 = 5.9 x2 = 6.9
3 3.1 The squared distance between them is given as
∥x1 − x2∥2 = (−1,−0.1)T2 = 12 + 0.12 = 1.01 With σ = 1, the Gaussian kernel is
K(x1,x2)=exp−1.012=exp{−0.51}=0.6 2
It is interesting to note that a feature space for the Gaussian kernel has infinite dimensionality. To see this, note that the exponential function can be written as the
148
infinite expansion
Kernel Methods
∞an 1213 exp{a}=n=0 n!=1+a+2!a +3!a +···
Further, using γ = 1 , and noting that x − y2 = ∥x∥2 + y2 − 2xTy, we can rewrite 2σ2
the Gaussian kernel as follows:
K(x,y) = exp−γ x − y2
=exp−γ ∥x∥2·exp−γ y2·exp2γxTy In particular, the last term is given as the infinite expansion
exp2γxTy=∞ (2γ)q xTyq =1+(2γ)xTy+(2γ)2 xTy2+··· q=0 q! 2!
Using the multinomial expansion of (xTy)q , we can write the Gaussian kernel as 2 ∞ ( 2 γ ) q q d
K(x,y)=exp −γ∥x∥2 exp −γy q! n (xkyk)nk q=0 |n|=q k=1
∞ d d
=√
aq,nexp −γ∥x∥2 xnk √aq,nexp −γy2 ynk kk
q=0 |n|=q k=1 k=1 = φ(x)Tφ(y)
where aq,n = (2γ)q q, and n = (n1,n2,…,nd), with |n| = n1 + n2 + ··· + nd = q. The q! n
mapping into feature space corresponds to the function φ : Rd → R∞
( 2 γ ) q q d
φ(x)= …, exp −γ∥x∥2 xnk,…
T
q!n k k=1
with the dimensions ranging over all degrees q = 0,…,∞, and with the variable n = (n1 , . . . , nd ) ranging over all possible assignments such that |n| = q for each value of q. Because φ maps the input space into an infinite dimensional feature space, we obviously cannot explicitly transform x into φ(x), yet computing the Gaussian kernel K(x,y) is straightforward.
5.3 BASIC KERNEL OPERATIONS IN FEATURE SPACE
Let us look at some of the basic data analysis tasks that can be performed solely via
kernels, without instantiating φ(x). Norm of a Point
We can compute the norm of a point φ(x) in feature space as follows: ∥φ(x)∥2 = φ(x)Tφ(x) = K(x,x)
which implies that ∥φ (x)∥ = √K(x, x).
Basic Kernel Operations in Feature Space 149 Distance between Points
The distance between two points φ(xi) and φ(xj) can be computed as
φ(xi)−φ(xj)2 =∥φ(xi)∥2 +φ(xj)2 −2φ(xi)Tφ(xj) (5.11)
=K(xi,xi)+K(xj,xj)−2K(xi,xj) δφ(xi),φ(xj)=φ(xi)−φ(xj)=K(xi,xi)+K(xj,xj)−2K(xi,xj)
Rearranging Eq. (5.11), we can see that the kernel value can be considered as a measure of the similarity between two points, as
1∥φ(xi)∥2 +∥φ(xj)∥2 −∥φ(xi)−φ(xj)∥2=K(xi,xj)=φ(xi)Tφ(xj) 2
Thus, the more the distance ∥φ (xi ) − φ (xj )∥ between the two points in feature space, the less the kernel value, that is, the less the similarity.
which implies that
Example 5.9. Consider the two points x1 and x2 in Figure 5.1: x1 = 5.9 x2 = 6.9
3 3.1
Assuming the homogeneous quadratic kernel, the norm of φ(x1) can be computed as
∥φ(x1)∥2 = K(x1,x1) = (xT1 x1)2 = 43.812 = 1919.32
which implies that the norm of the transformed point is ∥φ(x1)∥ = √43.812 = 43.81.
The distance between φ(x1) and φ(x2) in feature space is given as δφ(x1),φ(x2)=K(x1,x1)+K(x2,x2)−2K(x1,x2)
= √1919.32 + 3274.13 − 2 · 2501 = √191.45 = 13.84
Mean in Feature Space
The mean of the points in feature space is given as 1 n
μφ = n φ(xi) i=1
Because we do not, in general, have access to φ(xi), we cannot explicitly compute the mean point in feature space.
150
Kernel Methods
Nevertheless, we can compute the squared norm of the mean as follows: ∥ μ φ ∥ 2 = μ Tφ μ φ
1n T1n
= n φ(xi) n φ(xj)
i=1 j=1
1 n n
= n2
1 n n
φ(xi)Tφ(xj) K(xi,xj)
= n2
simply the average of the values in the kernel matrix K.
i=1 j=1 i=1 j=1
(5.12) The above derivation implies that the squared norm of the mean in feature space is
Example 5.10. Consider the five points from Example 5.3, also shown in Figure 5.1. Example 5.4 showed the norm of the mean for the linear kernel. Let us consider the Gaussian kernel with σ = 1. The Gaussian kernel matrix is given as
1.00
0.60 K = 0.78 0.42 0.72
The squared norm of the mean in
2 155 14.98
Total Variance in Feature Space
Let us first derive a formula for the squared distance of a point φ(xi) to the mean μφ in feature space:
∥φ(xi)−μφ∥2 =∥φ(xi)∥2 −2φ(xi)Tμφ +∥μφ∥2
K(xi,xj)= 25 =0.599 which implies that μφ = √0.599 = 0.774.
μφ =25
i=1 j=1
2 n
squared deviation of points from the mean in feature space:
1 n
σφ2 = n i=1 ∥φ(xi)−μφ∥2
0.60 0.78 0.42 1.00 0.94 0.07 0.94 1.00 0.13 0.07 0.13 1.00 0.44 0.65 0.23
0.72 0.44 0.65 0.23 1.00
feature space is therefore
1 n n
K(xi , xj ) + n2 K(xa , xb )
a=1 b=1
The total variance [Eq. (1.4)] in feature space is obtained by taking the average
= K(xi , xi ) − n
j=1
Basic Kernel Operations in Feature Space
151
(5.13)
1 n
= n
i=1
1 n
= n
i=1
1 n
= n
i=1
2 n
K(xi,xi)− n
j=1
K(xa,xb)
a=1 b=1
n n n
K(xi,xj)+ n3 K(xi,xj)
1 n n
K(xi,xj)+ n2
K(xi,xi)− n2
1 n n
K(xa,xb)
2 n n
i=1 j=1 K(xi,xi)− n2
a=1 b=1
i=1 j=1
In other words, the total variance in feature space is given as the difference between the average of the diagonal entries and the average of the entire kernel matrix K. Also notice that by Eq. (5.12) the second term is simply μφ 2 .
Example 5.11. Continuing Example 5.10, the total variance in feature space for the five points, for the Gaussian kernel, is given as
2 1n 2 1
σφ = n K(xi,xi) − μφ =5×5−0.599=0.401
i=1
The distance between φ(x1) and the mean μφ in feature space is given as
2 25 2 ∥φ(x1)−μφ∥ =K(x1,x1)−5 K(x1,xj)+ μφ
Centering in Feature Space
We can center each point in feature space by subtracting the mean from it, as follows:
φˆ(xi)=φ(xi)−μφ
Because we do not have explicit representation of φ(xi) or μφ, we cannot explicitly center the points. However, we can still compute the centered kernel matrix, that is, the kernel matrix over centered points.
The centered kernel matrix is given as
Kˆ =Kˆ(xi,xj)ni,j=1
where each cell corresponds to the kernel between centered points, that is
Kˆ(xi,xj)=φˆ(xi)Tφˆ(xj) =(φ(xi)−μφ)T(φ(xj)−μφ)
j=1
= 1 − 2 1 + 0.6 + 0.78 + 0.42 + 0.72 + 0.599
5 =1−1.410+0.599=0.189
=φ(xi)Tφ(xj)−φ(xi)Tμφ −φ(xj)Tμφ +μTφμφ
152
Kernel Methods
1 n
φ ( x i ) T φ ( x k ) − n φ ( xj ) T φ ( x k ) + ∥ μ φ ∥ 2
1 n
K(xi , xk ) − n
k=1
1 n
= K ( x i , xj ) − n
k=1
1 n
= K(xi , xj ) − n
k=1
k=1
K(xj , xk ) + n2 K(xa , xb )
1 n n a=1 b=1
In other words, we can compute the centered kernel matrix using only the kernel function. Over all the pairs of points, the centered kernel matrix can be written compactly as follows:
(5.14)
Kˆ =K− 11n×nK− 1K1n×n + 1 1n×nK1n×n n n n2
=I− 11n×nKI− 11n×n nn
where 1n×n is the n × n singular matrix, all of whose entries equal 1.
Example 5.12. Consider the first five points from the 2-dimensional Iris dataset shown in Figure 5.1a:
x1 =5.9 x2 =6.9 x3 =6.6 x4 =4.6 x5 = 6 3 3.1 2.9 3.2 2.2
Consider the linear kernel matrix shown in Figure 5.1b. We can center it by first computing
0.8 −0.2 1 −0.2 0.8 I − 5 15×5 = −0.2 −0.2 −0.2 −0.2 −0.2 −0.2
−0.2 −0.2 −0.2 −0.2 0.8 −0.2 −0.2 0.8 −0.2 −0.2
−0.2 −0.2 −0.2 −0.2 0.8
The centered kernel matrix [Eq. (5.14)] is given as
43.81 ˆ 1 50.01 K= I − 5 15×5 · 47.64 36.74 42.00
42.00
48.22 1 45.98 · I − 5 15×5 34.64
40.84
To verify that Kˆ is the same as
first center the points by subtracting the mean μ = (6.0,2.88)T. The centered points
50.01 47.64 36.74 57.22 54.53 41.66 54.53 51.97 39.64 41.66 39.64 31.40 48.22 45.98 34.64
0.02
−0.06 = −0.06 0.18 −0.08
−0.06 −0.06 0.18 0.86 0.54 −1.19 0.54 0.36 −0.83
−1.19 −0.83 2.06 −0.15 −0.01 −0.22
−0.08 −0.15 −0.01 −0.22 0.46
the kernel matrix for the centered points, let us
Basic Kernel Operations in Feature Space 153
in feature space are given as
z1 =−0.1 z2 =0.9 z3 =0.6 z4 =−1.4 z5 = 0.0
0.12 0.22 0.02 0.32 −0.68 For example, the kernel between φ(z1) and φ(z2) is
φ(z1)Tφ(z2) = zT1 z2 = −0.09 + 0.03 = −0.06
which matches Kˆ (x1,x2), as expected. The other entries can be verified in a similar manner. Thus, the kernel matrix obtained by centering the data and then computing the kernel is the same as that obtained via Eq. (5.14).
Normalizing in Feature Space
A common form of normalization is to ensure that points in feature space have unit length by replacing φ(xi) with the corresponding unit vector φn(xi) = φ(xi) . The dot
∥φ(xi)∥
product in feature space then corresponds to the cosine of the angle between the two
mapped points, because
T φ(xi)Tφ(xj)
φn(xi) φn(xj)= φ(xi)·φ(xj) =cosθ
If the mapped points are both centered and normalized, then a dot product corresponds to the correlation between the two points in feature space.
The normalized kernel matrix, Kn, can be computed using only the kernel function K, as
φ(xi)Tφ(xj) K(xi,xj) Kn(xi,xj)= φ(xi)·φ(xj) = K(xi,xi)·K(xj,xj)
Kn has all diagonal elements as 1.
Let W denote the diagonal matrix comprising the diagonal elements of K:
K(x1,x1) 0 ··· 0 W = diag(K) = 0 K(x2,x2) ··· 0
. ….. 0 0 ··· K(xn,xn)
The normalized kernel matrix can then be expressed compactly as Kn =W−1/2 ·K·W−1/2
whereW−1/2 isthediagonalmatrix,definedasW−1/2(xi,xi)=√
K(xi ,xi )
elements being zero.
1 , with all other
154 Kernel Methods
Example 5.13. Consider the five points and the linear Figure 5.1. We have
43.81 0 0 0
kernel matrix shown in 0
0.9929 0.9975 0.9980 0.9673 1.0000
0 57.22 0 0
W= 0 0 51.97 0 0
0 0 0 0 31.40 0
0 0 0 0 40.84 The normalized kernel is given as
0.9988 0.9984 0.9906 1.0000 0.9999 0.9828 0.9999 1.0000 0.9812 0.9828 0.9812 1.0000 0.9975 0.9980 0.9673
1.0000 −1/2 −1/2 0.9988
Kn =W ·K·W =0.9984 0.9906 0.9929
The same kernel is obtained if we first normalize the feature vectors to have unit length and then take the dot products. For example, with the linear kernel, the normalized point φn(x1) is given as
φn(x1) = φ(x1) = x1 = √ 1 5.9 = 0.8914
∥φ (x1 )∥
1 57.22
∥x1 ∥ 43.81 3 0.4532 6.9 = 0.9122. Their dot product is
Likewise, we have φn(x2) = √ φn(x1)Tφn(x2)=0.8914·0.9122+0.4532·0.4098=0.9988
3.1 0.4098
which matches Kn(x1,x2).
If we start with the centered kernel matrix Kˆ from Example 5.12, and then
normalize it, we obtain the normalized and centered kernel matrix Kˆ n:
1.00 ˆ −0.44 Kn = −0.61 0.80 −0.77
−0.44 −0.61 1.00 0.98 0.98 1.00
−0.89 −0.97 −0.24 −0.03
0.80 −0.77 −0.89 −0.24 −0.97 −0.03
1.00 −0.22 −0.22 1.00
Asnotedearlier,thekernelvalueKˆn(xi,xj)denotesthecorrelationbetweenxi and xj in feature space, that is, it is cosine of the angle between the centered points φ(xi) and φ(xj ).
5.4 KERNELS FOR COMPLEX OBJECTS
We conclude this chapter with some examples of kernels defined for complex data such as strings and graphs. The use of kernels for dimensionality reduction is described in
Kernels for Complex Objects 155 Section 7.3, for clustering in Section 13.2 and Chapter 16, for discriminant analysis in
Section 20.2, and for classification in Sections 21.4 and 21.5.
5.4.1 Spectrum Kernel for Strings
Consider text or sequence data defined over an alphabet . The l-spectrum feature map is the mapping φ : ∗ → R||l from the set of substrings over to the ||l-dimensional space representing the number of occurrences of all possible substrings of length l, defined as
T φ(x) = ··· ,#(α),··· α∈l
where #(α) is the number of occurrences of the l-length string α in x.
The (full) spectrum map is an extension of the l-spectrum map, obtained by considering all lengths from l = 0 to l = ∞, leading to an infinite dimensional feature
map φ : ∗ → R∞:
T
φ(x) = ··· ,#(α),··· α∈∗
where #(α) is the number of occurrences of the string α in x.
The (l-)spectrum kernel between two strings xi,xj is simply the dot product
between their (l-)spectrum maps: K(xi,xj)=φ(xi)Tφ(xj)
A naive computation of the l-spectrum kernel takes O(||l) time. However, for a given string x of length n, the vast majority of the l-length strings have an occurrence count of zero, which can be ignored. The l-spectrum map can be effectively computed in O(n) time for a string of length n (assuming n ≫ l) because there can be at most n − l + 1 substrings of length l, and the l-spectrum kernel can thus be computed in O(n + m) time for any two strings of length n and m, respectively.
The feature map for the (full) spectrum kernel is infinite dimensional, but once again, for a given string x of length n, the vast majority of the strings will have an occurrence count of zero. A straightforward implementation of the spectrum map for a string x of length n can be computed in O(n2) time because x can have at most nl=1 n − l + 1 = n(n + 1)/2 distinct nonempty substrings. The spectrum kernel can then be computed in O(n2 + m2) time for any two strings of length n and m, respectively. However, a much more efficient computation is enabled via suffix trees (see Chapter 10), with a total time of O(n + m).
Example 5.14. Consider sequences over the DNA alphabet = {A,C,G,T}. Let x1 = ACAGCAGTA, and let x2 = AGCAAGCGAG. For l = 3, the feature space has dimensionality ||l = 43 = 64. Nevertheless, we do not have to map the input points into the full feature space; we can compute the reduced 3-spectrum mapping by counting the number of occurrences for only the length 3 substrings that occur in each input sequence, as follows:
φ(x1) = (ACA : 1,AGC : 1,AGT : 1,CAG : 2,GCA : 1,GTA : 1)
φ(x2) = (AAG : 1,AGC : 2,CAA : 1,CGA : 1,GAG : 1,GCA : 1,GCG : 1)
156 Kernel Methods
where the notation α : #(α) denotes that substring α has #(α) occurrences in xi . We can then compute the dot product by considering only the common substrings, as follows:
K(x1,x2)=1×2+1×1=2+1=3
The first term in the dot product is due to the substring AGC, and the second is due to GCA, which are the only common length 3 substrings between x1 and x2.
The full spectrum can be computed by considering the occurrences of all common substrings over all possible lengths. For x1 and x2, the common substrings and their occurrence counts are given as
α A C G AG CA AGC GCA AGCA #(α)inx1 4 2 2 2 2 1 1 1 #(α)inx2 4 2 4 3 1 2 1 1
Thus, the full spectrum kernel value is given as K(x1,x2)=16+4+8+6+2+2+1+1=40
5.4.2 Diffusion Kernels on Graph Nodes
Let S be some symmetric similarity matrix between nodes of a graph G = (V, E). For instance, S can be the (weighted) adjacency matrix A [Eq. (4.1)] or the Laplacian matrix L = A − (or its negation), where is the degree matrix for an undirected graph G, defined as (i,i) = di and (i,j) = 0 for all i ̸= j, and di is the degree of node i.
Consider the similarity between any two nodes obtained by summing the product of the similarities over walks of length 2:
where
n a=1
S(2)(xi,xj)=
T
S(xi,xa)S(xa,xj)=STi Sj Si = S(xi,x1),S(xi,x2),…,S(xi,xn)
denotes the (column) vector representing the i-th row of S (and because S is symmetric, it also denotes the ith column of S). Over all pairs of nodes the similarity matrix over walks of length 2, denoted S(2), is thus given as the square of the base similarity matrix S:
S(2) =S×S=S2
In general, if we sum up the product of the base similarities over all l-length walks between two nodes, we obtain the l-length similarity matrix S(l), which is simply the lth power of S, that is,
S(l) = Sl
Kernels for Complex Objects 157
Power Kernels
Even walk lengths lead to positive semidefinite kernels, but odd walk lengths are not guaranteed to do so, unless the base matrix S is itself a positive semidefinite matrix. In particular, K = S2 is a valid kernel. To see this, assume that the ith row of S denotes the feature map for xi , that is, φ (xi ) = Si . The kernel value between any two points is then a dot product in feature space:
K(xi,xj)=S(2)(xi,xj)=STi Sj =φ(xi)Tφ(xj)
For a general walk length l, let K = Sl . Consider the eigen-decomposition of S:
eigenvalues of S:
|| | λ10···0
U=u u ··· u =0 λ2 ··· 0 12 n …..0
|| | 00···λn The eigen-decomposition of K can be obtained as follows:
K=Sl =UUTl =UlUT
where we used the fact that eigenvectors of S and Sl are identical, and further that eigenvalues of Sl are given as (λi)l (for all i = 1,…,n), where λi is an eigenvalue of S. For K = Sl to be a positive semidefinite matrix, all its eigenvalues must be non-negative, which is guaranteed for all even walk lengths. Because (λi )l will be negative if l is odd and λi is negative, odd walk lengths lead to a positive semidefinite kernel only if S is positive semidefinite.
Exponential Diffusion Kernel
Instead of fixing the walk length a priori, we can obtain a new kernel between nodes of a graph by considering walks of all possible lengths, but by damping the contribution of longer walks, which leads to the exponential diffusion kernel, defined as
S = U U T =
where U is the orthogonal matrix of eigenvectors and is the diagonal matrix of
K = ∞ 1 β l S l l=0 l!
=I+βS+ 1β2S2+ 1β3S3+··· 2! 3!
= exp βS
(5.15)
n i=1
u i λ i u Ti
where β is a damping factor, and exp{βS} is the matrix exponential. The series on the right hand side above converges for all β ≥ 0.
158 Kernel Methods Substituting S=UUT =ni=1λiuiuTi in Eq.(5.15), and utilizing the fact that
UUT= ni=1uiuTi =I,wehave K=I+βS+ 1β2S2+···
2!
nn n1
uiuTi + uiβλiuTi + ui 2!β2λ2i uTi +··· i=1 i=1 i=1
=
= u i 1 + β λ i + 1 β 2 λ 2i + · · · u Ti
n
i=1 2!
n
= u i e x p { β λ i } u Ti
i=1 exp{βλ1}
= U 0 .
0
0 ··· 0
UT 0 ··· exp{βλn}
exp{βλ2} ··· 0
. … 0
(5.16)
Thus, the eigenvectors of K are the same as those for S, whereas its eigenvalues are given as exp{βλi}, where λi is an eigenvalue of S. Further, K is symmetric because S is symmetric, and its eigenvalues are real and non-negative because the exponential of a real number is non-negative. K is thus a positive semidefinite kernel matrix. The complexity of computing the diffusion kernel is O(n3) corresponding to the complexity of computing the eigen-decomposition.
Von Neumann Diffusion Kernel
A related kernel based on powers of S is the von Neumann diffusion kernel, defined as K = ∞ β l S l ( 5 . 1 7 )
l=0 where β ≥ 0. Expanding Eq. (5.17), we have
K=I+βS+β2S2 +β3S3 +··· =I+βS(I+βS+β2S2 +···) =I+βSK
Rearranging the terms in the preceding equation, we obtain a closed form expression for the von Neumann kernel:
K−βSK=I (I−βS)K=I
K = (I − βS)−1 (5.18)
Kernels for Complex Objects 159 Plugging in the eigen-decomposition S = UUT , and rewriting I = UUT , we have
K=UUT −U(β)UT−1 =U(I−β)UT−1
= U (I − β )−1 UT
where (I−β)−1 is the diagonal matrix whose ith diagonal entry is (1−βλi)−1. The eigenvectors of K and S are identical, but the eigenvalues of K are given as 1/(1 − β λi ). For K to be a positive semidefinite kernel, all its eigenvalues should be non-negative, which in turn implies that
(1−βλi)−1 ≥0 1−βλi ≥0
β ≤ 1/λi Further,theinversematrix(I−β)−1 existsonlyif
n det(I−β)= (1−βλi)̸=0
i=1
which implies that β ̸= 1/λi for all i. Thus, for K to be a valid kernel, we require that β < 1/λi for all i = 1,...,n. The von Neumann kernel is therefore guaranteed to be positive semidefinite if |β | < 1/ρ (S), where ρ (S) = maxi {|λi |} is called the spectral radius of S, defined as the largest eigenvalue of S in absolute value.
Example 5.15. Consider the graph in Figure 5.2. Its adjacency and degree matrices are given as
0 0 1 1 0 2 0 0 0 0
0 0 1 0 1 0 2 0 0 0 A=1 1 0 1 0 =0 0 3 0 0 1 0 1 0 1 0 0 0 3 0
01010 00002
v4 v5
v1
v3 v2 Figure 5.2. Graph diffusion kernel.
160 Kernel Methods
The negated Laplacian matrix for the graph is therefore
The eigenvalues of S are as follows:
For the von Neumann diffusion
kernel, we have
0.00 0.00 0.78 0.00 0.00 0.68 0.00 0.00 0.00 0.00
(I − 0.2) = 0 0 0
u2 −0.63 0.51 −0.20 −0.20 0.51
u3 0.00 −0.60 −0.37 0.37 0.60
u4 0.63 0.20 −0.51 −0.51 0.20
u 5 0.00 −0.37 0.60 −0.60
−2 0
1 1 0
0 −2 1 0 1 S=−L=A−D= 1 1 −3 1 0 1 0 1−3 1
λ1 =0 λ2 =−1.38 and the eigenvectors of S are
u1
0.45 U= 0.45 0.45 0.45 0.45
exp{0.2λ1} 0 ··· K = exp0.2S = U 0 exp{0.2λ2} ···
λ5 =−4.62
0 1
λ3 =−2.38
0 1 −2 λ4 =−3.62
0.37 Assuming β = 0.2, the exponential diffusion kernel matrix is given as
. . 00
0.70 0.01 0.14 0.14
0.01 0.70 0.13 0.03 = 0.14 0.13 0.59 0.13 0.14 0.03 0.13 0.59 0.01 0.14 0.03 0.13
0
0 UT
0 exp{0.2λn}
0.00 0.00 0.00 0.00 0.00 0.00 0.58 0.00 0.00 0.52
1 −1 0
...
· · ·
0.01 0.14 0.03 0.13 0.70
Exercises 161
For instance, because λ2 = −1.38, we have 1 − βλ2 = 1 + 0.2 × 1.38 = 1.28, and thereforetheseconddiagonalentryis(1−βλ2)−1 =1/1.28=0.78.ThevonNeumann kernel is given as
0.75 −1 T 0.02 K=U(I−0.2) U =0.11 0.11 0.02
0.02 0.11 0.11 0.74 0.10 0.03 0.10 0.66 0.10 0.03 0.10 0.66 0.11 0.03 0.10
0.02 0.11 0.03 0.10 0.74
5.5 FURTHER READING
Kernel methods have been extensively studied in machine learning and data mining. For an in-depth introduction and more advanced topics see Scho ̈lkopf and Smola (2002) and Shawe-Taylor and Cristianini (2004). For applications of kernel methods in bioinformatics see Scho ̈ lkopf, Tsuda, and Vert (2004).
Scho ̈lkopf, B. and Smola, A. J. (2002). Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. Cambridge, MA: MIT Press. Scho ̈lkopf, B., Tsuda, K., and Vert, J.-P. (2004). Kernel methods in computational
biology. Cambridge, MA: MIT press.
Shawe-Taylor, J. and Cristianini, N. (2004). Kernel Methods for Pattern Analysis. New
York: Cambridge University Press.
5.6 EXERCISES
Q1. Provethatthedimensionalityofthefeaturespacefortheinhomogeneouspolynomial
kernel of degree q is
m=d+q
q
Q2. Consider the data shown in Table 5.1. Assume the following kernel function: K(xi,xj)=∥xi −xj∥2. Compute the kernel matrix K.
Table 5.1. Dataset for Q2
i
xi
x1 x2 x3 x4
(4, 2.9) (2.5, 1) (3.5, 4) (2, 2.1)
162 Kernel Methods
Q3. Show that eigenvectors of S and Sl are identical, and further that eigenvalues of Sl are given as (λi)l (for all i = 1,...,n), where λi is an eigenvalue of S, and S is some n × n symmetric similarity matrix.
Q4. The von Neumann diffusion kernel is a valid positive semidefinite kernel if |β| < 1 , ρ (S)
where ρ(S) is the spectral radius of S. Can you derive better bounds for cases when β > 0 and when β < 0?
Q5. Given the three points x1 = (2.5, 1)T , x2 = (3.5, 4)T , and x3 = (2, 2.1)T .
(a) Compute the kernel matrix for the Gaussian kernel assuming that σ 2 = 5.
(b) Compute the distance of the point φ(x1) from the mean in feature space.
(c) Compute the dominant eigenvector and eigenvalue for the kernel matrix
from (a).
CHAPTER 6 High-dimensionalData
In data mining typically the data is very high dimensional, as the number of attributes can easily be in the hundreds or thousands. Understanding the nature of high-dimensional space, or hyperspace, is very important, especially because hyperspace does not behave like the more familiar geometry in two or three dimensions.
6.1 HIGH-DIMENSIONAL OBJECTS
Consider the n × d data matrix
X1X2···Xd
x1 x11 x12 ··· x1d D=x2 x21 x22 ··· x2d . . . . . . . . . . . . . . .
xn xn1 xn2 ··· xnd where each point xi ∈ Rd and each attribute Xj ∈ Rn.
Hypercube
Let the minimum and maximum values for each attribute Xj be given as min(Xj)=minxij max(Xj)=maxxij
ii
The data hyperspace can be considered as a d-dimensional hyper-rectangle, defined as
d Rd = min(Xj),max(Xj)
j=1
=x=(x1,x2,...,xd)T xj ∈[min(Xj),max(Xj)],forj=1,...,d
163
164 High-dimensional Data Assume the data is centered to have mean μ = 0. Let m denote the largest absolute
value in D, given as
dn m=maxmax |xij|
j=1 i=1
The data hyperspace can be represented as a hypercube, centered at 0, with all sides of
length l = 2m, given as Hd(l)=x=(x1,x2,...,xd)T ∀i, xi ∈[−l/2,l/2]
The hypercube in one dimension, H1(l), represents an interval, which in two dimen- sions, H2(l), represents a square, and which in three dimensions, H3(l), represents a cube, and so on. The unit hypercube has all sides of length l = 1, and is denoted as Hd (1).
Hypersphere
Assume that the data has been centered, so that μ = 0. Let r denote the largest magnitude among all points:
r = max∥xi ∥ i
The data hyperspace can also be represented as a d-dimensional hyperball centered at 0 with radius r, defined as
Bd(r)=x| ∥x∥≤r
d orBd(r)= x=(x1,x2,...,xd)T xj2 ≤r2
j=1
The surface of the hyperball is called a hypersphere, and it consists of all the points
exactly at distance r from the center of the hyperball, defined as
Sd(r)=x| ∥x∥=r
d orSd(r)= x=(x1,x2,...,xd)T (xj)2 =r2
j=1
Because the hyperball consists of all the surface and interior points, it is also called a
closed hypersphere.
Example 6.1. Consider the 2-dimensional, centered, Iris dataset, plotted in Figure 6.1. The largest absolute value along any dimension is m = 2.06, and the point with the largest magnitude is (2.06, 0.75), with r = 2.19. In two dimensions, the hypercube representing the data space is a square with sides of length l = 2m = 4.12. The hypersphere marking the extent of the space is a circle (shown dashed) with radius r = 2.19.
High-dimensional Volumes
165
2
1
0
−1 −2
−2 −1 0 1 2 X1: sepal length
Figure 6.1. Iris data hyperspace: hypercube (solid; with l = 4.12) and hypersphere (dashed; with r = 2.19). 6.2 HIGH-DIMENSIONAL VOLUMES
Hypercube
The volume of a hypercube with edge length l is given as vol(Hd (l)) = ld
Hypersphere
The volume of a hyperball and its corresponding hypersphere is identical because the volume measures the total content of the object, including all internal space. Consider the well known equations for the volume of a hypersphere in lower dimensions
vol(S1 (r )) = 2r vol(S2(r)) = πr2
vol(S3(r)) = 4πr3 3
(6.1) (6.2)
(6.3)
As per the derivation in Appendix 6.7, the general equation for the volume of a d-dimensional hypersphere is given as
d
vol(Sd(r))=Kdrd = π2 rd (6.4)
Ŵd+1 2
r
X2: sepal width
166 High-dimensional Data where
πd/2
Kd = Ŵ(d +1) (6.5)
2
is a scalar that depends on the dimensionality d, and Ŵ is the gamma function
[Eq. (3.17)], defined as (for α > 0)
Ŵ(α) = By direct integration of Eq. (6.6), we have
Ŵ(1)=1 and Ŵ1=√π 2
The gamma function also has the following property for any α > 1: Ŵ(α) = (α − 1)Ŵ(α − 1)
For any integer n ≥ 1, we immediately have
Ŵ(n) = (n − 1)!
∞ 0
xα−1e−xdx
(6.6)
(6.7)
(6.8)
(6.9)
Turning our attention back to Eq. (6.4), when d is even, then d + 1 is an integer, 2
and by Eq. (6.9) we have
Ŵd +1=d! 22
and when d is odd, then by Eqs. (6.8) and (6.7), we have
Ŵd +1=dd−2d−4···d−(d−1)Ŵ1= d!! √π
2 2 2 2 2 2 2(d+1)/2 where d!! denotes the double factorial (or multifactorial), given as
d!! = 1 if d = 0 or d = 1 d·(d−2)!! ifd≥2
Putting it all together we have
d ! 2
2(d +1)/2
Plugging in values of Ŵ(d/2 + 1) in Eq. (6.4) gives us the equations for the volume of the hypersphere in different dimensions.
d
Ŵ 2+1 =√π d!!
i f d i s e v e n ifdisodd
(6.10)
High-dimensional Volumes 167
Example6.2. ByEq.(6.10),wehaveford=1,d=2andd=3: Ŵ(1/2+1)= 1√π
Ŵ(3/2+1)= 3√π 4
Thus, we can verify that the volume of a hypersphere in one, two, and three dimensions is given as
√π vol(S1(r))= 1√πr=2r
2
vol(S2(r)) = π r2 = πr2 1
π3/2 3 4 3 vol(S3(r))=3√πr =3πr
4
which match the expressions in Eqs. (6.1), (6.2), and (6.3), respectively.
2 Ŵ(2/2+1)=1!=1
Surface Area The surface area of the hypersphere can be obtained by differentiating its volume with respect to r, given as
dd
d π2 d−1 2π2 d−1
area(Sd(r))= dr vol(Sd(r))= Ŵd +1 dr = Ŵd r 22
We can quickly verify that for two dimensions the surface area of a circle is given as 2πr, and for three dimensions the surface area of sphere is given as 4πr2.
Asymptotic Volume An interesting observation about the hypersphere volume is that as dimensionality increases, the volume first increases up to a point, and then starts to decrease, and ultimately vanishes. In particular, for the unit hypersphere with r = 1,
d
lim vol(Sd(1))= lim π2 →0
d→∞ d→∞ Ŵ(d +1) 2
Example 6.3. Figure 6.2 plots the volume of the unit hypersphere in Eq. (6.4) with increasing dimensionality. We see that initially the volume increases, and achieves the highest volume for d = 5 with vol(S5(1)) = 5.263. Thereafter, the volume drops rapidly and essentially becomes zero by d = 30.
168
High-dimensional Data
5
4
3
2 1 0
0 5 10 15 20 25 30 35 40 45 50
d
Figure 6.2. Volume of a unit hypersphere.
6.3 HYPERSPHERE INSCRIBED WITHIN HYPERCUBE
We next look at the space enclosed within the largest hypersphere that can be accommodated within a hypercube (which represents the dataspace). Consider a hypersphere of radius r inscribed in a hypercube with sides of length 2r. When we take the ratio of the volume of the hypersphere of radius r to the hypercube with side length l = 2r, we observe the following trends.
In two dimensions, we have
vol(S2(r)) πr2 π
vol(H2(2r)) = 4r2 = 4 = 78.5% Thus,aninscribedcircleoccupiesπ ofthevolumeofitsenclosingsquare,asillustrated
4
In three dimensions, the ratio is given as
4 πr3 =3
8r 3
of the volume of its enclosing cube, as shown in
For the general case, as the dimensionality d increases asymptotically, we get
in Figure 6.3a.
vol(S3(r)) vol(H3 (2r ))
π
= =52.4%
6
Figure 6.3b, which is quite a sharp decrease over the 2-dimensional case.
An inscribed sphere takes up only π 6
vol(Sd (r )) d→∞ vol(Hd(2r))
π d /2 = lim d
→ 0
lim
d→∞ 2dŴ(2 +1)
This means that as the dimensionality increases, most of the volume of the hypercube is in the “corners,” whereas the center is essentially empty. The mental picture that
vol(Sd (1))
Volume of Thin Hypersphere Shell
169
−r 0 (a)
−r r
(b)
Figure 6.3. Hypersphere inscribed inside a hypercube: in (a) two and (b) three dimensions.
0
r
(a) (b) (c) (d)
Figure 6.4. Conceptual view of high-dimensional space: (a) two, (b) three, (c) four, and (d) higher dimensions. In d dimensions there are 2d “corners” and 2d−1 diagonals. The radius of the inscribed circle accurately reflects the difference between the volume of the hypercube and the inscribed hypersphere in d dimensions.
emerges is that high-dimensional space looks like a rolled-up porcupine, as illustrated in Figure 6.4.
6.4 VOLUME OF THIN HYPERSPHERE SHELL
Let us now consider the volume of a thin hypersphere shell of width ǫ bounded by an outer hypersphere of radius r, and an inner hypersphere of radius r − ǫ. The volume of the thin shell is given as the difference between the volumes of the two bounding hyperspheres, as illustrated in Figure 6.5.
Let Sd(r,ǫ) denote the thin hypershell of width ǫ. Its volume is given as vol(Sd(r,ǫ))=vol(Sd(r))−vol(Sd(r−ǫ))=Kdrd −Kd(r−ǫ)d.
170
High-dimensional Data
r
Figure 6.5. Volume of a thin shell (for ǫ > 0).
Let us consider the ratio of the volume of the thin shell to the volume of the outer sphere:
vol(Sd(r,ǫ)) Kdrd −Kd(r−ǫ)d ǫd vol(Sd(r)) = Kdrd =1− 1−r
Asymptotic Volume
As d increases, in the limit we obtain
Example6.4. Forexample,foracircleintwodimensions,withr=1andǫ=0.01the volume of the thin shell is 1 − (0.99)2 = 0.0199 ≃ 2%. As expected, in two-dimensions, the thin shell encloses only a small fraction of the volume of the original hypersphere. For three dimensions this fraction becomes 1 − (0.99)3 = 0.0297 ≃ 3%, which is still a relatively small fraction.
vol(Sd(r,ǫ)) ǫd
lim =lim1− 1− →1
d→∞ vol(Sd(r)) d→∞ r
That is, almost all of the volume of the hypersphere is contained in the thin shell as d → ∞. This means that in high-dimensional spaces, unlike in lower dimensions, most of the volume is concentrated around the surface (within ǫ) of the hypersphere, and the center is essentially void. In other words, if the data is distributed uniformly in the d-dimensional space, then all of the points essentially lie on the boundary of the space (which is a d − 1 dimensional object). Combined with the fact that most of the hypercube volume is in the corners, we can observe that in high dimensions, data tends to get scattered on the boundary and corners of the space.
r−ǫ
ǫ
Diagonals in Hyperspace 171
6.5 DIAGONALS IN HYPERSPACE
Another counterintuitive behavior of high-dimensional spaces deals with the diag- onals. Let us assume that we have a d-dimensional hypercube, with origin 0d = (01,02,…,0d)T, and bounded in each dimension in the range [−1,1]. Then each “corner” of the hyperspace is a d-dimensional vector of the form (±11,±12,…,±1d)T. Let ei = (01,…,1i,…,0d)T denote the d-dimensional canonical unit vector in dimension i, and let 1 denote the d-dimensional diagonal vector (11,12,…,1d)T.
Consider the angle θd between the diagonal vector 1 and the first axis e1, in d dimensions:
e T1 1 e T1 1 1 1 c o s θ d = ∥ e 1 ∥ ∥ 1 ∥ = e T1 e 1 √ 1 T 1 = √ 1 √ d = √ d
Asymptotic Angle
As d increases, the angle between the d-dimensional diagonal vector 1 and the first axis vector e1 is given as
Example 6.5. Figure 6.6 illustrates the angle between the diagonal vector 1 and e1, 1
for d = 2 and d = 3. In two dimensions, we have cosθ2 = √2 whereas in three
1 dimensions, we have cos θ3 = √3 .
1 limcosθd=lim√ →0
d→∞ d→∞ d lim θd → π = 90◦
1111
which implies that
d→∞ 2
θ
0 e1
0
θ
e1 −101 1
−1
−1 −1
0
0 1 −1
(a)
Figure 6.6. Angle between diagonal vector 1 and e1: in (a) two and (b) three dimensions.
(b)
172 High-dimensional Data
This analysis holds for the angle between the diagonal vector 1d and any of the d principal axis vectors ei (i.e., for all i ∈ [1,d]). In fact, the same result holds for any diagonal vector and any principal axis vector (in both directions). This implies that in high dimensions all of the diagonal vectors are perpendicular (or orthogonal) to all the coordinates axes! Because there are 2d corners in a d-dimensional hyperspace, there are 2d diagonal vectors from the origin to each of the corners. Because the diagonal vectors in opposite directions define a new axis, we obtain 2d−1 new axes, each of which is essentially orthogonal to all of the d principal coordinate axes! Thus, in effect, high-dimensional space has an exponential number of orthogonal “axes.” A consequence of this strange property of high-dimensional space is that if there is a point or a group of points, say a cluster of interest, near a diagonal, these points will get projected into the origin and will not be visible in lower dimensional projections.
6.6 DENSITY OF THE MULTIVARIATE NORMAL
Let us consider how, for the standard multivariate normal distribution, the density of points around the mean changes in d dimensions. In particular, consider the probability of a point being within a fraction α > 0, of the peak density at the mean.
Foramultivariatenormaldistribution[Eq.(2.33)],withμ=0d (thed-dimensional zero vector), and = Id (the d × d identity matrix), we have
f(x)= √1 exp−xTx (6.11) ( 2π)d 2
1 . Thus, the set of points x with density at least α fraction of the density at the mean, with 0 < α < 1, is given as
Atthemeanμ=0 ,thepeakdensityisf(0 )= √
d d (2π)d
which implies that
f (x) ≥ α f (0)
exp−xTx≥α 2
or xTx ≤ −2ln(α)
d i=1
and thus
(xi )2 ≤ −2 ln(α) (6.12)
It is known that if the random variables X1 , X2 , . . ., Xk are independent and identically distributed, and if each variable has a standard normal distribution, then theirsquaredsumX2+X2+···+X2k followsaχ2 distributionwithkdegreesoffreedom, denoted as χk2. Because the projection of the standard multivariatenormal onto any attributeXj isastandardunivariatenormal,weconcludethatxTx= di=1(xi)2 hasaχ2 distribution with d degrees of freedom. The probability that a point x is within α times the density at the mean can be computed from the χd2 density function using Eq. (6.12),
Density of the Multivariate Normal
173
as follows:
P f(x)≥α =PxTx≤−2ln(α)
= f χ d2 ( x T x ) 0
f (0)
−2ln(α)
= Fχd2 (−2 ln(α)) (6.13) where fχq2 (x) is the chi-squared probability density function [Eq. (3.16)] with q degrees
of freedom:
1 q−1−x fχq2(x)=2q/2Ŵ(q/2)x2 e2
and Fχq2 (x) is its cumulative distribution function.
As dimensionality increases, this probability decreases sharply, and eventually
tends to zero, that is,
lim PxTx≤−2ln(α)→0 (6.14) d→∞
Thus, in higher dimensions the probability density around the mean decreases very rapidly as one moves away from the mean. In essence the entire probability mass migrates to the tail regions.
Example 6.6. Consider the probability of a point being within 50% of the density at the mean, that is, α = 0.5. From Eq. (6.13) we have
P xTx ≤ −2 ln(0.5) = Fχd2 (1.386)
We can compute the probability of a point being within 50% of the peak density by evaluating the cumulative χ2 distribution for different degrees of freedom (the number of dimensions). For d = 1, we find that the probability is Fχ12 (1.386) = 76.1%. For d = 2 the probability decreases to Fχ2 (1.386) = 50%, and for d = 3 it reduces to 29.12%. Looking at Figure 6.7, we can see that only about 24% of the density is in the tail regions for one dimension, but for two dimensions more than 50% of the density is in the tail regions.
Figure 6.8 plots the χd2 distribution and shows the probability P xTx ≤ 1.386 for two and three dimensions. This probability decreases rapidly with dimensionality; by d = 10, it decreases to 0.075%, that is, 99.925% of the points lie in the extreme or tail regions.
Distance of Points from the Mean
Let us consider the average distance of a point x from the center of the standard multivariate normal. Let r2 denote the square of the distance of a point x to the center μ=0,givenas d
r2 =∥x−0∥2 =xTx= xi2 i=1
174
High-dimensional Data
0.4 0.3
α = 0.5 0.2
0.1
||
−4 −3 −2 −1 0 1 2 3 4 (a)
f (x)
0.15 0.10
0.05
−4 −3
Figure 6.7. Density contour for α fraction of the density at the mean: in (a) one and (b) two dimensions.
f(x) f(x) 0.5
F = 0.5 0.4
0.3 0.2 0.1
F = 0.29 0.25
−2 −1 0 X1
2 3 44
123
(b)
0.20 0.15 0.10 0.05
0x0x 0 5 10 15 0 5 10 15
(a) d = 2 (b) d = 3 Figure6.8. ProbabilityP(xTx≤−2ln(α)),withα=0.5.
α = 0.5
−4 −3
0 −2 −1
0 X2
1
Appendix: Derivation of Hypersphere Volume 175 xTx follows a χ2 distribution with d degrees of freedom, which has mean d and variance
2d. It follows that the mean and variance of the random variable r2 is μ2 =d σ2 =2d
By the central limit theorem, as d → ∞, r2 is approximately normal with mean d and variance 2d , which implies that r 2 is concentrated about its mean value of d . As a consequence, the distance r of a point x to the center of the standard multivariate normal is likewise approximately concentrated around its mean √d.
Next, to estimate the spread of the distance r around its mean value, we need to
derive the standard deviation of r from that of r2. Assuming that σr is much smaller
compared to r , then using the fact that d log r = 1 , after rearranging the terms, we have dr r
dr =dlogr r
= 1dlogr2 2
Usingthefactthat dlogr2 = 1 ,andrearrangingtheterms,weobtain dr2 r2
dr = 1 dr2 r 2r2
r r2
which implies that dr = 1 dr2. Setting the change in r2 equal to the standard deviation 2r √ √
of r2, we have dr2 = σr2 = 2d, and setting the mean radius r = d, we have 1√1
σr=dr=2√d 2d=√2
We conclude that for large d, the radius r (or the distance of a point x from the origin 0) follows a normal distribution with mean √d and standard deviation 1/√2. Nevertheless, the density at the mean distance √d, is exponentially smaller than that at the peak density because
f (x) = exp−xTx/2 = exp{−d/2} f (0)
Combined with the fact that the probability mass migrates away from the mean in high dimensions, we have another interesting observation, namely that, whereas the density of the standard multivariate normal is maximized at the center 0, most of the pro√bability mass (the points) is concentrated in a small band around the mean distance of d from the center.
6.7 APPENDIX: DERIVATION OF HYPERSPHERE VOLUME
The volume of the hypersphere can be derived via integration using spherical polar coordinates. We consider the derivation in two and three dimensions, and then for a general d.
176
High-dimensional Data
Volume in Two Dimensions
X2
(x1,x2)
X1
Figure 6.9. Polar coordinates in two dimensions.
As illustrated in Figure 6.9, in d = 2 dimensions, the point x = (x1,x2) ∈ R2 can be expressed in polar coordinates as follows:
x1 =rcosθ1 =rc1 x2 =rsinθ1 =rs1
where r = ∥x∥, and we use the notation cos θ1 = c1 and sin θ1 = s1 for convenience. The Jacobian matrix for this transformation is given as
∂x1∂x1 J(θ1)= ∂r ∂θ1 = c1 −rs1
det(J(θ1)) = rc12 + rs12 = r(c12 + s12) = r (6.15) Using the Jacobian in Eq. (6.15), the volume of the hypersphere in two dimensions
can be obtained by integration over r and θ1 (with r > 0, and 0 ≤ θ1 ≤ 2π)
vol(S2(r)) = det(J(θ1)) dr dθ1 r θ1
r2π r2π
= r dr dθ1 = r dr dθ1 0000
r 2 r 2 π 2 =20·θ10 =πr
θ1
∂x2 ∂x2 s1 ∂r ∂θ1
rc1
The determinant of the Jacobian matrix is called the Jacobian. For J(θ1), the Jacobian
is given as
r
Appendix: Derivation of Hypersphere Volume
X3
θ1 θ2
X1
Figure 6.10. Polar coordinates in three dimensions.
Volume in Three Dimensions
177
(x1,x2,x3) r
X2
As illustrated in Figure 6.10, in d = 3 dimensions, the point x = (x1,x2,x3) ∈ R3 can be expressed in polar coordinates as follows:
x1 = r cosθ1 cosθ2 = rc1c2 x2 = r cosθ1 sinθ2 = rc1s2 x3 =rsinθ1 =rs1
where r = ∥x∥, and we used the fact that the dotted vector that lies in the X1–X2 plane in Figure 6.10 has magnitude r cos θ1 .
The Jacobian matrix is given as ∂x1 ∂x1
∂x1
∂θ2 c1c2 −rs1c2 −rc1s2
∂r ∂θ1 J(θ1,θ2)=∂x2 ∂x2
∂x2=c1s2 −rs1s2 rc1c2
∂r ∂θ1 ∂x3 ∂x3 ∂r ∂θ1
∂θ2
∂x3 s1 rc1 0 ∂θ2
The Jacobian is then given as det(J(θ1,θ2))=s1(−rs1)(c1)det(J(θ2))−rc1c1c1 det(J(θ2))
= −r2c1(s12 + c2) = −r2c1
(6.16)
In computing this determinant we made use of the fact that if a column of a matrix A is multiplied by a scalar s, then the resulting determinant is sdet(A). We also relied on the fact that the (3,1)-minor of J(θ1,θ2), obtained by deleting row 3 and column 1 is actually J(θ2) with the first column multiplied by −rs1 and the second column
178 High-dimensional Data
multiplied by c1. Likewise, the (3,2)-minor of J(θ1,θ2)) is J(θ2) with both the columns multiplied by c1.
The volume of the hypersphere for d = 3 is obtained via a triple integral with r > 0, −π/2≤θ1 ≤π/2,and0≤θ2 ≤2π
vol(S3(r)) = det(J(θ1,θ2)) dr dθ1 dθ2 r θ1 θ2
r π/2 2π r π/2 2π = r2cosθ1 drdθ1 dθ2 = r2 dr cosθ1dθ1 dθ2
0 −π/2 0 0 −π/2 0 r 3 r π / 2 2 π r 3 4 3
= 3 0 ·sinθ1−π/2 ·θ20 = 3 ·2·2π = 3πr
Volume in d Dimensions
Before deriving a general expression for the hypersphere volume in d dimensions, let us consider the Jacobian in four dimensions. Generalizing the polar coordinates from three dimensions in Figure 6.10 to four dimensions, we obtain
x1 = r cosθ1 cosθ2 cosθ3 = rc1c2c3 x2 = r cosθ1 cosθ2 sinθ3 = rc1c2s3 x3 =rcosθ1sinθ2 =rc1s2
x4 =rsinθ1 =rs1
(6.17)
The Jacobian matrix is given as
∂x1 ∂r
J(θ1,θ2,θ3) = ∂r ∂x3 ∂r
∂x1 ∂x1 ∂θ1 ∂θ2 ∂x2 ∂x2 ∂θ1 ∂θ2 ∂x3 ∂x3 ∂θ1 ∂θ2 ∂x4 ∂x4 ∂θ1 ∂θ2
∂x1 ∂θ3 c1c2c3 −rs1c2c3 −rc1s2c3 −rc1c2s3
∂x2 ∂θ3 =c1c2s3 −rs1c2s3 −rc1s2s3 rc1c2c3
∂x2
∂x3 c1s2 −rs1s2 rc1c2 0 ∂θ3 s rc 0 0
∂x4 ∂r
∂x4 11 ∂θ3
Utilizing the Jacobian in three dimensions [Eq. (6.16)], the Jacobian in four dimensions is given as
det(J(θ1,θ2,θ3))=s1(−rs1)(c1)(c1)det(J(θ2,θ3))−rc1(c1)(c1)(c1)det(J(θ2,θ3)) = r3s12c12c2 + r3c14c2 = r3c12c2(s12 + c12) = r3c12c2
Jacobian in d Dimensions By induction, we can obtain the d-dimensional Jacobian as follows:
det(J(θ1,θ2,…,θd−1)) = (−1)drd−1cd−2cd−3 …cd−2 12
Appendix: Derivation of Hypersphere Volume 179 The volume of the hypersphere is given by the d-dimensional integral with r > 0,
−π/2≤θi ≤π/2foralli=1,…,d−2,and0≤θd−1 ≤2π:
vol(Sd(r))= … det(J(θ1,θ2,…,θd−1))drdθ1dθ2…dθd−1
r θ1 θ2 θd−1
r π/2 π/2 2π
= rd−1dr cd−2dθ1 … cd−2dθd−2 dθd−1 1
(6.18)
(6.19)
(6.20)
(6.21)
0 −π/2 −π/2 0 Consider one of the intermediate integrals:
π/2 π/2
(cosθ)kdθ=2 coskθdθ
−π/2 0
Let us substitute u = cos2 θ , then we have θ = cos−1 (u1/2 ), and the Jacobian is
J= ∂θ =−1u−1/2(1−u)−1/2 ∂u 2
Substituting Eq. (6.20) in Eq. (6.19), we get the new integral: π/2 1
2 coskθdθ= u(k−1)/2(1−u)−1/2du
00
k+1 1 Ŵk+1Ŵ1 22
=B 2 ,2 = Ŵk+1 2
where B(α,β) is the beta function, given as
1
B(α,β)= uα−1(1−u)β−1du
0
and it can be expressed in terms of the gamma function [Eq. (6.6)] via the identity
B(α,β) = Ŵ(α)Ŵ(β) Ŵ(α+β)
Using the fact that Ŵ(1/2) = √π , and Ŵ(1) = 1, plugging Eq. (6.21) into Eq. (6.18), we get
rd Ŵd−1Ŵ1Ŵd−2Ŵ1 2222 2
Ŵ(1)Ŵ1 Ŵd−1 … Ŵ3 2π
vol(Sd(r))= d Ŵd 222
πŴ1d/2−1 rd
=2 dŴd
22
πd/2
= Ŵd+1 rd
2
which matches the expression in Eq. (6.4).
180 High-dimensional Data
6.8 FURTHER READING
For an introduction to the geometry of d-dimensional spaces see Kendall (1961) and also Scott (1992, Section 1.5). The derivation of the mean distance for the multivariate normal is from MacKay (2003, p. 130).
Kendall, M. G. (1961). A Course in the Geometry of n Dimensions. New York: Hafner. MacKay, D. J. (2003). Information theory, inference and learning algorithms. New
York: Cambridge University Press.
Scott, D. W. (1992). Multivariate density estimation: theory, practice, and visualization.
New York: John Wiley & Sons.
6.9 EXERCISES
Q1. Given the gamma function in Eq. (6.6), show the following: (a) Ŵ(1)=1
(b) Ŵ1=√π 2
(c) Ŵ(α)=(α−1)Ŵ(α−1)
Q2. Show that the asymptotic volume of the hypersphere Sd(r) for any value of radius r
eventually tends to zero as d increases.
Q3. The ball with center c ∈ Rd and radius r is defined as
Bd(c,r)=x∈Rd |δ(x,c)≤r
where δ(x,c) is the distance between x and c, which can be specified using the
Lp -norm:
1 dp
Lp(x,c)= |xi −ci|p i=1
where p ̸= 0 is any real number. The distance can also be specified using the
L∞ -norm:
L∞(x,c)=max|xi −ci| i
Answer the following questions:
(a) For d = 2, sketch the shape of the hyperball inscribed inside the unit square, using
the Lp -distance with p = 0.5 and with center c = (0.5, 0.5)T .
(b) With d = 2 and c = (0.5, 0.5)T , using the L∞ -norm, sketch the shape of the ball of
radius r = 0.25 inside a unit square.
(c) Compute the formula for the maximum distance between any two points in
the unit hypercube in d dimensions, when using the Lp-norm. What is the maximum distance for p = 0.5 when d = 2? What is the maximum distance for the L∞ -norm?
Exercises
181
ǫ
ǫ
Figure 6.11. For Q4.
Q4. Consider the corner hypercubes of length ǫ ≤ 1 inside a unit hypercube. The 2-dimensional case is shown in Figure 6.11. Answer the following questions:
(a) Let ǫ = 0.1. What is the fraction of the total volume occupied by the corner cubes
in two dimensions?
(b) Derive an expression for the volume occupied by all of the corner hypercubes of
length ǫ < 1 as a function of the dimension d. What happens to the fraction of the
volume in the corners as d → ∞?
(c) What is the fraction of volume occupied by the thin hypercube shell of width ǫ < 1
as a fraction of the total volume of the outer (unit) hypercube, as d → ∞? For example, in two dimensions the thin shell is the space between the outer square (solid) and inner square (dashed).
Q5. ProveEq.(6.14),thatis,limd→∞PxTx≤−2ln(α)→0,foranyα∈(0,1)andx∈Rd.
Q6. Considertheconceptualviewofhigh-dimensionalspaceshowninFigure6.4.Derive
an expression for the radius of the inscribed circle, so that the area in the spokes
accurately reflects the difference between the volume of the hypercube and the
inscribed hypersphere in d dimensions. For instance, if the length of a half-diagonal 1
is fixed at 1, then the radius of the inscribed circle is √2 in Figure 6.4a.
Q7. Consider the unit hypersphere (with radius r = 1). Inside the hypersphere inscribe a hypercube (i.e., the largest hypercube you can fit inside the hypersphere). An example in two dimensions is shown in Figure 6.12. Answer the following questions:
Figure 6.12. For Q7.
182
Q8.
(a)
(b)
High-dimensional Data
Derive an expression for the volume of the inscribed hypercube for any given dimensionality d. Derive the expression for one, two, and three dimensions, and then generalize to higher dimensions.
What happens to the ratio of the volume of the inscribed hypercube to the volume of the enclosing hypersphere as d → ∞? Again, give the ratio in one, two and three dimensions, and then generalize.
Q9.
Assume that a unit hypercube is given as [0,1]d, that is, the range is [0,1] in each dimension. The main diagonal in the hypercube is defined as the vector from (0, 0) =
d−1 d−1
(0,...,0,0) to (1,1) = (1,...,1,1). For example, when d = 2, the main diagonal goes from (0,0) to (1,1). On the other hand, the main anti-diagonal is defined as the
d−1 d−1
vector from (1,0) = (1,...,1,0) to (0,1) = (0,...,0,1) For example, for d = 2, the anti-diagonal is from (1,0) to (0,1).
(a) Sketch the diagonal and anti-diagonal in d = 3 dimensions, and compute the angle
between them.
(b) What happens to the angle between the main diagonal and anti-diagonal as d →
∞. First compute a general expression for the d dimensions, and then take the limit as d → ∞.
Draw a sketch of a hypersphere in four dimensions.
CHAPTER 7 DimensionalityReduction
We saw in Chapter 6 that high-dimensional data has some peculiar characteristics, some of which are counterintuitive. For example, in high dimensions the center of the space is devoid of points, with most of the points being scattered along the surface of the space or in the corners. There is also an apparent proliferation of orthogonal axes. As a consequence high-dimensional data can cause problems for data mining and analysis, although in some cases high-dimensionality can help, for example, for nonlinear classification. Nevertheless, it is important to check whether the dimensionality can be reduced while preserving the essential properties of the full data matrix. This can aid data visualization as well as data mining. In this chapter we study methods that allow us to obtain optimal lower-dimensional projections of the data.
7.1 BACKGROUND
Let the data D consist of n points over d attributes, that is, it is an n × d matrix,
given as
X1X2···Xd
x1 x11 x12 ··· x1d D=x2 x21 x22 ··· x2d . . . . . . . . . . . . . . .
xn xn1 xn2 ··· xnd
Eachpointxi =(xi1,xi2,...,xid)T isavectorintheambientd-dimensionalvectorspace spanned by the d standard basis vectors e1 , e2 , . . . , ed , where ei corresponds to the ith attribute Xi. Recall that the standard basis is an orthonormal basis for the data space, that is, the basis vectors are pairwise orthogonal, eTi ej = 0, and have unit length ∥ei∥ = 1.
As such, given any other set of d orthonormal vectors u1,u2,...,ud , with uTi uj = 0 and ∥ui ∥ = 1 (or uTi ui = 1), we can re-express each point x as the linear combination
x=a1u1 +a2u2 +···+adud (7.1) 183
184 Dimensionality Reduction where the vector a = (a1,a2,...,ad)T represents the coordinates of x in the new basis.
The above linear combination can also be expressed as a matrix multiplication:
x = Ua
where U is the d ×d matrix, whose ith column comprises the ith basis vector ui:
|| | U= u1 u2 ··· ud
|||
(7.2)
The matrix U is an orthogonal matrix, whose columns, the basis vectors, are orthonormal, that is, they are pairwise orthogonal and have unit length
uTiuj=1 ifi=j 0 ifi̸=j
Because U is orthogonal, this means that its inverse equals its transpose: U−1 = UT
which implies that UT U = I, where I is the d × d identity matrix.
Multiplying Eq. (7.2) on both sides by UT yields the expression for computing the
coordinates of x in the new basis
UTx = UTUa
a = UTx (7.3)
Example 7.1. Figure 7.1a shows the centered Iris dataset, with n = 150 points, in the d = 3 dimensional space comprising the sepal length (X1 ), sepal width (X2 ), and petal length (X3 ) attributes. The space is spanned by the standard basis vectors
1 0 0 e1 =0 e2 =1 e3 =0
001
Figure 7.1b shows the same points in the space comprising the new basis vectors
−0.390 −0.639 −0.663 u1 = 0.089 u2 = −0.742 u3 = 0.664
−0.916 0.200 0.346
For example, the new coordinates of the centered point x = (−0.343,−0.754,
0.241)T can be computed as −0.390
a = UT x = −0.639 −0.663
0.089 −0.742 0.664
−0.916 −0.343 −0.154 0.200 −0.754 = 0.828 0.346 0.241 −0.190
One can verify that x can be written as the linear combination x = −0.154u1 + 0.828u2 − 0.190u3
Background
185
X1
X
u3
X3
u2
2
(a) Original Basis
Figure 7.1. Iris data: optimal basis in three dimensions.
Because there are potentially infinite choices for the set of orthonormal basis vectors, one natural question is whether there exists an optimal basis, for a suitable notion of optimality. Further, it is often the case that the input dimensionality d is very large, which can cause various problems owing to the curse of dimensionality (see Chapter 6). It is natural to ask whether we can find a reduced dimensionality subspace that still preserves the essential characteristics of the data. That is, we are interested in finding the optimal r-dimensional representation of D, with r ≪ d. In other words, given a point x, and assuming that the basis vectors have been sorted in decreasing order of importance, we can truncate its linear expansion [Eq. (7.1)] to just r terms, to obtain
′ r
x =a1u1 +a2u2 +···+arur = aiui (7.4)
i=1
Here x′ is the projection of x onto the first r basis vectors, which can be written in
matrix notation as follows:
| | | a1
a2
x = u 1 u 2 · · · u r . . . = U r a r ( 7 . 5 )
|| |ar
′
u1 (b) Optimal Basis
186 Dimensionality Reduction
where Ur is the matrix comprising the first r basis vectors, and ar is a vector comprising the first r coordinates. Further, because a = UT x from Eq. (7.3), restricting it to the first r terms, we get
a r = U Tr x ( 7 . 6 ) Plugging this into Eq. (7.5), the projection of x onto the first r basis vectors can be
compactly written as
x′ =UrUTr x=Prx (7.7)
where Pr = Ur UTr is the orthogonal projection matrix for the subspace spanned by the first r basis vectors. That is, Pr is symmetric and P2r = Pr . This is easy to verify because PTr =(UrUTr)T=UrUTr =Pr,andP2r =(UrUTr)(UrUTr)=UrUTr =Pr,whereweusethe observation that UTr Ur = Ir ×r , the r × r identity matrix. The projection matrix Pr can also be written as the decomposition
aiui =x−x′
P r = U r U Tr =
From Eqs.(7.1) and (7.4), the projection of x onto the remaining dimensions
comprises the error vector
d
ǫ=
i=r+1
It is worth noting that that x′ and ǫ are orthogonal vectors:
stronger statement. The subspace spanned by the first r basis vectors Sr =span(u1,...,ur)
and the subspace spanned by the remaining basis vectors Sd−r =span(ur+1,...,ud)
are orthogonal subspaces, that is, all pairs of vectors x ∈ Sr and y ∈ Sd−r must be orthogonal. The subspace Sd−r is also called the orthogonal complement of Sr.
r d i=1 j=r+1
x ′ T ǫ =
This is a consequence of the basis being orthonormal. In fact, we can make an even
r i=1
u i u Ti ( 7 . 8 )
a i a j u Ti u j = 0
Example 7.2. Continuing Example 7.1, approximating the centered point x = (−0.343, −0.754, 0.241)T by using only the first basis vector u1 = (−0.390, 0.089,
−0.916)T, we have
0.060 x′ = a1u1 = −0.154u1 = −0.014
0.141
Principal Component Analysis 187
The projection of x on u1 could have been obtained directly from the projection matrix
That is
0.357 −0.082 0.839
0.060 x′ = P1x = −0.014
0.141
T
− 0 . 3 9 0
0.089 −0.390 0.089 −0.916
P1 = u1u1 =
= −0.035 0.008 −0.082
−0.916
0.152 −0.035 0.357
The error vector is given as
ǫ = a2u2 + a3u3 = x − x′ = −0.74
One can verify that x′ and ǫ are orthogonal, i.e.,
′ T − 0 . 4 0 x ǫ = 0.060 −0.014 0.141 −0.74 0.10
−0.40 0.10
= 0
The goal of dimensionality reduction is to seek an r-dimensional basis that gives the best possible approximation x′i over all the points xi ∈ D. Alternatively, we may seek to minimize the error ǫi = xi − x′i over all the points.
7.2 PRINCIPAL COMPONENT ANALYSIS
Principal Component Analysis (PCA) is a technique that seeks a r-dimensional basis that best captures the variance in the data. The direction with the largest projected variance is called the first principal component. The orthogonal direction that captures the second largest projected variance is called the second principal component, and so on. As we shall see, the direction that maximizes the variance is also the one that minimizes the mean squared error.
7.2.1 Best Line Approximation
We will start with r = 1, that is, the one-dimensional subspace or line u that best approximates D in terms of the variance of the projected points. This will lead to the general PCA technique for the best 1 ≤ r ≤ d dimensional basis for D.
Without loss of generality, we assume that u has magnitude ∥u∥2 = uTu = 1; otherwise it is possible to keep on increasing the projected variance by simply
188 Dimensionality Reduction
increasing the magnitude of u. We also assume that the data has been centered so that it has mean μ = 0.
The projection of xi on the vector u is given as
x′i = uTxi u = (uTxi)u = aiu
where the scalar
uTu
ai =uTxi
gives the coordinate of x′i along u. Note that because the mean point is μ = 0, its coordinate along u is μu = 0.
We have to choose the direction u such that the variance of the projected points is maximized. The projected variance along u is given as
1 n
σ u2 = n ( a i − μ u ) 2
i=1 1 n
=n (uTxi)2 i=1
1n = u T x i x Ti u
n
1
= n u T x i x Ti u i=1
n i=1
=uTu (7.9)
where is the covariance matrix for the centered data D.
To maximize the projected variance, we have to solve a constrained optimization
problem, namely to maximize σu2 subject to the constraint that uTu = 1. This can be solved by introducing a Lagrangian multiplier α for the constraint, to obtain the unconstrained maximization problem
maxJ(u)=uTu−α(uTu−1) (7.10) u
Setting the derivative of J(u) with respect to u to the zero vector, we obtain ∂ J(u)=0
∂u
∂u
∂ uTu−α(uTu−1)=0
2u−2αu=0 u = αu
(7.11) This implies that α is an eigenvalue of the covariance matrix , with the associated
eigenvector u. Further, taking the dot product with u on both sides of Eq. (7.11) yields uTu = uTαu
Principal Component Analysis 189 From Eq. (7.9), we then have
σ u2 = α u T u
o r σ u2 = α ( 7 . 1 2 )
To maximize the projected variance σu2, we should thus choose the largest eigenvalue of . In other words, the dominant eigenvector u1 specifies the direction of most variance, also called the first principal component, that is, u = u1. Further, the largest eigenvalue λ1 specifies the projected variance, that is, σu2 = α = λ1.
Minimum Squared Error Approach
We now show that the direction that maximizes the projected variance is also the one that minimizes the average squared error. As before, assume that the dataset D has been centered by subtracting the mean from each point. For a point xi ∈ D, let x′i denote its projection along the direction u, and let ǫi = xi − x′i denote the error vector. The mean squared error (MSE) optimization condition is defined as
1 n MSE(u) = n i=1 ∥ǫi ∥2
1 n
= n i = 1 ∥ x i − x ′i ∥ 2
1 n
= n (xi −x′i)T(xi −x′i)
i=1
= 1 n ∥xi∥2 −2xTi x′i +(x′i)Tx′i
n i=1
n i=1
= 1 n ∥xi∥2 −2(uTxi)(xTi u)+(uTxi)(xTi u)uTu
(7.13)
(7.14)
Noting that x′i = (uT xi )u, we have n
= 1 ∥xi∥2 −2xTi (uTxi)u+(uTxi)uT(uTxi)u
n i=1
=1n ∥xi∥2−(uTxi)(xTiu)
n i=1
1 n 1 n
= n ∥ x i ∥ 2 − n u T ( x i x Ti ) u
i=1 i=1
1 n 1 n
= ∥ x i ∥ 2 − u T
n i=1 n i=1
n ∥xi∥2
= i=1 n −uTu
x i x Ti
u
(7.15)
190 Dimensionality Reduction Note that by Eq. (1.4) the total variance of the centered data (i.e., with μ = 0) is
given as
1 n 1 n var(D)= n i=1 ∥xi −0∥2 = n i=1 ∥xi∥2
Further, by Eq. (2.28), we have
Thus, we may rewrite Eq. (7.15) as
MSE(u) = var(D) − uTu =
σi2 − uTu
d var(D)=tr()= σi2
i=1
d i=1
Because the first term, var(D), is a constant for a given dataset D, the vector u that minimizes MSE(u) is thus the same one that maximizes the second term, the projected variance uTu. Because we know that u1, the dominant eigenvector of , maximizes the projected variance, we have
MSE(u1)=var(D)−uT1u1 =var(D)−uT1λ1u1 =var(D)−λ1 (7.16)
Thus, the principal component u1, which is the direction that maximizes the projected variance, is also the direction that minimizes the mean squared error.
Example 7.3. Figure 7.2 shows the first principal component, that is, the best one-dimensional approximation, for the three dimensional Iris dataset shown in Figure 7.1a. The covariance matrix for this dataset is given as
0.681 −0.039 1.265 = −0.039 0.187 −0.320
1.265 −0.320 3.092
The variance values σi2 for each of the original dimensions are given along the main diagonal of . For example, σ12 = 0.681, σ2 = 0.187, and σ32 = 3.092. The largest eigenvalue of is λ1 = 3.662, and the corresponding dominant eigenvector is u1 = (−0.390,0.089,−0.916)T. The unit vector u1 thus maximizes the projected variance, which is given as J(u1) = α = λ1 = 3.662. Figure 7.2 plots the principal component u1. It also shows the error vectors ǫi, as thin gray line segments.
The total variance of the data is given as
1n 1
var(D) = n i=1 ∥x∥2 = 150 · 594.04 = 3.96
Principal Component Analysis
191
X3
X1
X
2
u1
Figure 7.2. Best one-dimensional or line approximation.
We can also directly obtain the total variance as the trace of the covariance matrix: var(D)=tr()=σ12 +σ2 +σ32 =0.681+0.187+3.092=3.96
Thus, using Eq. (7.16), the minimum value of the mean squared error is given as MSE(u1) = var(D) − λ1 = 3.96 − 3.662 = 0.298
7.2.2 Best 2-dimensional Approximation
We are now interested in the best two-dimensional approximation to D. As before, assume that D has already been centered, so that μ = 0. We already computed the direction with the most variance, namely u1 , which is the eigenvector corresponding to the largest eigenvalue λ1 of . We now want to find another direction v, which also maximizes the projected variance, but is orthogonal to u1. According to Eq. (7.9) the projected variance along v is given as
σ v2 = v T v
We further require that v be a unit vector orthogonal to u1, that is,
vTu1 =0 vTv=1
192 Dimensionality Reduction The optimization condition then becomes
maxJ(v)=vTv−α(vTv−1)−β(vTu1 −0) (7.17) v
Taking the derivative of J(v) with respect to v, and setting it to the zero vector, gives 2v−2αv−βu1 =0 (7.18)
If we multiply on the left by uT1 we get
2uT1v−2αuT1v−βuT1u1 =0
2vTu1 −β = 0,which implies that
β = 2vTλ1u1 = 2λ1vTu1 = 0
In the derivation above we used the fact that uT1 v = vTu1, and that v is orthogonal
to u1. Plugging β = 0 into Eq. (7.18) gives us 2v−2αv=0
v=αv
This means that v is another eigenvector of . Also, as in Eq. (7.12), we have σv2 = α. To maximize the variance along v, we should choose α = λ2, the second largest eigenvalue of , with the second principal component being given by the corresponding eigenvector, that is, v = u2.
Total Projected Variance
Let U2 be the matrix whose columns correspond to the two principal components, given as
| | U2=u1 u2
||
Given the point xi ∈ D its coordinates in the two-dimensional subspace spanned by u1
and u2 can be computed via Eq. (7.6), as follows: a i = U T2 x i
Assume that each point xi ∈ Rd in D has been projected to obtain its coordinates ai ∈ R2 , yielding the new dataset A. Further, because D is assumed to be centered, with μ = 0, the coordinates of the projected mean are also zero because UT2 μ = UT2 0 = 0.
Principal Component Analysis
The total variance for A is given as
1 n
var(A)= n i=1 ∥ai −0∥2
n
= 1 UT2 xiT UT2 xi
n i=1 n
1n
= n x Ti P 2 x i
i=1
where P2 is the orthogonal projection matrix [Eq. (7.8)] given as
P 2 = U 2 U T2 = u 1 u T1 + u 2 u T2
Substituting this into Eq. (7.19), the projected total variance is given as
1 n
v a r ( A ) = n x Ti P 2 x i
193
= 1 x Ti U 2 U T2 x i n i=1
( 7 . 1 9 )
( 7 . 2 0 )
i=1
1n
= n x Ti u 1 u T1 + u 2 u T2 i=1
1 n
= n
i=1
x i
1 n
(uT1 xi)(xTi u1)+ n (uT2 xi)(xTi u2) i=1
= u T1 u 1 + u T2 u 2
var(A)=uT1u1 +uT2u2 =uT1λ1u1 +uT2λ2u2 =λ1 +λ2 (7.22)
Thus, the sum of the eigenvalues is the total variance of the projected points, and the first two principal components maximize this variance.
Mean Squared Error
We now show that the first two principal components also minimize the mean square error objective. The mean square error objective is given as
n
M S E = 1 x i − x ′ i 2
n i=1
=1 n ∥xi∥2−2xTix′i+(x′i)Tx′i,usingEq.(7.14)
n i=1
=var(D)+1n −2xTiP2xi+(P2xi)TP2xi,usingEq.(7.7)thatx′i=P2xi n i=1
( 7 . 2 1 ) Because u1 and u2 are eigenvectors of , we have u1 = λ1u1 and u2 = λ2u2, so that
194 Dimensionality Reduction =var(D)−1n xTiP2xi
n i=1
= var(D) − var(A), using Eq. (7.20) (7.23)
Thus, the MSE objective is minimized precisely when the total projected variance var(A) is maximized. From Eq. (7.22), we have
MSE=var(D)−λ1 −λ2
Example 7.4. For the Iris dataset from Example 7.1, the two largest eigenvalues are λ1 = 3.662, and λ2 = 0.239, with the corresponding eigenvectors:
−0.390 −0.639 u1 = 0.089 u2 = −0.742
−0.916 0.200 The projection matrix is given as
T||—uT1—T T P 2 = U 2 U 2 = u 1 u 2 — u T2 — = u 1 u 1 + u 2 u 2
|| 0.152 −0.035 0.357 0.408 0.474 −0.128
= −0.035 0.008 −0.082 + 0.474 0.551 −0.148
0.357 0.560
= 0.439 0.229
−0.082 0.439
0.558 −0.230
0.839 −0.128 −0.148 0.04 0.229
−0.230 0.879
Thus, each point xi can be approximated by its projection onto the first two principal components x′i = P2 xi . Figure 7.3a plots this optimal 2-dimensional subspace spanned by u1 and u2. The error vector ǫi for each point is shown as a thin line segment. The gray points are behind the 2-dimensional subspace, whereas the white points are in front of it. The total variance captured by the subspace is given as
λ1 +λ2 =3.662+0.239=3.901 The mean squared error is given as
MSE=var(D)−λ1 −λ2 =3.96−3.662−0.239=0.059
Figure 7.3b plots a nonoptimal 2-dimensional subspace. As one can see the optimal subspace maximizes the variance, and minimizes the squared error, whereas the nonoptimal subspace captures less variance, and has a high mean squared error value, which can be pictorially seen from the lengths of the error vectors (line segments). In fact, this is the worst possible 2-dimensional subspace; its MSE is 3.662.
Principal Component Analysis
195
X3
X3
X1
X1
X
X
2
2
u2
u1 (a) Optimal basis
7.2.3 Best r-dimensional Approximation
(b) Nonoptimal basis Figure 7.3. Best two-dimensional approximation.
We are now interested in the best r-dimensional approximation to D, where 2 < r ≤ d. Assume that we have already computed the first j − 1 principal components or eigenvectors, u1,u2,...,uj−1, corresponding to the j − 1 largest eigenvalues of , for 1 ≤ j ≤ r. To compute the jth new basis vector v, we have to ensure that it is normalized to unit length, that is, vT v = 1, and is orthogonal to all previous components ui, i.e., uTi v = 0, for 1 ≤ i < j. As before, the projected variance along v is given as
σ v2 = v T v
Combined with the constraints on v, this leads to the following maximization problem
with Lagrange multipliers: maxJ(v)=vTv−α(vTv−1)−
βi(uTi v−0)
Taking the derivative of J(v) with respect to v and setting it to the zero vector gives
v
i=1 j − 1
2v−2αv−
i=1
βiui =0 (7.24)
j − 1
196
I f w e m u l t i p l y o n t h e l e f t b y u Tk , f o r 1 ≤ k < j , w e g e t
i̸=k
where we used the fact that uk = λkuk, as uk is the eigenvector corresponding to the kth largest eigenvalue λk of . Thus, we find that βi = 0 for all i < j in Eq. (7.24), which implies that
v=αv
To maximize the variance along v, we set α = λj , the j th largest eigenvalue of , with v=uj givingthejthprincipalcomponent.
In summary, to find the best r-dimensional approximation to D, we compute the eigenvalues of . Because is positive semidefinite, its eigenvalues must all be non-negative, and we can thus sort them in decreasing order as follows:
λ1 ≥λ2 ≥···λr ≥λr+1···≥λd ≥0
We then select the r largest eigenvalues, and their corresponding eigenvectors to form
the best r-dimensional approximation. Total Projected Variance
Let Ur be the r-dimensional basis vector matrix
|| | Ur=u1 u2 ···ur
|||
r i=1
Let A denote the dataset formed by the coordinates of the projected points in the r -dimensional subspace, that is, ai = UTr xi , and let x′i = Pr xi denote the projected point in the original d-dimensional space. Following the derivation for Eqs.(7.19), (7.21), and (7.22), the projected variance is given as
with the projection matrix given as
2 u Tk v − 2 α u Tk v − β k u Tk u k −
2vTuk −βk =0
j − 1 i=1
β i u Tk u i = 0 βk =2vTλkuk =2λkvTuk =0
P r = U r U Tr =
u i u Ti
1 n
r i=1
r i=1
x Ti P r x i =
Thus, the total projected variance is simply the sum of the r largest eigenvalues of .
v a r ( A ) = n
i=1
u Ti u i =
λ i
Dimensionality Reduction
Principal Component Analysis 197
Mean Squared Error
Based on the derivation for Eq. (7.23), the mean squared error objective in r dimen- sions can be written as
MSE=1n xi−x′i2 n i=1
=var(D)−var(A)
r
=var(D)− =var(D)− λi
i=1
The first r-principal components maximize the projected variance var(A), and thus
they also minimize the MSE.
Total Variance
Note that the total variance of D is invariant to a change in basis vectors. Therefore, we have the following identity:
Choosing the Dimensionality
d
var(D)=
i=1
σi2 =
d i=1
λi
Often we may not know how many dimensions, r, to use for a good approximation. One criteria for choosing r is to compute the fraction of the total variance captured by the first r principal components, computed as
λ 1 + λ 2 + · · · + λ r ri = 1 λ i ri = 1 λ i f(r)=λ1+λ2+···+λd =di=1λi =var(D) (7.25)
Given a certain desired variance threshold, say α, starting from the first principal component, we keep on adding additional components, and stop at the smallest value r, for which f (r) ≥ α. In other words, we select the fewest number of dimensions such that the subspace spanned by those r dimensions captures at least α fraction of the total variance. In practice, α is usually set to 0.9 or higher, so that the reduced dataset captures at least 90% of the total variance.
Algorithm 7.1 gives the pseudo-code for the principal component analysis algorithm. Given the input data D ∈ Rn×d , it first centers it by subtracting the mean from each point. Next, it computes the eigenvectors and eigenvalues of the covariance matrix . Given the desired variance threshold α, it selects the smallest set of dimensions r that capture at least α fraction of the total variance. Finally, it computes the coordinates of each point in the new r-dimensional principal component subspace, to yield the new data matrix A ∈ Rn×r .
i=1 r
uTi ui
198
Dimensionality Reduction
4 (λ1,λ2,...,λd)=eigenvalues()//computeeigenvalues
5 U=u1 u2 ··· ud=eigenvectors()//computeeigenvectors
ri = 1 λ i
6 f (r) = d λ , for all r = 1,2,...,d // fraction of total variance
ALGORITHM7.1. PrincipalComponentAnalysis
P C A ( D , α ) :
1 μ= 1 n xi //computemean
n i=1
2 Z=D−1·μT//centerthedata
3 = 1 ZT Z // compute covariance matrix n
i=1 i
7 Choose smallest r so that f (r ) ≥ α // choose dimensionality
8 Ur =u1 u2 ··· ur//reducedbasis
9 A={ai |ai =UTr xi,fori=1,...,n}//reduceddimensionalitydata
Example 7.5. Given the 3-dimensional Iris dataset in Figure 7.1a, its covariance matrix is
0.681 −0.039 1.265 = −0.039 0.187 −0.320
1.265 −0.32 3.092 The eigenvalues and eigenvectors of are given as
λ1 = 3.662 −0.390
u1 = 0.089 −0.916
λ2 = 0.239 −0.639
u2 = −0.742 0.200
λ3 = 0.059 −0.663
u3 = 0.664 0.346
The total variance is therefore λ1 + λ2 + λ3 = 3.662 + 0.239 + 0.059 = 3.96. The optimal 3-dimensional basis is shown in Figure 7.1b.
To find a lower dimensional approximation, let α = 0.95. The fraction of total variance for different values of r is given as
For example, for r = 1, the fraction of total variance is given as f (1) = 3.662 = 0.925. 3.96
Thus, we need at least r = 2 dimensions to capture 95% of the total variance. This optimal 2-dimensional subspace is shown as the shaded plane in Figure 7.3a. The reduced dimensionality dataset A is shown in Figure 7.4. It consists of the point coordinates ai = UT2 xi in the new 2-dimensional principal components basis comprising u1 and u2.
r
1
2
3
f(r)
0.925
0.985
1.0
Principal Component Analysis
u2 1.5
1.0
0.5
0
−0.5
199
−1.0
−4 −3 −2 −1 0 1 2 3
u1
−1.5
7.2.4 Geometry of PCA
Figure 7.4. Reduced dimensionality dataset: Iris principal components.
Geometrically, when r = d, PCA corresponds to a orthogonal change of basis, so that the total variance is captured by the sum of the variances along each of the principal directions u1,u2,...,ud, and further, all covariances are zero. This can be seen by looking at the collective action of the full set of principal components, which can be arranged in the d × d orthogonal matrix
|| | U= u1 u2 ··· ud
|||
with U−1 = UT.
Each principal component ui corresponds to an eigenvector of the covariance
matrix , that is,
ui =λiui forall1≤i≤d which can be written compactly in matrix notation as follows:
|| ||| | u1 u2 ··· ud = λ1u1 λ2u2 ··· λdud
| | | | | |
λ1 0 ··· 0 U =U 0 λ2 ··· 0 . . ... .
0 0 ··· λd
(7.26)
U =U
200
Dimensionality Reduction
If we multiply Eq. (7.26) on the left by U−1 = UT we obtain λ1 0 ··· 0
UTU = UTU = = 0 λ2 ··· 0 . . ... .
0 0 ··· λd
This means that if we change the basis to U, we change the covariance matrix to a similar matrix , which in fact is the covariance matrix in the new basis. The fact that is diagonal confirms that after the change of basis, all of the covariances vanish, and we are left with only the variances along each of the principal components, with the variance along each new direction ui being given by the corresponding eigenvalue λi .
It is worth noting that in the new basis, the equation
xT−1x = 1 (7.27)
defines a d-dimensional ellipsoid (or hyper-ellipse). The eigenvectors ui of , that is, the principal components, are the directions for the principal axes of the ellipsoid. The square roots of the eigenvalues, that is, √λi , give the lengths of the semi-axes.
Multiplying Eq. (7.26) on the right by U−1 = UT , we have = UUT
Assuming that is invertible or nonsingular, we have
(7.28)
where
−1 = (UUT)−1 = U−1T −1U−1 = U−1UT 1 0···0
. . ... .
λ1
0 1 ··· 0 −1 = λ2
0 0 ···
Substituting −1 in Eq. (7.27), and using the fact that x = Ua from Eq. (7.2), where
a=(a1,a2,...,ad)T representsthecoordinatesofxinthenewbasis,weget
xT−1x=1 aTUTU−1UTUa = 1 aT−1a = 1
d ai2
i=1 λi =1
which is precisely the equation for an ellipse centered at 0, with semi-axes lengths √λi . Thus xT−1x = 1, or equivalently aT−1a = 1 in the new principal components basis, defines an ellipsoid in d-dimensions, where the semi-axes lengths equal the standard deviations (squared root of the variance, √λi ) along each axis. Likewise, the equation xT−1x = s, or equivalently aT−1a = s, for different values of the scalar s, represents concentric ellipsoids.
1
λd
Principal Component Analysis
201
Example 7.6. Figure 7.5b shows the ellipsoid xT−1x = aT−1a = 1 in the new principal components basis. Each semi-axis length corresponds to the standard deviation √λi along that axis. Because all pairwise covariances are zero in the principal components basis, the ellipsoid is axis-parallel, that is, each of its axes coincides with a basis vector.
u2
u3
u1
(a) Elliptic contours in standard basis
u3
u1
u2
(b) Axis parallel ellipsoid in principal components basis
Figure 7.5. Iris data: standard and principal components basis in three dimensions.
202 Dimensionality Reduction
7.3 KERNEL PRINCIPAL COMPONENT ANALYSIS
Principal component analysis can be extended to find nonlinear “directions” in the data using kernel methods. Kernel PCA finds the directions of most variance in the feature space instead of the input space. That is, instead of trying to find linear combinations of the input dimensions, kernel PCA finds linear combinations in the high-dimensional feature space obtained as some nonlinear transformation of the input dimensions. Thus, the linear principal components in the feature space correspond to nonlinear directions in the input space. As we shall see, using the kernel trick, all operations can be carried out in terms of the kernel function in input space, without having to transform the data into feature space.
On the other hand, in the original standard d-dimensional basis for D, the ellipsoid will not be axis-parallel, as shown by the contours of the ellipsoid in Figure 7.5a. Here the semi-axis lengths correspond to half the value range in each direction; the length was chosen so that the ellipsoid encompasses most of the points.
Example 7.7. Consider the nonlinear Iris dataset shown in Figure 7.6, obtained via a nonlinear transformation applied on the centered Iris data. In particular, the sepal length (A1 ) and sepal width attributes (A2 ) were transformed as follows:
X 1 = 0 . 2 A 21 + A 2 2 + 0 . 1 A 1 A 2 X2 = A2
The points show a clear quadratic (nonlinear) relationship between the two variables. Linear PCA yields the following two directions of most variance:
λ1 = 0.197
u1 = 0.301
0.953
λ2 = 0.087
u2 = −0.953
0.301
These two principal components are illustrated in Figure 7.6. Also shown in the figure are lines of constant projections onto the principal components, that is, the set of all points in the input space that have the same coordinates when projected onto u1 and u2, respectively. For instance, the lines of constant projections in Figure 7.6a correspond to the solutions of uT1 x = s for different values of the coordinate s. Figure 7.7 shows the coordinates of each point in the principal components space comprising u1 and u2. It is clear from the figures that u1 and u2 do not fully capture the nonlinear relationship between X1 and X2. We shall see later in this section that kernel PCA is able to capture this dependence better.
Let φ correspond to a mapping from the input space to the feature space. Each point in feature space is given as the image φ(xi) of the point xi in input space. In the input space, the first principal component captures the direction with the most projected variance; it is the eigenvector corresponding to the largest eigenvalue of the
Kernel Principal Component Analysis
203
1.5 1.0 0.5
0 −0.5
u1
1.5 1.0 0.5
0 −0.5
2
u
1.5
X2
X2
−1 −1
−0.5 0
0.5 1.0 1.5 X1
−0.5 0 0.5
1.0
0
−0.5 −1.0
−1.5 −0.75
u1
u2
(a) λ1 = 0.197
Figure 7.6. Nonlinear Iris dataset: PCA in input space.
0
Figure 7.7. Projection onto principal components.
0.5 1.0
1.5
covariance matrix. Likewise, in feature space, we can find the first kernel principal component u1 (with uT1 u1 = 1), by solving for the eigenvector corresponding to the largest eigenvalue of the covariance matrix in feature space:
φu1 =λ1u1 (7.29)
X1
(b) λ2 = 0.087
204 Dimensionality Reduction where φ, the covariance matrix in feature space, is given as
1 n
φ = n φ(xi)φ(xi)T (7.30)
i=1
Hereweassumethatthepointsarecentered,thatis,φ(xi)=φ(xi)−μφ,whereμφ is the mean in feature space.
Plugging in the expansion of φ from Eq. (7.30) into Eq. (7.29), we get
1 n
n φ(xi)φ(xi)T u1 =λ1u1
i=1
1n T
n φ(xi) φ(xi) u1 =λ1u1 i=1
n φ(xi)Tu1 φ(xi) = u1 i=1 nλ1
(7.31)
(7.32)
n i=1
ciφ(xi) = u1
where ci = φ(xi)Tu1 is a scalar value. From Eq.(7.32) we see that the best direction in
nλ1
the feature space, u1, is just a linear combination of the transformed points, where the
scalars ci show the importance of each point toward the direction of most variance. We can now substitute Eq. (7.32) back into Eq. (7.31) to get
1 n n n
n φ(xi)φ(xi)T cjφ(xj) =λ1 ciφ(xi)
i=1 j=1 i=1 1 n n n
n cjφ(xi)φ(xi)Tφ(xj)=λ1 ciφ(xi) i=1 j=1 i=1
n n n φ(xi) cjφ(xi)Tφ(xj)=nλ1 ciφ(xi)
i=1 j=1 i=1
In the preceding equation, we can replace the dot product in feature space, namely φ(xi)Tφ(xj), by the corresponding kernel function in input space, namely K(xi,xj), which yields
nn n
φ(xi) cjK(xi,xj)=nλ1 ciφ(xi) (7.33)
i=1 j=1 i=1
Note that we assume that the points in feature space are centered, that is, we assume
that the kernel matrix K has already been centered using Eq. (5.14):
K=I− 11n×nKI− 11n×n nn
Kernel Principal Component Analysis 205
where I is the n × n identity matrix, and 1n×n is the n × n matrix all of whose elements are 1.
We have so far managed to replace one of the dot products with the kernel function. To make sure that all computations in feature space are only in terms of dot products, we can take any point, say φ(xk) and multiply Eq.(7.33) by φ(xk)T on both sides to obtain
nn n φ(xk)Tφ(xi) cjK(xi,xj)=nλ1 ciφ(xk)Tφ(xi)
i=1 j=1 i=1 nn n
K(xk,xi) cjK(xi,xj) =nλ1 ciK(xk,xi) i=1 j=1 i=1
(7.34)
Further, let Ki denote row i of the centered kernel matrix, written as the column vector
Ki = (K(xi,x1) K(xi,x2) ··· K(xi,xn))T Let c denote the column vector of weights
c=(c1 c2 ··· cn)T We can plug Ki and c into Eq. (7.34), and rewrite it as
n
K ( x k , x i ) K Ti c = n λ 1 K Tk c i=1
In fact, because we can choose any of the n points, φ(xk), in the feature space, to obtain Eq. (7.34), we have a set of n equations:
n
K(x1,xi)KTi c=nλ1KT1c i=1
n
K(x2,xi)KTi c=nλ1KT2c i=1
. = . n
K(xn,xi)KTi c=nλ1KTnc i=1
We can compactly represent all of these n equations as follows: K2c=nλ1Kc
where K is the centered kernel matrix. Multiplying by K−1 on both sides, we obtain
K−1K2c = nλ1K−1Kc Kc = nλ1c
Kc = η1c (7.35)
206 Dimensionality Reduction
where η1 = nλ1. Thus, the weight vector c is the eigenvector corresponding to the largest eigenvalue η1 of the kernel matrix K.
Once c is found, we can plug it back into Eq. (7.32) to obtain the first kernel principal component u1. The only constraint we impose is that u1 should be normalized to be a unit vector, as follows:
u T1 u 1 = 1 n n
cicjφ(xi)Tφ(xj)=1 i=1 j=1
cTKc=1
Noting that Kc = η1c from Eq. (7.35), we get
cT(η1c) = 1
η1cTc = 1 ∥c∥2 = 1
η1
However, because c is an eigenvector of K it will have unit norm. Thus, to ensurethat
u1 is a unit vector, we have to scale the weight vector c so that its norm is ∥c∥ = 1 , η1
which can be achieved by multiplying c by 1 . η1
In general, because we do not map the input points into the feature space via φ, it is not possible to directly compute the principal direction, as it is specified in terms of φ (xi ), as seen in Eq. (7.32). However, what matters is that we can project any point φ(x) onto the principal direction u1, as follows:
n
projection of φ(xi) onto the principal component u1 can be written as the dot product
ai =uT1φ(xi)=KTi c (7.36)
where Ki is the column vector corresponding to the ith row in the kernel matrix.
Thus, we have shown that all computations, either for the solution of the principal
component, or for the projection of points, can be carried out using only the kernel
function. Finally, we can obtain the additional principal components by solving for
the other eigenvalues and eigenvectors of Eq. (7.35). In other words, if we sort the
eigenvalues of K in decreasing order η1 ≥ η2 ≥ ··· ≥ ηn ≥ 0, we can obtain the jth
uT1 φ(x) =
i=1
n i=1
ciφ(xi)Tφ(x) =
which requires only kernel operations. When x = xi is one of the input points, the
principal component as thecorresponding eigenvector cj , which has to be normalized
so that the norm is cj = 1 , provided ηj > 0. Also, because ηj = nλj , the variance ηj
ciK(xi,x)
along the jth principal component is given as λj = ηj . Algorithm 7.2 gives the n
pseudo-code for the kernel PCA method.
Kernel Principal Component Analysis 207 ALGORITHM7.2. KernelPrincipalComponentAnalysis
KERNELPCA (D,K,α):
1 K = K(xi , xj )i,j =1,…,n // compute n × n kernel matrix
2 K = (I − 1 1n×n)K(I − 1 1n×n) // center the kernel matrix nn
3 (η1 , η2 , . . . , ηd ) = eigenvalues(K) // compute eigenvalues
4 c1 c2 · · · cn = eigenvectors(K) // compute eigenvectors
5 λi =ηi foralli=1,…,n//computevarianceforeachcomponent n
ci = 1 ·ci foralli=1,…,n//ensurethatuTi ui =1
6 ηi
7 f(r)=d λ , forallr=1,2,…,d//fractionoftotalvariance
ri = 1 λ i
i=1 i
8 Choose smallest r so that f (r ) ≥ α // choose dimensionality
9 Cr =c1 c2 ··· cr//reducedbasis
10 A={ai |ai =CTr Ki,fori=1,…,n}//reduceddimensionalitydata
Example7.8. ConsiderthenonlinearIrisdatafromExample7.7withn=150points. Let us use the homogeneous quadratic polynomial kernel in Eq. (5.8):
K(xi,xj)=xTi xj2 The kernel matrix K has three nonzero eigenvalues:
η1 =31.0
λ1 = η1 =0.2067
η2 =8.94
λ2 = η2 =0.0596
η3 =2.76 λ3 = η3 =0.0184
150
150
150
The corresponding eigenvectors c1, c2, and c3 are not shown because they lie in R150. Figure 7.8 shows the contour lines of constant projection onto the first three kernel principal components. These lines are obtained by solving the equations uTi x = nj=1cijK(xj,x)=sfordifferentprojectionvaluess,foreachoftheeigenvectorsci = (ci1,ci2,…,cin)T of the kernel matrix. For instance, for the first principal component this corresponds to the solutions x = (x1,x2)T, shown as contour lines, of the following
equation:
1.0426×12 + 0.995×2 + 0.914x1x2 = s
λ1 +λ2 +λ3 0.2847
Incidentally, the use of a linear kernel K(xi , xj ) = xTi xj yields exactly the same
principal components as shown in Figure 7.7.
for each chosen value of s. The principal components are also not shown in the figure, as it is typically not possible or feasible to map the points into feature space, and thus one cannot derive an explicit expression for ui . However, because the projection onto the principal components can be carried out via kernel operations via Eq. (7.36), Figure 7.9 shows the projection of the points onto the first two kernel principal components, which capture λ1+λ2 = 0.2663 = 93.5% of the total variance.
208
1.5 1.0 0.5
0 −0.5
Dimensionality Reduction
−0.5 0
1.5 −0.5
0
0.5 1.0 X1
1.5
0
1.5 1.0 0.5
0 −0.5
−1 −1
0.5 1.0 X1
(a) λ1 = 0.2067 1.5
1.0
0.5
0
−0.5
−1 −0.5
(b) λ2 = 0.0596
0.5 X1
(c) λ3 = 0.0184
Figure 7.8. Kernel PCA: homogeneous quadratic kernel.
7.4 SINGULAR VALUE DECOMPOSITION
Principal components analysis is a special case of a more general matrix decomposition method called Singular Value Decomposition (SVD). We saw in Eq. (7.28) that PCA yields the following decomposition of the covariance matrix:
1.0
1.5
= UUT (7.37)
X2
X2
X2
Singular Value Decomposition
209
0 −0.5
−1.0
−1.5
−2
−0.5
u1
u2
0
Figure 7.9. Projected point coordinates: homogeneous quadratic kernel.
0.5
1.0 1.5
2.0
2.5
3.0
3.5
where the covariance matrix has been factorized into the orthogonal matrix U containing its eigenvectors, and a diagonal matrix containing its eigenvalues (sorted in decreasing order). SVD generalizes the above factorization for any matrix. In particular for an n × d data matrix D with n points and d columns, SVD factorizes D as follows:
D = LRT (7.38)
where L is a orthogonal n × n matrix, R is an orthogonal d × d matrix, and is an n × d “diagonal” matrix. The columns of L are called the left singular vectors, and the columns of R (or rows of RT) are called the right singular vectors. The matrix is defined as
(i, j ) = δi If i = j 0 Ifi̸=j
where i = 1,…,n and j = 1,…,d. The entries (i,i) = δi along the main diagonal of are called the singular values of D, and they are all non-negative. If the rank of D is r ≤ min(n,d), then there will be only r nonzero singular values, which we assume are ordered as follows:
δ1 ≥δ2 ≥···≥δr >0
One can discard those left and right singular vectors that correspond to zero singular
values, to obtain the reduced SVD as
D = Lr r RTr (7.39)
210 Dimensionality Reduction
where Lr is the n × r matrix of the left singular vectors, Rr is the d × r matrix of the right singular vectors, and r is the r × r diagonal matrix containing the positive singular vectors. The reduced SVD leads directly to the spectral decomposition of D, given as
D = L r r R Tr
δ 1
| | |0 = l 1 l 2 · · · l r . . .
| | | 0 =δ1l1rT1 +δ2l2rT2 +···+δrlrrTr
r i=1
The spectral decomposition represents D as a sum of rank one matrices of the form δilirTi . By selecting the q largest singular values δ1,δ2,…,δq and the corresponding left and right singular vectors, we obtain the best rank q approximation to the original matrix D. That is, if Dq is the matrix defined as
q Dq= δilirTi i=1
then it can be shown that Dq is the rank q matrix that minimizes the expression ∥D−Dq∥F
where∥A∥F iscalledtheFrobeniusNormofthen×dmatrixA,definedas n d
7.4.1 Geometry of SVD
In general, any n × d matrix D represents a linear transformation, D : Rd → Rn , from the space of d-dimensional vectors to the space of n-dimensional vectors because for any x ∈ Rd there exists y ∈ Rn such that
Dx=y
The set of all vectors y∈Rn such that Dx=y over all possible x∈Rd is called the column space of D, and the set of all vectors x ∈ Rd , such that DT y = x over all y ∈ Rn , is called the row space of D, which is equivalent to the column space of DT. In other words, the column space of D is the set of all vectors that can be obtained as linear combinations of columns of D, and the row space of D is the set of all vectors that can
=
δ i l i r Ti
0 · · · 0 — r T1 — δ2 ··· 0— rT2 — . . . … . — . —
0
··· δr — rTr —
∥A∥F = A(i,j)2 i=1 j=1
Singular Value Decomposition 211
be obtained as linear combinations of the rows of D (or columns of DT). Also note that the set of all vectors x ∈ Rd , such that Dx = 0 is called the null space of D, and finally, the set of all vectors y ∈ Rn, such that DTy = 0 is called the left null space of D.
One of the main properties of SVD is that it gives a basis for each of the four fundamental spaces associated with the matrix D. If D has rank r, it means that it has only r independent columns, and also only r independent rows. Thus, the r left singular vectors l1,l2,…,lr corresponding to the r nonzero singular values of D in Eq. (7.38) represent a basis for the column space of D. The remaining n − r left singular vectors lr+1,…,ln represent a basis for the left null space of D. For the row space, the r right singular vectors r1 , r2 , . . . , rr corresponding to the r non-zero singular values, represent a basis for the row space of D, and the remaining d − r right singular vectors rj (j=r+1,…,d),representabasisforthenullspaceofD.
Consider the reduced SVD expression in Eq. (7.39). Right multiplying both sides of the equation by Rr and noting that RTr Rr = Ir , where Ir is the r × r identity matrix, we have
D R r = L r r R Tr R r DRr =Lrr
δ1 0 ··· 0 DR =L 0 δ2 ··· 0 r r . . … .
0 0 ··· δr
|| ||| | D r1 r2 ··· rr = δ1l1 δ2l2 ··· δrlr
||||||
From the above, we conclude that
Dri =δili foralli=1,…,r
In other words, SVD is a special factorization of the matrix D, such that any basis vector ri for the row space is mapped to the corresponding basis vector li in the column space, scaled by the singular value δi . As such, we can think of the SVD as a mapping fromanorthonormalbasis(r1,r2,…,rr)inRd (therowspace)toanorthonormalbasis (l1,l2,…,lr) in Rn (the column space), with the corresponding axes scaled according to the singular values δ1,δ2,…,δr.
7.4.2 Connection between SVD and PCA
Assume that the matrix D has been centered, and assume that it has been factorized via SVD [Eq. (7.38)] as D = LRT. Consider the scatter matrix for D, given as DTD. We have
DTD = LRTT LRT = RTLTLRT
212 Dimensionality Reduction = R(T)RT
= R2d RT (7.40)
where2d isthed×ddiagonalmatrixdefinedas2d(i,i)=δi2,fori=1,…,d.Only r ≤ min(d,n) of these eigenvalues are positive, whereas the rest are all zeros.
Because the covariance matrix of centered D is given as = 1 DTD, and because n
it can be decomposed as = UUT via PCA [Eq. (7.37)], we have
DTD=n
= nUUT
= U(n)UT (7.41)
Equating Eq. (7.40) and Eq. (7.41), we conclude that the right singular vectors R are the same as the eigenvectors of . Further, the corresponding singular values of D are related to the eigenvalues of by the expression
n λ i = δ i2 δi2
or,λi = n, fori=1,…,d Let us now consider the matrix DDT. We have
DDT =(LRT)(LRT)T =LRT RT LT
=L(T )LT =L2n LT
(7.42)
where2n isthen×ndiagonalmatrixgivenas2n(i,i)=δi2,fori=1,…,n.Onlyrof these singular values are positive, whereas the rest are all zeros. Thus, the left singular vectors in L are the eigenvectors of the matrix n×n matrix DDT, and the corresponding eigenvalues are given as δi2.
Example7.9. Letusconsiderthen×dcenteredIrisdatamatrixDfromExample7.1, with n = 150 and d = 3. In Example 7.5 we computed the eigenvectors and eigenvalues of the covariance matrix as follows:
λ1 = 3.662 −0.390
u1 = 0.089 −0.916
λ2 = 0.239 −0.639
u2 = −0.742 0.200
λ3 = 0.059 −0.663
u3 = 0.664 0.346
Further Reading 213
Computing the SVD of D yields the following nonzero singular values and the corresponding right singular vectors
δ1 = 23.437 −0.390
r1 = 0.089 −0.916
δ2 = 5.992
0.639
r2 = 0.742 −0.200
δ3 = 2.974 −0.663
r3 = 0.664 0.346
We do not show the left singular vectors l1,l2,l3 because they lie in R150. Using Eq. (7.42) one can verify that λi = δi2 . For example,
n
δ12 23.4372 549.29 λ1=n= 150 = 150 =3.662
Notice also that the right singular vectors are equivalent to the principal components or eigenvectors of , up to isomorphism. That is, they may potentially be reversed in direction. For the Iris dataset, we have r1 = u1, r2 = −u2, and r3 = u3. Here the second right singular vector is reversed in sign when compared to the second principal component.
7.5 FURTHER READING
Principal component analysis was pioneered in Pearson (1901). For a comprehensive description of PCA see Jolliffe (2002). Kernel PCA was first introduced in Scho ̈ lkopf, Smola, and Mu ̈ller (1998). For further exploration of non-linear dimensionality reduction methods see Lee and Verleysen (2007). The requisite linear algebra background can be found in Strang (2006).
Jolliffe, I. (2002). Principal Component Analysis. 2nd ed. Springer Series in Statistics. New York: Springer Science + Business Media.
Lee, J. A. and Verleysen, M. (2007). Nonlinear dimensionality reduction. New York: Springer Science + Business Media.
Pearson, K. (1901). On lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 2 (11): 559–572.
Scho ̈ lkopf, B., Smola, A. J., and Mu ̈ ller, K.-R. (1998). Nonlinear Component Analysis as a Kernel Eigenvalue Problem. Neural Computation, 10 (5): 1299–1319.
Strang, G. (2006). Linear Algebra and Its Applications. 4th ed. Independence, KY: Thomson Brooks/Cole, Cengage learning.
214 Dimensionality Reduction 7.6 EXERCISES
Q1. Consider the following data matrix D:
(a) Compute the mean μ and covariance matrix for D.
(b) Compute the eigenvalues of .
(c) What is the “intrinsic” dimensionality of this dataset (discounting some small amount of variance)?
(d) Compute the first principal component.
(e) If the μ and from above characterize the normal distribution from which the
points were generated, sketch the orientation/extent of the 2-dimensional normal density function.
Q2. Given the covariance matrix = 5 4, answer the following questions: 45
(a) Compute the eigenvalues of by solving the equation det( − λI) = 0.
(b) Find the corresponding eigenvectors by solving the equation ui = λi ui .
Q3. Compute the singular values and the left and right singular vectors of the following
X1
X2
8
0 10 10 2
−20 −1 −19 −20
0
matrix:
A=1 1 0 001
Q4. Consider the data in Table 7.1. Define the kernel function as follows: K(xi , xj ) = ∥xi − xj ∥2 . Answer the following questions:
(a) Compute the kernel matrix K.
(b) Find the first kernel principal component.
Table 7.1. Dataset for Q4
Q5. Given the two points x1 = (1, 2)T , and x2 = (2, 1)T , use the kernel function K(xi,xj)=(xTi xj)2
to find the kernel principal component, by solving the equation Kc = η1c.
i
xi
x1 x4 x7 x9
(4, 2.9) (2.5, 1) (3.5, 4) (2, 2.1)
PART TWO FREQUENT PATTERN MINING
CHAPTER 8 Itemset Mining
In many applications one is interested in how often two or more objects of interest co-occur. For example, consider a popular website, which logs all incoming traffic to its site in the form of weblogs. Weblogs typically record the source and destination pages requested by some user, as well as the time, return code whether the request was successful or not, and so on. Given such weblogs, one might be interested in finding if there are sets of web pages that many users tend to browse whenever they visit the website. Such “frequent” sets of web pages give clues to user browsing behavior and can be used for improving the browsing experience.
The quest to mine frequent patterns appears in many other domains. The prototypical application is market basket analysis, that is, to mine the sets of items that are frequently bought together at a supermarket by analyzing the customer shopping carts (the so-called “market baskets”). Once we mine the frequent sets, they allow us to extract association rules among the item sets, where we make some statement about how likely are two sets of items to co-occur or to conditionally occur. For example, in the weblog scenario frequent sets allow us to extract rules like, “Users who visit the sets of pages main, laptops and rebates also visit the pages shopping-cart and checkout”, indicating, perhaps, that the special rebate offer is resulting in more laptop sales. In the case of market baskets, we can find rules such as “Customers who buy milk and cereal also tend to buy bananas,” which may prompt a grocery store to co-locate bananas in the cereal aisle. We begin this chapter with algorithms to mine frequent itemsets, and then show how they can be used to extract association rules.
8.1 FREQUENT ITEMSETS AND ASSOCIATION RULES
Itemsets and Tidsets
Let I = {x1 , x2 , . . . , xm } be a set of elements called items. A set X ⊆ I is called an itemset. The set of items I may denote, for example, the collection of all products sold at a supermarket, the set of all web pages at a website, and so on. An itemset of cardinality (or size) k is called a k-itemset. Further, we denote by I(k) the set of all k-itemsets, that is, subsets of I with size k. Let T = {t1,t2,…,tn} be another set of elements called
217
218 Itemset Mining
transaction identifiers or tids. A set T ⊆ T is called a tidset. We assume that itemsets and tidsets are kept sorted in lexicographic order.
A transaction is a tuple of the form ⟨t,X⟩, where t ∈ T is a unique transaction identifier, and X is an itemset. The set of transactions T may denote the set of all customers at a supermarket, the set of all the visitors to a website, and so on. For convenience, we refer to a transaction ⟨t,X⟩ by its identifier t.
Database Representation
A binary database D is a binary relation on the set of tids and items, that is, D ⊆ T × I . Wesaythattidt∈T containsitemx∈Iiff(t,x)∈D.Inotherwords,(t,x)∈Diffx∈X inthetuple⟨t,X⟩.Wesaythattidt containsitemsetX={x1,x2,…,xk}iff(t,xi)∈Dfor all i = 1,2,…,k.
For a set X, we denote by 2X the powerset of X, that is, the set of all subsets of X. Let i : 2T → 2I be a function, defined as follows:
i(T) = {x | ∀t ∈ T, t contains x} (8.1)
where T ⊆ T , and i(T) is the set of items that are common to all the transactions in the tidset T. In particular, i(t) is the set of items contained in tid t ∈ T . Note that in this chapter we drop the set notation for convenience (e.g., we write i(t) instead of i({t})). It is sometimes convenient to consider the binary database D, as a transaction database consisting of tuples of the form ⟨t , i(t )⟩, with t ∈ T . The transaction or itemset database can be considered as a horizontal representation of the binary database, where we omit items that are not contained in a given tid.
Let t : 2I → 2T be a function, defined as follows:
t(X)={t|t∈T andtcontainsX} (8.2)
where X ⊆ I, and t(X) is the set of tids that contain all the items in the itemset X. In particular, t(x) is the set of tids that contain the single item x ∈ I. It is also sometimes convenient to think of the binary database D, as a tidset database containing a collection of tuples of the form ⟨x,t(x)⟩, with x ∈ I. The tidset database is a vertical representation of the binary database, where we omit tids that do not contain a given item.
Example 8.1. Figure 8.1a shows an example binary database. Here I = {A,B,C,D,E}, and T = {1,2,3,4,5,6}. In the binary database, the cell in row t and column x is 1 iff (t,x) ∈ D, and 0 otherwise. We can see that transaction 1 contains item B, and it also contains the itemset BE, and so on.
Example 8.2. Figure 8.1b shows the corresponding transaction database for the binary database in Figure 8.1a. For instance, the first transaction is ⟨1,{A,B,D,E}⟩, where we omit item C since (1,C) ̸∈ D. Henceforth, for convenience, we drop the set notation for itemsets and tidsets if there is no confusion. Thus, we write ⟨1,{A,B,D,E}⟩ as ⟨1,ABDE⟩.
Frequent Itemsets and Association Rules
219
D
A
B
C
D
E
1
1
1
0
1
1
2
0
1
1
0
1
3
1
1
0
1
1
4
1
1
1
0
1
5
1
1
1
1
1
6
0
1
1
1
0
t
i(t )
1
ABDE
2
BCE
3
ABDE
4
ABCE
5
ABCDE
6
BCD
x
A
B
C
D
E
t(x )
1 3 4 5
1 2 3 4 5 6
2 4 5 6
1 3 5 6
1 2 3 4 5
(a) Binary database (b) Transaction database Figure 8.1. An example database.
Support and Frequent Itemsets
(c) Vertical database
Figure 8.1c shows the corresponding vertical database for the binary database in Figure 8.1a. For instance, the tuple corresponding to item A, shown in the first column, is ⟨A,{1,3,4,5}⟩, which we write as ⟨A,1345⟩ for convenience; we omit tids 2 and 6 because (2,A) ̸∈ D and (6,A) ̸∈ D.
The support of an itemset X in a dataset D, denoted sup(X,D), is the number of transactions in D that contain X:
sup(X,D) = {t | ⟨t,i(t)⟩ ∈ D and X ⊆ i(t)} = |t(X)|
The relative support of X is the fraction of transactions that contain X:
rsup(X,D)= sup(X,D) |D|
It is an estimate of the joint probability of the items comprising X.
An itemset X is said to be frequent in D if sup(X,D) ≥ minsup, where minsup is a user defined minimum support threshold. When there is no confusion about the database D, we write support as sup(X), and relative support as rsup(X). If minsup is specified as a fraction, then we assume that relative support is implied. We use the set F to denote the set of all frequent itemsets, and F(k) to denote the set of frequent
k-itemsets.
Example 8.3. Given the example dataset in Figure 8.1, let minsup = 3 (in relative support terms we mean minsup = 0.5). Table 8.1 shows all the 19 frequent itemsets in the database, grouped by their support value. For example, the itemset BCE is contained in tids 2, 4, and 5, so t(BCE) = 245 and sup(BCE) = |t(BCE)| = 3. Thus, BCE is a frequent itemset. The 19 frequent itemsets shown in the table comprise the set F. The sets of all frequent k-itemsets are
F(1) = {A,B,C,D,E}
F(2) ={AB,AD,AE,BC,BD,BE,CE,DE} F(3) ={ABD,ABE,ADE,BCE,BDE}
F(4) ={ABDE}
220
Itemset Mining
sup
itemsets
6 5 4 3
B
E, BE A,C,D,AB,AE,BC,BD,ABE
AD, CE, DE, ABD, ADE, BCE, BDE, ABDE
Association Rules
Table 8.1. Frequent itemsets with minsup = 3
s,c
An association rule is an expression X −→ Y, where X and Y are itemsets and they are disjoint, that is, X, Y ⊆ I , and X ∩ Y = ∅. Let the itemset X ∪ Y be denoted as XY. The support of the rule is the number of transactions in which both X and Y co-occur as subsets:
s = sup(X −→ Y) = |t(XY)| = sup(XY)
The relative support of the rule is defined as the fraction of transactions where X and
Y co-occur, and it provides an estimate of the joint probability of X and Y: rsup(X−→Y)= sup(XY) =P(X∧Y)
|D|
The confidence of a rule is the conditional probability that a transaction contains
Y given that it contains X:
c=conf(X−→Y)=P(Y|X)= P(X∧Y) = sup(XY)
A rule is frequent if the itemset XY is frequent, that is, sup(XY) ≥ minsup and a rule is strong if conf ≥ minconf, where minconf is a user-specified minimum confidence threshold.
P (X) sup(X)
Example 8.4. Consider the association rule BC −→ E. Using the itemset support values shown in Table 8.1, the support and confidence of the rule are as follows:
s = sup(BC −→ E) = sup(BCE) = 3
c = conf(BC −→ E) = sup(BCE) = 3/4 = 0.75 sup(BC)
Itemset and Rule Mining
From the definition of rule support and confidence, we can observe that to generate frequent and high confidence association rules, we need to first enumerate all the frequent itemsets along with their support values. Formally, given a binary database D and a user defined minimum support threshold minsup, the task of frequent itemset mining is to enumerate all itemsets that are frequent, i.e., those that have support at least minsup. Next, given the set of frequent itemsets F and a minimum confidence value minconf, the association rule mining task is to find all frequent and strong rules.
Itemset Mining Algorithms 221
8.2 ITEMSET MINING ALGORITHMS
We begin by describing a naive or brute-force algorithm that enumerates all the possible itemsets X ⊆ I, and for each such subset determines its support in the input dataset D. The method comprises two main steps: (1) candidate generation and (2) support computation.
Candidate Generation
This step generates all the subsets of I, which are called candidates, as each itemset is potentially a candidate frequent pattern. The candidate itemset search space is clearly exponential because there are 2|I| potentially frequent itemsets. It is also instructive to note the structure of the itemset search space; the set of all itemsets forms a lattice structure where any two itemsets X and Y are connected by a link iff X is an immediate subset of Y, that is, X ⊆ Y and |X| = |Y| − 1. In terms of a practical search strategy, the itemsets in the lattice can be enumerated using either a breadth-first (BFS) or depth-first (DFS) search on the prefix tree, where two itemsets X, Y are connected by a link iff X is an immediate subset and prefix of Y. This allows one to enumerate itemsets starting with an empty set, and adding one more item at a time.
Support Computation
This step computes the support of each candidate pattern X and determines if it is frequent. For each transaction ⟨t,i(t)⟩ in the database, we determine if X is a subset of i(t). If so, we increment the support of X.
The pseudo-code for the brute-force method is shown in Algorithm 8.1. It enumerates each itemset X ⊆ I, and then computes its support by checking if X ⊆ i(t) for each t ∈ T .
ALGORITHM 8.1. Algorithm BRUTEFORCE
1 2 3 4 5
6
7 8 9
10 11
BRUTEFORCE (D, I, minsup):
F ← ∅ // set of frequent itemsets foreachX⊆Ido
sup(X) ← COMPUTESUPPORT (X,D) if sup(X) ≥ minsup then
F ← F ∪ (X, sup(X)) return F
COMPUTESUPPORT (X,D): sup(X) ← 0
foreach ⟨t,i(t)⟩ ∈ D do
if X ⊆ i(t) then
sup(X) ← sup(X) + 1
return sup(X)
222
Itemset Mining
∅
ABCDE
AB AC AD AE BC BD BE CD CE DE
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE
Figure 8.2. Itemset lattice and prefix-based search tree (in bold).
Example 8.5. Figure 8.2 shows the itemset lattice for the set of items I = {A, B, C, D, E}. There are 2|I | = 25 = 32 possible itemsets including the empty set. The corresponding prefix search tree is also shown (in bold). The brute-force method explores the entire itemset search space, regardless of the minsup threshold employed. If minsup = 3, then the brute-force method would output the set of frequent itemsets shown in Table 8.1.
Computational Complexity
Support computation takes time O(|I| · |D|) in the worst case, and because there are O(2|I|) possible candidates, the computational complexity of the brute-force method is O(|I| · |D| · 2|I|). Because the database D can be very large, it is also important to measure the input/output (I/O) complexity. Because we make one complete database scan to compute the support of each candidate, the I/O complexity of BRUTEFORCE is O(2|I|) database scans. Thus, the brute force approach is computationally infeasible for even small itemset spaces, whereas in practice I can be very large (e.g., a supermarket carries thousands of items). The approach is impractical from an I/O perspective as well.
Itemset Mining Algorithms 223 We shall see next how to systematically improve on the brute force approach, by
improving both the candidate generation and support counting steps.
8.2.1 Level-wise Approach: Apriori Algorithm
The brute force approach enumerates all possible itemsets in its quest to determine the frequent ones. This results in a lot of wasteful computation because many of the candidates may not be frequent. Let X,Y ⊆ I be any two itemsets. Note that if X ⊆ Y, then sup(X) ≥ sup(Y), which leads to the following two observations: (1) if X is frequent, then any subset Y ⊆ X is also frequent, and (2) if X is not frequent, then any superset Y ⊇ X cannot be frequent. The Apriori algorithm utilizes these two properties to significantly improve the brute-force approach. It employs a level-wise or breadth-first exploration of the itemset search space, and prunes all supersets of any infrequent candidate, as no superset of an infrequent itemset can be frequent. It also avoids generating any candidate that has an infrequent subset. In addition to improving the candidate generation step via itemset pruning, the Apriori method also significantly improves the I/O complexity. Instead of counting the support for a single itemset, it explores the prefix tree in a breadth-first manner, and computes the support of all the valid candidates of size k that comprise level k in the prefix tree.
Example 8.6. Consider the example dataset in Figure 8.1; let minsup = 3. Figure 8.3 shows the itemset search space for the Apriori method, organized as a prefix tree where two itemsets are connected if one is a prefix and immediate subset of the other. Each node shows an itemset along with its support, thus AC(2) indicates that sup(AC) = 2. Apriori enumerates the candidate patterns in a level-wise manner, as shown in the figure, which also demonstrates the power of pruning the search space via the two Apriori properties. For example, once we determine that AC is infrequent, we can prune any itemset that has AC as a prefix, that is, the entire subtree under AC can be pruned. Likewise for CD. Also, the extension BCD from BC can be pruned, since it has an infrequent subset, namely CD.
Algorithm 8.2 shows the pseudo-code for the Apriori method. Let C(k) denote the prefix tree comprising all the candidate k-itemsets. The method begins by inserting the single items into an initially empty prefix tree to populate C(1). The while loop (lines 5–11) first computes the support for the current set of candidates at level k via the COMPUTESUPPORT procedure that generates k-subsets of each transaction in the database D, and for each such subset it increments the support of the corresponding candidate in C(k) if it exists. This way, the database is scanned only once per level, and the supports for all candidate k-itemsets are incremented during that scan. Next, we remove any infrequent candidate (line 9). The leaves of the prefix tree that survive comprise the set of frequent k-itemsets F(k), which are used to generate the candidate (k + 1)-itemsets for the next level (line 10). The EXTENDPREFIXTREE procedure employs prefix-based extension for candidate generation. Given two frequent k-itemsets Xa and Xb with a common k − 1 length prefix, that is, given two sibling leaf nodes with a common parent, we generate the (k + 1)-length candidate Xab =Xa ∪Xb.Thiscandidateisretainedonlyifithasnoinfrequentsubset.Finally,if a k-itemset Xa has no extension, it is pruned from the prefix tree, and we recursively
224
Itemset Mining
D(4) E(5)
Level 1 A(4)
∅
B(6) C(4)
BD(4) BE(5)
ADE(3) BCD
BCDE
Level 2
AB(4) AC(2)
Level 3
ABC ABD(3)
Level 4
ABCD ABCE
Level 5
ABCDE
AD(3) AE(4)
ABE(4) ACD
ABDE(3) ACDE
BC(4)
ACE
CD(2)
BCE(3)
CE(3)
DE(3)
BDE(3) CDE
Figure 8.3. Apriori: prefix search tree and effect of pruning. Shaded nodes indicate infrequent itemsets, whereas dashed nodes and lines indicate all of the pruned nodes and branches. Solid lines indicate frequent itemsets.
prune any of its ancestors with no k-itemset extension, so that in C(k) all leaves are at level k. If new candidates were added, the whole process is repeated for the next level. This process continues until no new candidates are added.
Example 8.7. Figure 8.4 illustrates the Apriori algorithm on the example dataset from Figure 8.1 using minsup = 3. All the candidates C(1) are frequent (see Figure 8.4a). During extension all the pairwise combinations will be considered, since they all share the empty prefix ∅ as their parent. These comprise the new prefix tree C(2) in Figure 8.4b; because E has no prefix-based extensions, it is removed from the tree. After support computation AC(2) and CD(2) are eliminated (shown in gray) since they are infrequent. The next level prefix tree is shown in Figure 8.4c. The candidate BCD is pruned due to the presence of the infrequent subset CD. All of the candidates at level 3 are frequent. Finally, C(4) (shown in Figure 8.4d) has only one candidate Xab = ABDE, which is generated from Xa = ABD and Xb = ABE because this is the only pair of siblings. The mining process stops after this step, since no more extensions are possible.
The worst-case computational complexity of the Apriori algorithm is still O(|I| · |D| · 2|I|), as all itemsets may be frequent. In practice, due to the pruning of the search
Itemset Mining Algorithms 225 ALGORITHM 8.2. Algorithm APRIORI
APRIORI (D, I, minsup): 1F←∅
2 3 4 5 6 7 8 9
10 11
12
13 14 15
16 17 18
19 20
21 22
23
C(1) ← {∅} // Initial prefix tree with single items foreachi∈Ido Addiaschildof∅inC(1) withsup(i)←0 k←1//kdenotesthelevel
whileC(k)̸=∅do
COMPUTESUPPORT (C(k),D)
foreach leaf X ∈ C(k) do
ifsup(X)≥minsupthen F←F∪ (X,sup(X)) else remove X from C(k)
C(k+1) ← EXTENDPREFIXTREE (C(k)) k←k+1
return F (k)
COMPUTESUPPORT (C(k),D): foreach ⟨t,i(t)⟩ ∈ D do
foreach k-subset X ⊆ i(t) do
ifX∈C(k) then sup(X)←sup(X)+1
EXTENDPREFIXTREE (C(k)): foreach leaf Xa ∈ C(k) do
foreach leaf Xb ∈ SIBLING(Xa),such that b > a do
Xab ←Xa ∪Xb
// prune candidate if there are any infrequent subsets if Xj ∈ C(k), for all Xj ⊂ Xab, such that |Xj | = |Xab| − 1 then
Add Xab as child of Xa with sup(Xab) ← 0 if no extensions from Xa then
remove Xa, and all ancestors of Xa with no extensions, from C(k) return C(k)
space the cost is much lower. However, in terms of I/O cost Apriori requires O(|I|) database scans, as opposed to the O(2|I|) scans in the brute-force method. In practice, it requires only l database scans, where l is the length of the longest frequent itemset.
8.2.2 Tidset Intersection Approach: Eclat Algorithm
The support counting step can be improved significantly if we can index the database in such a way that it allows fast frequency computations. Notice that in the level-wise approach, to count the support, we have to generate subsets of each transaction and check whether they exist in the prefix tree. This can be expensive because we may end up generating many subsets that do not exist in the prefix tree.
226
Itemset Mining
A(4)
AD(3) AE(4)
B(6)
∅(6)
C(4) D(4)
(a) C(1)
∅(6)
B(6)
BC(4) BD(4) (b) C(2)
E(5)
A(4)
C(4)
D(4)
CE(3) DE(3)
∅(6)
A(4)
AB(4)
ABD(3)
ABDE(3) (d) C(4)
AB(4)
AC(2)
BE(5)
CD(2)
∅(6) A(4)
B(6)
AB(4)
AD(3) BC(4)
ADE(3) BCE(3) (c) C(3)
BD(4)
BDE(3)
ABD(3)
ABE(4)
Figure 8.4. Itemset mining: Apriori algorithm. The prefix search trees C(k) at each level are shown. Leaves (unshaded) comprise the set of frequent k-itemsets F(k).
The Eclat algorithm leverages the tidsets directly for support computation. The basic idea is that the support of a candidate itemset can be computed by intersecting the tidsets of suitably chosen subsets. In general, given t(X) and t(Y) for any two frequent itemsets X and Y, we have
t(XY) = t(X) ∩ t(Y)
The support of candidate XY is simply the cardinality of t(XY), that is, sup(XY) = |t(XY)|. Eclat intersects the tidsets only if the frequent itemsets share a common prefix, and it traverses the prefix search tree in a DFS-like manner, processing a group of itemsets that have the same prefix, also called a prefix equivalence class.
Example 8.8. For example, if we know that the tidsets for item A and C are t(A) = 1345 and t(C) = 2456, respectively, then we can determine the support of AC by intersecting the two tidsets, to obtain t(AC) = t(A) ∩ t(C) = 1345 ∩ 2456 = 45.
Itemset Mining Algorithms 227 ALGORITHM 8.3. Algorithm ECLAT
1 2 3 4 5 6 7 8
9
The pseudo-code for Eclat is given in Algorithm 8.3. It employs a vertical representation of the binary database D. Thus, the input is the set of tuples ⟨i,t(i)⟩ for all frequent items i ∈ I, which comprise an equivalence class P (they all share the empty prefix); it is assumed that P contains only frequent itemsets. In general, given a prefix equivalence class P , for each frequent itemset Xa ∈ P , we try to intersect its tidset with the tidsets of all other itemsets Xb ∈P. The candidate pattern is Xab =Xa ∪Xb, and we check the cardinality of the intersection t(Xa ) ∩ t(Xb ) to determine whether it is frequent. If so, Xab is added to the new equivalence class Pa that contains all itemsets that share Xa as a prefix. A recursive call to Eclat then finds all extensions of the Xa branch in the search tree. This process continues until no extensions are possible over all branches.
Example 8.9. Figure 8.5 illustrates the Eclat algorithm. Here minsup = 3, and the initial prefix equivalence class is
P∅ =⟨A,1345⟩,⟨B,123456⟩,⟨C,2456⟩,⟨D,1356⟩,⟨E,12345⟩
Eclat intersects t(A) with each of t(B), t(C), t(D), and t(E) to obtain the tidsets for AB, AC, AD and AE, respectively. Out of these AC is infrequent and is pruned (marked gray). The frequent itemsets and their tidsets comprise the new prefix equivalence class
PA =⟨AB,1345⟩,⟨AD,135⟩,⟨AE,1345⟩
which is recursively processed. On return, Eclat intersects t(B) with t(C), t(D), and
t(E) to obtain the equivalence class
PB =⟨BC,2456⟩,⟨BD,1356⟩,⟨BE,12345⟩
// Initial Call: F ← ∅,P ← ⟨i,t(i)⟩ | i ∈ I,|t(i)| ≥ minsup ECLAT (P, minsup, F):
foreach⟨Xa,t(Xa)⟩∈P do
F←F∪(Xa,sup(Xa))
Pa←∅
foreach⟨Xb,t(Xb)⟩∈P,withXb >Xa do
Xab =Xa ∪Xb t(Xab)=t(Xa)∩t(Xb)
if sup(Xab) ≥ minsup then
Pa ←Pa ∪⟨Xab,t(Xab)⟩
ifPa ̸=∅then ECLAT (Pa,minsup,F)
In this case, we have sup(AC) = |45| = 2. An example of a prefix equivalence class is the set PA = {AB, AC, AD, AE}, as all the elements of PA share A as the prefix.
228 Itemset Mining
∅
A
1345
B
123456
C
2456
E
12345
D
1356
AB
AC
1345
45
BE
CD
12345
56
AE
1345
BC
2456
CE
245
DE
135
AD
BD
135
1356
BCD
56
ABE
1345
ADE
135
BCE
245
BDE
135
ABD
135
ABDE
135
Figure 8.5. Eclat algorithm: tidlist intersections (gray boxes indicate infrequent itemsets).
The computational complexity of Eclat is O(|D|·2|I|) in the worst case, since there
can be 2|I| frequent itemsets, and an intersection of two tidsets takes at most O(|D|)
time. The I/O complexity of Eclat is harder to characterize, as it depends on the size
of the intermediate tidsets. With t as the average tidset size, the initial database size
is O(t · |I|), and the total size of all the intermediate tidsets is O(t · 2|I|). Thus, Eclat
requirest·2|I| =O(2|I|/|I|)databasescansintheworstcase. t·|I|
Diffsets: Difference of Tidsets
The Eclat algorithm can be significantly improved if we can shrink the size of the intermediate tidsets. This can be achieved by keeping track of the differences in the tidsets as opposed to the full tidsets. Formally, let Xk = {x1,x2,…,xk−1,xk} be a k-itemset. Define the diffset of Xk as the set of tids that contain the prefix Xk−1 = {x1,…,xk−1} but do not contain the item xk, given as
d(Xk)=t(Xk−1)\t(Xk)
Consider two k-itemsets Xa = {x1,…,xk−1,xa} and Xb = {x1,…,xk−1,xb} that share the common(k−1)-itemsetX={x1,x2,…,xk−1}asaprefix.ThediffsetofXab =Xa∪Xb = {x1,…,xk−1,xa,xb} is given as
Other branches are processed in a similar manner; the entire search space that Eclat explores is shown in Figure 8.5. The gray nodes indicate infrequent itemsets, whereas the rest constitute the set of frequent itemsets.
However, note that
d(Xab)=t(Xa)\t(Xab)=t(Xa)\t(Xb) (8.3) t(Xa) \ t(Xb) = t(Xa) ∩ t(Xb)
Itemset Mining Algorithms 229 and taking the union of the above with the emptyset t(X) ∩ t(X), we can obtain an
expression for d(Xab) in terms of d(Xa) and d(Xb) as follows:
d(Xab)=t(Xa)\t(Xb) =t(Xa)∩t(Xb)
=t(Xa)∩t(Xb)∪t(X)∩t(X) =t(Xa)∪t(X)∩t(Xb)∪t(X)∩t(Xa)∪t(X)∩t(Xb)∪t(X)
= t(X) ∩ t(Xb) ∩ t(X) ∩ t(Xa) ∩ T =d(Xb)\d(Xa)
Thus, the diffset of Xab can be obtained from the diffsets of its subsets Xa and Xb, which means that we can replace all intersection operations in Eclat with diffset operations. Using diffsets the support of a candidate itemset can be obtained by subtracting the diffset size from the support of the prefix itemset:
sup(Xab)=sup(Xa)−|d(Xab)|
which follows directly from Eq. (8.3).
The variant of Eclat that uses the diffset optimization is called dEclat, whose
pseudo-code is shown in Algorithm 8.4. The input comprises all the frequent single items i ∈ I along with their diffsets, which are computed as
d(i)=t(∅)\t(i)=T \t(i)
Given an equivalence class P , for each pair of distinct itemsets Xa and Xb we generate the candidate pattern Xab = Xa ∪ Xb and check whether it is frequent via the use of diffsets (lines 6–7). Recursive calls are made to find further extensions. It is important
ALGORITHM 8.4. Algorithm DECLAT
1 2 3 4 5 6 7 8 9
10
DECLAT (P, minsup, F): foreach⟨Xa,d(Xa),sup(Xa)⟩∈P do
// Initial Call: F ← ∅,
P ←⟨i,d(i),sup(i)⟩|i∈I,d(i)=T \t(i),sup(i)≥minsup
F←F∪(Xa,sup(Xa))
Pa←∅
foreach⟨Xb,d(Xb),sup(Xb)⟩∈P,withXb >Xa do
Xab =Xa ∪Xb d(Xab)=d(Xb)\d(Xa) sup(Xab)=sup(Xa)−|d(Xab)| if sup(Xab) ≥ minsup then
Pa ←Pa ∪⟨Xab,d(Xab),sup(Xab)⟩ ifPa ̸=∅then DECLAT (Pa,minsup,F)
230 Itemset Mining
to note that the switch from tidsets to diffsets can be made during any recursive call to the method. In particular, if the initial tidsets have small cardinality, then the initial call should use tidset intersections, with a switch to diffsets starting with 2-itemsets. Such optimizations are not described in the pseudo-code for clarity.
Example 8.10. Figure 8.6 illustrates the dEclat algorithm. Here minsup = 3, and the initial prefix equivalence class comprises all frequent items and their diffsets, computed as follows:
d(A)=T \1345=26 d(B)=T \123456=∅ d(C)=T \2456=13
d(D)=T \1356=24 d(E)=T \12345=6
where T = 123456. To process candidates with A as a prefix, dEclat computes the diffsets for AB, AC, AD and AE. For instance, the diffsets of AB and AC are given as
d(AB)=d(B)\d(A)=∅\{2,6}=∅ d(AC)=d(C)\d(A)={1,3}\{2,6}=13
∅
A
(4)
26
B
(6)
∅
C
(4)
13
E
(5)
6
D
(4)
24
AC
(2)
13
CD
(2)
24
AB
(4)
∅
AE
(4)
∅
BC
(4)
13
BE
(5)
6
CE
(3)
6
DE
(3)
6
AD
(3)
BD
(4)
4
24
BCD
(2)
24
ABE
(4)
∅
ADE
(3)
∅
BCE
(3)
6
BDE
(3)
6
ABD
(3)
4
ABDE
(3)
∅
Figure 8.6. dEclat algorithm: diffsets (gray boxes indicate infrequent itemsets).
Itemset Mining Algorithms 231
and their support values are sup(AB)=sup(A)−|d(AB)|=4−0=4
sup(AC)=sup(A)−|d(AC)|=4−2=2
Whereas AB is frequent, we can prune AC because it is not frequent. The frequent itemsets and their diffsets and support values comprise the new prefix equivalence
class:
PA =⟨AB,∅,4⟩,⟨AD,4,3⟩,⟨AE,∅,4⟩
which is recursively processed. Other branches are processed in a similar manner. The entire search space for dEclat is shown in Figure 8.6. The support of an itemset is shown within brackets. For example, A has support 4 and diffset d(A) = 26.
8.2.3 Frequent Pattern Tree Approach: FPGrowth Algorithm
The FPGrowth method indexes the database for fast support computation via the use of an augmented prefix tree called the frequent pattern tree (FP-tree). Each node in the tree is labeled with a single item, and each child node represents a different item. Each node also stores the support information for the itemset comprising the items on the path from the root to that node. The FP-tree is constructed as follows. Initially the tree contains as root the null item ∅. Next, for each tuple ⟨t,X⟩ ∈ D, where X = i(t), we insert the itemset X into the FP-tree, incrementing the count of all nodes along the path that represents X. If X shares a prefix with some previously inserted transaction, then X will follow the same path until the common prefix. For the remaining items in X, new nodes are created under the common prefix, with counts initialized to 1. The FP-tree is complete when all transactions have been inserted.
The FP-tree can be considered as a prefix compressed representation of D. Because we want the tree to be as compact as possible, we want the most frequent items to be at the top of the tree. FPGrowth therefore reorders the items in decreasing order of support, that is, from the initial database, it first computes the support of all single items i ∈ I. Next, it discards the infrequent items, and sorts the frequent items by decreasing support. Finally, each tuple ⟨t,X⟩ ∈ D is inserted into the FP-tree after reordering X by decreasing item support.
Once the FP-tree has been constructed, it serves as an index in lieu of the original database. All frequent itemsets can be mined from the tree directly via the FPGROWTH method, whose pseudo-code is shown in Algorithm 8.5. The method accepts as input a FP-tree R constructed from the input database D, and the current itemset prefix P , which is initially empty.
Example 8.11. Consider the example database in Figure 8.1. We add each transac- tion one by one into the FP-tree, and keep track of the count at each node. For our example database the sorted item order is {B(6),E(5),A(4),C(4),D(4)}. Next, each transaction is reordered in this same order; for example, ⟨1,ABDE⟩ becomes ⟨1,BEAD⟩. Figure 8.7 illustrates step-by-step FP-tree construction as each sorted transaction is added to it. The final FP-tree for the database is shown in Figure 8.7f.
232
Itemset Mining
∅(4) B(4) E(4)
C(1)
C(1)
(d) ⟨4,BEAC⟩
E(5)
D(2)
∅(1)
B(1)
E(1)
A(1)
D(1)
(a) ⟨1,BEAD⟩
∅(2) B(2) E(2)
∅(3) B(3) E(3)
A(1)
C(1)
A(2)
C(1)
A(3)
D(1)
(b) ⟨2,BEC⟩
D(2)
(c) ⟨3,BEAD⟩
D(2)
∅(5) ∅(6) B(5) B(6) E(5) C(1)
A(4)
C(1) D(1)
A(4)
C(1)
C(2)
D(2)
C(2)
D(1)
(e) ⟨5,BEACD⟩
D(1)
(f) ⟨6,BCD⟩
Figure 8.7. Frequent pattern tree: bold edges indicate current transaction.
Given a FP-tree R, projected FP-trees are built for each frequent item i in R in increasing order of support. To project R on item i, we find all the occurrences of i in the tree, and for each occurrence, we determine the corresponding path from the root to i (line 13). The count of item i on a given path is recorded in cnt(i) (line 14), and the path is inserted into the new projected tree RX, where X is the itemset obtained by extending the prefix P with the item i. While inserting the path, the count of each node in RX along the given path is incremented by the path count cnt(i). We omit the item i from the path, as it is now part of the prefix. The resulting FP-tree is a projection of the itemset X that comprises the current prefix extended with item i (line 9). We then call FPGROWTH recursively with projected FP-tree RX and the new prefix itemset X as the parameters (line 16). The base case for the recursion happens when the input FP-tree R is a single path. FP-trees that are paths are handled by enumerating all itemsets that are subsets of the path, with the support of each such itemset being given by the least frequent item in it (lines 2–6).
Itemset Mining Algorithms 233 ALGORITHM 8.5. Algorithm FPGROWTH
1 2 3 4 5 6
7 8 9
10 11 12 13 14 15
16
// Initial Call: R ← FP-tree(D), P ← ∅, F ← ∅ FPGROWTH (R, P, F, minsup):
Remove infrequent items from R
if ISPATH(R) then // insert subsets of R into F
foreach Y ⊆ R do X←P∪Y
sup(X) ← minx∈Y{cnt(x)} F ← F ∪ (X, sup(X))
X←P ∪{i}
sup(X)←sup(i)//sumof cnt(i) forallnodeslabeled i F ← F ∪ (X, sup(X))
RX ← ∅ // projected FP-tree for X foreachpath∈PATHFROMROOT(i) do
cnt(i) ← count of i in path
Insert path, excluding i, into FP-tree RX with count cnt(i)
if RX ̸= ∅ then FPGROWTH (RX, X, F, minsup)
else
foreach i ∈ R in increasing order of sup(i) do
// process projected FP-trees for each frequent item i
Example 8.12. We illustrate the FPGrowth method on the FP-tree R built in Example 8.11, as shown in Figure 8.7f. Let minsup = 3. The initial prefix is P = ∅, and the set of frequent items i in R are B(6), E(5), A(4), C(4), and D(4). FPGrowth creates a projected FP-tree for each item, but in increasing order of support.
The projected FP-tree for item D is shown in Figure 8.8c. Given the initial FP-tree R shown in Figure 8.7f, there are three paths from the root to a node labeled D, namely
BCD, cnt (D) = 1 BEACD, cnt (D) = 1 BEAD, cnt (D) = 2
These three paths, excluding the last item i = D, are inserted into the new FP-tree RD with the counts incremented by the corresponding cnt(D) values, that is, we insert into RD, the paths BC with count of 1, BEAC with count of 1, and finally BEA with count of 2, as shown in Figures 8.8a–c. The projected FP-tree for D is shown in Figure 8.8c, which is processed recursively.
When we process RD, we have the prefix itemset P = D, and after removing the infrequent item C (which has support 2), we find that the resulting FP-tree is a single path B(4)–E(3)–A(3). Thus, we enumerate all subsets of this path and prefix them
234
Itemset Mining
E(3)
A(3)
C(1)
(c) Add BEA,cnt = 2
∅(1)
B(1)
C(1) C(1) (a) Add BC,cnt = 1
∅(2) B(2)
∅(4) B(4)
E(1) C(1)
A(1)
C(1)
(b) Add BEAC,cnt = 1
Figure 8.8. Projected frequent pattern tree for D.
with D, to obtain the frequent itemsets DB(4), DE(3), DA(3), DBE(3), DBA(3), DEA(3), and DBEA(3). At this point the call from D returns.
In a similar manner, we process the remaining items at the top level. The projected trees for C, A, and E are all single-path trees, allowing us to generate the frequent itemsets {CB(4),CE(3),CBE(3)}, {AE(4),AB(4),AEB(4)}, and {EB(5)}, respectively. This process is illustrated in Figure 8.9.
8.3 GENERATING ASSOCIATION RULES
Given a collection of frequent itemsets F , to generate association rules we iterate over all itemsets Z ∈ F, and calculate the confidence of various rules that can be derived from the itemset. Formally, given a frequent itemset Z ∈ F, we look at all proper subsets X ⊂ Z to compute rules of the form
s,c
X−→Y, whereY=Z\X
where Z \ X = Z − X. The rule must be frequent because
s = sup(XY) = sup(Z) ≥ minsup
Thus, we have to only check whether the rule confidence satisfies the minconf threshold. We compute the confidence as follows:
c = sup(X ∪ Y) = sup(Z) sup(X) sup(X)
If c ≥ minconf, then the rule is a strong rule. On the other hand, if conf(X −→ Y) < c, then conf(W−→Z\W)
The set of all closed frequent itemsets C is a condensed representation, as we can determine whether an itemset X is frequent, as well as the exact support of X using C alone. The itemset X is frequent if there exists a closed frequent itemset Z ∈ C such that X ⊆ Z. Further, the support of X is given as
sup(X) = maxsup(Z)|Z ∈ C,X ⊆ Z
The following relationship holds between the set of all, closed, and maximal
frequent itemsets:
Minimal Generators
M⊆C⊆F
A frequent itemset X is a minimal generator if it has no subsets with the same support:
G=X|X∈F and ̸∃Y⊂X, suchthatsup(X)=sup(Y)
In other words, all subsets of X have strictly higher support, that is, sup(X) < sup(Y), for all Y ⊂ X. The concept of minimum generator is closely related to the notion of closed itemsets. Given an equivalence class of itemsets that have the same tidset, a closed itemset is the unique maximum element of the class, whereas the minimal generators are the minimal elements of the class.
Example 9.2. Consider the example dataset in Figure 9.1a. The frequent closed (as well as maximal) itemsets using minsup = 3 are shown in Figure 9.2. We can see, for instance, that the itemsets AD, DE, ABD, ADE, BDE, and ABDE, occur in the same three transactions, namely 135, and thus constitute an equivalence class. The largest itemset among these, namely ABDE, is the closed itemset. Using the closure operator yields the same result; we have c(AD) = i(t(AD)) = i(135) = ABDE, which indicates that the closure of AD is ABDE. To verify that ABDE is closed note that c(ABDE) = i(t(ABDE)) = i(135) = ABDE. The minimal elements of the equivalence class, namely AD and DE, are the minimal generators. No subset of these itemsets shares the same tidset.
The set of all closed frequent itemsets, and the corresponding set of minimal generators, is as follows:
Tidset
C
G
1345
123456
1356
12345
2456
135
245
ABE B BD BE BC ABDE BCE
A
B
D
E
C AD, DE CE
Mining Maximal Frequent Itemsets: GenMax Algorithm 245 ∅
A
B
D
E
C
1345
123456
1356
12345
2456
AD
135
CE
245
DE
135
BD
1356
BE
12345
AB
1345
AE
1345
BC
2456
BCE
245
ABE
1345
ABD
135
ADE
135
BDE
135
ABDE
135
Figure 9.2. Frequent, closed, minimal generators, and maximal frequent itemsets. Itemsets that are boxed and shaded are closed, whereas those within boxes (but unshaded) are the minimal generators; maximal itemsets are shown boxed with double lines.
9.2 MINING MAXIMAL FREQUENT ITEMSETS: GENMAX ALGORITHM
Mining maximal itemsets requires additional steps beyond simply determining the frequent itemsets. Assuming that the set of maximal frequent itemsets is initially empty, that is, M = ∅, each time we generate a new frequent itemset X, we have to perform the following maximality checks
• Subset Check: ̸∃Y∈M, such that X⊂Y. If such a Y exists, then clearly X is not maximal. Otherwise, we add X to M, as a potentially maximal itemset.
• Superset Check: ̸ ∃Y ∈ M, such that Y ⊂ X. If such a Y exists, then Y cannot be maximal, and we have to remove it from M.
These two maximality checks take O(|M|) time, which can get expensive, especially as M grows; thus for efficiency reasons it is crucial to minimize the number of times these checks are performed. As such, any of the frequent itemset mining algorithms
Out of the closed itemsets, the maximal ones are ABDE and BCE. Consider itemset AB. Using C we can determine that
sup(AB) = max{sup(ABE), sup(ABDE)} = max{4, 3} = 4
246 Summarizing Itemsets
from Chapter 8 can be extended to mine maximal frequent itemsets by adding the maximality checking steps. Here we consider the GenMax method, which is based on the tidset intersection approach of Eclat (see Section 8.2.2). We shall see that it never inserts a nonmaximal itemset into M. It thus eliminates the superset checks and requires only subset checks to determine maximality.
Algorithm 9.1 shows the pseudo-code for GenMax. The initial call takes as input the set of frequent items along with their tidsets, ⟨i,t(i)⟩, and the initially empty set of maximal itemsets, M. Given a set of itemset–tidset pairs, called IT-pairs, of the form ⟨X,t(X)⟩, the recursive GenMax method works as follows. In lines 1–3, we check if theentire current branch can be pruned by checking if the union of all the itemsets, Y = Xi , is already subsumed by (or contained in) some maximal pattern Z ∈ M. If so, no maximal itemset can be generated from the current branch, and it is pruned. On the other hand, if the branch is not pruned, we intersect each IT-pair ⟨Xi , t(Xi )⟩ with all the other IT-pairs ⟨Xj ,t(Xj )⟩, with j > i, to generate new candidates Xij , which are added to the IT-pair set Pi (lines 6–9). If Pi is not empty, a recursive call to GENMAX is made to find other potentially frequent extensions of Xi. On the other hand, if Pi is empty, it means that Xi cannot be extended, and it is potentially maximal. In this case, we add Xi to the set M, provided that Xi is not contained in any previously added maximal set Z ∈ M (line 12). Note also that, because of this check for maximality before inserting any itemset into M, we never have to remove any itemsets from it. In other words, all itemsets in M are guaranteed to be maximal. On termination of GenMax, the set M contains the final set of all maximal frequent itemsets. The GenMax approach also includes a number of other optimizations to reduce the maximality checks and to improve the support computations. Further, GenMax utilizes diffsets (differences of tidsets) for fast support computation, which were described in Section 8.2.2. We omit these optimizations here for clarity.
ALGORITHM 9.1. Algorithm GENMAX
1 2 3
4 5 6 7 8 9
10 11 12
// Initial Call: M ← ∅, P ← ⟨i,t(i)⟩ | i ∈ I,sup(i) ≥ minsup G E N M A X ( P , m i n s u p , M ) :
Y← Xi
if∃Z∈M, suchthatY⊆Zthen
return // prune entire branch
foreach ⟨Xi , t(Xi )⟩ ∈ P do Pi ←∅
foreach ⟨Xj , t(Xj )⟩ ∈ P , with j > i do Xij ← Xi ∪ Xj
t(Xij)=t(Xi) ∩ t(Xj)
if sup(Xij ) ≥ minsup then Pi ← Pi ∪ {⟨Xij , t(Xij )⟩}
ifPi ̸=∅then GENMAX (Pi,minsup,M) elseif̸∃Z∈M,Xi ⊆Zthen
M=M∪Xi //add Xi tomaximalset
Mining Maximal Frequent Itemsets: GenMax Algorithm 247
Example 9.3. Figure 9.3 shows the execution of GenMax on the example database from Figure 9.1a using minsup = 3. Initially the set of maximal itemsets is empty. The root of the tree represents the initial call with all IT-pairs consisting of frequent single items and their tidsets. We first intersect t(A) with the tidsets of the other items. The set of frequent extensions from A are
PA =⟨AB,1345⟩,⟨AD,135⟩,⟨AE,1345⟩ Choosing Xi = AB, leads to the next set of extensions, namely
PAB =⟨ABD,135⟩,⟨ABE,1345⟩
Finally, we reach the left-most leaf corresponding to PABD = {⟨ABDE, 135⟩}. At this point, we add ABDE to the set of maximal frequent itemsets because it has no other extensions, so that M = {ABDE}.
The search then backtracks one level, and we try to process ABE, which is also a candidate to be maximal. However, it is contained in ABDE, so it is pruned. Likewise, when we try to process PAD = {⟨ADE,135⟩} it will get pruned because it is also subsumed by ABDE, and similarly for AE. At this stage, all maximal itemsets starting with A have been found, and we next proceed with the B branch. The left-most B branch, namely BCE, cannot be extended further. Because BCE is not
A
B
C
D
E
1345
123456
2456
1356
12345
PB
AB
AD
AE
1345
135
1345
BC
BD
BE
2456
1356
12345
CE
245
DE
135
PAB PAD
PA PC PD
PBC PBD
BCE
245
ADE
135
BDE
135
ABD
ABE
135
1345
PABD
ABDE
135
Figure9.3. Miningmaximalfrequentitemsets.Maximalitemsetsareshownasshadedovals,whereaspruned branches are shown with the strike-through. Infrequent itemsets are not shown.
248 Summarizing Itemsets
9.3 MINING CLOSED FREQUENT ITEMSETS: CHARM ALGORITHM
Mining closed frequent itemsets requires that we perform closure checks, that is, w h e t h e r X = c ( X ) . D i r e c t c l o s u r e c h e c k i n g c a n b e v e r y e x p e n s i v e , a s w e w o ul d h a v e t o verify that X is the largest itemset common to all the tids in t(X), that is, X = t ∈t(X) i(t ). Instead, we will describe a vertical tidset intersection based method called CHARM that performs more efficient closure checking. Given a collection of IT-pairs {⟨Xi , t(Xi )⟩}, the following three properties hold:
Property (1) If t(Xi ) = t(Xj ), then c(Xi ) = c(Xj ) = c(Xi ∪ Xj ), which implies that we can replace every occurrence of Xi with Xi ∪ Xj and prune the branch under Xj because its closure is identical to the closure of Xi ∪ Xj .
Property (2) If t(Xi) ⊂ t(Xj), then c(Xi) ̸= c(Xj) but c(Xi) = c(Xi ∪Xj), which means that we can replace every occurrence of Xi with Xi ∪ Xj , but we cannot prune Xj because it generates a different closure. Note that if t(Xi ) ⊃ t(Xj ) then we simply interchange the role of Xi and Xj .
Property (3) If t(Xi ) ̸= t(Xj ), then c(Xi ) ̸= c(Xj ) ̸= c(Xi ∪ Xj ). In this case we cannot remove either Xi or Xj , as each of them generates a different closure.
Algorithm 9.2 presents the pseudo-code for Charm, which is also based on the Eclat algorithm described in Section 8.2.2. It takes as input the set of all frequent single items along with their tidsets. Also, initially the set of all closed itemsets, C, is empty. Given any IT-pair set P = {⟨Xi , t(Xi )⟩}, the method first sorts them in increasing order of support. For each itemset Xi we try to extend it with all other items Xj in the sorted order, and we apply the above three properties to prune branches where possible. First we make sure that Xij = Xi ∪ Xj is frequent, by checking the cardinality of t(Xij ). If yes, then we check properties 1 and 2 (lines 8 and 12). Note that whenever we replace Xi with Xij =Xi ∪Xj, we make sure to do so in the current set P, as well as the new set Pi . Only when property 3 holds do we add the new extension Xij to the set Pi (line 14). If the set Pi is not empty, then we make a recursive call to Charm. Finally, if Xi is not a subset of any closed set Z with the same support, we can safely add it to the set of closed itemsets, C (line 18). For fast support computation, Charm uses the diffset optimization described in Section 8.2.2; we omit it here for clarity.
a subset of any maximal itemset in M, we insert it as a maximal itemset, so that M = {ABDE, BCE}. Subsequently, all remaining branches are subsumed by one of these two maximal itemsets, and are thus pruned.
Example 9.4. We illustrate the Charm algorithm for mining frequent closed itemsets from the example database in Figure 9.1a, using minsup = 3. Figure 9.4 shows the sequence of steps. The initial set of IT-pairs, after support based sorting, is shown at the root of the search tree. The sorted order is A, C, D, E, and B. We first process extensions from A, as shown in Figure 9.4a. Because AC is not frequent,
Mining Closed Frequent Itemsets: Charm Algorithm 249 ALGORITHM 9.2. Algorithm CHARM
1 2 3 4 5 6 7 8 9
10 11 12 13 14 15
16 17 18
// Initial Call: C ← ∅, P ← ⟨i,t(i)⟩ : i ∈ I,sup(i) ≥ minsup CHARM (P , minsup, C):
Sort P in increasing order of support (i.e., by increasing |t(Xi )|) foreach⟨Xi,t(Xi)⟩∈P do
Pi ←∅ foreach⟨Xj,t(Xj)⟩∈P,withj>ido
Xij = Xi ∪ Xj t(Xij)=t(Xi) ∩ t(Xj)
if sup(Xij ) ≥ minsup then
if t(Xi ) = t(Xj ) then // Property 1 Replace Xi with Xij in P and Pi Remove ⟨Xj , t(Xj )⟩ from P
else
if t(Xi ) ⊂ t(Xj ) then // Property 2
Replace Xi with Xij in P and Pi else // Property 3
Pi ←Pi ∪⟨Xij,t(Xij)⟩
ifPi̸=∅thenCHARM(Pi,minsup,C)
if̸∃Z∈C, suchthatXi ⊆Zandt(Xi)=t(Z)then
C=C∪Xi //Add Xi toclosedset
it is pruned. AD is frequent and because t(A) ̸= t(D), we add ⟨AD,135⟩ to the set PA (property 3). When we combine A with E, property 2 applies, and we simply replace all occurrences of A in both P and PA with AE, which is illustrated with the strike-through. Likewise, because t(A) ⊂ t(B) all current occurrences of A, actually AE, in both P and PA are replaced by AEB. The set PA thus contains only one itemset {⟨ADEB,135⟩}. When CHARM is invoked with PA as the IT-pair, it jumps straight to line 18, and adds ADEB to the set of closed itemsets C. When the call returns, we check whether AEB can be added as a closed itemset. AEB is a subset of ADEB, but it does not have the same support, thus AEB is also added to C. At this point all closed itemsets containing A have been found.
The Charm algorithm proceeds with the remaining branches as shown in Figure 9.4b. For instance, C is processed next. CD is infrequent and thus pruned. CE is frequent and it is added to PC as a new extension (via property 3). Because t(C) ⊂ t(B), all occurrences of C are replaced by CB, and PC = {⟨CEB, 245⟩}. CEB and CB are both found to be closed. The computation proceeds in this manner until all closed frequent itemsets are enumerated. Note that when we get to DEB and perform the closure check, we find that it is a subset of ADEB and also has the same support; thus DEB is not closed.
250
Summarizing Itemsets
AAE AEB
C
D
E
B
2456
1356
12345
123456
1345
PA
AD ADE ADEB
135
(a) Process A
PA PC PD
(b) Charm
AAE AEB
C CB
D DB
E EB
B
1345
2456
1356
12345
123456
AD ADE ADEB
135
CE CEB
245
DE DEB
135
Figure 9.4. Mining closed frequent itemsets. Closed itemsets are shown as shaded ovals. Strike-through represents itemsets Xi replaced by Xi ∪ Xj during execution of the algorithm. Infrequent itemsets are not shown.
9.4 NONDERIVABLE ITEMSETS
An itemset is called nonderivable if its support cannot be deduced from the supports of its subsets. The set of all frequent nonderivable itemsets is a summary or condensed representation of the set of all frequent itemsets. Further, it is lossless with respect to support, that is, the exact support of all other frequent itemsets can be deduced from it.
Generalized Itemsets
LetT beasetoftids,letIbeasetofitems,andletXbeak-itemset,thatis,X= {x1,x2,…,xk}. Consider the tidsets t(xi) for each item xi ∈ X. These k tidsets induce a partitioning of the set of all tids into 2k regions, some of which may be empty, where each partition contains the tids for some subset of items Y ⊆ X, but for none of the remaining items Z = X \ Y. Each such region is therefore the tidset of a generalized itemset comprising items in X or their negations. As such a generalized itemset can be represented as YZ, where Y consists of regular items and Z consists of negated items. We define the support of a generalized itemset YZ as the number of transactions that
Nonderivable Itemsets 251
t(A)
t(ACD) = ∅
t(ACD) = 13
t(ACD) = 4 t(ACD) = 5
t(ACD) = ∅ t(D)
t(C)
t(ACD) = 2
t(ACD) = 6
t(ACD) = ∅
Figure 9.5. Tidset partitioning induced by t(A), t(C), and t(D).
contain all items in Y but no item in Z:
sup(YZ)={t∈T |Y⊆i(t)andZ∩i(t)=∅}
Example 9.5. Consider the example dataset in Figure 9.1a. Let X = ACD. We have t(A) = 1345, t(C) = 2456, and t(D) = 1356. These three tidsets induce a partitioning on the space of all tids, as illustrated in the Venn diagram shown in Figure 9.5. For example, the region labeled t(ACD) = 4 represents those tids that contain A and C but not D. Thus, the support of the generalized itemset ACD is 1. The tids that belong to all the eight regions are shown. Some regions are empty, which means that the support of the corresponding generalized itemset is 0.
Inclusion–Exclusion Principle
Let YZ be a generalized itemset, and let X = Y ∪ Z = YZ. The inclusion–exclusion principle allows one to directly compute the support of YZ as a combination of the supports for all itemsets W, such that Y⊆ W ⊆ X:
sup(YZ) = Y⊆W⊆X −1|W\Y| · sup(W) (9.2)
252 Summarizing Itemsets
Example 9.6. Let us compute the support of the generalized itemset ACD = CAD, where Y = C, Z = AD and X = YZ = ACD. In the Venn diagram shown in Figure 9.5, we start with all the tids in t(C), and remove the tids contained in t(AC) and t(CD). However, we realize that in terms of support this removes sup(ACD) twice, so we need to add it back. In other words, the support of CAD is given as
sup(CAD) = sup(C) − sup(AC) − sup(CD) + sup(ACD) =4−2−2+1=1
But, this is precisely what the inclusion–exclusion formula gives:
sup(CAD) = (−1)0 sup(C)+ (−1)1 sup(AC)+ (−1)1 sup(CD)+ (−1)2 sup(ACD)
= sup(C)−sup(AC)−sup(CD)+sup(ACD)
W = C,|W \ Y| = 0 W=AC,|W\Y|=1 W = CD,|W \ Y| = 1 W = ACD,|W \ Y| = 2
We can see that the support of CAD is a combination of the support values over all itemsets W such that C ⊆ W ⊆ ACD.
Support Bounds for an Itemset
Notice that the inclusion–exclusion formula in Eq. (9.2) for the support of YZ has terms for all subsets between Y and X = YZ. Put differently, for a given k-itemset X, there are 2k generalized itemsets of the form YZ, with Y ⊆ X and Z = X \ Y, and each such generalized itemset has a term for sup(X) in the inclusion–exclusion equation; this happens when W = X. Because the support of any (generalized) itemset must be non-negative, we can derive a bound on the support of X from each of the 2k generalized itemsets by setting sup(YZ) ≥ 0. However, note that whenever |X \ Y| is even, the coefficient of sup(X) is +1, but when |X \ Y| is odd, the coefficient of sup(X) is −1 in Eq. (9.2). Thus, from the 2k possible subsets Y ⊆ X, we derive 2k−1 lower bounds and 2k−1 upper bounds for sup(X), obtained after setting sup(YZ) ≥ 0, and rearranging the terms in the inclusion–exclusion formula, so that sup(X) is on the left hand side and the the remaining terms are on the right hand side
Upper Bounds (|X \ Y| is odd): sup(X) ≤ −1(|X\W|+1)sup(W) (9.3) Y⊆W⊂X
Lower Bounds (|X \ Y| is even): sup(X) ≥ Y⊆W⊂X −1(|X\W|+1)sup(W) (9.4)
Note that the only difference in the two equations is the inequality, which depends on the starting subset Y.
Nonderivable Itemsets 253
Example 9.7. Consider Figure 9.5, which shows the partitioning induced by the tidsets of A, C, and D. We wish to determine the support bounds for X = ACD using each of the generalized itemsets YZ where Y ⊆ X. For example, if Y = C, then the inclusion-exclusion principle [Eq. (9.2)] gives us
sup(CAD) = sup(C) − sup(AC) − sup(CD) + sup(ACD) Setting sup(CAD) ≥ 0, and rearranging the terms, we obtain
sup(ACD) ≥ −sup(C) + sup(AC) + sup(CD)
which is precisely the expression from the lower-bound formula in Eq. (9.4) because |X\Y|=|ACD−C|=|AD|=2 is even.
As another example, let Y = ∅. Setting sup(ACD) ≥ 0, we have
sup(ACD) = sup(∅) − sup(A) − sup(C) − sup(D) +
sup(AC) + sup(AD) + sup(CD) − sup(ACD) ≥ 0
=⇒ sup(ACD)≤sup(∅)−sup(A)−sup(C)−sup(D)+ sup(AC) + sup(AD) + sup(CD)
Notice that this rule gives an upper bound on the support of ACD, which also follows from Eq. (9.3) because |X \ Y| = 3 is odd.
In fact, from each of the regions in Figure 9.5, we get one bound, and out of the eight possible regions, exactly four give upper bounds and the other four give lower bounds for the support of ACD:
sup(ACD)
≥0
≤ sup(AC)
≤ sup(AD)
≤ sup(CD)
≥ sup(AC)
≥ sup(AC)
≥ sup(AD)
≤ sup(AC)
when Y = ACD whenY=AC whenY=AD whenY=CD when Y = A when Y = C when Y = D
when Y = ∅
+ sup(AD) − sup(A) + sup(CD) − sup(C) + sup(CD) − sup(D)
+ sup(AD) + sup(CD)− sup(A) − sup(C) − sup(D) + sup(∅)
This derivation of the bounds is schematically summarized in Figure 9.6. For instance, at level 2 the inequality is ≥, which implies that if Y is any itemset at this level, we will obtain a lower bound. The signs at different levels indicate the coefficient of the corresponding itemset in the upper or lower bound computations via Eq. (9.3) and Eq. (9.4). Finally, the subset lattice shows which intermediate terms W have to be considered in the summation. For instance, if Y = A, then the intermediate terms are W ∈ {AC,AD,A}, with the corresponding signs {+1,+1,−1}, so that we obtain the lower bound rule:
sup(ACD) ≥ sup(AC) + sup(AD) − sup(A)
254
Summarizing Itemsets
subset lattice
ACD
AC AD CD
ACD
∅
sign inequality level 1≤1 −1 ≥ 2 1≤3
Nonderivable Itemsets
Figure 9.6. Support bounds from subsets.
Given an itemset X, and Y ⊆ X, let IE(Y) denote the summation IE(Y) = −1(|X\W|+1) · sup(W)
Y⊆W⊂X
Then, the sets of all upper and lower bounds for sup(X) are given as
U B ( X ) = I E ( Y ) Y ⊆ X , | X \ Y | i s o d d L B ( X ) = I E ( Y ) Y ⊆ X , | X \ Y | i s e v e n
An itemset X is called nonderivable if max{LB(X)} ̸= min{UB(X)}, which implies that the support of X cannot be derived from the support values of its subsets; we know only the range of possible values, that is,
sup(X) ∈ max{LB(X)}, min{UB(X)}
On the other hand, X is derivable if sup(X) = max{LB(X)} = min{UB(X)} because in this case sup(X) can be derived exactly using the supports of its subsets. Thus, the set of all frequent nonderivable itemsets is given as
N = X ∈ F | max{LB(X)} ̸= min{UB(X)} where F is the set of all frequent itemsets.
Example 9.8. Consider the set of upper bound and lower bound formulas for sup(ACD) outlined in Example 9.7. Using the tidset information in Figure 9.5, the
Nonderivable Itemsets 255
support lower bounds are
sup(ACD) ≥ 0 ≥sup(AC)+sup(AD)−sup(A)=2+3−4=1 ≥sup(AC)+sup(CD)−sup(C)=2+2−4=0 ≥sup(AD)+sup(CD)−sup(D)=3+2−4=0
and the upper bounds are
sup(ACD) ≤ sup(AC) = 2 ≤ sup(AD) = 3 ≤ sup(CD) = 2
Thus, we have
≤ sup(AC) + sup(AD) + sup(CD) − sup(A) − sup(C)− sup(D)+sup(∅)=2+3+2−4−4−4+6=1
LB(ACD) = {0, 1} max{LB(ACD)} = 1 UB(ACD) = {1, 2, 3} min{UB(ACD)} = 1
Because max{LB(ACD)} = min{UB(ACD)} we conclude that ACD is derivable. Note that is it not essential to derive all the upper and lower bounds before one can conclude whether an itemset is derivable. For example, let X = ABDE.
Considering its immediate subsets, we can obtain the following upper bound values:
sup(ABDE) ≤ sup(ABD) = 3 ≤ sup(ABE) = 4 ≤ sup(ADE) = 3 ≤ sup(BDE) = 3
From these upper bounds, we know for sure that sup(ABDE) ≤ 3. Now, let us consider the lower bound derived from Y = AB:
sup(ABDE) ≥ sup(ABD) + sup(ABE) − sup(AB) = 3 + 4 − 4 = 3
At this point we know that sup(ABDE) ≥ 3, so without processing any further bounds, we can conclude that sup(ABDE) ∈ [3,3], which means that ABDE is derivable.
For the example database in Figure 9.1a, the set of all frequent nonderivable itemsets, along with their support bounds, is
N =A[0,6],B[0,6],C[0,6],D[0,6],E[0,6], AD[2, 4], AE[3, 4], CE[3, 4], DE[3, 4]
Notice that single items are always nonderivable by definition.
256 Summarizing Itemsets
9.5 FURTHER READING
The concept of closed itemsets is based on the elegant lattice theoretic framework of formal concept analysis in Ganter, Wille, and Franzke (1997).The Charm algorithm for mining frequent closed itemsets appears in Zaki and Hsiao (2005), and the GenMax method for mining maximal frequent itemsets is described in Gouda and Zaki (2005). For an Apriori style algorithm for maximal patterns, called MaxMiner, that uses very effective support lower bound based itemset pruning see Bayardo Jr (1998). The notion of minimal generators was proposed in Bastide et al. (2000); they refer to them as key patterns. Nonderivable itemset mining task was introduced in Calders and Goethals (2007).
Bastide, Y., Taouil, R., Pasquier, N., Stumme, G., and Lakhal, L. (2000). Mining fre- quent patterns with counting inference. ACM SIGKDD Explorations Newsletter, 2 (2): 66–75.
Bayardo Jr, R. J. (1998). Efficiently mining long patterns from databases. Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, pp. 85–93.
Calders, T. and Goethals, B. (2007). Non-derivable itemset mining. Data Mining and Knowledge Discovery, 14 (1): 171–206.
Ganter, B., Wille, R., and Franzke, C. (1997). Formal concept analysis: mathematical foundations. New York: Springer-Verlag.
Gouda, K. and Zaki, M. J. (2005). Genmax: An efficient algorithm for mining maximal frequent itemsets. Data Mining and Knowledge Discovery, 11 (3): 223–242.
Zaki, M. J. and Hsiao, C.-J. (2005). Efficient algorithms for mining closed itemsets and their lattice structure. IEEE Transactions on Knowledge and Data Engineering, 17 (4): 462–478.
9.6 EXERCISES
Q1. True or False:
(a) Maximal frequent itemsets are sufficient to determine all frequent itemsets with
their supports.
(b) An itemset and its closure share the same set of transactions.
(c) The set of all maximal frequent sets is a subset of the set of all closed frequent itemsets.
(d) The set of all maximal frequent sets is the set of longest possible frequent itemsets.
Q2. Given the database in Table 9.1
(a) Show the application of the closure operator on AE, that is, compute c(AE). Is
AE closed?
(b) Find all frequent, closed, and maximal itemsets using minsup = 2/6.
Q3. Given the database in Table 9.2, find all minimal generators using minsup = 1.
Exercises
257
Table 9.1. Dataset for Q2
Tid
Itemset
t1 t2 t3 t4 t5 t6
ACD
BCE ABCE BDE ABCE ABCD
Table 9.2. Dataset for Q3
Tid
Itemset
1 2 3 4 5 6
ACD BCD AC ABD ABCD BCD
BC(5)
ABCD(3)
ABD(6)
B(8)
Figure 9.7. Closed itemset lattice for Q4.
Q4. Consider the frequent closed itemset lattice shown in Figure 9.7. Assume that the item space is I = {A, B, C, D, E}. Answer the following questions:
(a) What is the frequency of CD?
(b) Find all frequent itemsets and their frequency, for itemsets in the subset interval
[B, ABD].
(c) Is ADE frequent? If yes, show its support. If not, why?
Q5. LetCbethesetofallclosedfrequentitemsetsandMthesetofallmaximalfrequent itemsets for some database. Prove that M ⊆ C.
Q6. Provethattheclosureoperatorc=i◦tsatisfiesthefollowingproperties(XandYare some itemsets):
(a) Extensive: X ⊆ c(X)
(b) Monotonic: If X ⊆ Y then c(X) ⊆ c(Y)
(c) Idempotent: c(X) = c(c(X))
258
Summarizing Itemsets
Table 9.3. Dataset for Q7
Tid
Itemset
1 2 3 4 5 6
ACD BCD ACD ABD ABCD BC
Q7.
Q8.
Let δ be an integer. An itemset X is called a δ-free itemset iff for all subsets Y ⊂ X, we have sup(Y) − sup(X) > δ. For any itemset X, we define the δ-closure of X as follows:
δ-closure(X) = Y | X ⊂ Y, sup(X) − sup(Y) ≤ δ, and Y is maximal
Consider the database shown in Table 9.3. Answer the following questions:
(a) Given δ = 1, compute all the δ-free itemsets.
(b) For each of the δ-free itemsets, compute its δ-closure for δ = 1.
Given the lattice of frequent itemsets (along with their supports) shown in Figure 9.8, answer the following questions:
(a) List all the closed itemsets.
(b) Is BCD derivable? What about ABCD? What are the bounds on their supports.
∅(6)
AB(5)
A(6) B(5)
AC(4) AD(3)
ABC(3) ABD(2)
C(4) D(3)
BC(3) BD(2)
ACD(2) BCD(1)
CD(2)
ABCD(1)
Figure 9.8. Frequent itemset lattice for Q8.
Q9. Prove that if an itemset X is derivable, then so is any superset Y ⊃ X. Using this observation describe an algorithm to mine all nonderivable itemsets.
CHAPTER 10 Sequence Mining
Many real-world applications such as bioinformatics, Web mining, and text mining have to deal with sequential and temporal data. Sequence mining helps discover patterns across time or positions in a given dataset. In this chapter we consider methods to mine frequent sequences, which allow gaps between elements, as well as methods to mine frequent substrings, which do not allow gaps between consecutive elements.
10.1 FREQUENT SEQUENCES
Let denote an alphabet, defined as a finite set of characters or symbols, and let || denote its cardinality. A sequence or a string is defined as an ordered list of symbols, and is written as s = s1s2 …sk, where si ∈ is a symbol at position i, also denoted as s[i]. Here |s| = k denotes the length of the sequence. A sequence with length k is also calledak-sequence.Weusethenotations[i:j]=sisi+1···sj−1sj todenotethesubstring or sequence of consecutive symbols in positions i through j, where j > i. Define the prefix of a sequence s as any substring of the form s[1 : i ] = s1 s2 . . . si , with 0 ≤ i ≤ n. Also, define the suffix of s as any substring of the form s[i : n] = si si +1 . . . sn , with 1 ≤ i ≤ n + 1. Note that s[1 : 0] is the empty prefix, and s[n + 1 : n] is the empty suffix. Let ⋆ be the set of all possible sequences that can be constructed using the symbols in , including the empty sequence ∅ (which has length zero).
Let s=s1s2…sn and r=r1r2…rm be two sequences over . We say that r is a subsequence of s denoted r ⊆ s, if there exists a one-to-one mapping φ : [1, m] → [1, n], such that r[i] = s[φ(i)] and for any two positions i,j in r, i < j =⇒ φ(i) < φ(j). In other words, each position in r is mapped to a different position in s, and the order of symbols is preserved, even though there may be intervening gaps between consecutive elements of r in the mapping. If r ⊆ s, we also say that s contains r. The sequence r is called a consecutive subsequence or substring of s provided r1 r2 . . . rm = sj sj +1 . . . sj +m−1 , i.e., r[1 : m] = s[j : j + m − 1], with 1 ≤ j ≤ n − m + 1. For substrings we do not allow any gaps between the elements of r in the mapping.
Example 10.1. Let = {A,C,G,T}, and let s = ACTGAACG. Then r1 = CGAAG is a subsequence of s, and r2 = CTGA is a substring of s. The sequence r3 = ACT is a prefix of s, and so is r4 = ACTGA, whereas r5 = GAACG is one of the suffixes of s.
259
260 Sequence Mining
Given a database D = {s1,s2,...,sN} of N sequences, and given some sequence r, the support of r in the database D is defined as the total number of sequences in D that contain r
sup(r)=si ∈D|r⊆si
The relative support of r is the fraction of sequences that contain r
rsup(r) = sup(r)/N
Given a user-specified minsup threshold, we say that a sequence r is frequent in database D if sup(r) ≥ minsup. A frequent sequence is maximal if it is not a subsequence of any other frequent sequence, and a frequent sequence is closed if it is not a subsequence of any other frequent sequence with the same support.
10.2 MINING FREQUENT SEQUENCES
For sequence mining the order of the symbols matters, and thus we have to consider all possible permutations of the symbols as the possible frequent candidates. Contrast this with itemset mining, where we had only to consider combinations of the items. The sequence search space can be organized in a prefix search tree. The root of the tree, at level 0, contains the empty sequence, with each symbol x ∈ as one of its children. As such, a node labeled with the sequence s = s1 s2 . . . sk at level k has children of the form s′ =s1s2...sksk+1 atlevelk+1.Inotherwords,sisaprefixofeachchilds′,whichisalso called an extension of s.
The subsequence search space is conceptually infinite because it comprises all sequences in ∗, that is, all sequences of length zero or more that can be created using symbols in . In practice, the database D consists of bounded length sequences. Let l denote the length of the longest sequence in the database, then, in the worst case, we will have to consider all candidate sequences of length up to l, which gives the following
Table 10.1. Example sequence database
Example 10.2. Let = {A,C,G,T} and let the sequence database D consist of the three sequences shown in Table 10.1. The sequence search space organized as a prefix search tree is illustrated in Figure 10.1. The support of each sequence is shown within brackets. For example, the node labeled A has three extensions AA, AG, and AT, out of which AT is infrequent if minsup = 3.
Id
Sequence
s1
CAGAAGT
s2
TGACAG
s3
GAAGT
Mining Frequent Sequences 261 ALGORITHM 10.1. Algorithm GSP
GSP (D, , minsup): 1F←∅
2 3 4 5 6 7 8 9
10 11
12
13 14 15
16 17 18
19 20
21 22
23
C(1) ← {∅} // Initial prefix tree with single symbols foreachs∈do Addsaschildof∅inC(1) withsup(s)←0 k←1//kdenotesthelevel
whileC(k)̸=∅do
COMPUTESUPPORT (C(k),D)
foreach leaf s ∈ C(k) do
ifsup(r)≥minsupthen F←F∪ (r,sup(r)) else remove s from C(k)
C(k+1) ← EXTENDPREFIXTREE (C(k)) k←k+1
return F (k)
COMPUTESUPPORT (C(k),D): foreach si ∈ D do
foreach r ∈ C(k) do
if r⊆si then sup(r)←sup(r)+1
EXTENDPREFIXTREE (C(k)): foreach leaf ra ∈ C(k) do
foreach leaf rb ∈ CHILDREN(PARENT(ra)) do
rab ←ra +rb[k]//extend ra withlastitemof rb
// prune if there are any infrequent subsequences ifrc ∈C(k), forallrc ⊂rab, suchthat|rc|=|rab|−1then
Add rab as child of ra with sup(rab) ← 0 if no extensions from ra then
remove ra, and all ancestors of ra with no extensions, from C(k) return C(k)
bound on the size of the search space:
||1 +||2 +···+||l = O(||l) (10.1)
since at level k there are ||k possible subsequences of length k. 10.2.1 Level-wise Mining: GSP
We can devise an effective sequence mining algorithm that searches the sequence prefix tree using a level-wise or breadth-first search. Given the set of frequent sequences at level k, we generate all possible sequence extensions or candidates at level k + 1. We next compute the support of each candidate and prune those that are not frequent. The search stops when no more frequent extensions are possible.
262
AA(3)
AAA(1) AAG(3)
AAGG
∅(3)
Sequence Mining
T(3)
TA(1) TG(1) TT(0)
A(3) C(2)
AG(3) AT(2) GA(3)
AGA(1) AGG(1) GAA(3)
GAAA GAAG(3)
G(3)
GG(3)
GT(2)
GAG(3)
GGA(0) GGG(0)
Figure 10.1. Sequence search space: shaded ovals represent candidates that are infrequent; those without support in brackets can be pruned based on an infrequent subsequence. Unshaded ovals represent frequent sequences.
The pseudo-code for the level-wise, generalized sequential pattern (GSP) mining method is shown in Algorithm 10.1. It uses the antimonotonic property of support to prune candidate patterns, that is, no supersequence of an infrequent sequence can be frequent, and all subsequences of a frequent sequence must be frequent. The prefix search tree at level k is denoted C(k). Initially C(1) comprises all the symbols in . Given the current set of candidate k-sequences C(k), the method first computes their support (line 6). For each database sequence si ∈ D, we check whether a candidate sequence r ∈ C(k) is a subsequence of si . If so, we increment the support of r. Once the frequent sequences at level k have been found, we generate the candidates for level k + 1 (line 10). For the extension, each leaf ra is extended with the last symbol of any other leaf rb that shares the same prefix (i.e., has the same parent), to obtain the new candidate (k + 1)-sequence rab = ra + rb[k] (line 18). If the new candidate rab contains any infrequent k-sequence, we prune it.
GAGA GAGG
Example 10.3. For example, let us mine the database shown in Table 10.1 using minsup = 3. That is, we want to find only those subsequences that occur in all three database sequences. Figure 10.1 shows that we begin by extending the empty sequence ∅ at level 0, to obtain the candidates A, C, G, and T at level 1. Out of these C can be pruned because it is not frequent. Next we generate all possible candidates at level 2. Notice that using A as the prefix we generate all possible extensions AA, AG, and AT. A similar process is repeated for the other two symbols G and T. Some candidate extensions can be pruned without counting. For example, the extension GAAA obtained from GAA can be pruned because it has an infrequent subsequence AAA. The figure shows all the frequent sequences (unshaded), out of which GAAG(3) and T(3) are the maximal ones.
The computational complexity of GSP is O(||l) as per Eq.(10.1), where l is the length of the longest frequent sequence. The I/O complexity is O(l · D) because we compute the support of an entire level in one scan of the database.
Mining Frequent Sequences 263 10.2.2 VerticalSequenceMining:Spade
The Spade algorithm uses a vertical database representation for sequence mining. The idea is to record for each symbol the sequence identifiers and the positions where it occurs. For each symbol s ∈ , we keep a set of tuples of the form ⟨i,pos(s)⟩, where pos(s) is the set of positions in the database sequence si ∈ D where symbol s appears. Let L(s) denote the set of such sequence-position tuples for symbol s, which we refer to as the poslist. The set of poslists for each symbol s ∈ thus constitutes a vertical representation of the input database. In general, given k-sequence r, its poslist L(r) maintains the list of positions for the occurrences of the last symbol r[k] in each database sequence si, provided r ⊆ si. The support of sequence r is simply the number of distinct sequences in which r occurs, that is, sup(r) = |L(r)|.
∅
Example 10.4. In Table 10.1, the symbol A occurs in s1 at positions 2, 4, and 5. Thus, we add the tuple ⟨1,{2,4,5}⟩ to L(A). Because A also occurs at positions 3 and 5 in sequence s2, and at positions 2 and 3 in s3, the complete poslist for A is {⟨1,{2,4,5}⟩,⟨2,{3,5}⟩,⟨1,{2,3}⟩}. We have sup(A) = 3, as its poslist contains three tuples. Figure 10.2 shows the poslist for each symbol, as well as other sequences. For example, for sequence GT, we find that it is a subsequence of s1 and s3.
T
1 2 3
7 1 5
A
G
C
1 2
1 4
1 2 3
2,4,5 3,5 2,3
1 2 3
3,6 2,6 1,4
AG
1 2 3
3,6 6 4
GA
1 2 3
4,5 3,5 2,3
GG
1 2 3
6 6 4
AA
AT
1 3
7 5
GT
1 3
7 5
TA
2
3,5
TG
2
2,6
1 2 3
4,5 5 3
AAG
1 2 3
6 6 4
GAG
1 2 3
6 6 4
GAA
AAA
1
5
AGA
1
5
AGG
1
6
1 2 3
5 5 3
GAAG
1 2 3
6 6 4
Figure 10.2. Sequence mining via Spade: infrequent sequences with at least one occurrence are shown shaded; those with zero support are not shown.
264 Sequence Mining
Support computation in Spade is done via sequential join operations. Given the poslists for any two k-sequences ra and rb that share the same (k − 1) length prefix, the idea is to perform sequential joins on the poslists to compute the support for the new (k + 1) length candidate sequence rab = ra + rb[k]. Given a tuple i,posrb[k] ∈ L(rb), we first check if there exists a tuple i,posra[k] ∈ L(ra), that is, both sequences must occur in the same database sequence si. Next, for each position p ∈ posrb[k], we check whether there exists a position q ∈ posra[k] such that q < p. If yes, this means that the symbol rb[k] occurs after the last position of ra and thus we retain p as a valid occurrence of rab. The poslist L(rab) comprises all such valid occurrences. Notice how we keep track of positions only for the last symbol in the candidate sequence. This is because we extend sequences from a common prefix, so there is no need to keep track of all the occurrences of the symbols in the prefix. We denote the sequential join as L(rab) = L(ra) ∩ L(rb ).
The main advantage of the vertical approach is that it enables different search strategies over the sequence search space, including breadth or depth-first search. Algorithm 10.2 shows the pseudo-code for Spade. Given a set of sequences P that share the same prefix, along with their poslists, the method creates a new prefix equivalence class Pa for each sequence ra ∈ P by performing sequential joins with every sequence rb ∈ P , including self-joins. After removing the infrequent extensions, the new equivalence class Pa is then processed recursively.
ALGORITHM 10.2. Algorithm SPADE
Even though there are two occurrences of GT in s1, the last symbol T occurs at position 7 in both occurrences, thus the poslist for GT has the tuple ⟨1,7⟩. The full poslist for GT is L(GT) = {⟨1,7⟩,⟨3,5⟩}. The support of GT is sup(GT) = |L(GT)| = 2.
1 2 3 4 5 6 7 8
9
SPADE (P, minsup, F, k): foreach ra ∈ P do
//InitialCall: F ←∅, k←0,
P ← ⟨s,L(s)⟩ | s ∈ ,sup(s) ≥ minsup
F ← F ∪ (ra , sup(ra )) Pa ←∅
foreach rb ∈ P do
rab =ra +rb[k] L(rab)=L(ra)∩L(rb)
if sup(rab) ≥ minsup then
Pa ←Pa ∪⟨rab,L(rab)⟩
ifPa ̸=∅then SPADE (P,minsup,F,k+1)
Mining Frequent Sequences 265
Example 10.5. Consider the poslists for A and G shown in Figure 10.2. To obtain L(AG), we perform a sequential join over the poslists L(A) and L(G). For the tuples ⟨1,{2,4,5}⟩ ∈ L(A) and ⟨1,{3,6}⟩ ∈ L(G), both positions 3 and 6 for G, occur after some occurrence of A, for example, at position 2. Thus, we add the tuple ⟨1, {3, 6}⟩ to L(AG). The complete poslist for AG is L(AG) = {⟨1, {3, 6}⟩, ⟨2, 6⟩, ⟨3, 4⟩}.
Figure 10.2 illustrates the complete working of the Spade algorithm, along with all the candidates and their poslists.
10.2.3 Projection-Based Sequence Mining: PrefixSpan
Let D denote a database, and let s ∈ be any symbol. The projected database with respect to s, denoted Ds , is obtained by finding the the first occurrence of s in si , say at position p. Next, we retain in Ds only the suffix of si starting at position p + 1. Further, any infrequent symbols are removed from the suffix. This is done for each sequence si ∈D.
The main idea in PrefixSpan is to compute the support for only the individual symbols in the projected database Ds , and then to perform recursive projections on the frequent symbols in a depth-first manner. The PrefixSpan method is outlined in Algorithm 10.3. Here r is a frequent subsequence, and Dr is the projected dataset for r. Initially r is empty and Dr is the entire input dataset D. Given a database of (projected) sequences Dr, PrefixSpan first finds all the frequent symbols in the projected dataset. For each such symbol s, we extend r by appending s to obtain the new frequent subsequence rs . Next, we create the projected dataset Ds by projecting Dr on symbol s. A recursive call to PrefixSpan is then made with rs and Ds.
ALGORITHM 10.3. Algorithm PREFIXSPAN
Example 10.6. Consider the three database sequences in Table 10.1. Given that the symbol G first occurs at position 3 in s1 = CAGAAGT, the projection of s1 with respect to G is the suffix AAGT. The projected database for G, denoted DG is therefore given as: {s1 : AAGT, s2 : AAG, s3 : AAGT}.
1 2 3 4 5 6 7 8
9
//InitialCall: Dr ←D, r←∅, F ←∅ PREFIXSPAN (Dr, r, minsup, F):
foreach s ∈ such that sup(s, Dr ) ≥ minsup do
rs =r+s//extend r bysymbol s F←F∪(rs,sup(s,Dr))
Ds ← ∅ // create projected data for symbol s foreach si ∈ Dr do
s′i ← projection of si w.r.t symbol s Remove any infrequent symbols from s′i Adds′i toDs ifs′i ̸=∅
ifDs̸=∅then PREFIXSPAN(Ds,rs,minsup,F)
266
Sequence Mining
Example 10.7. Figure 10.3 shows the projection-based PrefixSpan mining approach for the example dataset in Table 10.1 using minsup = 3. Initially we start with the whole database D, which can also be denoted as D∅. We compute the support of each symbol, and find that C is not frequent (shown crossed out). Among the frequent symbols, we first create a new projected dataset DA. For s1, we find that the first A occurs at position 2, so we retain only the suffix GAAGT. In s2, the first A occurs at position 3, so the suffix is CAG. After removing C (because it is infrequent), we are left with only AG as the projection of s2 on A. In a similar manner we obtain the projection for s3 as AGT. The left child of the root shows the final projected dataset DA. Now the mining proceeds recursively. Given DA, we count the symbol supports in DA, finding that only A and G are frequent, which will lead to the projection DAA and then DAG, and so on. The complete projection-based approach is illustrated in Figure 10.3.
D∅
s1 CAGAAGT
s2 TGACAG
s3 GAAGT
A(3), C(2), G(3), T(3)
DA
s1 GAAGT
s2 AG
s3 AGT
A(3), G(3), T(2)
DG
s1 AAGT
s2 AAG
s3 AAGT
A(3), G(3), T(2)
DT
s2 GAAG
A(1), G(1)
DAA s1 AG
s2 G s3 G
A(1), G(3)
DGA s1 AG s2 AG s3 AG
A(3), G(3)
DAG
s1 AAG
A(1), G(1)
DGG ∅
DGAA s1 G s2 G s3 G
G(3)
DAAG ∅
DGAG ∅
DGAAG ∅
Figure 10.3. Projection-based sequence mining: PrefixSpan.
Substring Mining via Suffix Trees 267
10.3 SUBSTRING MINING VIA SUFFIX TREES
We now look at efficient methods for mining frequent substrings. Let s be a sequence having length n, then there are at most O(n2) possible distinct substrings contained in s. To see this consider substrings of length w, of which there are n − w + 1 possible ones in s. Adding over all substring lengths we get
n
(n−w+1)=n+(n−1)+···+2+1=O(n2) w=1
This is a much smaller search space compared to subsequences, and consequently we can design more efficient algorithms for solving the frequent substring mining task. In fact, we can mine all the frequent substrings in worst case O(Nn2) time for a dataset D = {s1,s2,...,sN} with N sequences.
10.3.1 Suffix Tree
Let denote the alphabet, and let $ ̸∈ be a terminal character used to mark the end of a string. Given a sequence s, we append the terminal character so that s = s1 s2 . . . sn sn+1 , where sn+1 = $, and the jth suffix of s is given as s[j : n + 1] = sj sj+1 ...sn+1. The suffix tree of the sequences in the database D, denoted T , stores all the suffixes for each si ∈ D in a tree structure, where suffixes that share a common prefix lie on the same path from the root of the tree. The substring obtained by concatenating all the symbols from the root node to a node v is called the node label of v, and is denoted as L(v). The substring that appears on an edge (va,vb) is called an edge label, and is denoted as L(va,vb). A suffix tree has two kinds of nodes: internal and leaf nodes. An internal node in the suffix tree (except for the root) has at least two children, where each edge label to a child begins with a different symbol. Because the terminal character is unique, there are as many leaves in the suffix tree as there are unique suffixes over all the sequences. Each leaf node corresponds to a suffix shared by one or more sequences in D.
It is straightforward to obtain a quadratic time and space suffix tree construction algorithm. Initially, the suffix tree T is empty. Next, for each sequence si ∈ D, with |si | = ni , we generate all its suffixes si [j : ni + 1], with 1 ≤ j ≤ ni , and insert each of them into the tree by following the path from the root until we either reach a leaf or there is a mismatch in one of the symbols along an edge. If we reach a leaf, we insert the pair (i,j) into the leaf, noting that this is the jth suffix of sequence si. If there is a mismatch in one of the symbols, say at position p ≥ j, we add an internal vertex just before the mismatch, and create a new leaf node containing (i,j) with edge label si[p:ni +1].
Example 10.8. Consider the database in Table 10.1 with three sequences. In particular, let us focus on s1 = CAGAAGT. Figure 10.4 shows what the suffix tree T looks like after inserting the j th suffix of s1 into T . The first suffix is the entire sequence s1 appended with the terminal symbol; thus the suffix tree contains a single leaf containing (1, 1) under the root (Figure 10.4a). The second suffix is AGAAGT$, and Figure 10.4b shows the resulting suffix tree, which now has two leaves. The third
268
Sequence Mining
(1,1)
(1,2)
(1,1)
(1,2)
(1,1)
(1,3)
(1,4)
(1,2)
(1,1)
(1,3)
(a) j = 1
(b) j = 2
(c) j = 3
(d) j = 4
(1,1)
(1,3)
(1,1)
(1,6)
(1,1)
(1,7)
(1,4)
(1,4)
(1,4)
(1,3)
(1,3)
(1,6)
(1,2)
(1,5)
(1,2)
(1,5)
(1,2)
(1,5)
(e) j = 5
(f) j = 6 (g) j = 7
Figure 10.4. Suffix tree construction: (a)–(g) show the successive changes to the tree, after we add the jth suffix of s1 = CAGAAGT$ for j = 1,...,7.
suffix GAAGT$ begins with G, which has not yet been observed, so it creates a new leafinT undertheroot.ThefourthsuffixAAGT$sharestheprefixAwiththesecond suffix, so it follows the path beginning with A from the root. However, because there is a mismatch at position 2, we create an internal node right before it and insert the leaf (1,4), as shown in Figure 10.4d. The suffix tree obtained after inserting all of the suffixes of s1 is shown in Figure 10.4g, and the complete suffix tree for all three sequences is shown in Figure 10.5.
CAGAAGT$
CAGAAGT$
CAGAAGT$
CAGAAGT$
CAGAAGT$
GAAGT$
T$
T$
T$
T
T$
GAAGT$
G
GAAGT$
GAAGT$
AGAAGT$
A
CAGAAGT$
T$
G
A
A
A
CAGAAGT$
G
G
G
AGT$
AGT$
AGT$
AGAAGT$
AGT$
AAGT$
AAGT$
AAGT$
AAGT$
AAGT$
Substring Mining via Suffix Trees
269
3
3233
33
(1,4) (3,2)
(1,6) (3,4)
(1,7) (3,5)
(2,3)
(1,1)
(2,4)
(2,6)
(2,1)
(1,5) (3,3)
(1,3) (3,1)
(1,2)
(2,5)
Figure 10.5. Suffix tree for all three sequences in Table 10.1. Internal nodes store support information. Leaves also record the support (not shown).
In terms of the time and space complexity, the algorithm sketched above requires O(Nn2) time and space, where N is the number of sequences in D, and n is the longest sequence length. The time complexity follows from the fact that the method always inserts a new suffix starting from the root of the suffix tree. This means that in the worst case it compares O(n) symbols per suffix insertion, giving the worst case bound of O(n2) over all n suffixes. The space complexity comes from the fact that each suffix is explicitly represented in the tree, taking n + (n − 1) + · · · + 1 = O(n2 ) space. Over all the N sequences in the database, we obtain O(Nn2) as the worst case time and space bounds.
Frequent Substrings
Once the suffix tree is built, we can compute all the frequent substrings by checking how many different sequences appear in a leaf node or under an internal node. The node labels for the nodes with support at least minsup yield the set of frequent substrings; all the prefixes of such node labels are also frequent. The suffix tree can also support ad hoc queries for finding all the occurrences in the database for any query substring q. For each symbol in q, we follow the path from the root until all symbols in q have been seen, or until there is a mismatch at any position. If q is found, then the set of leaves under that path is the list of occurrences of the query q. On the other hand, if there is mismatch that means the query does not occur in the database. In terms of the query time complexity, because we have to match each character in q, we immediately get O(|q|) as the time bound (assuming that || is a constant), which is independent of the size of the database. Listing all the matches takes additional time, for a total time complexity of O(|q| + k), if there are k matches.
(2,2)
A
G
T
CAG
CAG$
$
G
$
A
T$
AGT$
AAGT$
$
CAG$
T$
$
GACAG$
AGT$
AAGT$
270 Sequence Mining
Example 10.9. Consider the suffix tree shown in Figure 10.5, which stores all the suffixes for the sequence database in Table 10.1. To facilitate frequent substring enumeration, we store the support for each internal as well as leaf node, that is, we store the number of distinct sequence ids that occur at or under each node. For example, the leftmost child of the root node on the path labeled A has support 3 because there are three distinct sequences under that subtree. If minsup = 3, then the frequent substrings are A, AG, G, GA, and T. Out of these, the maximal ones are AG, GA, and T. If minsup = 2, then the maximal frequent substrings are GAAGT and CAG.
For ad hoc querying consider q = GAA. Searching for symbols in q starting from the root leads to the leaf node containing the occurrences (1,3) and (3,1), which means that GAA appears at position 3 in s1 and at position 1 in s3. On the other hand if q = CAA, then the search terminates with a mismatch at position 3 after following the branch labeled CAG from the root. This means that q does not occur in the database.
10.3.2 Ukkonen’s Linear Time Algorithm
We now present a linear time and space algorithm for constructing suffix trees. We first consider how to build the suffix tree for a single sequence s = s1 s2 . . . sn sn+1 , with sn+1 = $. The suffix tree for the entire dataset of N sequences can be obtained by inserting each sequence one by one.
Achieving Linear Space
Let us see how to reduce the space requirements of a suffix tree. If an algorithm stores all the symbols on each edge label, then the space complexity is O(n2), and we cannot achieve linear time construction either. The trick is to not explicitly store all the edge labels, but rather to use an edge-compression technique, where we store only the starting and ending positions of the edge label in the input string s. That is, if an edge label is given as s[i : j ], then we represent is as the interval [i, j ].
Example10.10. Considerthesuffixtreefors1=CAGAAGT$showninFigure10.4g. The edge label CAGAAGT$ for the suffix (1,1) can be represented via the interval [1,8] because the edge label denotes the substring s1[1 : 8]. Likewise, the edge label AAGT$ leading to suffix (1, 2) can be compressed as [4, 8] because AAGT$ = s1[4 : 8]. The complete suffix tree for s1 with compressed edge labels is shown in Figure 10.6.
In terms of space complexity, note that when we add a new suffix to the tree T , it can create at most one new internal node. As there are n suffixes, there are n leaves in T and at most n internal nodes. With at most 2n nodes, the tree has at most 2n − 1 edges, and thus the total space required to store an interval for each edge is 2(2n − 1) = 4n−2=O(n).
Substring Mining via Suffix Trees
v2
v3
Figure 10.6. Suffix tree for s1 = CAGAAGT$ using edge-compression.
Achieving Linear Time
271
v1
(1,1)
v4
(1,7)
(1,4)
(1,3)
(1,6)
(1,2)
(1,5)
Ukkonen’s method is an online algorithm, that is, given a string s = s1s2 ...sn$ it constructs the full suffix tree in phases. Phase i builds the tree up to the i-th symbol in s, that is, it updates the suffix tree from the previous phase by adding the next symbol si . Let Ti denote the suffix tree up to the ith prefix s[1 : i], with 1 ≤ i ≤ n. Ukkonen’s algorithm constructs Ti from Ti−1, by making sure that all suffixes including the current character si are in the new intermediate tree Ti. In other words, in the ith phase, it inserts all the suffixes s[j : i] from j = 1 to j = i into the tree Ti . Each such insertion is called the jth extension of the ith phase. Once we process the terminal character at position n + 1 we obtain the final suffix tree T for s.
Algorithm 10.4 shows the code for a naive implementation of Ukkonen’s approach. This method has cubic time complexity because to obtain Ti from Ti−1 takes O(i2) time, with the last phase requiring O(n2) time. With n phases, the total time is O(n3). Our goal is to show that this time can be reduced to just O(n) via the optimizations described in the following paragraghs.
Implicit Suffixes This optimization states that, in phase i, if the jth extension s[j : i] is found in the tree, then any subsequent extensions will also be found, and consequently there is no need to process further extensions in phase i. Thus, the suffix tree Ti at the end of phase i has implicit suffixes corresponding to extensions j + 1 through i. It is important to note that all suffixes will become explicit the first time we encounter a new substring that does not already exist in the tree. This will surely happen in phase
[3, 3]
[7, 8]
[2, 2]
[1, 8]
[3, 3]
[5, 8]
[7, 8]
[7, 8]
[4, 8]
[4, 8]
272
Sequence Mining
ALGORITHM 10.4. Algorithm NAIVEUKKONEN
1 2 3 4 5
6 7
NAIVEUKKONEN (s):
n←|s|
s[n + 1] ← $ // append terminal character
T ←∅//addemptystringasroot foreachi=1,...,n+1do//phasei-constructTi
foreach j = 1, . . . , i do // extension j for phase i // Insert s[j : i] into the suffix tree
Find end of the path with label s[j : i − 1] in T Insert si at end of path;
return T
n + 1 when we process the terminal character $, as it cannot occur anywhere else in s
(after all, $ ̸∈ ).
Implicit Extensions Let the current phase be i, and let l ≤ i − 1 be the last explicit suffix in the previous tree Ti−1. All explicit suffixes in Ti−1 have edge labels of the form [x , i − 1] leading to the corresponding leaf nodes, where the starting position x is node specific, but the ending position must be i − 1 because si−1 was added to the end of these paths in phase i − 1. In the current phase i, we would have to extend these paths by adding si at the end. However, instead of explicitly incrementing all the ending positions, we can replace the ending position by a pointer e which keeps track of the current phase being processed. If we replace [x,i − 1] with [x,e], then in phase i, if we set e = i, then immediately all the l existing suffixes get implicitly extended to [x,i]. Thus, in one operation of incrementing e we have, in effect, taken care of extensions 1 through l for phase i.
8
Example 10.11. Let s1 = CAGAAGT$. Assume that we have already performed the first six phases, which result in the tree T6 shown in Figure 10.7a. The last explicit suffix in T6 is l = 4. In phase i = 7 we have to execute the following extensions:
CAGAAGT extension 1 AGAAGT extension 2 GAAGT extension 3 AAGT extension 4 AGT extension 5 GT extension 6 T extension 7
At the start of the seventh phase, we set e = 7, which yields implicit extensions for all suffixes explicitly in the tree, as shown in Figure 10.7b. Notice how symbol s7 = T is now implicitly on each of the leaf edges, for example, the label [5, e] = AG in T6 now becomes [5, e] = AGT in T7 . Thus, the first four extensions listed above are taken care of by simply incrementing e. To complete phase 7 we have to process the remaining extensions.
Substring Mining via Suffix Trees
273
(1,1)
(1,3)
(1,4)
(1,4)
(a) T6
Figure 10.7. Implicit extensions in phase i = 7. Last explicit suffix in T6 is l = 4 (shown double-circled). Edge
labels shown for convenience; only the intervals are stored.
Skip/Count Trick For the jth extension of phase i, we have to search for the substring s[j : i − 1] so that we can add si at the end. However, note that this string must exist in Ti−1 because we have already processed symbol si−1 in the previous phase. Thus, instead of searching for each character in s[j : i − 1] starting from the root, we first count the number of symbols on the edge beginning with character sj ; let this length be m. If m is longer than the length of the substring (i.e., if m > i − j), then the substring must end on this edge, so we simply jump to position i − j and insert si . On the other hand, if m ≤ i − j, then we can skip directly to the child node, say vc, and search for the remaining string s[j + m : i − 1] from vc using the same skip/count technique. With this optimization, the cost of an extension becomes proportional to the number of nodes on the path, as opposed to the number of characters in s[j :i−1].
Suffix Links We saw that with the skip/count optimization we can search for the substring s[j : i − 1] by following nodes from parent to child. However, we still have to start from the root node each time. We can avoid searching from the root via the use of suffix links. For each internal node va we maintain a link to the internal node vb , where L(vb ) is the immediate suffix of L(va ). In extension j − 1, let vp denote the internal node under which we find s[j − 1 : i ], and let m be the length of the node label of vp. To insert the jth extension s[j : i], we follow the suffix link from vp to another node, say vs , and search for the remaining substring s[j + m − 1 : i − 1] from vs . The use of suffix links allows us to jump internally within the tree for different extensions, as opposed to searching from the root each time. As a final observation, if extension j
(b) T7, extensions j = 1,…,4
(1,1)
(1,3)
(1,2)
(1,2)
[1, e] = CAGAAGT
[1, e] = CAGAAG
[3, e] = GAAGT
[3, e] = GAAG
[2,2] = A
[2,2] = A
[3, e] = GAAG
[3, e] = GAAGT
[5,e]=AGT
[5,e]=AG
274
Sequence Mining
ALGORITHM 10.5. Algorithm UKKONEN
1 2 3 4 5 6 7
8
9 10 11 12 13
UKKONEN (s):
n←|s|
s[n + 1] ← $ // append terminal character
T ←∅//addemptystringasroot
l ← 0 // last explicit suffix foreachi=1,…,n+1do//phasei-constructTi
e ← i // implicit extensions
foreach j = l + 1, . . . , i do // extension j for phase i
// Insert s[j : i] into the suffix tree
Find end of s[j : i − 1] in T via skip/count and suffix links if si ∈ T then // implicit suffixes
break else
Insert si at end of path
Set last explicit suffix l if needed
return T
creates a new internal node, then its suffix link will point to the new internal node that
will be created during extension j + 1.
The pseudo-code for the optimized Ukkonen’s algorithm is shown in Algorithm 10.5. It is important to note that it achieves linear time and space only with all of the optimizations in conjunction, namely implicit extensions (line 6), implicit suffixes (line 9), and skip/count and suffix links for inserting extensions in T (line 8).
14
Example 10.12. Let us look at the execution of Ukkonen’s algorithm on the sequence s1 = CAGAAGT$, as shown in Figure 10.8. In phase 1, we process character s1 = C and insert the suffix (1, 1) into the tree with edge label [1, e] (see Figure 10.8a). In phases 2 and 3, new suffixes (1,2) and (1,3) are added (see Figures 10.8b–10.8c). For phase 4, when we want to process s4 = A, we note that all suffixes up to l = 3 are already explicit. Setting e = 4 implicitly extends all of them, so we have only to make sure that the last extension (j = 4) consisting of the single character A is in the tree. Searching from the root, we find A in the tree implicitly, and we thus proceed to the next phase. In the next phase, we set e = 5, and the suffix (1,4) becomes explicit when we try to add the extension AA, which is not in the tree. For e = 6, we find the extension AG already in the tree and we skip ahead to the next phase. At this point the last explicit suffix is still (1,4). For e = 7, T is a previously unseen symbol, and so all suffixes will become explicit, as shown in Figure 10.8g.
It is instructive to see the extensions in the last phase (i = 7). As described in Example 10.11, the first four extensions will be done implicitly. Figure 10.9a shows the suffix tree after these four extensions. For extension 5, we begin at the last explicit
Substring Mining via Suffix Trees
275
C A GAAGT$,e=2
(1,2)
(1,1)
(1,2)
(1,1)
(1,2)
(1,1)
C AGAAGT$,e=1
(1,1)
CA G AAGT$,e=3
(1,3)
(c) T3
CAG A AGT$,e=4
(1,3)
(a) T1
(b) T2
(d) T4 CAGAAG T $,e=7
CAGA A GT$,e=5
CAGAA G T$,e=6
(1,1)
(1,2)
(1,1)
(1,3)
(1,4)
(1,2)
(1,1)
(1,3)
(1,4)
(1,4)
(1,3)
(1,6)
(1,2)
(1,5)
(e) T5
(f) T6
(1,7)
Figure 10.8. Ukkonen’s linear time algorithm for suffix tree construction. Steps (a)–(g) show the successive changes to the tree after the ith phase. The suffix links are shown with dashed lines. The double-circled leaf denotes the last explicit suffix in the tree. The last step is not shown because when e = 8, the terminal character $ will not alter the tree. All the edge labels are shown for ease of understanding, although the actual suffix tree keeps only the intervals for each edge.
(g) T7
leaf, follow its parent’s suffix link, and begin searching for the remaining characters from that point. In our example, the suffix link points to the root, so we search for s[5 : 7] = AGT from the root. We skip to node vA, and look for the remaining string GT, which has a mismatch inside the edge [3,e]. We thus create a new internal node after G, and insert the explicit suffix (1,5), as shown in Figure 10.9b. The next extension s[6 : 7] = GT begins at the newly created leaf node (1,5). Following the closest suffix link leads back to the root, and a search for GT gets a mismatch on the edge out of the root to leaf (1, 3). We then create a new internal node vG at that point, add a suffix link from the previous internal node vAG to vG, and add a new explicit leaf (1, 6), as shown in Figure 10.9c. The last extension, namely j = 7, corresponding
[1,e]=CAG
[3,e]=GA
[3,e]=G
[1,e]=C
[2,e]=AG
[2,e]=AGA
[1,e]=CA
[3,e]=GAA
[3, e] = GAAG
[1, e] = CAGA
[3, e] = GAAG
[3,e]=GAA
[3,3] = G
[7,e]=T
[2,2] = A
[2,2] = A
[2,2] = A
[1, e] = CAGAAG
[1, e] = CAGAA
[1, e] = CAGAAGT
[3,3] = G
[2,e]=A
[5,e]=AG
[5,e]=A
[7,e]=T
[7,e]=T
[5,e]=AGT
[4, e] = AAGT
[4, e] = AAGT
276
Sequence Mining
Extensions 1–4
vA
(1,4)
(1,1)
(1,3)
(1,2)
Extension 5: AGT
(1,4)
vA
(1,1)
(1,3)
vAG
(1,2)
(1,5)
Extension 6: GT
vA
vAG
vG
(1,1)
(1,4)
(1,3)
(1,6)
(1,2)
(1,5)
(a)
(b)
(c)
Figure 10.9. Extensions in phase i = 7. Initially the last explicit suffix is l = 4 and is shown double-circled. All the edge labels are shown for convenience; the actual suffix tree keeps only the intervals for each edge.
to s[7 : 7] = T, results in making all the suffixes explicit because the symbol T has been seen for the first time. The resulting tree is shown in Figure 10.8g.
Once s1 has been processed, we can then insert the remaining sequences in the database D into the existing suffix tree. The final suffix tree for all three sequences is shown in Figure 10.5, with additional suffix links (not shown) from all the internal nodes.
Ukkonen’s algorithm has time complexity of O(n) for a sequence of length n because it does only a constant amount of work (amortized) to make each suffix explicit. Note that, for each phase, a certain number of extensions are done implicitly just by incrementing e. Out of the i extensions from j = 1 to j = i, let us say that l are done implicitly. For the remaining extensions, we stop the first time some suffix is implicitly in the tree; let that extension be k. Thus, phase i needs to add explicit suffixes only for suffixes l + 1 through k − 1. For creating each explicit suffix, we perform a constant number of operations, which include following the closest suffix link, skip/counting to look for the first mismatch, and inserting if needed a new suffix leaf node. Because each leaf becomes explicit only once, and the number of skip/count steps are bounded by O(n) over the whole tree, we get a worst-case O(n)
[3, e] = GAAGT
[3, e] = GAAGT
[3,3] = G
[2,2] = A
[2,2] = A
[2,2] = A
[1, e] = CAGAAGT
[1, e] = CAGAAGT
[1, e] = CAGAAGT
[3, e] = GAAGT
[7,e]=T
[7,e]=T
[3,3] = G
[3,3] = G
[5,e]=AGT
[5,e]=AGT
[4, e] = AAGT
[5,e]=AGT
[4, e] = AAGT
[7,e]=T
[4, e] = AAGT
Exercises 277 time algorithm. The total time over the entire database of N sequences is thus O(Nn),
if n is the longest sequence length. 10.4 FURTHER READING
The level-wise GSP method for mining sequential patterns was proposed in Srikant and Agrawal (1996). Spade is described in Zaki (2001), and the PrefixSpan algorithm in Pei et al. (2004). Ukkonen’s linear time suffix tree construction method appears in Ukkonen (1995). For an excellent introduction to suffix trees and their numerous applications see Gusfield (1997); the suffix tree description in this chapter has been heavily influenced by it.
Gusfield, D. (1997). Algorithms on strings, trees and sequences: computer science and computational biology. New York: Cambridge University Press.
Pei, J., Han, J., Mortazavi-Asl, B., Wang, J., Pinto, H., Chen, Q., Dayal, U., and Hsu, M.-C. (2004). Mining sequential patterns by pattern-growth: The prefixspan approach. IEEE Transactions on Knowledge and Data Engineering, 16(11): 1424–1440.
Srikant, R. and Agrawal, R. (1996). Mining sequential patterns: Generalizations and performance improvements. Proceedings of the 5th International Conference on Extending Database Technology. New York: Springer-Verlag, pp. 1–17.
Ukkonen, E. (1995). On-line construction of suffix trees. Algorithmica, 14 (3): 249–260. Zaki, M. J. (2001). SPADE: An efficient algorithm for mining frequent sequences.
Machine learning, 42 (1-2): 31–60. 10.5 EXERCISES
Q1. ConsiderthedatabaseshowninTable10.2.Answerthefollowingquestions:
(a) Let minsup = 4. Find all frequent sequences.
(b) Given that the alphabet is = {A, C, G, T}. How many possible sequences of
length k can there be?
Table 10.2. Sequence database for Q1
Q2. Given the DNA sequence database in Table 10.3, answer the following questions using minsup = 4
(a) Find the maximal frequent sequences.
(b) Find all the closed frequent sequences.
(c) Find the maximal frequent substrings.
Id
Sequence
s1 s2 s3 s4
AATACAAGAAC GTATGGTGAT AACATGGCCAA AAGCGTGGTCAA
278
Sequence Mining
(d) (e)
Show how Spade would work on this dataset. Show the steps of the PrefixSpan algorithm.
Table 10.3. Sequence database for Q2
Id
Sequence
s1 s2 s3 s4 s5 s6 s7
ACGTCACG TCGA GACTGCA CAGTC AGCT TGCAGCTC AGTCAG
Q3. Given s = AABBACBBAA, and = {A,B,C}. Define support as the number of occurrence of a subsequence in s. Using minsup = 2, answer the following questions:
(a) Show how the vertical Spade method can be extended to mine all frequent
substrings (consecutive subsequences) in s.
(b) Construct the suffix tree for s using Ukkonen’s method. Show all intermediate
steps, including all suffix links.
(c) Using the suffix tree from the previous step, find all the occurrences of the query
q = ABBA allowing for at most two mismatches.
(d) Show the suffix tree when we add another character A just before the $. That is,
you must undo the effect of adding the $, add the new symbol A, and then add $
back again.
(e) Describe an algorithm to extract all the maximal frequent substrings from a suffix
tree. Show all maximal frequent substrings in s.
Q4. Consider a bitvector based approach for mining frequent subsequences. For instance, in Table 10.2, for s1, the symbol C occurs at positions 5 and 11. Thus, the bitvector for C in s1 is given as 00001000001. Because C does not appear in s2 its bitvector can be omitted for s2. The complete set of bitvectors for symbol C is
(s1 , 00001000001) (s3 , 00100001100) (s4 , 000100000100)
Given the set of bitvectors for each symbol show how we can mine all frequent sub- sequences by using bit operations on the bitvectors. Show the frequent subsequences and their bitvectors using minsup = 4.
Q5. Consider the database shown in Table 10.4. Each sequence comprises itemset events that happen at the same time. For example, sequence s1 can be considered to be a sequence of itemsets (AB)10(B)20(AB)30(AC)40, where symbols within brackets are considered to co-occur at the same time, which is given in the subscripts. Describe an algorithm that can mine all the frequent subsequences over itemset events. The
Exercises
279
Table 10.4. Sequences for Q5
Id
Time
Items
s1
10 20 30 40
A,B B A,B A,C
s2
20 30 50
A,C A,B,C B
s3
10 30 40 50 60
A B A C B
s4
30 40 50 60
A,B A
B
C
itemsets can be of any length as long as they are frequent. Find all frequent itemset sequences with minsup = 3.
Q6. The suffix tree shown in Figure 10.5 contains all suffixes for the three sequences s1,s2,s3 in Table 10.1. Note that a pair (i,j) in a leaf denotes the jth suffix of sequence si.
(a) Add a new sequence s4 = GAAGCAGAA to the existing suffix tree, using the
Ukkonen algorithm. Show the last character position (e), along with the suffixes (l) as they become explicit in the tree for s4. Show the final suffix tree after all suffixes of s4 have become explicit.
(b) Find all closed frequent substrings with minsup = 2 using the final suffix tree.
Q7. Giventhefollowingthreesequences:
s1 : GAAGT s2 : CAGAT s3 : ACGT
Find all the frequent subsequences with minsup = 2, but allowing at most a gap of 1 position between successive sequence elements.
CHAPTER 11 Graph Pattern Mining
Graph data is becoming increasingly more ubiquitous in today’s networked world. Examples include social networks as well as cell phone networks and blogs. The Internet is another example of graph data, as is the hyperlinked structure of the World Wide Web (WWW). Bioinformatics, especially systems biology, deals with understanding interaction networks between various types of biomolecules, such as protein–protein interactions, metabolic networks, gene networks, and so on. Another prominent source of graph data is the Semantic Web, and linked open data, with graphs represented using the Resource Description Framework (RDF) data model.
The goal of graph mining is to extract interesting subgraphs from a single large graph (e.g., a social network), or from a database of many graphs. In different applications we may be interested in different kinds of subgraph patterns, such as subtrees, complete graphs or cliques, bipartite cliques, dense subgraphs, and so on. These may represent, for example, communities in a social network, hub and authority pages on the WWW, cluster of proteins involved in similar biochemical functions, and so on. In this chapter we outline methods to mine all the frequent subgraphs that appear in a database of graphs.
11.1 ISOMORPHISM AND SUPPORT
A graph is a pair G=(V,E) where V is a set of vertices, and E⊆V×V is a set of edges. We assume that edges are unordered, so that the graph is undirected. If (u,v) is an edge, we say that u and v are adjacent and that v is a neighbor of u, and vice versa. The set of all neighbors of u in G is given as N(u) = {v ∈ V | (u,v) ∈ E}. A labeled graph has labels associated with its vertices as well as edges. We use L(u) to denote the label of the vertex u, and L(u,v) to denote the label of the edge (u,v), with the set of vertex labels denoted as V and the set of edge labels as E. Given an edge (u,v) ∈ G, the tuple ⟨u,v,L(u),L(v),L(u,v)⟩ that augments the edge with the node and edge labels is called an extended edge.
Example 11.1. Figure 11.1a shows an example of an unlabeled graph, whereas Figure 11.1b shows the same graph, with labels on the vertices, taken from the vertex
280
Isomorphism and Support
281
v5 v6 v3 v4 badc
(b)
v1
v3 v4
v2
v1 v2 ac
v5 v6
bc v7 v8
Subgraphs
v8
Figure 11.1. An unlabeled (a) and labeled (b) graph with eight vertices.
v7
(a)
label set V = {a,b,c,d}. In this example, edges are all assumed to be unlabeled, and are therefore edge labels are not shown. Considering Figure 11.1b, the label of vertex v4 is L(v4) = a, and its neighbors are N(v4) = {v1,v2,v3,v5,v7,v8}. The edge (v4,v1) leads to the extended edge ⟨v4,v1,a,a⟩, where we omit the edge label L(v4,v1) because it is empty.
A graph G′ =(V′,E′) is said to be a subgraph of G if V′ ⊆V and E′ ⊆E. Note that this definition allows for disconnected subgraphs. However, typically data mining applications call for connected subgraphs, defined as a subgraph G′ such that V′ ⊆ V, E′ ⊆E,andforanytwonodesu,v∈V′,thereexistsapathfromutovinG′.
Graph and Subgraph Isomorphism
A graph G′ = (V′,E′) is said to be isomorphic to another graph G = (V,E) if there exists a bijective function φ : V′ → V, i.e., both injective (into) and surjective (onto), such that
1. (u,v) ∈ E′ ⇐⇒ (φ(u),φ(v)) ∈ E
2. ∀u∈V′,L(u)=L(φ(u))
3. ∀(u,v) ∈ E′, L(u,v) = L(φ(u),φ(v))
In other words, the isomorphism φ preserves the edge adjacencies as well as the vertex andedgelabels.Putdifferently,theextendedtuple⟨u,v,L(u),L(v),L(u,v)⟩∈G′ ifand only if ⟨φ(u),φ(v),L(φ(u)),L(φ(v)),L(φ(u),φ(v))⟩ ∈ G.
Example 11.2. The graph defined by the bold edges in Figure 11.2a is a subgraph of the larger graph; it has vertex set V′ = {v1,v2,v4,v5,v6,v8}. However, it is a disconnected subgraph. Figure 11.2b shows an example of a connected subgraph on the same vertex set V′.
282 Graph Pattern Mining v1 v2 v1 v2
ac ac
v3 v4 v5 v6 v3 v4 v5 v6 badcbadc
bc bc
v7 v8 v7 v8 (a) (b)
Figure 11.2. A subgraph (a) and connected subgraph (b).
G1 G2 G3 G4
u1 a
u2 a
v1 a w1 a
v3 a w2 a
u4 b v2 b v4 b w3 b Figure 11.3. Graph and subgraph isomorphism.
x1 b
x2 a
x3 b
u3 b
If the function φ is only injective but not surjective, we say that the mapping φ is a subgraph isomorphism from G′ to G. In this case, we say that G′ is isomorphic to a subgraph of G, that is, G′ is subgraph isomorphic to G, denoted G′ ⊆ G; we also say that G contains G′.
Example11.3. InFigure11.3,G1=(V1,E1)andG2=(V2,E2)areisomorphicgraphs. There are several possible isomorphisms between G1 and G2. An example of an isomorphism φ : V2 → V1 is
φ(v1)=u1 φ(v2)=u3 φ(v3)=u2 φ(v4)=u4
The inverse mapping φ−1 specifies the isomorphism from G1 to G2. For example, φ−1(u1) = v1, φ−1(u2) = v3, and so on. The set of all possible isomorphisms from G2 to G1 are as follows:
v1 v2 v3 v4 φ1 u1 u3 u2 u4 φ2 u1 u4 u2 u3 φ3 u2 u3 u1 u4 φ4 u2 u4 u1 u3
Isomorphism and Support 283
The graph G3 is subgraph isomorphic to both G1 and G2. The set of all possible subgraph isomorphisms from G3 to G1 are as follows:
w1 w2 w3 φ1 u1 u2 u3 φ2 u1 u2 u4 φ3 u2 u1 u3 φ4 u2 u1 u4
The graph G4 is not subgraph isomorphic to either G1 or G2, and it is also not isomorphic to G3 because the extended edge ⟨x1,x3,b,b⟩ has no possible mappings in G1, G2 or G3.
Subgraph Support
Given a database of graphs, D = {G1 , G2 , . . . , Gn }, and given some graph G, the support of G in D is defined as follows:
s u p ( G ) = G i ∈ D | G ⊆ G i
The support is simply the number of graphs in the database that contain G. Given a minsup threshold, the goal of graph mining is to mine all frequent connected subgraphs with sup(G) ≥ minsup.
To mine all the frequent subgraphs, one has to search over the space of all possible graph patterns, which is exponential in size. If we consider subgraphs with m vertices, then there are m2 = O(m2) possible edges. The number of possible subgraphs with m nodes is then O(2m2 ) because we may decide either to include or exclude each of the edges. Many of these subgraphs will not be connected, but O(2m2 ) is a convenient upper bound. When we add labels to the vertices and edges, the number of labeled graphs will be even more. Assume that |V| = |E| = s, then there are sm possible ways to label the vertices and there are sm2 ways to label the edges. Thus, the number of possible labeled subgraphs with m vertices is 2m2 smsm2 = O(2s)m2 . This is the worst case bound, as many of these subgraphs will be isomorphic to each other, with the number of distinct subgraphs being much less. Nevertheless, the search space is still enormous because we typically have to search for all subgraphs ranging from a single vertex to some maximum number of vertices given by the largest frequent subgraph.
There are two main challenges in frequent subgraph mining. The first is to system- atically generate candidate subgraphs. We use edge-growth as the basic mechanism for extending the candidates. The mining process proceeds in a breadth-first (level-wise) or a depth-first manner, starting with an empty subgraph (i.e., with no edge), and adding a new edge each time. Such an edge may either connect two existing vertices in the graph or it may introduce a new vertex as one end of a new edge. The key is to perform nonredundant subgraph enumeration, such that we do not generate the same graph candidate more than once. This means that we have to perform graph isomorphism checking to make sure that duplicate graphs are removed. The second challenge is to count the support of a graph in the database. This involves subgraph isomorphism checking, as we have to find the set of graphs that contain a given candidate.
284 Graph Pattern Mining
11.2 CANDIDATE GENERATION
An effective strategy to enumerate subgraph patterns is the so-called rightmost path extension. Given a graph G, we perform a depth-first search (DFS) over its vertices, and create a DFS spanning tree, that is, one that covers or spans all the vertices. Edges that are included in the DFS tree are called forward edges, and all other edges are called backward edges. Backward edges create cycles in the graph. Once we have a DFS tree, define the rightmost path as the path from the root to the rightmost leaf, that is, to the leaf with the highest index in the DFS order.
Example 11.4. Consider the graph shown in Figure 11.4a. One of the possible DFS spanning trees is shown in Figure 11.4b (illustrated via bold edges), obtained by starting at v1 and then choosing the vertex with the smallest index at each step. Figure 11.5 shows the same graph (ignoring the dashed edges), rearranged to emphasize the DFS tree structure. For instance, the edges (v1,v2) and (v2,v3) are examples of forward edges, whereas (v3,v1), (v4,v1), and (v6,v1) are all backward edges. The bold edges (v1,v5), (v5,v7) and (v7,v8) comprise the rightmost path.
For generating new candidates from a given graph G, we extend it by adding a new edge to vertices only on the rightmost path. We can either extend G by adding backward edges from the rightmost vertex to some other vertex on the rightmost path (disallowing self-loops or multi-edges), or we can extend G by adding forward edges from any of the vertices on the rightmost path. A backward extension does not add a new vertex, whereas a forward extension adds a new vertex.
For systematic candidate generation we impose a total order on the extensions, as follows: First, we try all backward extensions from the rightmost vertex, and then we try forward extensions from vertices on the rightmost path. Among the backward edge extensions,ifur istherightmostvertex,theextension(ur,vi)istriedbefore(ur,vj)if i < j. In other words, backward extensions closer to the root are considered before those farther away from the root along the rightmost path. Among the forward edge
extensions, if vx is the new vertex to be added, the
extension (vi , vx ) is tried before
v6d c av7 v6d
v5 v5
v1aav2bv8v1a
c av7
av2 bv8
v4c bv3 v4c bv3
(a)
(b)
Figure 11.4. A graph (a) and a possible depth-first spanning tree (b).
Candidate Generation
285
v2 a
v4 c
v1 a
v5 c
v7 a #2
#1
#6
#5
v3 b
v6 d
v8b #4
#3
Figure 11.5. Rightmost path extensions. The bold path is the rightmost path in the DFS tree. The rightmost vertex is v8 , shown double circled. Solid black lines (thin and bold) indicate the forward edges, which are part of the DFS tree. The backward edges, which by definition are not part of the DFS tree, are shown in gray. The set of possible extensions on the rightmost path are shown with dashed lines. The precedence ordering of the extensions is also shown.
(vj , vx ) if i > j . In other words, the vertices farther from the root (those at greater depth) are extended before those closer to the root. Also note that the new vertex will be numbered x = r + 1, as it will become the new rightmost vertex after the extension.
Example 11.5. Consider the order of extensions shown in Figure 11.5. Node v8 is the rightmost vertex; thus we try backward extensions only from v8. The first extension, denoted #1 in Figure 11.5, is the backward edge (v8,v1) connecting v8 to the root, and the next extension is (v8,v5), denoted #2, which is also backward. No other backward extensions are possible without introducing multiple edges between the same pair of vertices. The forward extensions are tried in reverse order, starting from the rightmost vertex v8 (extension denoted as #3) and ending at the root (extension denoted as #6). Thus, the forward extension (v8,vx), denoted #3, comes before the forward extension (v7 , vx ), denoted #4, and so on.
11.2.1 Canonical Code
When generating candidates using rightmost path extensions, it is possible that duplicate, that is, isomorphic, graphs are generated via different extensions. Among the isomorphic candidates, we need to keep only one for further extension, whereas the others can be pruned to avoid redundant computation. The main idea is that if we can somehow sort or rank the isomorphic graphs, we can pick the canonical representative, say the one with the least rank, and extend only that graph.
286 Graph Pattern Mining G1 G2 G3
v1 a v1 a v1 a qqqr
v2 a r v2 a r v2 a r b v4 rrrrr
v3 a b v4 v3 b a v4 v3 a
DFScode(G1 ) DFScode(G2 ) DFScode(G3 ) Figure 11.6. Canonical DFS code. G1 is canonical, whereas G2 and G3 are noncanonical. Vertex label set
V = {a, b}, and edge label set E = {q, r}. The vertices are numbered in DFS order.
Let G be a graph and let TG be a DFS spanning tree for G. The DFS tree TG defines an ordering of both the nodes and edges in G. The DFS node ordering is obtained by numbering the nodes consecutively in the order they are visited in the DFS walk. We assume henceforth that for a pattern graph G the nodes are numbered according to their position in the DFS ordering, so that i < j implies that vi comes before vj in the DFS walk. The DFS edge ordering is obtained by following the edges between consecutive nodes in DFS order, with the condition that all the backward edges incident with vertex vi are listed before any of the forward edges incident with it. The DFS code for a graph G, for a given DFS tree TG , denoted DFScode(G), is defined as the sequence of extended edge tuples of the form vi , vj , L(vi ), L(vj ), L(vi , vj ) listed in the DFS edge order.
t11 = ⟨v1,v2,a,a,q⟩ t12 = ⟨v2,v3,a,a,r⟩ t13 = ⟨v3,v1,a,a,r⟩ t14 = ⟨v2,v4,a,b,r⟩
t21 = ⟨v1,v2,a,a,q⟩ t22 = ⟨v2,v3,a,b,r⟩ t23 = ⟨v2,v4,a,a,r⟩ t24 = ⟨v4,v1,a,a,r⟩
t31 = ⟨v1,v2,a,a,q⟩ t32 = ⟨v2,v3,a,a,r⟩ t33 = ⟨v3,v1,a,a,r⟩ t34 = ⟨v1,v4,a,b,r⟩
Example 11.6. Figure 11.6 shows the DFS codes for three graphs, which are all isomorphic to each other. The graphs have node and edge labels drawn from the label sets V = {a,b} and E = {q,r}. The edge labels are shown centered on the edges. The bold edges comprise the DFS tree for each graph. For G1, the DFS node ordering is v1,v2,v3,v4, whereas the DFS edge ordering is (v1,v2), (v2,v3), (v3,v1), and (v2,v4). Based on the DFS edge ordering, the first tuple in the DFS code for G1 is therefore ⟨v1,v2,a,a,q⟩. The next tuple is ⟨v2,v3,a,a,r⟩ and so on. The DFS code for each graph is shown in the corresponding box below the graph.
Canonical DFS Code
A subgraph is canonical if it has the smallest DFS code among all possible isomorphic graphs, with the ordering between codes defined as follows. Let t1 and t2 be any two
Candidate Generation 287 DFS code tuples:
t1 =vi,vj,L(vi),L(vj),L(vi,vj) t2 =vx,vy,L(vx),L(vy),L(vx,vy)
We say that t1 is smaller than t2, written t1 < t2, iff
i)(vi,vj)
If eij and exy are both backward edges, then (a) i < x, or (b) i = x and j < y. That is, (a) a backward edge from a node earlier in the DFS node order is smaller, or (b) if both the backward edges originate from a node with the same DFS node order, then the backward edge to a node earlier in DFS node order (i.e., closer to the root along the rightmost path) is smaller.
If eij is a forward and exy is a backward edge, then j ≤ x. That is, a forward edge to a node earlier in the DFS node order is smaller than a backward edge from that node or any node that comes after it in DFS node order.
If eij is a backward and exy is a forward edge, then i < y. That is, a backward edge from a node earlier in DFS node order is smaller than a forward edge to any later node.
Given any two DFS codes, we can compare them tuple by tuple to check which is smaller. In particular, the canonical DFS code for a graph G is defined as follows:
C = min DFScode(G′ ) | G′ is isomorphic to G G′
Given a candidate subgraph G, we can first determine whether its DFS code is canonical or not. Only canonical graphs need to be retained for extension, whereas noncanonical candidates can be removed from further consideration.
288 Graph Pattern Mining
Example 11.7. Consider the DFS codes for the three graphs shown in Figure 11.6. Comparing G1 and G2, we find that t11 = t21, but t12 < t22 because ⟨a,a,r⟩
11.3 THE GSPAN ALGORITHM
We describe the gSpan algorithm to mine all frequent subgraphs from a database of graphs. Given a database D = {G1,G2,…,Gn} comprising n graphs, and given a minimum support threshold minsup, the goal is to enumerate all (connected) subgraphs G that are frequent, that is, sup(G) ≥ minsup. In gSpan, each graph is represented by its canonical DFS code, so that the task of enumerating frequent subgraphs is equivalent to the task of generating all canonical DFS codes for frequent subgraphs. Algorithm 11.1 shows the pseudo-code for gSpan.
gSpan enumerates patterns in a depth-first manner, starting with the empty code. Given a canonical and frequent code C, gSpan first determines the set of possible edge extensions along the rightmost path (line 1). The function RIGHTMOSTPATH- EXTENSIONS returns the set of edge extensions along with their support values, E. Each extended edge t in E leads to a new candidate DFS code C′ = C ∪ {t }, with support sup(C′) = sup(t) (lines 3–4). For each new candidate code, gSpan checks whether it is frequent and canonical, and if so gSpan recursively extends C′ (lines 5–6). The algorithm stops when there are no more frequent and canonical extensions possible.
ALGORITHM 11.1. Algorithm GSPAN
// Initial Call: C ← ∅
GSPAN (C, D, minsup):
1 E←RIGHTMOSTPATH-EXTENSIONS(C,D)//extensionsand
supports
2 foreach (t,sup(t)) ∈ E do
3 C′ ← C ∪ t // extend the code with extended edge tuple t
4 sup(C′) ← sup(t) // record the support of new extension
// recursively call gSpan if code is frequent and canonical
5 if sup(C′) ≥ minsup and ISCANONICAL (C′) then
6 GSPAN (C′, D, minsup)
The gSpan Algorithm
b20
b40
289
G1 a10
G2 b50
a30
a60
b70
a80 Figure 11.7. Example graph database.
Example 11.8. Consider the example graph database comprising G1 and G2 shown in Figure 11.7. Let minsup = 2, that is, assume that we are interested in mining subgraphs that appear in both the graphs in the database. For each graph the node labels and node numbers are both shown, for example, the node a10 in G1 means that node 10 has label a.
Figure 11.8 shows the candidate patterns enumerated by gSpan. For each candidate the nodes are numbered in the DFS tree order. The solid boxes show frequent subgraphs, whereas the dotted boxes show the infrequent ones. The dashed boxes represent noncanonical codes. Subgraphs that do not occur even once are not shown. The figure also shows the DFS codes and their corresponding graphs.
The mining process begins with the empty DFS code C0 corresponding to the empty subgraph. The set of possible 1-edge extensions comprises the new set of candidates. Among these, C3 is pruned because it is not canonical (it is isomorphic to C2), whereas C4 is pruned because it is not frequent. The remaining two candidates, C1 and C2, are both frequent and canonical, and are thus considered for further extension. The depth-first search considers C1 before C2, with the rightmost path extensions of C1 being C5 and C6. However, C6 is not canonical; it is isomorphic to C5, which has the canonical DFS code. Further extensions of C5 are processed recursively. Once the recursion from C1 completes, gSpan moves on to C2, which will be recursively extended via rightmost edge extensions as illustrated by the subtree under C2. After processing C2, gSpan terminates because no other frequent and canonical extensions are found. In this example, C12 is a maximal frequent subgraph, that is, no supergraph of C12 is frequent.
This example also shows the importance of duplicate elimination via canonical checking. The groups of isomorphic subgraphs encountered during the execution of gSpan are as follows: {C2,C3}, {C5,C6,C17}, {C7,C19}, {C9,C25}, {C20,C21,C22,C24}, and {C12,C13,C14}. Within each group the first graph is canonical and thus the remaining codes are pruned.
For a complete description of gSpan we have to specify the algorithm for enumerating the rightmost path extensions and their support, so that infrequent patterns can be eliminated, and the procedure for checking whether a given DFS code is canonical, so that duplicate patterns can be pruned. These are detailed next.
C0 ∅
C1 ⟨0,1,a,a⟩
a0 a1
C2 ⟨0,1,a,b⟩
a0 b1
C3 ⟨0,1,b,a⟩
b0 a1
C4 ⟨0,1,b,b⟩
b0 b1
C5 ⟨0,1,a,a⟩ ⟨1,2,a,b⟩ a0
a1 b2
C15 ⟨0,1,a,b⟩ ⟨1,2,b,a⟩ a0
b1 a2
C16 ⟨0,1,a,b⟩ ⟨1,2,b,b⟩ a0
b1 b2
C6 ⟨0,1,a,a⟩ ⟨0,2,a,b⟩
a0
a1 b2
C17 ⟨0,1,a,b⟩ ⟨0,2,a,a⟩
a0
b1 a2
C18 ⟨0,1,a,b⟩ ⟨0,2,a,b⟩
a0
b1 b2
C8 ⟨0,1,a,a⟩ ⟨1,2,a,b⟩ ⟨2,3,b,b⟩ a0
a1 b2 b3
C7 ⟨0,1,a,a⟩ ⟨1,2,a,b⟩ ⟨2,0,b,a⟩ a0
a1 b2
C9 ⟨0,1,a,a⟩ ⟨1,2,a,b⟩ ⟨1,3,a,b⟩
a0
a1
b2 b3
C10 ⟨0,1,a,a⟩ ⟨1,2,a,b⟩ ⟨0,3,a,b⟩
a0
a1 b3 b2
C24 ⟨0,1,a,b⟩ ⟨0,2,a,b⟩ ⟨2,3,b,a⟩
a0
b1 b2 a3
C25 ⟨0,1,a,b⟩ ⟨0,2,a,b⟩ ⟨0,3,a,a⟩
a0
b1 b2 a3
C20 ⟨0,1,a,b⟩ ⟨1,2,b,a⟩ ⟨2,3,a,b⟩ a0
b1 a2 b3
C19 ⟨0,1,a,b⟩ ⟨1,2,b,a⟩ ⟨2,0,a,a⟩ a0
b1 a2
C21 ⟨0,1,a,b⟩ ⟨1,2,b,a⟩ ⟨1,3,b,b⟩
a0
b1
a2 b3
C22 ⟨0,1,a,b⟩ ⟨1,2,b,a⟩ ⟨0,3,a,b⟩
a0
b1 b3 a2
C11 ⟨0,1,a,a⟩ ⟨1,2,a,b⟩ ⟨2,0,b,a⟩ ⟨2,3,b,b⟩ a0
a1 b2 b3
Figure 11.8. Frequent graph mining: minsup = 2. Solid boxes indicate the frequent subgraphs, dotted the infrequent, and dashed the noncanonical subgraphs.
C23 ⟨0,1,a,b⟩ ⟨1,2,b,a⟩ ⟨2,3,a,b⟩ ⟨3,1,b,b⟩ a0
b1 a2 b3
C12 ⟨0,1,a,a⟩ ⟨1,2,a,b⟩ ⟨2,0,b,a⟩ ⟨1,3,a,b⟩
a0
a1
b2 b3
C13 ⟨0,1,a,a⟩ ⟨1,2,a,b⟩ ⟨2,0,b,a⟩ ⟨0,3,a,b⟩
a0
a1 b3 b2
C14 ⟨0,1,a,a⟩ ⟨1,2,a,b⟩ ⟨1,3,a,b⟩ ⟨3,0,b,a⟩
a0
a1
b2 b3
The gSpan Algorithm 291 11.3.1 Extension and Support Computation
The support computation task is to find the number of graphs in the database D that contain a candidate subgraph, which is very expensive because it involves subgraph isomorphism checks. gSpan combines the tasks of enumerating candidate extensions and support computation.
Assume that D = {G1 , G2 , . . . , Gn } comprises n graphs. Let C = {t1 , t2 , . . . , tk } denote a frequent canonical DFS code comprising k edges, and let G(C) denote the graph corresponding to code C. The task is to compute the set of possible rightmost path extensions from C, along with their support values, which is accomplished via the pseudo-code in Algorithm 11.2.
Given code C, gSpan first records the nodes on the rightmost path (R), and the rightmost child (ur ). Next, gSpan considers each graph Gi ∈ D. If C = ∅, then each distinct label tuple of the form ⟨L(x),L(y),L(x,y)⟩ for adjacent nodes x and y in Gi contributes a forward extension ⟨0,1,L(x),L(y),L(x,y)⟩ (lines 6-8). On the other hand, if C is not empty, then gSpan enumerates all possible subgraph isomorphisms i between the code C and graph Gi via the function SUBGRAPHISOMORPHISMS (line 10). Given subgraph isomorphism φ ∈ i , gSpan finds all possible forward and backward edge extensions, and stores them in the extension set E.
Backward extensions (lines 12–15) are allowed only from the rightmost child ur in C to some other node on the rightmost path R. The method considers each neighbor x of φ(ur) in Gi and checks whether it is a mapping for some vertex v = φ−1(x) along the rightmost path R in C. If the edge (ur,v) does not already exist in C, it is a new extension, and the extended tuple b = ⟨ur , v, L(ur ), L(v), L(ur , v)⟩ is added to the set of extensions E, along with the graph id i that contributed to that extension.
Forward extensions (lines 16–19) are allowed only from nodes on the rightmost path R to new nodes. For each node u in R, the algorithm finds a neighbor x in Gi that is not in a mapping from some node in C. For each such node x, the forward extension f = ⟨u,ur + 1,L(φ(u)),L(x),L(φ(u),x)⟩ is added to E, along with the graph id i. Because a forward extension adds a new vertex to the graph G(C), the id of the new node in C must be ur + 1, that is, one more than the highest numbered node in C, which by definition is the rightmost child ur .
Once all the backward and forward extensions have been cataloged over all graphs Gi in the database D, we compute their support by counting the number of distinct graph ids that contribute to each extension. Finally, the method returns the set of all extensions and their supports in sorted order (increasing) based on the tuple comparison operator in Eq. (11.1).
Example 11.9. Consider the canonical code C and the corresponding graph G(C) shown in Figure 11.9a. For this code all the vertices are on the rightmost path, that is, R = {0, 1, 2}, and the rightmost child is ur = 2.
The sets of all possible isomorphisms from C to graphs G1 and G2 in the database (shown in Figure 11.7) are listed in Figure 11.9b as 1 and 2. For example, the first isomorphism φ1 : G(C) → G1 is defined as
φ1(0) = 10 φ1(1) = 30 φ1(2) = 20
292
Graph Pattern Mining
ALGORITHM11.2. RightmostPathExtensionsandTheirSupport
1 2 3 4 5
6 7 8
9 10 11
12 13 14 15
16 17 18 19
20 21
22
RIGHTMOSTPATH-EXTENSIONS (C, D): R ← nodes on the rightmost path in C
ur ← rightmost child in C // dfs number E←∅//setofextensionsfromC foreach Gi ∈ D, i = 1,…,n do
if C = ∅ then
//adddistinctlabeltuplesinGi asforward
extensions
foreachdistinct⟨L(x),L(y),L(x,y)⟩∈Gi do f = 0,1,L(x),L(y),L(x,y)
Add tuple f to E along with graph id i
else
i =SUBGRAPHISOMORPHISMS(C,Gi) foreach isomorphism φ ∈ i do
// backward extensions from rightmost child
foreach x ∈ NGi (φ(ur )) such that ∃v ← φ−1(x) do if v ∈ R and (ur , v) ̸∈ G(C) then
b=ur,v,L(ur),L(v),L(ur,v)
Add tuple b to E along with graph id i
// forward extensions from nodes on rightmost path
Add tuple f to E along with graph id i
foreach u ∈ R do
foreach x ∈ NG (φ(u)) and ̸ ∃φ−1(x) do
i f = u,ur +1,L(φ(u)),L(x),L(φ(u),x)
// Compute the support of each extension
foreach distinct extension s ∈ E do
sup(s) = number of distinct graph ids that support tuple s
return set of pairs ⟨s,sup(s)⟩ for extensions s ∈ E, in tuple sorted order
The list of possible backward and forward extensions for each isomorphism is
shown in Figure 11.9c. For example, there are two possible edge extensions from the
isomorphism φ1. The first is a backward edge extension ⟨2,0,b,a⟩, as (20,10) is a
valid backward edge in G1. That is, the node x = 10 is a neighbor of φ(2) = 20 in G1,
φ−1(10) = 0 = v is on the rightmost path, and the edge (2,0) is not already in G(C),
which satisfy the backward extension steps in lines 12–15 in Algorithm 11.2. The
second extension is a forward one ⟨1,3,a,b⟩, as ⟨30,40,a,b⟩ is a valid extended edge
in G1. That is, x =40 is a neighbor of φ(1)=30 in G1, and node 40 has not already
been mapped to any node in G(C), that is, φ−1(40) does not exist. These conditions 1
satisfy the forward extension steps in lines 16–19 in Algorithm 11.2.
The gSpan Algorithm
293
t1 :⟨0,1,a,a⟩ t2 :⟨1,2,a,b⟩
φ
φ1 φ2 φ3
φ
φ1 φ2 φ3
Id
C
G(C) a0
a1 b2
(a) Code C and graph G(C)
Extensions
012 10 30 20 1 103040 30 10 20 φ4 60 80 70 2 φ5 806050 φ6 80 60 70
(b) Subgraph isomorphisms
Extension Support
⟨2,0,b,a⟩ 2 ⟨2,3,b,b⟩ 1 ⟨1,3,a,b⟩ 2 ⟨0,3,a,b⟩ 2
{⟨2,0,b,a⟩,⟨1,3,a,b⟩} G1 {⟨1,3,a,b⟩,⟨0,3,a,b⟩} {⟨2,0,b,a⟩,⟨0,3,a,b⟩}
φ4 {⟨2,0,b,a⟩,⟨2,3,b,b⟩,⟨0,3,a,b⟩} G2 φ5 {⟨2,3,b,b⟩,⟨1,3,a,b⟩,⟨0,3,a,b⟩} φ6 {⟨2,0,b,a⟩,⟨2,3,b,b⟩,⟨1,3,a,b⟩}
(c) Edge extensions
(d) Extensions (sorted) and supports Figure 11.9. Rightmost path extensions.
Given the set of all the edge extensions, and the graph ids that contribute to them, we obtain support for each extension by counting how many graphs contribute to it. The final set of extensions, in sorted order, along with their support values is shown in Figure 11.9d. With minsup = 2, the only infrequent extension is ⟨2, 3, b, b⟩.
Subgraph Isomorphisms
The key step in listing the edge extensions for a given code C is to enumerate all the possible isomorphisms from C to each graph Gi ∈ D. The function SUBGRAPHI- SOMORPHISMS, shown in Algorithm 11.3, accepts a code C and a graph G, and returns the set of all isomorphisms between C and G. The set of isomorphisms is initialized by mapping vertex 0 in C to each vertex x in G that shares the same label as 0, that is, if L(x) = L(0) (line 1). The method considers each tuple ti in C and extends the current set of partial isomorphisms. Let ti = ⟨u, v, L(u), L(v), L(u, v)⟩. We have to check if each isomorphism φ ∈ can be extended in G using the information from ti (lines 5–12). If ti is a forward edge, then we seek a neighbor x of φ(u) in G such that x has not already been mapped to some vertex in C, that is, φ−1(x) should not exist, and the node and edge labels should match, that is, L(x) = L(v), and L(φ(u),x) = L(u,v). If so, φ can be extended with the mapping φ(v) → x. The new extended isomorphism, denoted φ′, is added to the initially empty set of isomorphisms ′. If ti is a backward edge, we have to check if φ(v) is a neighbor of φ(u) in G. If so, we add the current isomorphism φ to ′. Thus,
294
Graph Pattern Mining
ALGORITHM11.3. EnumerateSubgraphIsomorphisms
1 2 3 4 5 6
7 8 9
10 11
12
13 14
SUBGRAPHISOMORPHISMS (C = {t1,t2,…,tk}, G): ←{φ(0)→x|x∈GandL(x)=L(0)}
foreach ti ∈ C, i = 1,…,k do
⟨u, v, L(u), L(v), L(u, v)⟩ ← ti // expand extended edge ti ′ ← ∅ // partial isomorphisms including ti
foreach partial isomorphism φ ∈ do
if v > u then
// forward edge foreach x ∈ NG (φ (u)) do
if ̸ ∃φ−1(x) and L(x) = L(v) and L(φ(u),x) = L(u,v) then φ′ ←φ∪{φ(v)→x}
Add φ′ to ′
else
// backward edge
ifφ(v)∈NGj(φ(u))then Addφto′//validisomorphism
← ′ // update partial isomorphisms return
only those isomorphisms that can be extended in the forward case, or those that satisfy the backward edge, are retained for further checking. Once all the extended edges in C have been processed, the set contains all the valid isomorphisms from C to G.
Example 11.10. Figure 11.10 illustrates the subgraph isomorphism enumeration algorithm from the code C to each of the graphs G1 and G2 in the database shown in Figure 11.7.
For G1, the set of isomorphisms is initialized by mapping the first node of C to
all nodes labeled a in G1 because L(0) = a. Thus, = {φ1(0) → 10,φ2(0) → 30}. We
next consider each tuple in C, and see which isomorphisms can be extended. The first
tuple t1 = ⟨0,1,a,a⟩ is a forward edge, thus for φ1, we consider neighbors x of 10 that
are labeled a and not included in the isomorphism yet. The only other vertex that
satisfies this condition is 30; thus the isomorphism is extended by mapping φ1(1) →
30. In a similar manner the second isomorphism φ2 is extended by adding φ2 (1) → 10,
as shown in Figure 11.10. For the second tuple t2 = ⟨1,2,a,b⟩, the isomorphism
φ1 has two possible extensions, as 30 has two neighbors labeled b, namely 20
and 40. The extended mappings are denoted φ′ and φ′′. For φ2 there is only one 11
extension.
The isomorphisms of C in G2 can be found in a similar manner. The complete
sets of isomorphisms in each database graph are shown in Figure 11.10.
The gSpan Algorithm
C
295
t1 :⟨0,1,a,a⟩ t2 :⟨1,2,a,b⟩
φ
φ1′ φ′′
1 φ2
Initial Add t1
Add t2 id
0,1,2
10, 30, 20
φ
φ1 φ2
φ
φ1 φ2
G(C) id 0 id 0,1 a0
a1 b2
G 10 G 10,30 G1 10, 30, 40
1 30 1 30,10 30, 10, 20
G φ3 60 G φ3 60,80 φ3 60, 80, 70
2 φ4 80 2 φ4 80,60 G2 φ4′ 80, 60, 50
φ′′ 4
80, 60, 70
11.3.2 Canonicality Checking
Figure 11.10. Subgraph isomorphisms.
Given a DFS code C = {t1,t2,…,tk} comprising k extended edge tuples and the corresponding graph G(C), the task is to check whether the code C is canonical. This can be accomplished by trying to reconstruct the canonical code C∗ for G(C) in an iterative manner starting from the empty code and selecting the least rightmost path extension at each step, where the least edge extension is based on the extended tuple comparison operator in Eq. (11.1). If at any step the current (partial) canonical DFS code C∗ is smaller than C, then we know that C cannot be canonical and can thus be pruned. On the other hand, if no smaller code is found after k extensions then C must be canonical. The pseudo-code for canonicality checking is given in Algorithm 11.4. The method can be considered as a restricted version of gSpan in that the graph G(C) plays the role of a graph in the database, and C∗ plays the role of a candidate extension. The key difference is that we consider only the smallest rightmost path edge extension among all the possible candidate extensions.
ALGORITHM 11.4. Canonicality Checking: Algorithm ISCANONICAL
1 2 3 4 5 6 7
8 9
ISCANONICAL (C):
DC ← {G(C)} // graph corresponding to code C C∗ ← ∅ // initialize canonical DFScode fori=1···kdo
E =RIGHTMOSTPATH-EXTENSIONS(C∗,DC)//extensionsof C∗ (si,sup(si))←min{E}//leastrightmostedgeextensionof C∗ if si < ti then
return false // C∗ is smaller, thus C is not canonical C∗ ← C∗ ∪ si
return true // no smaller code exists; C is canonical
296
Graph Pattern Mining
Step 3
G∗ aa
G
Step 1
Step 2
a0 a1 b2
G∗
b3
G∗ aaa
abb
Figure 11.11. Canonicality checking.
C
t1 =⟨0,1,a,a⟩ t2 =⟨1,2,a,b⟩ t3 =⟨1,3,a,b⟩ t4 =⟨3,0,b,a⟩
C∗
s1 =⟨0,1,a,a⟩ s2 =⟨1,2,a,b⟩
C∗
s1 =⟨0,1,a,a⟩ s2 =⟨1,2,a,b⟩ s3 =⟨2,0,b,a⟩
C∗
s1 =⟨0,1,a,a⟩
Example 11.11. Consider the subgraph candidate C14 from Figure 11.8, which is replicated as graph G in Figure 11.11, along with its DFS code C. From an initial canonical code C∗ = ∅, the smallest rightmost edge extension s1 is added in Step 1. Because s1 = t1, we proceed to the next step, which finds the smallest edge extension s2. Once again s2 = t2, so we proceed to the third step. The least possible edge extension for G∗ is the extended edge s3 . However, we find that s3 < t3 , which means that C cannot be canonical, and there is no need to try further edge extensions.
11.4 FURTHER READING
The gSpan algorithm was described in Yan and Han (2002), along with the notion of canonical DFS code. A different notion of canonical graphs using canonical adjacency matrices was described in Huan, Wang, and Prins (2003). Level-wise algorithms to mine frequent subgraphs appear in Kuramochi and Karypis (2001) and Inokuchi, Washio, and Motoda (2000). Markov chain Monte Carlo methods to sample a set of representative graph patterns were proposed in Al Hasan and Zaki (2009). For an efficient algorithm to mine frequent tree patterns see Zaki (2002).
Al Hasan, M. and Zaki, M. J. (2009). Output space sampling for graph patterns. Proceedings of the VLDB Endowment, 2 (1): 730–741.
Huan, J., Wang, W., and Prins, J. (2003). Efficient mining of frequent subgraphs in the presence of isomorphism. Proceedings of the IEEE International Conference on Data Mining. IEEE, pp. 549–552.
Inokuchi, A., Washio, T., and Motoda, H. (2000). “An apriori-based algorithm for mining frequent substructures from graph data”. In: Proceedings of the European Conference on Principles of Data Mining and Knowledge Discovery. Springer, pp. 13–23.
Exercises 297
Kuramochi, M. and Karypis, G. (2001). Frequent subgraph discovery. Proceedings of the IEEE International Conference on Data Mining. IEEE, pp. 313–320.
Yan, X. and Han, J. (2002). gSpan: Graph-based substructure pattern mining. Proceedings of the IEEE International Conference on Data Mining. IEEE, pp. 721–724.
Zaki, M. J. (2002). Efficiently mining frequent trees in a forest. Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp. 71–80.
11.5 EXERCISES
Q1. Find the canonical DFS code for the graph in Figure 11.12. Try to eliminate some codes without generating the complete search tree. For example, you can eliminate a code if you can show that it will have a larger code than some other code.
ac
bada
ba
Figure 11.12. Graph for Q1.
Q2. Given the graph in Figure 11.13. Mine all the frequent subgraphs with minsup = 1. For each frequent subgraph, also show its canonical code.
a
aa
a
Figure 11.13. Graph for Q2.
298 Graph Pattern Mining
Q3. Consider the graph shown in Figure 11.14. Show all its isomorphic graphs and their DFS codes, and find the canonical representative (you may omit isomorphic graphs that can definitely not have canonical codes).
A
a AaA
a
b
BA
Figure 11.14. Graph for Q3.
Q4. Given the graphs in Figure 11.15, separate them into isomorphic groups. G1 G2 G3 G4
aaab
aaaa
bbbbbb
a
aabbba
bbb
Figure 11.15. Data for Q4.
b
G5 G6 G7 aaa
Exercises 299
Q5. GiventhegraphinFigure11.16.FindthemaximumDFScodeforthegraph,subject to the constraint that all extensions (whether forward or backward) are done only from the right most path.
bccca
c
a
Figure 11.16. Graph for Q5.
Q6. ForanedgelabeledundirectedgraphG=(V,E),defineitslabeledadjacencymatrix A as follows:
L(vi ) if i = j A(i,j)= L(vi,vj) if(vi,vj)∈E
0 Otherwise
where L(vi ) is the label for vertex vi and L(vi , vj ) is the label for edge (vi , vj ). In other words, the labeled adjacency matrix has the node labels on the main diagonal, and it has the label of the edge (vi,vj) in cell A(i,j). Finally, a 0 in cell A(i,j) means that there is no edge between vi and vj .
v0 v1 v2
y abb
yyy
yz bba
v3 v4 v5 Figure 11.17. Graph for Q6.
Given a particular permutation of the vertices, a matrix code for the graph is obtained by concatenating the lower triangular submatrix of A row-by-row. For
x
300
Graph Pattern Mining
example, one possible matrix corresponding to the default vertex permutation v0v1v2v3v4v5 for the graph in Figure 11.17 is given as
The code for the matrix above is axb0yb0yyb00yyb0000za. Given the total ordering on the labels
0 sup(Y), for all Y ⊃ X
An itemset X is a minimal generator if all its subsets have strictly higher support,
that is,
sup(X) < sup(Y), for all Y ⊂ X
If an itemset X is not a minimal generator, then it implies that it has some redundant items, that is, we can find some subset Y ⊂ X, which can be replaced with an even smaller subset W ⊂ Y without changing the support of X, that is, there exists a W ⊂ Y, such that
sup(X)=sup(Y∪(X\Y))=sup(W∪(X\Y))
One can show that all subsets of a minimal generator must themselves be minimal generators.
314 Pattern and Rule Assessment Table 12.16. Closed itemsets and minimal generators
sup
Closed Itemset
Minimal Generators
3 3 4 4 4 5 6
ABDE BCE ABE BC BD BE
B
AD, DE CE
A
C
D E B
Example 12.13. Consider the dataset in Table 12.1 and the set of frequent itemsets with minsup = 3 as shown in Table 12.2. There are only two maximal frequent itemsets, namely ABDE and BCE, which capture essential information about whether another itemset is frequent or not: an itemset is frequent only if it is a subset of one of these two.
Table 12.16 shows the seven closed itemsets and the corresponding minimal generators. Both of these sets allow one to infer the exact support of any other frequent itemset. The support of an itemset X is the maximum support among all closed itemsets that contain it. Alternatively, the support of X is the minimum support among all minimal generators that are subsets of X. For example, the itemset AE is a subset of the closed sets ABE and ABDE, and it is a superset of the minimal generators A, and E; we can observe that
sup(AE) = max{sup(ABE), sup(ABDE)} = 4 sup(AE) = min{sup(A), sup(E)} = 4
Productive Itemsets An itemset X is productive if its relative support is higher than the expected relative support over all of its bipartitions, assuming they are independent. More formally, let |X| ≥ 2, and let {X1,X2} be a bipartition of X. We say that X is productive provided
rsup(X)>rsup(X1)×rsup(X2), forallbipartitions{X1,X2}ofX (12.3) This immediately implies that X is productive if its minimum lift is greater than
one, as
MinLift(X) = min rsup(X) > 1 X1,X2 rsup(X1)·rsup(X2)
In terms of leverage, X is productive if its minimum leverage is above zero because
MinLeverage(X) = min rsup(X) − rsup(X1) × rsup(X2) > 0 X1 ,X2
Rule and Pattern Assessment Measures 315
Example12.14. ConsideringthefrequentitemsetsinTable12.2,thesetABDEisnot productive because there exists a bipartition with lift value of 1. For instance, for its bipartition {B,ADE} we have
lift(B −→ ADE) = rsup(ABDE) = 3/6 = 1 rsup(B) · rsup(ADE) 6/6 · 3/6
On the other hand, ADE is productive because it has three distinct bipartitions and all of them have lift above 1:
lift(A −→ DE) = lift(D −→ AE) = lift(E −→ AD) =
rsup(ADE) rsup(A) · rsup(DE)
rsup(ADE) rsup(D) · rsup(AE)
rsup(ADE) rsup(E) · rsup(AD)
= 3/6 4/6 · 3/6
= 3/6 4/6 · 4/6
= 3/6 5/6 · 3/6
= 1.5 = 1.125 = 1.2
Comparing Rules
GiventworulesR:X−→YandR′ :W−→Ythathavethesameconsequent,wesay that R is more specific than R′, or equivalently, that R′ is more general than R provided W⊂X.
Nonredundant Rules We say that a rule R : X −→ Y is redundant provided there exists a more general rule R′ : W −→ Y that has the same support, that is, W ⊂ X and sup(R) = sup(R′). On the other hand, if sup(R) < sup(R′) over all its generalizations R′, then R is nonredundant.
Improvement and Productive Rules Define the improvement of a rule X −→ Y as follows:
imp(X−→Y)=conf(X−→Y)−maxconf(W−→Y) W⊂X
Improvement quantifies the minimum difference between the confidence of a rule and any of its generalizations. A rule R : X −→ Y is productive if its improvement is greater than zero, which implies that for all more general rules R′ : W −→ Y we have conf(R) > conf(R′). On the other hand, if there exists a more general rule R′ with conf(R′) ≥ conf(R), then R is unproductive. If a rule is redundant, it is also unproductive because its improvement is zero.
The smaller the improvement of a rule R : X −→ Y, the more likely it is to be unproductive. We can generalize this notion to consider rules that have at least some minimum level of improvement, that is, we may require that imp(X −→ Y) ≥ t, where t is a user-specified minimum improvement threshold.
316 Pattern and Rule Assessment
Example 12.15. Consider the example dataset in Table 12.1, and the set of frequent itemsets in Table 12.2. Consider rule R : BE −→ C, which has support 3, and confidence 3/5 = 0.60. It has two generalizations, namely
R′1 :E−→C, sup=3,conf=3/5=0.6 R′2 :B−→C, sup=4,conf=4/6=0.67
Thus, BE −→ C is redundant w.r.t. E −→ C because they have the same support, that is, sup(BCE) = sup(BC). Further, BE −→ C is also unproductive, since imp(BE −→ C) = 0.6 − max{0.6, 0.67} = −0.07; it has a more general rule, namely R′2 , with higher confidence.
12.2 SIGNIFICANCE TESTING AND CONFIDENCE INTERVALS
We now consider how to assess the statistical significance of patterns and rules, and
how to derive confidence intervals for a given assessment measure.
12.2.1 Fisher Exact Test for Productive Rules
We begin by discussing the Fisher exact test for rule improvement. That is, we directly test whether the rule R : X −→ Y is productive by comparing its confidence with that of each of its generalizations R′ : W −→ Y, including the default or trivial rule ∅ −→ Y.
Let R : X −→ Y be an association rule. Consider its generalization R′ : W −→ Y, where W = X \ Z is the new antecedent formed by removing from X the subset Z ⊆ X. Given an input dataset D, conditional on the fact that W occurs, we can create a 2 × 2 contingency table between Z and the consequent Y as shown in Table 12.17. The different cell values are as follows:
a = sup(WZY) = sup(XY) b = sup(WZ¬Y) = sup(X¬Y) c = sup(W¬ZY) d = sup(W¬Z¬Y)
Here, a denotes the number of transactions that contain both X and Y, b denotes the number of transactions that contain X but not Y, c denotes the number of transactions that contain W and Y but not Z, and finally d denotes the number of transactions that contain W but neither Z nor Y. The marginal counts are given as
row marginals: a + b = sup(WZ) = sup(X), c + d = sup(W¬Z) column marginals: a + c = sup(WY), b + d = sup(W¬Y)
where the row marginals give the occurrence frequency of W with and without Z, and the column marginals specify the occurrence counts of W with and without Y. Finally, we can observe that the sum of all the cells is simply n = a + b + c + d = sup(W). Notice that when Z = X, we have W = ∅, and the contingency table defaults to the one shown in Table 12.8.
Given a contingency table conditional on W, we are interested in the odds ratio obtained by comparing the presence and absence of Z, that is,
oddsratio= a/(a+b)c/(c+d) = ad (12.4) b/(a+b) d/(c+d) bc
Significance Testing and Confidence Intervals 317 Table12.17. ContingencytableforZandY,conditionalonW=X\Z
a+c b+d n=sup(W)
Recall that the odds ratio measures the odds of X, that is, W and Z, occurring with Y versus the odds of its subset W, but not Z, occurring with Y. Under the null hypothesis H0 that Z and Y are independent given W the odds ratio is 1. To see this, note that under the independence assumption the count in a cell of the contingency table is equal to the product of the corresponding row and column marginal counts divided by n, that is, under H0:
a = (a + b)(a + c)/n b = (a + b)(b + d)/n c=(c+d)(a+c)/n d =(c+d)(b+d)/n
Plugging these values in Eq. (12.4), we obtain
oddsratio= ad = (a+b)(c+d)(b+d)(a+c) =1
bc (a+c)(b+d)(a+b)(c+d)
The null hypothesis therefore corresponds to H0 : oddsratio = 1, and the alternative hypothesis is Ha : oddsratio > 1. Under the null hypothesis, if we further assume that the row and column marginals are fixed, then a uniquely determines the other three values b, c, and d, and the probability mass function of observing the value a in the contingency table is given by the hypergeometric distribution. Recall that the hypergeometric distribution gives the probability of choosing s successes in t trails if we sample without replacement from a finite population of size T that has S successes in total, given as
P(s|t,S,T)=S·T−ST st−st
In our context, we take the occurrence of Z as a success. The population size is T = sup(W) = n because we assume that W always occurs, and the total number of successes is the support of Z given W, that is, S = a + b. In t = a + c trials, the hypergeometric distribution gives the probability of s = a successes:
W
Y ¬Y ab cd
Z ¬Z
a+b c+d
a+b · n−(a+b) a+b · c+d
P a (a+c),(a+b),n =
= (a+b)!(c+d)! n!
a (a+c)−a n
a c = n
a+c
a!b!c!d! (a+c)!(n−(a+c))!
= (a+b)!(c+d)!(a+c)!(b+d)! n!a!b!c!d!
a+c
(12.5)
318 Pattern and Rule Assessment Table 12.18. Contingency table: increase a by i
a+c b+d n=sup(W)
Our aim is to contrast the null hypothesis H0 that oddsratio = 1 with the alternative hypothesis Ha that oddsratio > 1. Because a determines the rest of the cells under fixed row and column marginals, we can see from Eq. (12.4) that the larger the a the larger the odds ratio, and consequently the greater the evidence for Ha. We can obtain the p-value for a contingency table as extreme as that in Table 12.17 by summing Eq. (12.5) over all possible values a or larger:
which follows from the fact that when we increase the count of a by i, then because the row and column marginals are fixed, b and c must decrease by i, and d must increase by i, as shown in Table 12.18. The lower the p-value the stronger the evidence that the odds ratio is greater than one, and thus, we may reject the null hypothesis H0 if p-value ≤ α, where α is the significance threshold (e.g., α = 0.01). This test is known as the Fisher Exact Test.
In summary, to check whether a rule R : X −→ Y is productive, we must compute p-value(a) = p-value(sup(XY)) of the contingency tables obtained from each of its generalizations R′ : W −→ Y, where W = X \ Z, for Z ⊆ X. If p-value(sup(XY)) > α for any of these comparisons, then we can reject the rule R : X −→ Y as nonproductive. On the other hand, if p-value(sup(XY)) ≤ α for all the generalizations, then R is productive. However, note that if |X| = k, then there are 2k − 1 possible generalizations; to avoid this exponential complexity for large antecedents, we typically restrict our attention to only the immediate generalizations of the form R′ : X \ z −→ Y, where z ∈ X is one of the attribute values in the antecedent. However, we do include the trivial rule ∅ −→ Y because the conditional probability P (Y|X) = conf(X −→ Y) should also be higher than the prior probability P (Y) = conf(∅ −→ Y).
W
Y ¬Y a+i b−i c−i d+i
Z ¬Z
a+b c+d
min(b,c) i=0
p-value(a) =
= i=0 n!(a+i)!(b−i)!(c−i)!(d+i)!
P (a + i | (a + c), (a + b), n) min(b,c) (a+b)!(c+d)!(a+c)!(b+d)!
Example 12.16. Consider the rule R : pw2 −→ c2 obtained from the discretized Iris dataset. To test if it is productive, because there is only a single item in the antecedent, we compare it only with the default rule ∅ −→ c2. Using Table 12.17, the various cell values are
a = sup(pw2,c2) = 49 b = sup(pw2,¬c2) = 5
c = sup(¬pw2,c2) = 1 d = sup(¬pw2,¬c2) = 95
Significance Testing and Confidence Intervals 319
with the contingency table given as
Thus the p-value is given as
min(b,c)
c2 ¬c2 49 5
1 95
pw2 ¬pw2
54 96
50 100
150
p-value = =P(49|50,54,150)+P(50|50,54,150)
P (a + i | (a + c), (a + b), n) =54·96150 + 54·96150
i=0
49 95 50 50 96 50 = 1.51 × 10−32 + 1.57 × 10−35 = 1.51 × 10−32
Since the p-value is extremely small, we can safely reject the null hypothesis that the odds ratio is 1. Instead, there is a strong relationship between X = pw2 and Y = c2, and we conclude that R : pw2 −→ c2 is a productive rule.
Example 12.17. Consider another rule {sw1,pw2} −→ c2, with X = {sw1,pw2} and Y = c2. Consider its three generalizations, and the corresponding contingency tables and p-values:
R′1 : pw2 −→ c2 Z={sw1} W=X\Z={pw2} p-value = 0.84
R′2 :sw1 −→c2 Z={pw2} W=X\Z={sw1} p-value = 1.39 × 10−11
R′3 :∅−→c2 Z={sw1,pw2} W=X\Z=∅ −17 p-value = 3.55 × 10
W=pw2 sw1 ¬s w1
c2 ¬c2 34 4 15 1
38 16
49 5
54
W=sw1 pw2 ¬pw2
c2 ¬c2 34 4
0 19
38 19
34 23
57
W=∅ {sw1,pw2} ¬{sw1,pw2}
c2 ¬c2 34 4 16 96
38 112
50 100
150
320 Pattern and Rule Assessment
Multiple Hypothesis Testing
Given an input dataset D, there can be an exponentially large number of rules
that need to be tested to check whether they are productive or not. We thus run
into the multiple hypothesis testing problem, that is, just by the sheer number of
hypothesis tests some unproductive rules will pass the p-value ≤ α threshold by
random chance. A strategy for overcoming this problem is to use the Bonferroni
correction of the significance level that explicitly takes into account the number of
experiments performed during the hypothesis testing process. Instead of using the
given α threshold, we should use an adjusted threshold α′ = α , where #r is the number #r
of rules to be tested or its estimate. This correction ensures that the rule false discovery rate is bounded by α, where a false discovery is to claim that a rule is productive when it is not.
We can see that whereas the p-value with respect to R′2 and R′3 is small, for R′1 we have p-value = 0.84, which is too high and thus we cannot reject the null hypothesis. We conclude that R : {sw1,pw2} −→ c2 is not productive. In fact, its generalization R′1 is the one that is productive, as shown in Example 12.16.
Example 12.18. Consider the discretized Iris dataset, using the discretization shown in Table 12.10. Let us focus only on class-specific rules, that is, rules of the form X → ci . Since each example can take on only one value at a time for a given attribute, the maximum antecedent length is four, and the maximum number of class-specific rules that can be generated from the Iris dataset is given as
#r=c×4 4ibi i=1
where c is the number of Iris classes, and b is the maximum number of bins for any other attribute. The summation is over the antecedent size i, that is, the number of attributes to be used in the antecedent. Finally, there are bi possible combinations for the chosen set of i attributes. Because there are three Iris classes, and because each attribute has three bins, we have c = 3 and b = 3, and the number of possible rules is
#r=3×4 4i3i=3(12+54+108+81)=3·255=765 i=1
Thus, if the input significance level is α = 0.01, then the adjusted significance level using the Bonferroni correction is α′ = α/#r = 0.01/765 = 1.31 × 10−5. The rule pw2 −→ c2 in Example 12.16 has p-value = 1.51 × 10−32, and thus it remains productive even when we use α′.
12.2.2 Permutation Test for Significance
A permutation or randomization test determines the distribution of a given test statistic by randomly modifying the observed data several times to obtain a random sample
Significance Testing and Confidence Intervals 321
of datasets, which can in turn be used for significance testing. In the context of pattern assessment, given an input dataset D, we first generate k randomly permuted datasets D1,D2,…,Dk. We can then perform different types of significance tests. For instance, given a pattern or rule we can check whether it is statistically significant by first computing the empirical probability mass function (EPMF) for the test statistic by computing its value θi in the ith randomized dataset Di for all i ∈ [1,k]. From these values we can generate the empirical cumulative distribution function
ˆ ˆ 1k F(x)=P(≤x)=k I(θi ≤x)
i=1
where I is an indicator variable that takes on the value 1 when its argument is true, and is 0 otherwise. Let θ be the value of the test statistic in the input dataset D, then p-value(θ), that is, the probability of obtaining a value as high as θ by random chance can be computed as
p-value(θ)=1−F(θ)
Given a significance level α, if p-value(θ) > α, then we accept the null hypothesis that the pattern/rule is not statistically significant. On the other hand, if p-value(θ) ≤ α, then we can reject the null hypothesis and conclude that the pattern is significant because a value as high as θ is highly improbable. The permutation test approach can also be used to assess an entire set of rules or patterns. For instance, we may test a collection of frequent itemsets by comparing the number of frequent itemsets in D with the distribution of the number of frequent itemsets empirically derived from the permuted datasets Di . We may also do this analysis as a function of minsup, and so on.
Swap Randomization
A key question in generating the permuted datasets Di is which characteristics of the input dataset D we should preserve. The swap randomization approach maintains as invariant the column and row margins for a given dataset, that is, the permuted datasets preserve the support of each item (the column margin) as well as the number of items in each transaction (the row margin). Given a dataset D, we randomly create k datasets that have the same row and column margins. We then mine frequent patterns in D and check whether the pattern statistics are different from those obtained using the randomized datasets. If the differences are not significant, we may conclude that the patterns arise solely from the row and column margins, and not from any interesting properties of the data.
Given a binary matrix D ⊆ T × I, the swap randomization method exchanges two nonzero cells of the matrix via a swap that leaves the row and column margins unchanged. To illustrate how swap works, consider any two transactions ta,tb ∈ T and any two items ia,ib ∈ I such that (ta,ia),(tb,ib) ∈ D and (ta,ib),(tb,ia) ̸∈ D, which corresponds to the 2 × 2 submatrix in D, given as
D(ta,ia;tb,ib) = 1 0 01
322
Pattern and Rule Assessment
ALGORITHM12.1. GenerateSwapRandomizedDataset
1 2 3 4
5 6
SWAPRANDOMIZATION(t, D ⊆ T × I): whilet>0do
Select pairs (ta , ia ), (tb , ib ) ∈ D randomly if (ta,ib) ̸∈ D and (tb,ia) ̸∈ D then
D ← D \ (ta,ia),(tb,ib) ∪ (ta,ib),(tb,ia) t=t−1
return D
After a swap operation we obtain the new submatrix
D(ta,ib;tb,ia) = 0 1 10
where we exchange the elements in D so that (ta,ib),(tb,ia) ∈ D, and (ta,ia),(tb,ib) ̸∈ D. We denote this operation as Swap(ta,ia;tb,ib). Notice that a swap does not affect the row and column margins, and we can thus generate a permuted dataset with the same row and column sums as D through a sequence of swaps. Algorithm 12.1 shows the pseudo-code for generating a swap randomized dataset. The algorithm performs t swap trials by selecting two pairs (ta,ia), (tb,ib) ∈ D at random; a swap is successful only if both (ta,ib), (tb,ia) ̸∈ D.
Example 12.19. Consider the input binary dataset D shown in Table 12.19a, whose row and column sums are also shown. Table 12.19b shows the resulting dataset after a single swap operation Swap(1,D;4,C), highlighted by the gray cells. When we apply another swap, namely Swap(2,C;4,A), we obtain the data in Table 12.19c. We can observe that the marginal counts remain invariant.
From the input dataset D in Table 12.19a we generated k = 100 swap randomized datasets, each of which is obtained by performing 150 swaps (the product of all possible transaction pairs and item pairs, that is, 62 · 52 = 150). Let the test statistic be the total number of frequent itemsets using minsup = 3. Mining D results in |F | = 19 frequent itemsets. Likewise, mining each of the k = 100 permuted datasets results in the following empirical PMF for |F|:
P|F|=19=0.67 P|F|=17=0.33
Because p-value(19) = 0.67, we may conclude that the set of frequent itemsets is essentially determined by the row and column marginals.
Focusing on a specific itemset, consider ABDE, which is one of the maximal frequent itemsets in D, with sup(ABDE) = 3. The probability that ABDE is frequent is 17/100 = 0.17 because it is frequent in 17 of the 100 swapped datasets. As this probability is not very low, we may conclude that ABDE is not a statistically significant pattern; it has a relatively high chance of being frequent in random datasets. Consider another itemset BCD that is not frequent in D because
Significance Testing and Confidence Intervals 323 Table 12.19. Input data D and swap randomization
Tid
Items
Sum
A
B
C
D
E
1
1
1
0
1
1
4
2
0
1
1
0
1
3
3
1
1
0
1
1
4
4
1
1
1
0
1
4
5
1
1
1
1
1
5
6
0
1
1
1
0
3
Sum
4
6
4
4
5
(a) Input binary data D
Tid
Items
Sum
A
B
C
D
E
1
1
1
1
0
1
4
2
0
1
1
0
1
3
3
1
1
0
1
1
4
4
1
1
0
1
1
4
5
1
1
1
1
1
5
6
0
1
1
1
0
3
Sum
4
6
4
4
5
Tid
Items
Sum
A
B
C
D
E
1
1
1
1
0
1
4
2
1
1
0
0
1
3
3
1
1
0
1
1
4
4
0
1
1
1
1
4
5
1
1
1
1
1
5
6
0
1
1
1
0
3
Sum
4
6
4
4
5
(b) Swap(1,D;4,C)
(c) Swap(2,C;4,A)
sup(BCD) = 2. The empirical PMF for the support of BCD is given as
P ( s up = 2 ) = 0 . 5 4 P ( s up = 3 ) = 0 . 4 4 P ( s up = 4 ) = 0 . 0 2
In a majority of the datasets BCD is infrequent, and if minsup = 4, then p-value(sup = 4) = 0.02 implies that BCD is highly unlikely to be a frequent pattern.
Example 12.20. We apply the swap randomization approach to the discretized Iris dataset. Figure 12.3 shows the cumulative distribution of the number of frequent itemsets in D at various minimum support levels. We choose minsup = 10, for which wehaveFˆ(10)=P(sup<10)=0.517.Putdifferently,P(sup≥10)=1−0.517=0.483, that is, 48.3% of the itemsets that occur at least once are frequent using minsup = 10.
Define the test statistic to be the relative lift, defined as the relative change in the lift value of itemset X when comparing the input dataset D and a randomized dataset
Di , that is,
rlift(X,D,Di)= lift(X,D)−lift(X,Di) lift(X, D)
For an m-itemset X = {x1,...,xm}, by Eq. (12.2) note that
m j=1
lift(X, D) = rsup(X, D)
rsup(xj , D)
324
Pattern and Rule Assessment
Fˆ
1.00 0.75 0.50 0.25
0
0 10 20 30 40 50 60
minsup
Figure 12.3. Cumulative distribution of the number of frequent itemsets as a function of minimum support.
Because the swap randomization process leaves item supports (the column margins) intact, and does not change the number of transactions, we have rsup(xj,D) = rsup(xj , Di ), and |D| = |Di |. We can thus rewrite the relative lift statistic as
rlift(X,D,Di)= sup(X,D)−sup(X,Di) =1− sup(X,Di) sup(X, D) sup(X, D)
We generate k = 100 randomized datasets and compute the average relative lift for each of the 140 frequent itemsets of size two or more in the input dataset, as lift values are not defined for single items. Figure 12.4 shows the cumulative distribution for average relative lift, which ranges from −0.55 to 0.998. An average relative lift
Fˆ 1.00
0.75 0.50 0.25
0
−0.6 −0.4 −0.2
Figure 12.4. Cumulative distribution for average relative lift.
0 0.2
Avg. Relative Lift
0.4 0.6
0.8 1.0
Significance Testing and Confidence Intervals
fˆ 0.16
0.12 0.08 0.04
0
325
−0.8 −0.6 −0.4 −0.2 Relative Lift
0
−1.2 −1.0
Figure 12.5. PMF for relative lift for {sl1,pw2}.
close to 1 means that the corresponding frequent pattern hardly ever occurs in any of the randomized datasets. On the other hand, a larger negative average relative lift value means that the support in randomized datasets is higher than in the input dataset. Finally, a value close to zero means that the support of the itemset is the same in both the original and randomized datasets; it is mainly a consequence of the marginal counts, and thus of little interest.
Figure 12.4 indicates that 44% of the frequent itemsets have average relative lift values above 0.8. These patterns are likely to be of interest. The pattern with the highest lift value of 0.998 is {sl1,sw3,pl1,pw1,c1}. The itemset that has more or less the same support in the input and randomized datasets is {sl2,c3}; its average relative lift is −0.002. On the other hand, 5% of the frequent itemsets have average relative lift below −0.2. These are also of interest because they indicate more of a dis-association among the items, that is, the itemsets are more frequent by random chance. An example of such a pattern is {sl1,pw2}. Figure 12.5 shows the empirical probability mass function for its relative lift values across the 100 swap randomized datasets. Its average relative lift value is −0.55, and p-value(−0.2) = 0.069, which indicates a high probability that the itemset is disassociative.
12.2.3 Bootstrap Sampling for Confidence Interval
Typically the input transaction database D is just a sample from some population, and it is not enough to claim that a pattern X is frequent in D with support sup(X). What can we say about the range of possible support values for X? Likewise, for a rule R with a given lift value in D, what can we say about the range of lift values in different samples? In general, given a test assessment statistic , bootstrap sampling allows one
326 Pattern and Rule Assessment
to infer the confidence interval for the possible values of at a desired confidence level α.
The main idea is to generate k bootstrap samples from D using sampling with replacement, that is, assuming |D| = n, each sample Di is obtained by selecting at random n transactions from D with replacement. Given pattern X or rule R : X −→ Y, we can obtain the value of the test statistic in each of the bootstrap samples; let θi denote the value in sample Di. From these values we can generate the empirical cumulative distribution function for the statistic
ˆ ˆ 1k F(x)=P(≤x)=k I(θi ≤x)
i=1
where I is an indicator variable that takes on the value 1 when its argument is true, and 0 otherwise. Given a desired confidence level α (e.g., α = 0.95) we can compute the interval for the test statistic by discarding values from the tail ends of Fˆ on both sides that encompass (1 − α)/2 of the probability mass. Formally, let vt denote the critical value such that Fˆ (vt ) = t , which can be obtained from quantile function as vt = Fˆ −1 (t ). We then have
P∈[v(1−α)/2,v(1+α)/2]=Fˆ(1+α)/2−Fˆ(1−α)/2 =(1+α)/2−(1−α)/2=α
Thus, the α% confidence interval for the chosen test statistic is [v(1−α)/2,v(1+α)/2]
The pseudo-code for bootstrap sampling for estimating the confidence interval is shown in Algorithm 12.2.
ALGORITHM12.2. BootstrapResamplingMethod
1 2 3
4 5
6 7
BOOTSTRAP-CONFIDENCEINTERVAL(X, α, k, D): fori∈[1,k]do
Di ← sample of size n with replacement from D θi ← compute teststatistic for X on Di
Fˆ(x)=P(≤x)=1 k I(θi≤x) k i=1
v(1−α)/2 = Fˆ −1 (1 − α)/2 v(1+α)/2 = Fˆ −1 (1 + α)/2 return [v(1−α)/2,v(1+α)/2]
Example 12.21. Let the relative support rsup be the test statistic. Consider the itemset X = {sw1,pl3,pw3,cl3}, which has relative support rsup(X,D) = 0.113 (or sup(X, D) = 17) in the Iris dataset.
Significance Testing and Confidence Intervals
327
fˆ 0.16
0.12 0.08 0.04
0
0.04 0.06
0.08 0.10 0.12
rsup
0.14 0.16
0.18
Figure 12.6. Empirical PMF for relative support. Fˆ
v0.95
1.00 0.75 0.50 0.25
v0.05 0
0.14 0.16 0.18
Figure 12.7. Empirical cumulative distribution for relative support.
0.04 0.06 0.08 0.10 0.12
rsup
Using k = 100 bootstrap samples, we first compute the relative support of X in each of the samples (rsup(X,Di)). The empirical probability mass function for the relative support of X is shown in Figure 12.6 and the corresponding empirical cumulative distribution is shown in Figure 12.7. Let the confidence level be α = 0.9. To obtain the confidence interval we have to discard the values that account for 0.05 of the probability mass at both ends of the relative support values. The critical values at the left and right ends are as follows:
v(1−α)/2 = v0.05 = 0.073 v(1+α)/2 = v0.95 = 0.16
328 Pattern and Rule Assessment
12.3 FURTHER READING
Reviews of various measures for rule and pattern interestingness appear in Tan, Kumar, and Srivastava (2002) and Geng and Hamilton (2006) and Lallich, Teytaud, and Prudhomme (2007). Randomization and resampling methods for significance testing and confidence intervals are described in Megiddo and Srikant (1998) and Gionis et al. (2007). Statistical testing and validation approaches also appear in Webb (2006) and Lallich, Teytaud, and Prudhomme (2007).
Geng, L. and Hamilton, H. J. (2006). Interestingness measures for data mining: A survey. ACM Computing Surveys, 38 (3): 9.
Gionis, A., Mannila, H., Mielika ̈ inen, T., and Tsaparas, P. (2007). Assessing data min- ing results via swap randomization. ACM Transactions on Knowledge Discovery from Data, 1 (3): 14.
Lallich, S., Teytaud, O., and Prudhomme, E. (2007). “Association rule interestingness: measure and statistical validation”. In: Quality measures in data mining. New York: Springer Science + Business Media, pp. 251–275.
Megiddo, N. and Srikant, R. (1998). Discovering predictive association rules. Proceed- ings of the 4th International Conference on Knowledge Discovery in Databases and Data Mining, pp. 274–278.
Tan, P.-N., Kumar, V., and Srivastava, J. (2002). Selecting the right interestingness measure for association patterns. Proceedings of the eighth ACM SIGKDD inter- national conference on Knowledge discovery and data mining. ACM, pp. 32–41.
Webb, G. I. (2006). Discovering significant rules. Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp. 434–443.
12.4 EXERCISES
Q1. ShowthatifXandYareindependent,thenconv(X−→Y)=1.
Q2. Show that if X and Y are independent then oddsratio(X −→ Y) = 1.
Q3. Show that for a frequent itemset X, the value of the relative lift statistic defined in Example 12.20 lies in the range
1 − |D|/minsup, 1
Q4. Prove that all subsets of a minimal generator must themselves be minimal generators.
Thus, the 90% confidence interval for the relative support of X is [0.073, 0.16], which corresponds to the interval [11,24] for its absolute support. Note that the relative support of X in the input dataset is 0.113, which has p-value(0.113) = 0.45, and the expected relative support value of X is μrsup = 0.115.
Exercises
329
Table 12.20. Data for Q5
Support
No. of samples
10,000 15,000 20,000 25,000 30,000 35,000 40,000 45,000
5
20
40
50
20
50
5
10
Q5. Let D be a binary database spanning one trillion (109) transactions. Because it is too time consuming to mine it directly, we use Monte Carlo sampling to find the bounds on the frequency of a given itemset X. We run 200 sampling trials Di (i = 1 . . . 200), with each sample of size 100, 000, and we obtain the support values for X in the various samples, as shown in Table 12.20. The table shows the number of samples where the support of the itemset was a given value. For instance, in 5 samples its support was 10,000. Answer the following questions:
(a) Draw a histogram for the table, and calculate the mean and variance of the support across the different samples.
(b) Find the lower and upper bound on the support of X at the 95% confidence level. The support values given should be for the entire database D.
(c) Assume that minsup = 0.25, and let the observed support of X in a sample be sup(X) = 32500. Set up a hypothesis testing framework to check if the support of X is significantly higher than the minsup value. What is the p-value?
Q6. Let A and B be two binary attributes. While mining association rules at 30% minimum support and 60% minimum confidence, the following rule was mined: A −→ B, with sup = 0.4, and conf = 0.66. Assume that there are a total of 10,000 customers, and that 4000 of them buy both A and B; 2000 buy A but not B, 3500 buy B but not A, and 500 buy neither A nor B.
Compute the dependence between A and B via the χ2-statistic from the corre- sponding contingency table. Do you think the discovered association is truly a strong rule, that is, does A predict B strongly? Set up a hypothesis testing framework, writing down the null and alternate hypotheses, to answer the above question, at the 95% confidence level. Here are some values of chi-squared statistic for the 95% confidence level for various degrees of freedom (df):
df
χ2
1 2 3 4 5 6
3.84 5.99 7.82 9.49 11.07 12.59
PART THREE CLUSTERING
CHAPTER 13 Representative-based Clustering
Given a dataset with n points in a d-dimensional space, D = {xi}ni=1, and given the number of desired clusters k, the goal of representative-based clustering is to partition the dataset into k groups or clusters, which is called a clustering and is denoted as C={C1,C2,...,Ck}.Further,foreachclusterCi thereexistsarepresentativepointthat summarizes the cluster, a common choice being the mean (also called the centroid) μi of all points in the cluster, that is,
μi = 1 xj ni xj∈Ci
where ni = |Ci | is the number of points in cluster Ci .
A brute-force or exhaustive algorithm for finding a good clustering is simply to
generate all possible partitions of n points into k clusters, evaluate some optimization score for each of them, and retain the clustering that yields the best score. The exact number of ways of partitioning n points into k nonempty and disjoint parts is given by the Stirling numbers of the second kind, given as
1 k k S(n,k)= (−1)t (k−t)n
Informally, each point can be assigned to any one of the k clusters, so there are at most kn possible clusterings. However, any permutation of the k clusters within a given clustering yields an equivalent clustering; therefore, there are O(kn/k!) clusterings of n points into k groups. It is clear that exhaustive enumeration and scoring of all possible clusterings is not practically feasible. In this chapter we describe two approaches for representative-based clustering, namely the K-means and expectation-maximization algorithms.
13.1 K-MEANS ALGORITHM
Given a clustering C = {C1 , C2 , . . . , Ck } we need some scoring function that evaluates its quality or goodness. This sum of squared errors scoring function is defined as
k! t=0 t
333
334 Representative-based Clustering k
SSE(C)= xj −μi2 (13.1) i=1 xj∈Ci
The goal is to find the clustering that minimizes the SSE score: C∗ = argmin{SSE(C)}
C
K-means employs a greedy iterative approach to find a clustering that minimizes the SSE objective [Eq. (13.1)]. As such it can converge to a local optima instead of a globally optimal clustering.
K-means initializes the cluster means by randomly generating k points in the data space. This is typically done by generating a value uniformly at random within the range for each dimension. Each iteration of K-means consists of two steps: (1) cluster assignment, and (2) centroid update. Given the k cluster means, in the cluster assignment step, each point xj ∈ D is assigned to the closest mean, which induces a clustering, with each cluster Ci comprising points that are closer to μi than any other cluster mean. That is, each point xj is assigned to cluster Cj∗, where
∗ k2
j =argmin xj −μi (13.2)
i=1
Given a set of clusters Ci, i = 1,...,k, in the centroid update step, new mean values
are computed for each cluster from the points in Ci. The cluster assignment and
centroid update steps are carried out iteratively until we reach a fixed point or local
minima. Practically speaking, one can assume that K-means has converged if the
k μt − μt−12 ≤ ǫ, where ǫ > 0 is the convergence threshold, t denotes the current i=1 i i
iteration, and μti denotes the mean for cluster Ci in iteration t.
The pseudo-code for K-means is given in Algorithm 13.1. Because the method
starts with a random guess for the initial centroids, K-means is typically run several times, and the run with the lowest SSE value is chosen to report the final clustering. It is also worth noting that K-means generates convex-shaped clusters because the region in the data space corresponding to each cluster can be obtained as the intersection of half-spaces resulting from hyperplanes that bisect and are normal to the line segments that join pairs of cluster centroids.
In terms of the computational complexity of K-means, we can see that the cluster assignment step take O(nkd) time because for each of the n points we have to compute its distance to each of the k clusters, which takes d operations in d dimensions. The centroid re-computation step takes O(nd) time because we have to add at total of n d-dimensional points. Assuming that there are t iterations, the total time for K-means is given as O(tnkd). In terms of the I/O cost it requires O(t) full database scans, because we have to read the entire database in each iteration.
centroids do not change from one iteration to the next. For instance, we can stop if
Example 13.1. Consider the one-dimensional data shown in Figure 13.1a. Assume that we want to cluster the data into k = 2 groups. Let the initial centroids be μ1 = 2 and μ2 = 4. In the first iteration, we first compute the clusters, assigning each point
K-means Algorithm 335 ALGORITHM 13.1. K-means Algorithm
K-MEANS (D,k,ǫ): 1t=0
2 Randomly initialize k centroids: μt1 , μt2 , . . . , μtk ∈ Rd 3 repeat
4 t←t+1
5 Cj←∅forallj=1,···,k
// Cluster Assignment Step
6 foreach x ∈ D do
j 2
j∗ ← argmin x −μt−1 // Assign x to closest centroid 7ijij
8
Cj∗ ←Cj∗ ∪{xj}
9 10
foreach i = 1to k do μt ← 1 x
// Centroid Update Step
i |Ci| xj∈Ci j
11 untilk μt−μt−12≤ǫ i=1 i i
to the closest mean, to obtain
C1 = {2,3} C2 = {4,10,11,12,20,25,30}
We next update the means as follows:
μ1 = 2 + 3 = 5 = 2.5 22
μ2 = 4 + 10 + 11 + 12 + 20 + 25 + 30 = 112 = 16 77
The new centroids and clusters after the first iteration are shown in Figure 13.1b. For the second step, we repeat the cluster assignment and centroid update steps, as shown in Figure 13.1c, to obtain the new clusters:
C1 = {2,3,4} and the new means:
C2 = {10,11,12,20,25,30} μ2 = 10 + 11 + 12 + 20 + 25 + 30 = 108 = 18
μ1 = 2 + 3 + 4 = 9 = 3 43
66
The complete process until convergence is illustrated in Figure 13.1. The final clusters are given as
C1 = {2,3,4,10,11,12} C2 = {20,25,30} with representatives μ1 = 7 and μ2 = 25.
336 Representative-based Clustering
234 101112 20 25 30
μ1 = 2 μ2 = 4
(a) Initial dataset
234 101112 20 25 30
(b) Iteration: t = 1 μ1 = 2.5 μ2 = 16
234 101112 20 25 30
(c) Iteration: t = 2
μ1 = 3 μ2 = 18
234 101112 20 25 30
(d) Iteration: t = 3
μ1 = 4.75 μ2 = 19.60
234 101112 20 25 30
(e) Iteration: t = 4
μ1 = 7
234 101112 20 25 30
μ2 = 25
(f) Iteration: t = 5 (converged) Figure 13.1. K-means in one dimension.
Example 13.2 (K-means in Two Dimensions). In Figure 13.2 we illustrate the K-means algorithm on the Iris dataset, using the first two principal components as the two dimensions. Iris has n = 150 points, and we want to find k = 3 clusters, corresponding to the three types of Irises. A random initialization of the cluster means yields
μ1 =(−0.98,−1.24)T μ2 =(−2.96,1.16)T μ3 =(−1.69,−0.80)T
as shown in Figure 13.2a. With these initial clusters, K-means takes eight iterations
to converge. Figure 13.2b shows the clusters and their means after one iteration: μ1 =(1.56,−0.08)T μ2 =(−2.86,0.53)T μ3 =(−1.50,−0.05)T
13.1 K-means Algorithm
337
u2
1.0 0.5 0 −0.5 −1.0 −1.5
1.0 0.5 0 −0.5 −1.0 −1.5
1.0 0.5 0 −0.5 −1.0 −1.5
u1
−4 −3 −2 −1 0 1 2 3
u2
(a) Random initialization: t = 0
u1
−4 −3 −2 −1 0 1 2 3
u2
(b) Iteration: t = 1
u1
−4 −3 −2 −1 0 1 2 3 (c) Iteration: t = 8 (converged)
Figure 13.2. K-means in two dimensions: Iris principal components dataset.
338 Representative-based Clustering
Finally, Figure 13.2c shows the clusters on convergence. The final means are as follows:
μ1 =(2.64,0.19)T μ2 =(−2.35,0.27)T μ3 =(−0.66,−0.33)T
Figure 13.2 shows the cluster means as black points, and shows the convex regions of data space that correspond to each of the three clusters. The dashed lines (hyperplanes) are the perpendicular bisectors of the line segments joining two cluster centers. The resulting convex partition of the points comprises the clustering.
Figure 13.2c shows the final three clusters: C1 as circles, C2 as squares, and C3 as triangles. White points indicate a wrong grouping when compared to the known Iris types. Thus, we can see that C1 perfectly corresponds to iris-setosa, and the major- ity of the points in C2 correspond to iris-virginica, and in C3 to iris-versicolor. For example, three points (white squares) of type iris-versicolor are wrongly clustered in C2, and 14 points from iris-virginica are wrongly clustered in C3 (white triangles). Of course, because the Iris class label is not used in clustering, it is reasonable to expect that we will not obtain a perfect clustering.
13.2 KERNEL K-MEANS
In K-means, the separating boundary between clusters is linear. Kernel K-means allows one to extract nonlinear boundaries between clusters via the use of the kernel trick outlined in Chapter 5. This way the method can be used to detect nonconvex clusters.
In kernel K-means, the main idea is to conceptually map a data point xi in input space to a point φ(xi) in some high-dimensional feature space, via an appropriate nonlinear mapping φ. However, the kernel trick allows us to carry out the clustering in feature space purely in terms of the kernel function K(xi , xj ), which can be computed in input space, but corresponds to a dot (or inner) product φ (xi )T φ (xj ) in feature space.
Assume for the moment that all points xi ∈ D have been mapped to their corresponding images φ(xi) in feature space. Let K = K(xi,xj)i,j=1,…,n denote the n × n symmetric kernel matrix, where K(xi , xj ) = φ (xi )T φ (xj ). Let {C1 , . . . , Ck } specify the partitioning of the n points into k clusters, and let the corresponding cluster means in feature space be given as {μφ1 ,…,μφk }, where
μ φi = 1 φ ( x j ) ni xj∈Ci
is the mean of cluster Ci in feature space, with ni = |Ci |.
In feature space, the kernel K-means sum of squared errors objective can be
written as
k φ2 min SSE(C)= φ(xj)−μi
C
i=1 xj∈Ci
Expanding the kernel SSE objective in terms of the kernel function, we get
k φ2 SSE(C)= φ(xj)−μi
i=1 xj∈Ci
Kernel K-means 339 k 2 T φ φ 2
= φ(xj) −2φ(xj)μi+μi i=1 xj∈Ci
k 2 1 Tφ φ2 = φ(xj) −2ni n φ(xj) μi +niμi
i=1 xj∈Ci i xj∈Ci k T k φ2
= φ(xj)φ(xj) − niμi i=1 xj∈Ci i=1
= k K ( x j , x j ) − k 1 K ( x a , x b ) i=1 xj∈Ci i=1 ni xa∈Ci xb∈Ci
n k 1 = K(xj,xj)− n
j=1 i=1 i xa∈Ci xb∈Ci
K(xa,xb) (13.3)
Thus, the kernel K-means SSE objective function can be expressed purely in terms of the kernel function. Like K-means, to minimize the SSE objective we adopt a greedy iterative approach. The basic idea is to assign each point to the closest mean in feature space, resulting in a new clustering, which in turn can be used obtain new estimates for the cluster means. However, the main difficulty is that we cannot explicitly compute the mean of each cluster in feature space. Fortunately, explicitly obtaining the cluster means is not required; all operations can be carried out in terms of the kernel function K(xi,xj ) = φ(xi)Tφ(xj ).
Consider the distance of a point φ(xj ) to the mean μφi in feature space, which can be computed as
φ2 2 T φ φ2 φ(xj)−μi = φ(xj) −2φ(xj) μi +μi
=φ(xj)Tφ(xj)− 2 φ(xj)Tφ(xa)+ 1 φ(xa)Tφ(xb) n i x a ∈ C i n 2i x a ∈ C i x b ∈ C i
=K(xj,xj)− 2 K(xa,xj)+ 1 K(xa,xb) (13.4) ni xa∈Ci n2i xa∈Ci xb∈Ci
Thus, the distance of a point to a cluster mean in feature space can be computed using only kernel operations. In the cluster assignment step of kernel K-means, we assign a point to the closest cluster mean as follows:
φ 2 C (xj)=argmin φ(xj)−μi
i
=argminK(xj,xj)− 2 K(xa,xj)+ 1 K(xa,xb) i ni xa∈Ci n2i xa∈Ci xb∈Ci
∗
=argmin 1 K(xa,xb)− 2 K(xa,xj)
(13.5)
i n2i xa∈C x ∈C ni xa∈C ibi i
340 Representative-based Clustering
where we drop the K(xj,xj) term because it remains the same for all k clusters and does not impact the cluster assignment decision. Also note that the first term is simply the average pairwise kernel value for cluster Ci and is independent of the point xj . It is in fact the squared norm of the cluster mean in feature space. The second term is twice the average kernel value for points in Ci with respect to xj .
Algorithm 13.2 shows the pseudo-code for the kernel K-means method. It starts from an initial random partitioning of the points into k clusters. It then iteratively updates the cluster assignments by reassigning each point to the closest mean in feature space via Eq. (13.5). To facilitate the distance computation, it first computes the average kernel value, that is, the squared norm of the cluster mean, for each cluster (for loop in line 5). Next, it computes the average kernel value for each point xj with points in cluster Ci (for loop in line 7). The main cluster assignment step uses these values to compute the distance of xj from each of the clusters Ci and assigns xj to the closest mean. This reassignment information is used to re-partition the points into a new set of clusters. That is, all points xj that are closer to the mean for Ci make up the new cluster for the next iteration. This iterative process is repeated until convergence.
For convergence testing, we check if there is any change in the cluster assignments
of the points. The number of points that do not change clusters is given as the
sum k |Ct ∩ Ct−1|, where t specifies the current iteration. The fraction of points i=1 i i
ALGORITHM13.2. KernelK-meansAlgorithm
KERNEL-KMEANS(K,k,ǫ): 1t←0
2 3 4 5
6
7 8 9
10 11 12
13 14
15 16
C t ← {Ct1 , . . . , Ctk }// Randomly partition points into k clusters repeat
foreach xj ∈ D do // Average kernel value for xj and Ci f o r e a c h C i ∈ C t − 1 d o
t←t+1
foreach Ci ∈ C t −1 do // Compute squared norm of cluster means
sqnorm ← 1 K(x ,x ) i n2i xa∈Ci xb∈Ci a b
avg ← 1 K(xa,xj) ji ni xa∈Ci
// Find closest cluster for each point
foreach xj ∈ D do foreachCi ∈Ct−1 do
d(xj,Ci)←sqnormi −2·avgji j∗ ←argminid(xj,Ci)
Cjt∗ ←Cjt∗ ∪{xj}//Clusterreassignment Ct ←Ct1,…,Ctk
until1−1k Ct∩Ct−1≤ǫ n i=1 i i
Kernel K-means
X2 6
5 4 3
1.5 X1 0 1 2 3 4 5 6 7 8 9 10 11 12
341
X2 6
5 4 3
(a) Linear kernel: t = 5 iterations
1.5 X1 0 1 2 3 4 5 6 7 8 9 10 11 12
(b) Gaussian kernel: t = 4 Iterations
Figure 13.3. Kernel K-means: linear versus Gaussian kernel.
reassigned to a different cluster in the current iteration is given as n−k |Ct∩Ct−1| 1k
i=1 i i =1− |Ct ∩Ct−1| n nii
Kernel K-means stops when the fraction of points with new cluster assignments falls below some threshold ǫ ≥ 0. For example, one can iterate until no points change clusters.
Computational Complexity
Computing the average kernel value for each cluster Ci takes time O(n2) across all clusters. Computing the average kernel value of each point with respect to each of the k clusters also takes O(n2) time. Finally, computing the closest mean for each point and cluster reassignment takes O(kn) time. The total computational complexity of kernel
i=1
342 Representative-based Clustering K-means is thus O(tn2), where t is the number of iterations until convergence. The I/O
complexity is O(t) scans of the kernel matrix K.
Example 13.3. Figure 13.3 shows an application of the kernel K-means approach on a synthetic dataset with three embedded clusters. Each cluster has 100 points, for a total of n = 300 points in the dataset.
Using the linear kernel K(xi , xj ) = xTi xj is equivalent to the K-means algorithm because in this case Eq.(13.5) is the same as Eq.(13.2). Figure 13.3a shows the resulting clusters; points in C1 are shown as squares, in C2 as triangles, and in C3 as circles. We can see that K-means is not able to separate the three clusters due to the presence of the parabolic shaped cluster. The white points are those that are wrongly clustered, comparing with the ground truth in terms of the generated cluster labels.
Using the Gaussian kernel K(xi , xj ) = exp
σ = 1.5, results in a near-perfect clustering, as shown in Figure 13.3b. Only four points (white triangles) are grouped incorrectly with cluster C2, whereas they should belong to cluster C1. We can see from this example that kernel K-means is able to handle nonlinear cluster boundaries. One caveat is that the value of the spread parameter σ has to be set by trial and error.
∥xi−xj∥2
− 2σ 2 from Eq. (5.10), with
13.3 EXPECTATION-MAXIMIZATION CLUSTERING
The K-means approach is an example of a hard assignment clustering, where each point can belong to only one cluster. We now generalize the approach to consider soft assignment of points to clusters, so that each point has a probability of belonging to each cluster.
Let D consist of n points xj in d-dimensional space Rd. Let Xa denote the random variable corresponding to the ath attribute. We also use Xa to denote the ath column vector, corresponding to the n data samples from Xa. Let X = (X1,X2,…,Xd) denote the vector random variable across the d-attributes, with xj being a data sample from X.
Gaussian Mixture Model
We assume that each cluster Ci is characterized by a multivariate normal distribution, that is,
1 (x−μ)T−1(x−μ) fi(x)=f(x|μi,i)= d 1 exp − i i i (13.6)
where the cluster mean μi ∈ Rd and covariance matrix i ∈ Rd×d are both unknown parameters. fi (x) is the probability density at x attributable to cluster Ci . We assume that the probability density function of X is given as a Gaussian mixture model over all
(2π)2 |i|2 2
Expectation-Maximization Clustering 343 the k cluster normals, defined as
k k
f(x)= fi(x)P(Ci)= f(x|μi,i)P(Ci) (13.7)
i=1 i=1
where the prior probabilities P(Ci) are called the mixture parameters, which must
satisfy the condition
k
P(Ci) = 1 i=1
The Gaussian mixture model is thus characterized by the mean μi, the covariance matrix i, and the mixture probability P(Ci) for each of the k normal distributions. We write the set of all the model parameters compactly as
θ =μ1,1,P(Ci)…,μk,k,P(Ck)
Maximum Likelihood Estimation
Given the dataset D, we define the likelihood of θ as the conditional probability of the data D given the model parameters θ, denoted as P(D|θ). Because each of the n points xj is considered to be a random sample from X (i.e., independent and identically distributed as X), the likelihood of θ is given as
n P(D|θ)= f(xj)
j=1
The goal of maximum likelihood estimation (MLE) is to choose the parameters θ
that maximize the likelihood, that is,
θ∗ =argmax{P(D|θ)}
θ
It is typical to maximize the log of the likelihood function because it turns the product over the points into a summation and the maximum value of the likelihood and log-likelihood coincide. That is, MLE maximizes
θ∗ = argmax{lnP(D|θ)} θ
where the log-likelihood function is given as
n nk lnP(D|θ)= lnf(xj)= ln f(xj|μi,i)P(Ci)
j=1 j=1 i=1
(13.8)
Directly maximizing the log-likelihood over θ is hard. Instead, we can use the expectation-maximization (EM) approach for finding the maximum likelihood estimates for the parameters θ. EM is a two-step iterative approach that starts from an initial guess for the parameters θ. Given the current estimates for θ, in the expectation step EM computes the cluster posterior probabilities P (Ci |xj ) via the Bayes theorem:
P(Ci|xj)=P(Ci andxj)=P(xj|Ci)P(Ci) P(xj) ka=1P(xj|Ca)P(Ca)
344 Representative-based Clustering
Because each cluster is modeled as a multivariate normal distribution [Eq. (13.6)], the probability of xj given cluster Ci can be obtained by considering a small interval ǫ > 0 centered at xj , as follows:
P(xj|Ci)≃2ǫ·f(xj|μi,i)=2ǫ·fi(xj) The posterior probability of Ci given xj is thus given as
P(Ci|xj)= fi(xj)·P(Ci) (13.9) ka = 1 f a ( x j ) · P ( C a )
andP(Ci|xj)canbeconsideredastheweightorcontributionofthepointxj tocluster Ci. Next, in the maximization step, using the weights P(Ci|xj) EM re-estimates θ, that is, it re-estimates the parameters μi, i, and P(Ci) for each cluster Ci. The re-estimated mean is given as the weighted average of all the points, the re-estimated covariance matrix is given as the weighted covariance over all pairs of dimensions, and the re-estimated prior probability for each cluster is given as the fraction of weights that contribute to that cluster. In Section 13.3.3 we formally derive the expressions for the MLE estimates for the cluster parameters, and in Section 13.3.4 we describe the generic EM approach in more detail. We begin with the application of the EM clustering algorithm for the one-dimensional and general d-dimensional cases.
13.3.1 EM in One Dimension
Consider a dataset D consisting of a single attribute X, where each point xj ∈ R (j = 1, . . . , n) is a random sample from X. For the mixture model [Eq. (13.7)], we use univariate normals for each cluster:
f (x ) = f (x |μ , σ 2 ) = √ 1 exp − (x − μi )2 i ii 2πσi 2σi2
with the cluster parameters μ , σ 2 , and P (C ). The EM approach consists of three steps: iii
initialization, expectation step, and maximization step.
Initialization
For each cluster Ci, with i = 1,2,…,k, we can randomly initialize the cluster parameters μ , σ2, and P(C ). The mean μ is selected uniformly at random from the
iiii
range of possible values for X. It is typical to assume that the initial variance is given as σ 2 = 1. Finally, the cluster prior probabilities are initialized to P (C ) = 1 , so that each
cluster has an equal probability.
Expectation Step
Assume that for each of the k clusters, we have an estimate for the parameters, namely the mean μi , variance σi2 , and prior probability P (Ci ). Given these values the clusters posterior probabilities are computed using Eq. (13.9):
P (Ci |xj ) = f (xj |μi , σi2 ) · P (Ci )
ka = 1 f ( x j | μ a , σ a2 ) · P ( C a )
iik
Expectation-Maximization Clustering 345 For convenience, we use the notation wij = P (Ci |xj ), treating the posterior probability
astheweightorcontributionofthepointxj toclusterCi.Further,let wi = (wi1,…,win)T
denote the weight vector for cluster Ci across all the n points.
Maximization Step
Assuming that all the posterior probability values or weights wij = P (Ci |xj ) are known, the maximization step, as the name implies, computes the maximum likelihood estimates of the cluster parameters by re-estimating μi , σi2 , and P (Ci ).
The re-estimated value for the cluster mean, μi , is computed as the weighted mean of all the points:
nj = 1 w i j · x j μi= nj=1wij
In terms of the weight vector wi and the attribute vector X = (x1,x2,…,xn)T, we can rewrite the above as
μ i = w Ti X w Ti 1
The re-estimated value of the cluster variance is computed as the weighted variance across all the points:
2 nj = 1 w i j ( x j − μ i ) 2 σ i = nj = 1 w i j
Let Zi = X − μi1 = (x1 − μi,x2 − μi,…,xn − μi)T = (zi1,zi2,…,zin)T be the
centered attribute vector for cluster Ci, and let Zsi be the squared vector given as
Zs = (z2 ,…,z2 )T. The variance can be expressed compactly in terms of the dot i i1 in
product between the weight vector and the squared centered vector: σ 2 = w Ti Z si
i wTi1
Finally, the prior probability of cluster Ci is re-estimated as the fraction of the total
weight belonging to Ci , computed as
nj=1 wij nj=1 wij nj=1 wij
P(Ci)=ka=1nj=1waj = nj=11 = n where we made use of the fact that
(13.10)
P(Ci|xj)=1 In vector notation the prior probability can be written as
P ( C i ) = w Ti 1 n
k i=1
k i=1
wij =
346 Representative-based Clustering
Iteration
Starting from an initial set of values for the cluster parameters μi, σi2 and P(Ci) for all i = 1,…,k, the EM algorithm applies the expectation step to compute the weights wij = P (Ci |xj ). These values are then used in the maximization step to compute the updatedclusterparametersμi,σi2 andP(Ci).Boththeexpectationandmaximization steps are iteratively applied until convergence, for example, until the means change very little from one iteration to the next.
Example 13.4 (EM in 1D). Figure 13.4 one-dimensional dataset:
illustrates x4 =2.6
the EM algorithm on the x5 =2.8
x1 =1.0 x2 =1.3 x3 =2.2 x6 =5.0 x7 =7.3 x8 =7.4
x10 =7.7 x11 =7.9 μ1 =6.63 σ12 =1 P(C2)=0.5
μ2 =7.57 σ2 =1 P(C2)=0.5
After repeated expectation and maximization steps, the EM method converges after
five iterations. After t = 1 (see Figure 13.4b) we have
μ1 =3.72 σ12 =6.13 P(C1)=0.71
μ2 = 7.4 σ2 = 0.69 P(C2) = 0.29 After the final iteration (t = 5), as shown in Figure 13.4c, we have
μ1 =2.48 σ12 =1.69 P(C1)=0.55 μ2 = 7.56 σ2 = 0.05 P(C2) = 0.45
One of the main advantages of the EM algorithm over K-means is that it returns the probability P (Ci |xj ) of each cluster Ci for each point xj . However, in this 1-dimensional example, these values are essentially binary; assigning each point to the cluster with the highest posterior probability, we obtain the hard clustering
C1 = {x1,x2,x3,x4,x5,x6} (white points) C2 = {x7,x8,x9,x10,x11} (gray points)
as illustrated in Figure 13.4c.
x9 =7.5
We assume that k = 2. The initial random means are shown in Figure 13.4a, with the
initial parameters given as
13.3.2 EM in d Dimensions
We now consider the EM method in d dimensions, where each cluster is characterized by a multivariate normal distribution [Eq. (13.6)], with parameters μi , i , and P (Ci ). For each cluster Ci, we thus need to estimate the d-dimensional mean vector:
μi =(μi1,μi2,…,μid)T
Expectation-Maximization Clustering
0.4 0.3 0.2 0.1
347
μ1 = 6.63
μ2 = 7.57
−1 0 1 2 3 4 5 6 7 8 9 10 11
0.5 0.4 0.3 0.2 0.1
(a) Initialization: t = 0
μ2 =7.4
μ1 = 3.72
−2 −1 0 1 2 3 4 5 6 7 8 9 10 11
(b) Iteration: t = 1
μ2 = 7.56
1.8 1.5 1.2 0.9 0.6 0.3
μ1 = 2.48
−1 0 1 2 3 4 5 6 7 8 9 10 11
(c) Iteration: t = 5 (converged)
Figure 13.4. EM in one dimension. and the d × d covariance matrix:
(σi)2 σi … σi 1 12 1d
σi (σi)2 … σi
i= 21 2
σi σi … (σi)2
2d . . …
d1 d2 d
Because the covariance matrix is symmetric, we have to estimate d = d(d−1) pairwise
22
covariances and d variances, for a total of d(d+1) parameters for i. This may be 2
too many parameters for practical purposes because we may not have enough data to estimate all of them reliably. For example, if d = 100, then we have to estimate 100 · 101/2 = 5050 parameters! One simplification is to assume that all dimensions are
348 Representative-based Clustering independent, which leads to a diagonal covariance matrix:
(σ1i)2 0 … 0 = 0 (σ2i)2 … 0 i . . …
0 0…(σdi)2
Under the independence assumption we have only d parameters to estimate for the
diagonal covariance matrix.
Initialization
ForeachclusterCi,withi=1,2,…,k,werandomlyinitializethemeanμi byselecting
a value μia for each dimension Xa uniformly at random from the range of Xa. The
covariance matrix is initialized as the d × d identity matrix, i = I. Finally, the cluster
prior probabilities are initialized to P(Ci) = 1, so that each cluster has an equal k
probability.
Expectation Step
In the expectation step, we compute the posterior probability of cluster Ci given point xj using Eq. (13.9), with i = 1,…,k and j = 1,…,n. As before, we use the shorthand notation wij = P (Ci |xj ) to denote the fact that P (Ci |xj ) can be considered as the weight orcontributionofpointxj toclusterCi,andweusethenotationwi =(wi1,wi2,…,win)T to denote the weight vector for cluster Ci , across all the n points.
Maximization Step
Giventheweightswij,inthemaximizationstep,were-estimatei,μi andP(Ci).The mean μi for cluster Ci can be estimated as
nj = 1 w i j · x j μi= n w
j=1 ij which can be expressed compactly in matrix form as
μi = DTwi w Ti 1
(13.11)
Let Zi = D−1·μTi be the centered data matrix for cluster Ci. Let zji = xj −μi ∈ Rd denote the jth centered point in Zi. We can express i compactly using the outer-product form
and Xb is estimated as
n wijzjizT
i= j=1 ji (13.12)
w Ti 1
Considering the pairwise attribute view, the covariance between dimensions Xa
i nj = 1 w i j ( x j a − μ i a ) ( x j b − μ i b ) σ a b = nj = 1 w i j
Expectation-Maximization Clustering 349
where xja and μia denote the values of the ath dimension for xj and μi, respectively. Finally, the prior probability P(Ci) for each cluster is the same as in the
one-dimensional case [Eq. (13.10)], given as
P ( C i ) = nj = 1 w i j = w Ti 1 ( 1 3 . 1 3 )
nn
A formal derivation of these re-estimates for μi [Eq. (13.11)], i [Eq. (13.12)], and
P (Ci ) [Eq. (13.13)] is given in Section 13.3.3.
EM Clustering Algorithm
The pseudo-code for the multivariate EM clustering algorithm is given in
Algorithm 13.3. After initialization of μi, i, and P(Ci) for all i = 1,…,k, the expecta-
wecheckwhether μt −μt−12 ≤ǫ,whereǫ>0istheconvergencethreshold,andt iii
denotes the iteration. In words, the iterative process continues until the change in the cluster means becomes very small.
ALGORITHM13.3. Expectation-Maximization(EM)Algorithm
EXPECTATION-MAXIMIZATION (D,k,ǫ): 1t←0
tion and maximization steps are repeated until convergence. For the convergence test,
2 3 4 5 6
7 8
9 10
11 12
13
// Initialization
Randomly initialize μt1 , . . . , μtk
ti←I,∀i=1,…,k
Pt(Ci)←1,∀i=1,…,k k
repeat
t←t+1
// Expectation Step
for i = 1,…,k and j = 1,…,n do
w ← f(xj|μi,i)·P(Ci)
ij ka=1f(xj|μa,a)·P(Ca)
// Maximization Step
f o r i = 1 , . . . , k d o
//posteriorprobability Pt(C |x ) i j
n wij·xj t j = 1
μi ← nj=1 wij // re-estimate mean n w ( x − μ ) ( x − μ ) T
t j=1ijjiji
i ← nj=1 wij // re-estimate covariance matrix
n wij
P t (Ci ) ← j =1 // re-estimate priors n
untilk μt−μt−12≤ǫ i=1 i i
350 Representative-based Clustering
Example 13.5 (EM in 2D). Figure 13.5 illustrates the EM algorithm for the two-dimensional Iris dataset, where the two attributes are its first two principal components. The dataset consists of n = 150 points, and EM was run using k = 3, with
full covariance matrix for each cluster. The initial cluster parameters are i = 1 0 01
and P (Ci ) = 1/3, with the means chosen as
μ1 =(−3.59,0.25)T μ2 =(−1.09,−0.46)T μ3 =(0.75,1.07)T
The cluster means (shown in black) and the joint probability density function are shown in Figure 13.5a.
The EM algorithm took 36 iterations to converge (using ǫ = 0.001). An intermediate stage of the clustering is shown in Figure 13.5b, for t = 1. Finally at iteration t = 36, shown in Figure 13.5c, the three clusters have been correctly identified, with the following parameters:
μ1 =(−2.02,0.017)T 1 = 0.56 −0.29
−0.29 0.23 P(C1) = 0.36
μ2 =(−0.51,−0.23)T 2 = 0.36 −0.22
−0.22 0.19 P(C2) = 0.31
μ3 =(2.64,0.19)T
3 = 0.05 −0.06
−0.06 0.21 P(C3) = 0.33
To see the effect of a full versus diagonal covariance matrix, we ran the EM algorithm on the Iris principal components dataset under the independence assumption, which took t = 29 iterations to converge. The final cluster parameters were
μ1 =(−2.1,0.28)T
μ2 =(−0.67,−0.40)T
μ3 =(2.64,0.19)T 3=0.05 0
0 0.21 P(C3) = 0.33
1=0.59 0
P(C1) = 0.30
0 0.11
2=0.49 0
P(C2) = 0.37
0 0.11
Figure 13.6b shows the clustering results. Also shown are the contours of the normal density function for each cluster (plotted so that the contours do not intersect). The results for the full covariance matrix are shown in Figure 13.6a, which is a projection of Figure 13.5c onto the 2D plane. Points in C1 are shown as squares, in C2 as triangles, and in C3 as circles.
One can observe that the diagonal assumption leads to axis parallel contours for the normal density, contrasted with the rotated contours for the full covariance matrix. The full matrix yields much better clustering, which can be observed by considering the number of points grouped with the wrong Iris type (the white points). For the full covariance matrix only three points are in the wrong group, whereas for the diagonal covariance matrix 25 points are in the wrong cluster, 15 from iris-virginica (white triangles) and 10 from iris-versicolor (white squares). The points corresponding to iris-setosa are correctly clustered as C3 in both approaches.
Expectation-Maximization Clustering
351
−1 −4
−3
−2
X2
X2
2 1
(b) Iteration: t = 1 f (x)
0
X2
−1
f (x)
(a) Iteration: t = 0 f (x)
3
Figure 13.5. EM algorithm in two dimensions: mixture of k = 3 Gaussian4s.
2 (c) iteration: t = 36
0
1
X1
X1
1
X
352
Representative-based Clustering
0 −0.5 −1.0 −1.5
X2
1.0 0.5
−4 −3 −2 −1 0 1 2 3 (a) Full covariance matrix (t = 36)
X2
1.0 0.5
0 −0.5 −1.0 −1.5
−4 −3 −2 −1 0 1 2 3 (b) Diagonal covariance matrix (t = 29)
Figure 13.6. Iris principal components dataset: full versus diagonal covariance matrix.
Computational Complexity
For the expectation step, to compute the cluster posterior probabilities, we need to invert i and compute its determinant |i|, which takes O(d3) time. Across the k c l u s t e r s t h e t i m e i s O ( k d 3 ) . F o r t h e e x p e c t a t i o n s t e p , e v a l u a t i n g t h e d e n s i t y f ( xj | μ i , i ) takes O(d2) time, for a total time of O(knd2) over the n points and k clusters. For the maximization step, the time is dominated by the update for i, which takes O(knd2) time over all k clusters. The computational complexity of the EM method is thus O(t(kd3 +nkd2)), where t is the number of iterations. If we use a diagonal covariance matrix, then inverse and determinant of i can be computed in O(d) time. Density computation per point takes O(d) time, so that the time for the expectation step is O(knd). The maximization step also takes O(knd) time to re-estimate i. The total time for a diagonal covariance matrix is therefore O(tnkd). The I/O complexity for the
X1
X1
Expectation-Maximization Clustering 353 EM algorithm is O(t) complete database scans because we read the entire set of points
in each iteration.
K-means as Specialization of EM
Although we assumed a normal mixture model for the clusters, the EM approach can be applied with other models for the cluster density distribution P (xj |Ci ). For instance, K-means can be considered as a special case of the EM algorithm, obtained as follows:
P(x|C)=1 ifCi=argminxj−μa2 ji Ca
0 otherwise
Using Eq. (13.9), the posterior probability P (Ci |xj ) is given as
P (Ci |xj ) = P (xj |Ci )P (Ci )
ka = 1 P ( x j | C a ) P ( C a )
One can see that if P(xj|Ci) = 0, then P(Ci|xj) = 0. Otherwise, if P(xj|Ci) = 1, then P(xj|Ca) = 0 for all a ̸= i, and thus P(Ci|xj) = 1·P(Ci) = 1. Putting it all together, the
posterior probability is given as
P(C|x)=1 ifxj∈Ci,i.e.,ifCi=argminxj−μa2 ij Ca
0 otherwise
It is clear that for K-means the cluster parameters are μi and P (Ci ); we can ignore the covariance matrix.
13.3.3 Maximum Likelihood Estimation
In this section, we derive the maximum likelihood estimates for the cluster parameters μi, i and P(Ci). We do this by taking the derivative of the log-likelihood function with respect to each of these parameters and setting the derivative to zero.
The partial derivative of the log-likelihood function [Eq. (13.8)] with respect to some parameter θ i for cluster Ci is given as
∂∂n ∂θ ln P(D|θ) =∂θ lnf(xj)
i i j=1
= n 1 · ∂ f ( x j )
1·P (Ci )
j=1 f(xj) ∂θi
=n 1 k ∂ f(xj|μa,a)P(Ca)
j=1 f(xj)a=1 ∂θi
=n 1 · ∂ f(xj|μi,i)P(Ci)
j=1 f(xj) ∂θi
The last step follows from the fact that because θi is a parameter for the ith cluster the mixture components for the other clusters are constants with respect to θi. Using the
354
Representative-based Clustering
the multivariate normal density in Eq. (13.6) can be written as
fact that | | = 1
i |−1|
i
where
−d −11 f(xj|μi,i)=(2π) 2 |i |2 exp g(μi,i)
g(μ,)=−1(x −μ)T−1(x −μ) ii2jiiji
(13.14)
(13.15)
(13.16)
(13.17)
Thus, the derivative of the log-likelihood function can be written as
∂ ∂θ ln P(D|θ) =
i
j=1 j i Below, we make use of the fact that
∂ expg(μi,i)=expg(μi,i)· ∂ g(μi,i) ∂θi ∂θi
Estimation of Mean
n1 ∂ −d −11 f(x ) · ∂θ (2π) 2 |i |2 exp g(μi,i) P(Ci)
To derive the maximum likelihood estimate for the mean μi, we have to take the derivative of the log-likelihood with respect to θ i = μi . As per Eq. (13.16), the only term involving μi is expg(μi,i). Using the fact that
∂ g(μ,)=−1(x−μ) (13.18) ∂μ i i i j i
i
and making use of Eq. (13.17), the partial derivative of the log-likelihood [Eq. (13.16)] with respect to μi is
n ∂ 1−d−11−1
∂μ ln(P(D|θ))= f(x )(2π) 2 |i |2 exp g(μi,i) P(Ci)i (xj −μi) i j=1j
n f(xj|μi,i)P(Ci) −1 =j=1 f(xj) ·i (xj−μi)
=
j=1
n
ij i j i
w −1(x −μ)
where we made use of Eqs. (13.14) and (13.9), and the fact that
wij =P(Ci|xj)= f(xj|μi,i)P(Ci) f(xj)
Expectation-Maximization Clustering 355 Setting the partial derivative of the log-likelihood to the zero vector, and multiplying
both sides by i , we get n
wij (xj − μi ) = 0, which implies that j=1
n
wij xj = μi
j=1 j=1
nj = 1 w i j x j μi = n w
j=1 ij
which is precisely the re-estimation formula we used in Eq. (13.11).
Estimation of Covariance Matrix
To re-estimate the covariance matrix i, we take the partial derivative of
−11 i term |i |2 exp g(μi,i) .
wij , and therefore
(13.19)
Eq. (13.16) with respect to −1 using the product rule for the differentiation of the
Using the fact that for any square matrix A, we have ∂|A| = |A| · (A−1)T the
−1 1 −1 derivative of |i | 2 with respect to i is
∂A
−1 1
∂|i|2 1−1−1 −1 1−11
(13.20) Next,usingthefactthatforthesquarematrixA∈Rd×d andvectorsa,b∈Rd,wehave
∂−1 = 2·|i | 2 ·|i |·i = 2·|i |2 ·i i
∂ aTAb = abT the derivative of expg(μ , ) with respect to −1 is obtained from ∂A iii
Eq. (13.17) as follows:
∂ expg(μi,i)=−1expg(μi,i)(xj −μi)(xj −μi)T (13.21)
∂−1 2 i
Using the product rule on Eqs. (13.20) and (13.21), we get ∂ −11
1−111−11 T
= 2|i |2 i exp g(μi,i) − 2|i |2 exp g(μi,i) (xj −μi)(xj −μi)
1−11 T
= 2 ·|i |2 ·exp g(μi,i) i −(xj −μi)(xj −μi) (13.22)
Plugging Eq. (13.22) into Eq. (13.16) the derivative of the log-likelihood function with respect to −1 is given as
i
∂ ln(P (D|θ )) ∂−1
i
∂−1 |i |2 exp g(μi,i) i
n −d −11 = 1(2π) 2 |i |2 exp g(μi,i) P(Ci) i −(xj −μi)(xj −μi)T
2 j=1 f(xj)
356 Representative-based Clustering =1n f(xj|μi,i)P(Ci)·i−(xj−μi)(xj−μi)T
2 j=1 f(xj)
1n T = 2 wij i −(xj −μi)(xj −μi)
j=1
Setting the derivative to the d × d zero matrix 0d×d , we can solve for i :
n
j=1
nj=1wij(xj −μi)(xj −μi)T
i = n w j=1 ij
w −(x −μ)(x −μ)T=0
ij i j i j i d×d
,whichimpliesthat
(13.23)
Thus, we can see that the maximum likelihood estimate for the covariance matrix is given as the weighted outer-product form in Eq. (13.12).
Estimating the Prior Probability: Mixture Parameters
To obtain a maximum likelihood estimate for the mixture parameters or the prior probabilities P(Ci), we have to take the partial derivative of the log-likelihood [Eq.(13.16)] with respect to P(Ci). However, we have to introduce a Lagrange multiplier α for the constraint that ka=1P(Ca) = 1. We thus take the following derivative:
∂ k
∂P(Ci) ln(P(D|θ))+α P(Ca)−1 (13.24)
a=1
The partial derivative of the log-likelihood in Eq.(13.16) with respect to P(Ci)
gives
∂ ln(P(D|θ))=n f(xj|μi,i) ∂P(Ci) j=1 f(xj)
The derivative in Eq. (13.24) thus evaluates to
n f ( x j | μ i , i ) + α j=1 f(xj)
Setting the derivative to zero, and multiplying on both sides by P (Ci ), we get
n f(xj|μi,i)P(Ci) =−αP(Ci) j=1 f(xj)
n j=1
(13.25)
wij =−αP(Ci)
Expectation-Maximization Clustering 357 Taking the summation of Eq. (13.25) over all clusters yields
k i=1
or n = −α
Eq. (13.25), gives us the maximum likelihood estimate for P (Ci ) as follows:
k n i=1 j=1
wij =−α
The last step follows from the fact that ki=1 wij = 1. Plugging Eq. (13.26) into
P ( C i ) = nj = 1 w i j ( 1 3 . 2 7 ) n
which matches the formula in Eq. (13.13).
We can see that all three parameters μi , i , and P (Ci ) for cluster Ci depend
on the weights wij, which correspond to the cluster posterior probabilities P(Ci|xj). Equations (13.19), (13.23), and (13.27) thus do not represent a closed-form solution for maximizing the log-likelihood function. Instead, we use the iterative EM approach to compute the wij in the expectation step, and we then re-estimate μi , i and P (Ci ) in the maximization step. Next, we describe the EM framework in some more detail.
13.3.4 EMApproach
Maximizing the log-likelihood function [Eq. (13.8)] directly is hard because the mixture term appears inside the logarithm. The problem is that for any point xj we do not know which normal, or mixture component, it comes from. Suppose that we knew this information, that is, suppose each point xj had an associated value indicating the cluster that generated the point. As we shall see, it is much easier to maximize the log-likelihood given this information.
The categorical attribute corresponding to the cluster label can be modeled as a vector random variable C = (C1,C2,…,Ck), where Ci is a Bernoulli random variable (see Section 3.1.2 for details on how to model a categorical variable). If a given point is generated from cluster Ci , then Ci = 1, otherwise Ci = 0. The parameter P (Ci ) gives the probability P (Ci = 1). Because each point can be generated from only one cluster, ifCa =1foragivenpoint,thenCi =0foralli̸=a.Itfollowsthat ki=1P(Ci)=1.
For each point xj , let its cluster vector be cj = (cj1,…,cjk)T. Only one component of cj has value 1. If cji = 1, it means that Ci = 1, that is, the cluster Ci generates the point xj . The probability mass function of C is given as
k P(C=cj)= P(Ci)cji
i=1
Given the cluster information cj for each point xj , the conditional probability density
function for X is given as
P(Ci)
(13.26)
k
f(xj|cj)= f(xj|μi,i)cji
i=1
358 Representative-based Clustering
Only one cluster can generate xj , say Ca , in which case cj a = 1, and the above expression w o u l d s i m p l i f y t o f ( xj | cj ) = f ( xj | μ a , a ) .
The pair (xj,cj) is a random sample drawn from the joint distribution of vector random variables X = (X1,…,Xd) and C = (C1,…,Ck), corresponding to the d data attributes and k cluster attributes. The joint density function of X and C is given as
f(xj andcj)=f(xj|cj)P(cj)=k f(xj|μi,i)P(Ci)cji i=1
The log-likelihood for the data given the cluster information is as follows:
Expectation Step
lnP(D|θ)=ln
n j=1
=
cji nk
(13.28)
n j=1
f(xj andcj|θ) lnf(xj andcj|θ)
=
= ln f(xj|μi,i)P(Ci)
n k j=1 i=1
cji lnf(xj|μi,i)+lnP(Ci)
j=1 i=1
In the expectation step, we compute the expected value of the log-likelihood for the labeled data given in Eq.(13.28). The expectation is over the missing cluster information cj treating μi, i, P(Ci), and xj as fixed. Owing to the linearity of expectation, the expected value of the log-likelihood is given as
E[lnP(D|θ)]=
The expected value E[cji] can be computed as
= P(xj|Ci)P(Ci) = f(xj|μi,i)P(Ci) P(xj) f(xj)
= wij
n k j=1 i=1
E[cji] lnf(xj|μi,i)+lnP(Ci) E[cji]=1·P(cji =1|xj)+0·P(cji =0|xj)=P(cji =1|xj)=P(Ci|xj)
Thus,intheexpectationstepweusethevaluesofθ=μi,i,P(Ci)ki=1 toestimatethe posteriorprobabilitiesorweightswij foreachpointforeachcluster.UsingE[cji]=wij, the expected value of the log-likelihood function can be rewritten as
nk
E[lnP(D|θ)]= wij lnf(xj|μi,i)+lnP(Ci) (13.30)
j=1 i=1
(13.29)
Expectation-Maximization Clustering 359
Maximization Step
In the maximization step, we maximize the expected value of the log-likelihood [Eq.(13.30)].Takingthederivativewithrespecttoμi,i orP(Ci)wecanignorethe terms for all the other clusters.
The derivative of Eq. (13.30) with respect to μi is given as ∂ ∂n
∂μ lnE[P(D|θ)]= ∂μ wij lnf(xj|μi,i) i i j=1
n 1∂
wij ·f(x |μ, )∂μ f(xj|μi,i)
=
=
=
where we used the observation that
∂ f(x|μ,)=f(x|μ,)−1(x−μ)
which follows from Eqs. (13.14), (13.17), and (13.18). Setting the derivative of the expected value of the log-likelihood to the zero vector, and multiplying on both sides by i, we get
nj = 1 w i j x j μ i = nj = 1 w i j
matching the formula in Eq. (13.11).
Making use of Eqs. (13.22) and (13.14), we obtain the derivative of Eq. (13.30) with
respect to −1 as follows: i
∂ ln E[P (D|θ )] ∂−1
i
n 1 1 T
= wij · f(xj|μi,i) · 2 f(xj|μi,i) i −(xj −μi)(xj −μi) j=1
1n T = 2 wij · i −(xj −μi)(xj −μi)
j=1
Setting the derivative to the d × d zero matrix and solving for i yields
nj=1 wij(xj −μi)(xj −μi)T i = nj = 1 w i j
j=1 jiii
n 1 −1
(xj −μi)
j=1 n
wij ·f(xj|μi,i)·f(xj|μi,i)i w −1(x−μ)
ij i j i j=1
∂μ j i i j i i i j i i
which is the same as that in Eq. (13.12).
360 Representative-based Clustering
Using the Lagrange multiplier α for the constraint ki=1 P (Ci ) = 1, and noting that in the log-likelihood function [Eq.(13.30)], the term lnf(xj|μi,i) is a constant with respect to P (Ci ), we obtain the following:
∂ k ∂ ∂P(Ci) lnE[P(D|θ)]+α P(Ci)−1 = ∂P(Ci) wij lnP(Ci)+αP(Ci)
i=1 n 1
= wij·P(Ci) +α
j=1
Setting the derivative to zero, we get
n
wij =−α·P(Ci) j=1
Using the same derivation as in Eq. (13.26) we obtain P ( C i ) = nj = 1 w i j
n
which is identical to the re-estimation formula in Eq. (13.13).
13.4 FURTHER READING
The K-means algorithm was proposed in several contexts during the 1950s and 1960s; among the first works to develop the method include MacQueen (1967); Lloyd (1982) and Hartigan (1975). Kernel k-means was first proposed in Scho ̈lkopf, Smola, and Mu ̈ ller (1996). The EM algorithm was proposed in Dempster, Laird, and Rubin (1977). A good review on EM method can be found in McLachlan and Krishnan (2008). For a scalable and incremental representative-based clustering method that can also generate hierarchical clusterings see Zhang, Ramakrishnan, and Livny (1996).
Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39 (1): 1–38.
Hartigan, J. A. (1975). Clustering Algorithms. New York: New York: John Wiley & Sons.
Lloyd, S. (1982). Least squares quantization in PCM. IEEE Transactions on Informa- tion Theory, 28 (2): 129–137.
MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. Proceedings of the Fifth Berkeley symposium on mathematical statistics and probability. Vol. 1. 281-297. University of California Press, Berkeley, p. 14.
McLachlan, G. and Krishnan, T. (2008). The EM algorithm and extensions, 2nd Edition. New Jersey: Hoboken, NJ: John Wiley & Sons.
Scho ̈lkopf, B., Smola, A., and Mu ̈ller, K.-R. (1996). Nonlinear component analysis as a kernel eigenvalue problem. Technical Report No. 44. Tu ̈bingen, Germany: Max-Planck-Institut fu ̈ r biologische Kybernetik.
Exercises 361
Zhang, T., Ramakrishnan, R., and Livny, M. (1996). BIRCH: an efficient data clustering method for very large databases. ACM SIGMOD Record. Vol. 25. 2. ACM, pp. 103–114.
13.5 EXERCISES
Q1. Given the following points: 2,4,10,12,3,20,30,11,25. Assume k = 3, and that we randomly pick the initial means μ1 = 2, μ2 = 4 and μ3 = 6. Show the clusters obtained using K-means algorithm after one iteration, and show the new means for the next iteration.
Table 13.1. Dataset for Q2
x
P(C1|x)
P(C2|x)
2 3 7 9 2 1
0.9 0.8 0.3 0.1 0.9 0.8
0.1 0.1 0.7 0.9 0.1 0.2
Q2. GiventhedatapointsinTable13.1,andtheirprobabilityofbelongingtotwoclusters. Assume that these points were produced by a mixture of two univariate normal distributions. Answer the following questions:
(a) Find the maximum likelihood estimate of the means μ1 and μ2.
(b) Assume that μ1 = 2, μ2 = 7, and σ1 = σ2 = 1. Find the probability that the point x = 5 belongs to cluster C1 and to cluster C2. You may assume that the prior probability of each cluster is equal (i.e., P(C1) = P(C2) = 0.5), and the prior probability P (x = 5) = 0.029.
Table 13.2. Dataset for Q3
Q3. Giventhetwo-dimensionalpointsinTable13.2,assumethatk=2,andthatinitially the points are assigned to clusters as follows: C1 = {x1,x2,x4} and C2 = {x3,x5}. Answer the following questions:
(a) Apply the K-means algorithm until convergence, that is, the clusters do not
X1
X2
x1 x2 x3 x4 x5
0
0 1.5 5
5
2 0 0 0 2
change, assuming (1) the usual Euclidean distance or the L2-norm as the distance
362
Q4.
(b)
Representative-based Clustering
between points, defined as xi −xj2 = da=1(xia − xja)21/2, and (2) the
M a n h a t t a n d i s t a n c e o r t h e L 1 – n o r m d e f i n e d a s x i − x j 1 = da = 1 | x i a − x j a | . Apply the EM algorithm with k = 2 assuming that the dimensions are independent. Show one complete execution of the expectation and the maximization steps. Start with the assumption that P (Ci |xj a ) = 0.5 for a = 1, 2 and j = 1, . . . , 5.
Given the categorical database in Table 13.3. Find k = 2 clusters in this data using the EM method. Assume that each attribute is independent, and that the domain of each attribute is {A,C,T}. Initially assume that the points are partitioned as follows: C1 ={x1,x4},andC2 ={x2,x3}.AssumethatP(C1)=P(C2)=0.5.
Table 13.3. Dataset for Q4
The probability of an attribute value given a cluster is given as
P (xj a |Ci ) = No. of times the symbol xj a occurs in cluster Ci
No. of objects in cluster Ci
for a = 1, 2. The probability of a point given a cluster is then given as
2 P(xj|Ci)= P(xja|Ci)
a=1
Instead of computing the mean for each cluster, generate a partition of the objects by doing a hard assignment. That is, in the expectation step compute P (Ci |xj ), and in the maximization step assign the point xj to the cluster with the largest P (Ci |xj ) value, which gives a new partitioning of the points. Show one full iteration of the EM algorithm and show the resulting clusters.
Table 13.4. Dataset for Q5
X1
X2
x1 x2 x3 x4
A A C A
T A C C
X1
X2
X3
x1 x2 x3 x4 x5 x6 x7 x8 x9 x10
0.5 2.2 3.9 2.1 0.5 0.8 2.7 2.5 2.8 0.1
4.5 1.5 3.5 1.9 3.2 4.3 1.1 3.5 3.9 4.1
2.5 0.1 1.1 4.9 1.2 2.6 3.1 2.8 1.5 2.9
Exercises 363
Q5. Given the points in Table 13.4, assume that there are two clusters: C1 and C2, with μ1 = (0.5,4.5,2.5)T and μ2 = (2.5,2,1.5)T. Initially assign each point to the closest mean, and compute the covariance matrices i and the prior probabilities P (Ci ) for i = 1,2. Next, answer which cluster is more likely to have produced x8?
Q6. ConsiderthedatainTable13.5.Answerthefollowingquestions:
(a) Compute the kernel matrix K between the points assuming the following kernel:
K(xi,xj)=1+xTi xj
(b) Assume initial cluster assignments of C1 = {x1 , x2 } and C2 = {x3 , x4 }. Using kernel
K-means, which cluster should x1 belong to in the next step? Table 13.5. Data for Q6
Q7. Provethefollowingequivalenceforthemultivariatenormaldensityfunction:
∂ f(xj|μi,i)=f(xj|μi,i)−1 (xj −μi) ∂μi i
X1
X2
X3
x1 x2 x3 x4
0.4 0.5 0.6 0.4
0.9 0.1 0.3 0.8
0.6 0.6 0.6 0.5
CHAPTER 14 HierarchicalClustering
Given n points in a d-dimensional space, the goal of hierarchical clustering is to create a sequence of nested partitions, which can be conveniently visualized via a tree or hierarchy of clusters, also called the cluster dendrogram. The clusters in the hierarchy range from the fine-grained to the coarse-grained – the lowest level of the tree (the leaves) consists of each point in its own cluster, whereas the highest level (the root) consists of all points in one cluster. Both of these may be considered to be trivial cluster- ings. At some intermediate level, we may find meaningful clusters. If the user supplies k, the desired number of clusters, we can choose the level at which there are k clusters.
There are two main algorithmic approaches to mine hierarchical clusters: agglomerative and divisive. Agglomerative strategies work in a bottom-up manner. That is, starting with each of the n points in a separate cluster, they repeatedly merge the most similar pair of clusters until all points are members of the same cluster. Divisive strategies do just the opposite, working in a top-down manner. Starting with all the points in the same cluster, they recursively split the clusters until all points are in separate clusters. In this chapter we focus on agglomerative strategies. We discuss some divisive strategies in Chapter 16, in the context of graph partitioning.
14.1 PRELIMINARIES
Given a dataset D = {x1,…,xn}, where xi ∈ Rd, a clustering C = {C1,…,Ck} is a partition of D, that is, each cluster is a set of points Ci ⊆ D, such that the clusters are pairwise disjoint Ci ∩ Cj = ∅ (for all i ̸= j), and ∪ki=1Ci = D. A clustering A = {A1,…,Ar} is said to be nested in another clustering B = {B1,…,Bs} if and only if r >s, and for each cluster Ai ∈A, there exists a cluster Bj ∈B, such that Ai ⊆Bj. Hierarchical clustering yields a sequence of n nested partitions C1,…,Cn, ranging from the trivial clustering C1 = {x1},…,{xn} where each point is in a separate cluster, to the other trivial clustering Cn = {x1,…,xn}, where all points are in one cluster. In general, the clustering Ct−1 is nested in the clustering Ct. The cluster dendrogram is a rooted binary tree that captures this nesting structure, with edges between cluster Ci ∈Ct−1 and cluster Cj ∈Ct if Ci is nested in Cj, that is, if Ci ⊂Cj. In this way the dendrogram captures the entire sequence of nested clusterings.
364
Preliminaries
365
ABCD
ABCDE
CD
AB
ABCDE
Figure 14.1. Hierarchical clustering dendrogram.
Example14.1. Figure14.1showsanexampleofhierarchicalclusteringoffivelabeled points: A, B, C, D, and E. The dendrogram represents the following sequence of nested partitions:
with Ct−1 ⊂ Ct for t = 2,…,5. We assume that A and B are merged before C and D.
Clustering
Clusters
C1 C2 C3 C4 C5
{A}, {B}, {C}, {D}, {E} {AB}, {C}, {D}, {E} {AB}, {CD}, {E} {ABCD}, {E} {ABCDE}
Number of Hierarchical Clusterings
The number of different nested or hierarchical clusterings corresponds to the number of different binary rooted trees or dendrograms with n leaves with distinct labels. Any tree with t nodes has t − 1 edges. Also, any rooted binary tree with m leaves has m − 1 internal nodes. Thus, a dendrogram with m leaf nodes has a total of t = m + m − 1 = 2m − 1 nodes, and consequently t − 1 = 2m − 2 edges. To count the number of different dendrogram topologies, let us consider how we can extend a dendrogram with m leaves by adding an extra leaf, to yield a dendrogram with m + 1 leaves. Note that we can add the extra leaf by splitting (i.e., branching from) any of the 2m − 2 edges. Further, we can also add the new leaf as a child of a new root, giving 2m−2+1 = 2m−1 new dendrograms with m + 1 leaves. The total number of different dendrograms with n leaves is thus obtained by the following product:
n−1
(2m−1)=1×3×5×7×···×(2n−3)=(2n−3)!! (14.1)
m=1
366
Hierarchical Clustering
21
1323
3
12
1
12
(a) n = 1
(b) n = 2
(c) n = 3 Figure 14.2. Number of hierarchical clusterings.
The index m in Eq. (14.1) goes up to n − 1 because the last term in the product denotes the number of dendrograms one obtains when we extend a dendrogram with n − 1 leaves by adding one more leaf, to yield dendrograms with n leaves.
The number of possible hierarchical clusterings is thus given as (2n − 3)!!, which grows extremely rapidly. It is obvious that a naive approach of enumerating all possible hierarchical clusterings is simply infeasible.
Example14.2. Figure14.2showsthenumberoftreeswithone,two,andthreeleaves. The gray nodes are the virtual roots, and the black dots indicate locations where a new leaf can be added. There is only one tree possible with a single leaf, as shown in Figure 14.2a. It can be extended in only one way to yield the unique tree with two leaves in Figure 14.2b. However, this tree has three possible locations where the third leaf can be added. Each of these cases is shown in Figure 14.2c. We can further see that each of the trees with m = 3 leaves has five locations where the fourth leaf can be added, and so on, which confirms the equation for the number of hierarchical clusterings in Eq. (14.1).
14.2 AGGLOMERATIVE HIERARCHICAL CLUSTERING
In agglomerative hierarchical clustering, we begin with each of the n points in a separate cluster. We repeatedly merge the two closest clusters until all points are members of the same cluster, as shown in the pseudo-code given in Algorithm 14.1. Formally, given a set of clusters C = {C1,C2,..,Cm}, we find the closest pair of clusters Ci and Cj and merge them into a new cluster Cij = Ci ∪ Cj . Next, we update the set of clustersbyremovingCi andCj andaddingCij,asfollowsC=C\{Ci,Cj}∪{Cij}. We repeat the process until C contains only one cluster. Because the number of clusters decreases by one in each step, this process results in a sequence of n nested clusterings. If specified, we can stop the merging process when there are exactly k clusters remaining.
Agglomerative Hierarchical Clustering 367 ALGORITHM14.1. AgglomerativeHierarchicalClusteringAlgorithm
1 2 3 4 5 6 7 8
AGGLOMERATIVECLUSTERING(D,k):
C←{Ci ={xi}|xi ∈D}//Eachpointinseparatecluster ←{δ(xi,xj): xi,xj ∈D}//Computedistancematrix repeat
Find the closest pair of clusters Ci , Cj ∈ C
Cij ← Ci ∪ Cj // Merge the clusters
C ← C \ {Ci , Cj } ∪ {Cij } // Update the clustering Update distance matrix to reflect new clustering
14.2.1
until|C|=k DistancebetweenClusters
The main step in the algorithm is to determine the closest pair of clusters. Several distance measures, such as single link, complete link, group average, and others discussed in the following paragraphs, can be used to compute the distance between any two clusters. The between-cluster distances are ultimately based on the distance between two points, which is typically computed using the Euclidean distance or L2-norm, defined as
d 1/2 δ(x,y)= x−y 2 = (xi −yi)2
i=1
However, one may use other distance metrics, or if available one may a user-specified
distance matrix.
Single Link
GiventwoclustersCi andCj,thedistancebetweenthem,denotedδ(Ci,Cj),isdefined as the minimum distance between a point in Ci and a point in Cj
δ(Ci,Cj)=min{δ(x,y)|x∈Ci,y∈Cj}
The name single link comes from the observation that if we choose the minimum distance between points in the two clusters and connect those points, then (typically) only a single link would exist between those clusters because all other pairs of points would be farther away.
Complete Link
The distance between two clusters is defined as the maximum distance between a point inCi andapointinCj:
δ(Ci,Cj)=max{δ(x,y)|x∈Ci,y∈Cj}
The name complete link conveys the fact that if we connect all pairs of points from the two clusters with distance at most δ(Ci , Cj ), then all possible pairs would be connected, that is, we get a complete linkage.
368 Hierarchical Clustering
Group Average
The distance between two clusters is defined as the average pairwise distance between points in Ci and Cj :
δ(Ci,Cj)= x∈Ci y∈Cj δ(x,y) ni ·nj
where ni = |Ci | denotes the number of points in cluster Ci .
Mean Distance
The distance between two clusters is defined as the distance between the means or centroids of the two clusters:
δ(Ci,Cj)=δ(μi,μj) (14.2) where μi = 1 x∈C x.
ni i
Minimum Variance: Ward’s Method
The distance between two clusters is defined as the increase in the sum of squared errors (SSE) when the two clusters are merged. The SSE for a given cluster Ci is given as
SSEi =x−μi2 x∈Ci
which can also be written as
SSEi =x−μi2
=
The SSE for a clustering C = {C1,…,Cm} is given as
x∈Ci
T T T
x x−2
= xTx−niμTi μi
m m2 SSE= SSEi = x−μi
x∈Ci x∈Ci
x∈Ci
i=1 i=1 x∈Ci
Ward’s measure defines the distance between two clusters Ci and Cj as the net
change in the SSE value when we merge Ci and Cj into Cij , given as δ(Ci,Cj)=SSEij =SSEij −SSEi −SSEj (14.4)
We can obtain a simpler expression for the Ward’s measure by plugging Eq. (14.3) into Eq. (14.4), and noting that because Cij = Ci ∪ Cj and Ci ∩ Cj = ∅, we
x μi + μi μi x∈Ci
(14.3)
Agglomerative Hierarchical Clustering
have|Cij|=nij =ni +nj,andtherefore δ(Ci,Cj)=SSEij
= z − μ i j 2 − x − μ i 2 − y − μ j 2 z∈Cij x∈Ci y∈Cj
= z T z − n i j μ Ti j μ i j − x T x + n i μ Ti μ i − y T y + n j μ j T μ j z∈Cij x∈Ci y∈Cj
369
=niμTi μi +njμjTμj −(ni +nj)μTijμij (14.5) The last step follows from the fact that z∈Cij zT z = x∈Ci xT x + y∈Cj yT y. Noting that
we obtain
μij =niμi+njμj ni +nj
μ Ti j μ i j = 1 n 2i μ Ti μ i + 2 n i n j μ Ti μ j + n j2 μ jT μ j (ni +nj)2
Plugging the above into Eq. (14.5), we finally obtain
δ(Ci,Cj)=SSEij
= n i μ Ti μ i + n j μ jT μ j − 1 n 2i μ Ti μ i + 2 n i n j μ Ti μ j + n j2 μ jT μ j
(ni +nj)
= ni(ni +nj)μTi μi +nj(ni +nj)μjTμj −n2i μTi μi −2ninjμTi μj −nj2μjTμj
= ninj μTi μi −2μTi μj +μjTμj ni +nj
ni + nj
n i n j μ i − μ j 2
ni +nj
Ward’s measure is therefore a weighted version of the mean distance measure because if we use Euclidean distance, the mean distance in Eq.(14.2) can be rewritten as
δ(μi,μj)=μi −μj2 (14.6)
We can see that the only difference is that Ward’s measure weights the distance
between the means by half of the harmonic mean of the cluster sizes, where the
=
harmonic mean of two numbers n and n is given as 2 = 2n1n2 . 12 1+1n1+n2
n1 n2
Example 14.3 (Single Link). Consider the single link clustering shown in Figure 14.3 on a dataset of five points, whose pairwise distances are also shown on the bottom left. Initially, all points are in their own cluster. The closest pair of points are (A,B) and (C,D), both with δ = 1. We choose to first merge A and B, and derive a new distance matrix for the merged cluster. Essentially, we have to
370
Hierarchical Clustering
ABCDE
3
2
11
δ
E
ABCD
3
ABCD
2
AB
11
CD
3
δ
CD
E
AB
2
3
CD
3
δ
C
D
E
AB
3
2
3
C
1
3
D
5
δ
B
C
D
E
A
1
3
2
4
B
3
2
3
C
1
3
D
5
ABCDE
Figure 14.3. Single link agglomerative clustering.
compute the distances of the new cluster AB to all other clusters. For example, δ(AB,E) = 3 because δ(AB,E) = min{δ(A,E),δ(B,E)} = min{4,3} = 3. In the next step we merge C and D because they are the closest clusters, and we obtain a new distance matrix for the resulting set of clusters. After this, AB and CD are merged, and finally, E is merged with ABCD. In the distance matrices, we have shown (circled) the minimum distance used at each iteration that results in a merging of the two closest pairs of clusters.
14.2.2 Updating Distance Matrix
Whenever two clusters Ci and Cj are merged into Cij , we need to update the distance matrix by recomputing the distances from the newly created cluster Cij to all other clustersCr (r̸=iandr̸=j).TheLance–Williamsformulaprovidesageneralequation to recompute the distances for all of the cluster proximity measures we considered earlier; it is given as
δ(Cij,Cr)=αi ·δ(Ci,Cr)+αj ·δ(Cj,Cr)+
β·δ(Ci,Cj)+γ ·δ(Ci,Cr)−δ(Cj,Cr) (14.7)
Agglomerative Hierarchical Clustering
Table 14.1. Lance–Williams formula for cluster proximity
371
Measure
αi
αj
β
γ
Single link
1 2
1 2
0
−1 2
Complete link
1 2
1 2
0
1 2
Group average
ni ni+nj
ni ni+nj
ni +nr ni+nj+nr
nj ni+nj nj ni+nj nj+nr ni+nj+nr
0
0
Mean distance
−ni·nj (ni+nj)2
−nr ni+nj+nr
0
Ward’s measure
0
u2
1.0 0.5 0 −0.5 −1.0 −1.5
−4 −3 −2 −1 0 1 2 3 Figure 14.4. Iris dataset: complete link.
u1
The coefficients αi,αj,β, and γ differ from one measure to another. Let ni = |Ci| denote the cardinality of cluster Ci; then the coefficients for the different distance measures are as shown in Table 14.1.
Example 14.4. Consider the two-dimensional Iris principal components dataset shown in Figure 14.4, which also illustrates the results of hierarchical clustering using the complete-link method, with k = 3 clusters. Table 14.2 shows the contingency table comparing the clustering results with the ground-truth Iris types (which are not used in clustering). We can observe that 15 points are misclustered in total; these points are shown in white in Figure 14.4. Whereas iris-setosa is well separated, the other two Iris types are harder to separate.
14.2.3 Computational Complexity
In agglomerative clustering, we need to compute the distance of each cluster to all other clusters, and at each step the number of clusters decreases by 1. Initially it takes
372 Hierarchical Clustering Table 14.2. Contingency table: clusters versus Iris types
O(n2) time to create the pairwise distance matrix, unless it is specified as an input to the algorithm.
At each merge step, the distances from the merged cluster to the other clusters have to be recomputed, whereas the distances between the other clusters remain the same. This means that in step t, we compute O(n − t) distances. The other main operation is to find the closest pair in the distance matrix. For this we can keep the n2 distances in a heap data structure, which allows us to find the minimum distance in O(1) time; creating the heap takes O(n2) time. Deleting/updating distances from the heap takes O(logn) time for each operation, for a total time across all merge steps of O(n2 log n). Thus, the computational complexity of hierarchical clustering is O(n2 log n).
14.3 FURTHER READING
Hierarchical clustering has a long history, especially in taxonomy or classificatory systems, and phylogenetics; see, for example, Sokal and Sneath (1963). The generic Lance–Williams formula for distance updates appears in Lance and Williams (1967). Ward’s measure is from Ward (1963). Efficient methods for single-link and complete-link measures with O(n2) complexity are given in Sibson (1973) and Defays (1977), respectively. For a good discussion of hierarchical clustering, and clustering in general, see Jain and Dubes (1988).
Defays, D. (1977). An efficient algorithm for a complete link method. Computer Journal, 20 (4): 364–366.
Jain, A. K. and Dubes, R. C. (1988). Algorithms for clustering data. Upper Saddle River, NJ: Prentice-Hall.
Lance, G. N. and Williams, W. T. (1967). A general theory of classificatory sorting strategies 1. Hierarchical systems. The Computer Journal, 9 (4): 373–380.
Sibson, R. (1973). SLINK: An Optimally Efficient Algorithm for the Single-Link Cluster Method. Computer Journal, 16 (1): 30–34.
Sokal, R. R. and Sneath, P. H. (1963). The Principles of Numerical Taxonomy. San Francisco: W.H. Freeman.
Ward, J. H. (1963). Hierarchical Grouping to Optimize an Objective Function. Journal of the American Statistical Association, 58 (301): 236–244.
iris-setosa
iris-virginica
iris-versicolor
C1 (circle) C2 (triangle) C3 (square)
50 0 0
0
1 49
0 36 14
Exercises 373 14.4 EXERCISES
Q1. Considerthe5-dimensionalcategoricaldatashowninTable14.3. Table 14.3. Data for Q1
Point
X1 X2 X3 X4 X5
x1 x2 x3 x4 x5 x6
10110 11010 00110 01010 10101 01100
The similarity between categorical data points can be computed in terms of the number of matches and mismatches for the different attributes. Let n11 be the number of attributes on which two points xi and xj assume the value 1, and let n10 denote the number of attributes where xi takes value 1, but xj takes on the value of 0. Define n01 and n00 in a similar manner. The contingency table for measuring the similarity is then given as
xj
xi
1
0
1
n11
n10
0
n01
n00
Define the following similarity measures:
• Simple matching coefficient: SMC(xi , xj ) =
n
• Jaccard coefficient: JC(xi , xj ) = 11
Q2. GiventhedatasetinFigure14.5,showthedendrogramresultingfromthesingle-link hierarchical agglomerative clustering approach using the L1-norm as the distance between points
2 a=1
Whenever there is a choice, merge the cluster that has the lexicographically smallest labeled point. Show the cluster merge order in the tree, stopping when you have k = 4 clusters. Show the full distance matrix at each step.
n11 +n00 n11+n10+n01+n00
n11 +n10 +n01 n11
• Rao’s coefficient: RC(xi , xj ) =
Find the cluster dendrograms produced by the hierarchical clustering algorithm under the following scenarios:
(a) We use single link with RC.
(b) We use complete link with SMC.
(c) We use group average with JC.
n11+n10+n01+n00
δ(x,y)=
|xia −yia|
374
Hierarchical Clustering
a
b
c
k
d
e
f
g
h
i
j
9 8 7 6 5 4 3 2 1
123456789 Figure 14.5. Dataset for Q2.
Table 14.4. Dataset for Q3
A
B
C
D
E
A
0
1
3
2
4
B
0
3
2
3
C
0
1
3
D
0
5
E
0
Q3. Using the distance matrix from Table 14.4, use the average link method to generate hierarchical clusters. Show the merging distance thresholds.
Q4. Prove that in the Lance–Williams formula [Eq. (14.7)]
(a) Ifαi= ni ,αj= nj ,β=0andγ=0,thenweobtainthegroupaverage
(b) Ifαi= ni+nr ,αj= nj+nr ,β= −nr andγ=0,thenweobtainWard’s ni +nj +nr ni +nj +nr ni +nj +nr
measure.
Q5. If we treat each point as a vertex, and add edges between two nodes with distance less than some threshold value, then the single-link method corresponds to a well known graph algorithm. Describe this graph-based algorithm to hierarchically cluster the nodes via single-link measure, using successively higher distance thresholds.
ni+nj ni+nj measure.
CHAPTER 15 Density-basedClustering
The representative-based clustering methods like K-means and expectation- maximization are suitable for finding ellipsoid-shaped clusters, or at best convex clusters. However, for nonconvex clusters, such as those shown in Figure 15.1, these methods have trouble finding the true clusters, as two points from different clusters may be closer than two points in the same cluster. The density-based methods we consider in this chapter are able to mine such nonconvex clusters.
15.1 THE DBSCAN ALGORITHM
Density-based clustering uses the local density of points to determine the clusters, rather than using only the distance between points. We define a ball of radius ǫ around a point x ∈ Rd , called the ǫ-neighborhood of x, as follows:
Nǫ(x)=Bd(x,ǫ)={y| δ(x,y)≤ǫ}
Here δ(x,y) represents the distance between points x and y, which is usually assumed to be the Euclidean distance, that is, δ(x,y) = ∥x−y∥2. However, other distance metrics can also be used.
For any point x ∈ D, we say that x is a core point if there are at least minpts points in its ǫ-neighborhood. In other words, x is a core point if |Nǫ(x)| ≥ minpts, where minpts is a user-defined local density or frequency threshold. A border point is defined as a point that does not meet the minpts threshold, that is, it has |Nǫ(x)| < minpts, but it belongs to the ǫ-neighborhood of some core point z, that is, x ∈ Nǫ(z). Finally, if a point is neither a core nor a border point, then it is called a noise point or an outlier.
We say that a point x is directly density reachable from another point y if x ∈ Nǫ (y) and y is a core point. We say that x is density reachable from y if there exists a chain 375
Example 15.1. Figure 15.2a shows the ǫ-neighborhood of the point x, using the Euclidean distance metric. Figure 15.2b shows the three different types of points, using minpts = 6. Here x is a core point because |Nǫ(x)| = 6, y is a border point because |Nǫ(y)| < minpts, but it belongs to the ǫ-neighborhood of the core point x, i.e., y ∈ Nǫ (x). Finally, z is a noise point.
376
Density-based Clustering
X2
395
320
245
170
95
20
0
X1
xǫ
100
200 300 400 500 Figure 15.1. Density-based dataset.
x
(b)
Figure 15.2. (a) Neighborhood of a point. (b) Core, border, and noise points.
(a)
of points, x0,x1,...,xl, such that x = x0 and y = xl, and xi is directly density reachable from xi−1 for all i = 1,...,l. In other words, there is set of core points leading from y to x. Note that density reachability is an asymmetric or directed relationship. Define any two points x and y to be density connected if there exists a core point z, such that both x and y are density reachable from z. A density-based cluster is defined as a maximal set of density connected points.
The pseudo-code for the DBSCAN density-based clustering method is shown in Algorithm 15.1. First, DBSCAN computes the ǫ-neighborhood Nǫ(xi) for each point xi in the dataset D, and checks if it is a core point (lines 2–5). It also sets the cluster id id(xi) = ∅ for all points, indicating that they are not assigned to any cluster. Next, starting from each unassigned core point, the method recursively finds all its density
600
y z
The DBSCAN Algorithm 377 ALGORITHM15.1. Density-basedClusteringAlgorithm
1 2 3 4 5
6 7 8 9
10
11 12 13 14
15 16 17
DBSCAN (D, ǫ, minpts):
Core←∅
foreach xi ∈ D do // Find the core points
Compute Nǫ(xi)
id(xi ) ← ∅ // cluster id for xi ifNǫ(xi)≥minptsthen Core←Core∪{xi}
k←0//clusterid
foreach xi ∈ Core, such that id(xi ) = ∅ do
k←k+1
id(xi)←k//assignxi toclusteridk DENSITYCONNECTED (xi,k)
C←{Ci}ki=1,whereCi ←{x∈D| id(x)=i} Noise←{x∈D| id(x)=∅}
Border ← D \ {Core ∪ Noise}
return C,Core,Border,Noise
DENSITYCONNECTED (x, k): foreach y ∈ Nǫ (x) do
id(y) ← k // assign y to cluster id k
if y ∈ Core then DENSITYCONNECTED (y,k)
connected points, which are assigned to the same cluster (line 10). Some border point may be reachable from core points in more than one cluster; they may either be arbitrarily assigned to one of the clusters or to all of them (if overlapping clusters are allowed). Those points that do not belong to any cluster are treated as outliers or noise.
DBSCAN can also be considered as a search for the connected components in a graph where the vertices correspond to the core points in the dataset, and there exists an (undirected) edge between two vertices (core points) if the distance between them is less than ǫ, that is, each of them is in the ǫ-neighborhood of the other point. The connected components of this graph correspond to the core points of each cluster. Next, each core point incorporates into its cluster any border points in its neighborhood.
One limitation of DBSCAN is that it is sensitive to the choice of ǫ, in particular if clusters have different densities. If ǫ is too small, sparser clusters will be categorized as noise. If ǫ is too large, denser clusters may be merged together. In other words, if there are clusters with different local densities, then a single ǫ value may not suffice.
Example 15.2. Figure 15.3 shows the clusters discovered by DBSCAN on the density-based dataset in Figure 15.1. For the parameter values ǫ = 15 and minpts = 10, found after parameter tuning, DBSCAN yields a near-perfect clustering comprising all nine clusters. Cluster are shown using different symbols and shading; noise points are shown as plus symbols.
378
Density-based Clustering
4.0 3.5
+ +
X2
X2
395 320 245 170
95
20
0
+
++ +
+ + +
+
+
+
+
+ +
+ + + +
+
+
+ +
+
++
+ +
+
+ +
+
++ +
+
+
++ ++
++
+
+ +
+
+
+++
+
+
+
+
+
+
+
+
++ + + +
+
+
+
+
++
+
+
++
+
+ +
+
+ ++++
+ ++ ++
+
+
++ +
+ +
+ +
+ +
++ + + + + + + + + + +
+
+
+
+ + + +
+ +++
+
+ + ++ + + ++ ++
++
++ +
+
+ +
+
+ ++
+ + +
+ + +
+
+
+
+
++
+
+ +
+
++ +
+ + +
+
+
+
+
+
+
+
++
+ +
++ + +
+ +
+ + +++ +
+ + +
+
+
+
+
+ +
+
+ ++ +
+ + ++++++++
+
+
+
+
+ ++
+
+
+
+ + + +
+
+
+++ +++++ +++X1
+
++ +
100
+ + +
200 300 400 500 Figure 15.3. Density-based clusters.
X2
+
600
3.0 ++
+
+ + +
+
+ +
3.5 3.0
+
4.0
+ +
+
+ ++
+ ++
++ +
+
+++ + + +
+
2.5 ++
+
+
X1
++
++
+++
++ 2.5
2+ X12 45674567
(a) ǫ = 0.2, minpts = 5 (b) ǫ = 0.36, minpts = 3 Figure 15.4. DBSCAN clustering: Iris dataset.
Example 15.3. Figure 15.4 shows the clusterings obtained via DBSCAN on the two-dimensional Iris dataset (over sepal length and sepal width attributes) for two different parameter settings. Figure 15.4a shows the clusters obtained with radius ǫ = 0.2 and core threshold minpts = 5. The three clusters are plotted using different shaped points, namely circles, squares, and triangles. Shaded points are core points, whereas the border points for each cluster are showed unshaded (white). Noise points are shown as plus symbols. Figure 15.4b shows the clusters obtained with a larger value of radius ǫ = 0.36, with minpts = 3. Two clusters are found, corresponding to the two dense regions of points.
For this dataset tuning the parameters is not that easy, and DBSCAN is not very effective in discovering the three Iris classes. For instance it identifies too many points (47 of them) as noise in Figure 15.4a. However, DBSCAN is able to find the two main dense sets of points, distinguishing iris-setosa (in triangles) from the other types of Irises, in Figure 15.4b. Increasing the radius more than ǫ = 0.36 collapses all points into a single large cluster.
Kernel Density Estimation 379
Computational Complexity
The main cost in DBSCAN is for computing the ǫ-neighborhood for each point. If the dimensionality is not too high this can be done efficiently using a spatial index structure in O(nlogn) time. When dimensionality is high, it takes O(n2) to compute the neighborhood for each point. Once Nǫ (x) has been computed the algorithm needs only a single pass over all the points to find the density connected clusters. Thus, the overall complexity of DBSCAN is O(n2) in the worst-case.
15.2 KERNEL DENSITY ESTIMATION
There is a close connection between density-based clustering and density estimation. The goal of density estimation is to determine the unknown probability density function by finding the dense regions of points, which can in turn be used for clustering. Kernel density estimation is a nonparametric technique that does not assume any fixed probability model of the clusters, as in the case of K-means or the mixture model assumed in the EM algorithm. Instead, it tries to directly infer the underlying probability density at each point in the dataset.
15.2.1 UnivariateDensityEstimation
Assume that X is a continuous random variable, and let x1,x2,...,xn be a random sample drawn from the underlying probability density function f (x), which is assumed to be unknown. We can directly estimate the cumulative distribution function from the data by counting how many points are less than or equal to x:
ˆ 1n
F(x)= n I(xi ≤x)
i=1
where I is an indicator function that has value 1 only when its argument is true, and 0 otherwise. We can estimate the density function by taking the derivative of Fˆ (x), by considering a window of small width h centered at x, that is,
ˆ Fˆ x + h − Fˆ x − h k / n k
f(x)= 2 2== (15.1)
h hnh
where k is the number of points that lie in the window of width h centered at x, that
is, within the closed interval [x − h , x + h ]. Thus, the density estimate is the ratio of 22
the fraction of the points in the window (k/n) to the volume of the window (h). Here h plays the role of “influence.” That is, a large h estimates the probability density over a large window by considering many points, which has the effect of smoothing the estimate. On the other hand, if h is small, then only the points in close proximity to x are considered. In general we want a small value of h, but not too small, as in that case no points will fall in the window and we will not be able to get an accurate estimate of the probability density.
380 Density-based Clustering
Kernel Estimator
Kernel density estimation relies on a kernel function K that is non-negative, symmetric, and integrates to 1, that is, K(x) ≥ 0, K(−x) = K(x) for all values x, and K(x)dx = 1. Thus, K is essentially a probability density function. Note that K should not be confused with the positive semidefinite kernel mentioned in Chapter 5.
Discrete Kernel The density estimate fˆ(x) from Eq. (15.1) can also be rewritten in terms of the kernel function as follows:
ˆ 1n x−xi f(x)= nh K h
i=1
where the discrete kernel function K computes the number of points in a window of
width h, and is defined as
1 If|z|≤1
K(z) = 2 (15.2)
0 Otherwise
We can see that if |z|=|x−xi |≤ 1, then the point xi is within a window of width h
h2
centered at x, as
x−xi ≤ 1 implies that− 1 ≤ xi −x ≤ 1, or
h2 2h2
− h ≤ xi − x ≤ h , and finally
22
x − h ≤ xi ≤ x + h 22
Example 15.4. Figure 15.5 shows the kernel density estimates using the discrete kernel for different values of the influence parameter h, for the one-dimensional Iris dataset comprising the sepal length attribute. The x-axis plots the n = 150 data points. Because several points have the same value, they are shown stacked, where the stack height corresponds to the frequency of that value.
When h is small, as shown in Figure 15.5a, the density function has many local maxima or modes. However, as we increase h from 0.25 to 2, the number of modes decreases, until h becomes large enough to yield a unimodal distribution, as shown in Figure 15.5d. We can observe that the discrete kernel yields a non-smooth (or jagged) density function.
Gaussian Kernel The width h is a parameter that denotes the spread or smoothness
of the density estimate. If the spread is too large we get a more averaged value. If it is
too small we do not have enough points in the window. Further, the kernel function in
Eq. (15.2) has an abrupt influence. For points within the window (|z| ≤ 1 ) there is a net 2
contributionof 1 totheprobabilityestimatefˆ(x).Ontheotherhand,pointsoutside hn
the window (|z| > 1 ) contribute 0. 2
Kernel Density Estimation 381
f(x) f(x)
0.66
0.33
0.44
(a) h = 0.25 (b) h = 0.5 f(x) f(x)
45678 45678
0
x
0.22
0 x
0.42
0.4
0.2
0 x
(c) h = 1.0 (d) h = 2.0 Figure 15.5. Kernel density estimation: discrete kernel (varying h).
Instead of the discrete kernel, we can define a more smooth transition of influence via a Gaussian kernel:
0.21
45678 45678
0
Thus, we have
1 z2 K(z)=√2πexp −2
x−xi 1 (x−xi)2 K h =√2πexp−2h2
x
Here x, which is at the center of the window, plays the role of the mean, and h acts as the standard deviation.
Example 15.5. Figure 15.6 shows the univariate density function for the 1-dimensional Iris dataset (over sepal length) using the Gaussian kernel. Plots are shown for increasing values of the spread parameter h. The data points are shown stacked along the x-axis, with the heights corresponding to the value frequencies.
As h varies from 0.1 to 0.5, we can see the smoothing effect of increasing h on the density function. For instance, for h = 0.1 there are many local maxima, whereas for h = 0.5 there is only one density peak. Compared to the discrete kernel case shown in Figure 15.5, we can clearly see that the Gaussian kernel yields much smoother estimates, without discontinuities.
382 Density-based Clustering
f(x) f(x)
0.54
(a) h = 0.1 (b) h = 0.15 f(x) f(x)
0.27
0.23
0 x
0
x
45678 45678
0.46
0.4
0.38
0.2
0.19
0 x
(c) h = 0.25 (d) h = 0.5 Figure 15.6. Kernel density estimation: Gaussian kernel (varying h).
45678 45678
0
x
15.2.2 Multivariate Density Estimation
To estimate the probability density at a d-dimensional point x = (x1,x2,…,xd)T, we define the d-dimensional “window” as a hypercube in d dimensions, that is, a hypercube centered at x with edge length h. The volume of such a d-dimensional hypercube is given as
vol(Hd (h)) = hd
The density is then estimated as the fraction of the point weight lying within the
d-dimensional window centered at x, divided by the volume of the hypercube: ˆ 1n x−xi
K h (15.3) where the multivariate kernel function K satisfies the condition K(z)dz = 1.
f (x) = nhd
Discrete Kernel For any d -dimensional vector z = (z1 , z2 , . . . , zd )T , the discrete kernel
function in d-dimensions is given as
i=1
1 If|zj|≤1,foralldimensionsj=1,…,d K(z)= 2
0 Otherwise
Kernel Density Estimation
383
(a) h = 0.1
(b) h = 0.2
(c) h = 0.35
Figure 15.7. Density estimation: 2D Iris dataset (varying h).
For z = x−xi , we see that the kernel computes the number of points within h
the hypercube centered at x because K(x−xi ) = 1 if and only if |xj−xij | ≤ 1 for all hh2
dimensions j. Each point within the hypercube thus contributes a weight of 1 to the n
density estimate.
Gaussian Kernel The d-dimensional Gaussian kernel is given as
K(z) = 1 exp−zTz (2π )d /2 2
(15.4) where we assume that the covariance matrix is the d × d identity matrix, that is, = Id .
Plugging z = x−xi in Eq. (15.4), we have h
Kx−xi = 1 exp−(x−xi)T(x−xi) h (2π)d/2 2h2
Each point contributes a weight to the density estimate inversely proportional to its distance from x tempered by the width parameter h.
Example 15.6. Figure 15.7 shows the probability density function for the 2D Iris dataset comprising the sepal length and sepal width attributes, using the Gaussian kernel. As expected, for small values of h the density function has several local maxima, whereas for larger values the number of maxima reduce, and ultimately for a large enough value we obtain a unimodal distribution.
(d) h = 0.6
384
Density-based Clustering
X2 500
400
300
200
100
0
0 100 200
X1
300 400
Figure 15.8. Density estimation: density-based dataset.
700
Example 15.7. Figure 15.8 shows the kernel density estimate for the density-based dataset in Figure 15.1, using a Gaussian kernel with h = 20. One can clearly discern that the density peaks closely correspond to regions with higher density of points.
15.2.3 Nearest Neighbor Density Estimation
In the preceding density estimation formulation we implicitly fixed the volume by fixing the width h, and we used the kernel function to find out the number or weight of points that lie inside the fixed volume region. An alternative approach to density estimation is to fix k, the number of points required to estimate the density, and allow the volume of the enclosing region to vary to accommodate those k points. This approach is called the k nearest neighbors (KNN) approach to density estimation. Like kernel density estimation, KNN density estimation is also a nonparametric approach.
Given k, the number of neighbors, we estimate the density at x as follows: fˆ(x)= k
where hx is the distance from x to its kth nearest neighbor, and vol(Sd (hx)) is the volume of the d-dimensional hypersphere Sd(hx) centered at x, with radius hx [Eq.(6.4)]. In other words, the width (or radius) hx is now a variable, which depends on x and the chosen value k.
nvol(Sd(hx))
500 600
Density-based Clustering: DENCLUE 385
15.3 DENSITY-BASED CLUSTERING: DENCLUE
Having laid the foundations of kernel density estimation, we can develop a general formulation of density-based clustering. The basic approach is to find the peaks in the density landscape via gradient-based optimization, and find the regions with density above a given threshold.
Density Attractors and Gradient
A point x∗ is called a density attractor if it is a local maxima of the probability density function f . A density attractor can be found via a gradient ascent approach starting at some point x. The idea is to compute the density gradient, the direction of the largest increase in the density, and to move in the direction of the gradient in small steps, until we reach a local maxima.
The gradient at a point x can be computed as the multivariate derivative of the probability density estimate in Eq. (15.3), given as
( 1 5 . 5 )
∇ f ˆ ( x ) = ∂ f ˆ ( x ) = 1 n ∂ K x − x i ∂x nhd ∂x h
i=1 For the Gaussian kernel [Eq. (15.4)], we have
∂ K(z)= 1 exp−zTz·−z· ∂z ∂x (2π)d/2 2 ∂x
= K(z) · −z · ∂ z ∂x
∂ Kx−xi =Kx−xi ·xi −x·1 ∂x h h h h
Setting z = x−xi above, we get
h
which follows from the fact that ∂ x−xi = 1 . Substituting the above in Eq. (15.5), the ∂x h h
gradient at a point x is given as
ˆ 1nx−xi
∇f(x)= nhd+2 K h ·(xi −x) (15.6) i=1
This equation can be thought of as having two parts: a vector (xi − x) and a scalar influence value K( x−xi ). For each point xi , we first compute the direction away from
h
x, that is, the vector (xi − x). Next, we scale it using the Gaussian kernel value as the
weight Kx−xi . Finally, the vector ∇fˆ(x) is the net influence at x, as illustrated in h
Figure 15.9, that is, the weighted sum of the difference vectors.
We say that x∗ is a density attractor for x, or alternatively that x is density attracted to
x∗, if a hill climbing process started at x converges to x∗. That is, there exists a sequence of points x = x0 → x1 → . . . → xm , starting from x and ending at xm , such that ∥xm − x∗ ∥ ≤ ǫ, that is, xm converges to the attractor x∗.
The typical approach is to use the gradient-ascent method to compute x∗, that is, starting from x, we iteratively update it at each step t via the update rule:
xt+1 =xt +δ·∇fˆ(xt)
386 Density-based Clustering
x 3 ∇ fˆ ( x ) x 2 2
1x x1 0
012345
Figure 15.9. The gradient vector ∇fˆ(x) (shown in thick black) obtained as the sum of difference vectors xi − x (shown in gray).
where δ > 0 is the step size. That is, each intermediate point is obtained after a small move in the direction of the gradient vector. However, the gradient-ascent approach can be slow to converge. Instead, one can directly optimize the move direction by setting the gradient [Eq. (15.6)] to the zero vector:
1n x−xi∇fˆ(x)=0 nhd+2 K h ·(xi −x)=0
3
i=1
n x−xin x−xi
x· K h = K h xi
i=1 i=1 n
Kx−xi xi i=1 h
x= n Kx−xi i=1 h
The point x is involved on both the left- and right-hand sides above; however, it can be used to obtain the following iterative update rule:
Kxt−xi xi
xt+1 = n Kxt−xi (15.7)
n
i=1 h
i=1 h
where t denotes the current iteration and xt+1 is the updated value for the current vector xt. This direct update rule is essentially a weighted average of the influence (computed via the kernel function K) of each point xi ∈ D on the current point xt . The direct update rule results in much faster convergence of the hill-climbing process.
Center-defined Cluster
A cluster C ⊆ D, is called a center-defined cluster ˆif all the points x ∈ C are density attractedtoauniquedensityattractorx∗,suchthatf(x∗)≥ξ,whereξ isauser-defined
387
of density attractors x∗1,x∗2,…,x∗m, such that
1. Each point x ∈ C is attracted to some attractor x∗i .
2. Eachdensityattractorhasdensityaboveξ.Thatis,fˆ(x∗i)≥ξ.
3. Any two density attractors x∗i and xj∗ are density reachable, that is, there exists a path fromx∗i toxj∗,suchthatforallpointsyonthepath,fˆ(y)≥ξ.
DENCLUE Algorithm
The pseudo-code for DENCLUE is shown in Algorithm 15.2. The first step is to compute the density attractor x∗ for each point x in the dataset (line 4). If the density at x∗ is above the minimum density threshold ξ, the attractor is added to the set of attractors A. The data point x is also added to the set of points R(x∗) attracted to x∗
ALGORITHM 15.2. DENCLUE Algorithm
Density-based Clustering: DENCLUE
minimum density threshold. In other words,
ˆ∗ 1n x∗−xi
≥ ξ
An arbitrary-shaped cluster C ⊆ D is called a density-based cluster if there exists a set
f (x ) = nhd K h i=1
Density-based Cluster
1 2 4 5 7 9
11 12 13
14
16 17 18
20 21 22 24
DENCLUE (D,h,ξ,ǫ):
A←∅
foreach x ∈ D do // find density attractors
x∗ ← FINDATTRACTOR(x,D,h,ǫ) i f fˆ ( x ∗ ) ≥ ξ t h e n
A←A∪{x∗}
R(x∗) ← R(x∗) ∪ {x}
C ← {maximal C ⊆ A | ∀x∗i , xj∗ ∈ C, x∗i and xj∗ are density reachable} foreach C ∈ C do // density-based clusters
foreachx∗ ∈CdoC←C∪R(x∗) return C
FINDATTRACTOR (x,D,h,ǫ): t←0
xt←x
repeat
K xt −xi ·xt xt+1←n Kxt−xi
n
i=1 h
t←t+1 until∥xt−xt−1∥≤ǫ return xt
i=1 h
388 Density-based Clustering
(line 9). In the second step, DENCLUE finds all the maximal subsets of attractors C ⊆ A, such that any pair of attractors in C is density-reachable from each other (line 11). These maximal subsets of mutually reachable attractors form the seed for each density-based cluster. Finally, for each attractor x∗ ∈ C, we add to the cluster all of the points R(x∗) that are attracted to x∗, which results in the final set of clusters C.
The FINDATTRACTOR method implements the hill-climbing process using the direct update rule [Eq. (15.7)], which results in fast convergence. To further speed up the influence computation, it is possible to compute the kernel values for only the nearest neighbors of xt . That is, we can index the points in the dataset D using a spatial index structure, so that we can quickly compute all the nearest neighbors of xt within some radius r . For the Gaussian kernel, we can set r = h · z, where h is the influence parameter that plays the role of standard deviation, and z specifies the number of standard deviations. Let Bd(xt,r) denote the set of all points in D that lie within a d-dimensional ball of radius r centered at xt. The nearest neighbor based update rule can then be expressed as
xt+1=
xi∈Bd(xt,r) h
Kxt−xi xi h Kxt−xi
xi ∈Bd (xt ,r)
which can be used in line 20 in Algorithm 15.2. When the data dimensionality is not high, this can result in a significant speedup. However, the effectiveness deteriorates rapidly with increasing number of dimensions. This is due to two effects. The first is that finding Bd(xt,r) reduces to a linear-scan of the data taking O(n) time for each query. Second, due to the curse of dimensionality (see Chapter 6), nearly all points appear to be equally close to xt, thereby nullifying any benefits of computing the nearest neighbors.
Example 15.8. Figure 15.10 shows the DENCLUE clustering for the 2-dimensional Iris dataset comprising the sepal length and sepal width attributes. The results were obtained with h = 0.2 and ξ = 0.08, using a Gaussian kernel. The clustering is obtained by thresholding the probability density function in Figure 15.7b at ξ = 0.08. The two peaks correspond to the two final clusters. Whereas iris setosa is well separated, it is hard to separate the other two types of Irises.
Example 15.9. Figure 15.11 shows the clusters obtained by DENCLUE on the density-based dataset from Figure 15.1. Using the parameters h = 10 and ξ = 9.5 × 10−5, with a Gaussian kernel, we obtain eight clusters. The figure is obtained by slicing the density function at the density value ξ; only the regions above that value are plotted. All the clusters are correctly identified, with the exception of the two semicircular clusters on the lower right that appear merged into one cluster.
DENCLUE: Special Cases
It can be shown that DBSCAN is a special case of the general kernel density estimate based clustering approach, DENCLUE. If we let h = ǫ and ξ = minpts, then using a
Density-based Clustering: DENCLUE 389
X2 500
400
300
200
100
0
0
X1
X2
4 f(x) X1
3
2
7.5 6.5
1
3.5
5.5 4.5
Figure 15.10. DENCLUE: Iris 2D dataset.
100 200 300 400 500 600 Figure 15.11. DENCLUE: density-based dataset.
700
discrete kernel DENCLUE yields exactly the same clusters as DBSCAN. Each density attractor corresponds to a core point, and the set of connected core points define the attractors of a density-based cluster. It can also be shown that K-means is a special case of density-based clustering for appropriates value of h and ξ, with the density attractors corresponding to the cluster centroids. Further, it is worth noting that the density-based approach can produce hierarchical clusters, by varying the ξ threshold.
390 Density-based Clustering
For example, decreasing ξ can result in the merging of several clusters found at higher thresholds values. At the same time it can also lead to new clusters if the peak density satisfies the lower ξ value.
Computational Complexity
The time for DENCLUE is dominated by the cost of the hill-climbing process. For each point x ∈ D, finding the density attractor takes O(nt) time, where t is the maximum number of hill-climbing iterations. This is because each iteration takes O(n) time for computing the sum of the influence function over all the points xi ∈ D. The total cost to compute density attractors is therefore O(n2t). We assume that for reasonable values of h and ξ, there are only a few density attractors, that is, |A| = m ≪ n. The cost of finding the maximal reachable subsets of attractors is O(m2), and the final clusters can be obtained in O(n) time.
15.4 FURTHER READING
Kernel density estimation was developed independently in Rosenblatt (1956) and Parzen (1962). For an excellent description of density estimation techniques see Silverman (1986). The density-based DBSCAN algorithm was introduced in Ester et al. (1996). The DENCLUE method was proposed in Hinneburg and Keim (1998), with the faster direct update rule appearing in Hinneburg and Gabriel (2007). However, the direct update rule is essentially the mean-shift algorithm first proposed in Fukunaga and Hostetler (1975). See Cheng (1995) for convergence properties and generalizations of the mean-shift method.
Cheng, Y. (1995). Mean shift, mode seeking, and clustering. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 17 (8): 790–799.
Ester, M., Kriegel, H.-P., Sander, J., and Xu, X. (1996). A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. Proceedings of the 2nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Palo Alto, CA: AAAI Press, pp. 226–231.
Fukunaga, K. and Hostetler, L. (1975). The estimation of the gradient of a density func- tion, with applications in pattern recognition. IEEE Transactions on Information Theory, 21 (1): 32–40.
Hinneburg, A. and Gabriel, H.-H. (2007). Denclue 2.0: Fast clustering based on kernel density estimation. Proceedings of the 7th International Symposium on Intelligent Data Analysis. New York: Springer Science + Business Media, pp. 70–80.
Hinneburg, A. and Keim, D. A. (1998). An Efficient Approach to Clustering in Large Multimedia Databases with Noise. Proceedings of the 4th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Palo Alto, CA: AAAI Press, pp. 58–65.
Parzen, E. (1962). On Estimation of a Probability Density Function and Mode. The Annals of Mathematical Statistics, 33 (3): 1065–1076.
Rosenblatt, M. (1956). Remarks on some nonparametric estimates of a density function. The Annals of Mathematical Statistics, 27 (3): 832–837.
Exercises 391 Silverman, B. (1986). Density estimation for statistics and data analysis. Monographs on
Statistics and Applied Probability. Boca Raton, FL: Chapman and Hall / CRC.
15.5 EXERCISES
Q1. ConsiderFigure15.12andanswerthefollowingquestions,assumingthatweusethe Euclidean distance between points, and that ǫ = 2 and minpts = 3
(a) List all the core points.
(b) Is a directly density reachable from d?
(c) Is o density reachable from i? Show the intermediate points on the chain or the point where the chain breaks.
(d) Is density reachable a symmetric relationship, that is, if x is density reachable from y, does it imply that y is density reachable from x? Why or why not?
(e) Is l density connected to x? Show the intermediate points that make them density connected or violate the property, respectively.
(f) Is density connected a symmetric relationship?
(g) Show the density-based clusters and the noise points.
10 9 8 7 6 5 4 3 2 1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Figure 15.12. Dataset for Q1.
Q2. ConsiderthepointsinFigure15.13.Definethefollowingdistancemeasures:
d L∞(x,y)=max |xi−yi|
i=1
d 1 2
|xi −yi|2
a
b
c
d
e
f
g
h
i
j
n
k
m
o
p
q
r
s
t
l
u
v
w
x
L1 (x,y) = 2
i=1
d Lmin(x,y)=min |xi −yi|
i=1
392
Density-based Clustering
1/2
(a) Using ǫ = 2, minpts = 5, and L∞ distance, find all core, border, and noise points.
d
2i−1(xi −yi)2
(b) Showtheshapeoftheballofradiusǫ=4usingtheL1 distance.Usingminpts=3
Lpow(x,y)=
i=1
2
(c) Using ǫ = 1, minpts = 6, and Lmin, list all core, border, and noise points.
show all the clusters found by DBSCAN.
(d) Using ǫ = 4, minpts = 3, and Lpow, show all clusters found by DBSCAN.
a
b
c
k
d
e
f
g
h
i
j
9 8 7 6 5 4 3 2 1
123456789 Figure 15.13. Dataset for Q2 and Q3.
Q3. Consider the points shown in Figure 15.13. Define the following two kernels:
K1 (z) = 1
K2(z)=1 Ifdj=1|zj|≤1
IfL∞(z,0)≤1 0 Otherwise
0 Otherwise
Using each of the two kernels K1 and K2, answer the following questions assuming that h = 2:
(a) What is the probability density at e?
(b) What is the gradient at e?
(c) List all the density attractors for this dataset.
Q4. The Hessian matrix is defined as the set of partial derivatives of the gradient vector with respect to x. What is the Hessian matrix for the Gaussian kernel? Use the gradient in Eq. (15.6).
Q5. Let us compute the probability density at a point x using the k-nearest neighbor approach, given as
fˆ(x)= k nVx
where k is the number of nearest neighbors, n is the total number of points, and Vx is the volume of the region encompassing the k nearest neighbors of x. In other words,
Exercises 393 we fix k and allow the volume to vary based on those k nearest neighbors of x. Given
the following points
2, 2.5, 3, 4, 4.5, 5, 6.1
Find the peak density in this dataset, assuming k = 4. Keep in mind that this may happen at a point other than those given above. Also, a point is its own nearest neighbor.
CHAPTER 16 SpectralandGraphClustering
In this chapter we consider clustering over graph data, that is, given a graph, the goal is to cluster the nodes by using the edges and their weights, which represent the similarity between the incident nodes. Graph clustering is related to divisive hierarchical clustering, as many methods partition the set of nodes to obtain the final clusters using the pairwise similarity matrix between nodes. As we shall see, graph clustering also has a very strong connection to spectral decomposition of graph-based matrices. Finally, if the similarity matrix is positive semidefinite, it can be considered as a kernel matrix, and graph clustering is therefore also related to kernel-based clustering.
16.1 GRAPHS AND MATRICES
Given a dataset D = {xi }ni=1 consisting of n points in Rd , let A denote the n × n
symmetric similarity matrix between the points, given as
a11
a21 A= .
an1
where A(i, j ) = aij denotes the similarity or
require the similarity to be symmetric and non-negative, that is, aij = aj i and aij ≥ 0, respectively. The matrix A may be considered to be a weighted adjacency matrix of the weighted (undirected) graph G = (V,E), where each vertex is a point and each edge joins a pair of points, that is,
V={xi| i=1,…,n} E=(xi,xj)| 1≤i,j≤n
Further, the similarity matrix A gives the weight on each edge, that is, aij denotes the weight of the edge (xi,xj). If all affinities are 0 or 1, then A represents the regular adjacency relationship between the vertices.
394
a12 ··· a22 ···
. ··· an2 ···
a1n a2n
. (16.1) ann
affinity between points xi and xj . We
Graphs and Matrices 395 For a vertex xi , let di denote the degree of the vertex, defined as
d10···0nj=1a1j0 ··· 0 =0 d2 ··· 0= 0 nj=1a2j ··· 0 . . … . . . … .
0 0 ··· dn 0 0 ··· nj=1anj canbecompactlywrittenas(i,i)=di forall1≤i≤n.
di =
We define the degree matrix of graph G as the n × n diagonal matrix:
n j=1
aij
Example 16.1. Figure 16.1 shows the similarity graph for the Iris dataset, obtained as follows. Each of the n = 150 points xi ∈ R4 in the Iris dataset is represented by a node in G. To create the edges, we first compute the pairwise similarity between the points using the Gaussian kernel [Eq. (5.10)]:
a i j = e x p − x i − x j 2 2σ2
using σ = 1. Each edge (xi , xj ) has the weight aij . Next, for each node xi we compute the top q nearest neighbors in terms of the similarity value, given as
Nq(xi)=xj ∈V: aij ≤aiq
where aiq represents the similarity value between xi and its qth nearest neighbor. We used a value of q = 16, as in this case each node records at least 15 nearest neighbors (not including the node itself), which corresponds to 10% of the nodes. An edge is added between nodes xi and xj if and only if both nodes are mutual nearest neighbors, that is, if xj ∈ Nq (xi ) and xi ∈ Nq (xj ). Finally, if the resulting graph is disconnected, we add the top q most similar (i.e., highest weighted) edges between any two connected components.
The resulting Iris similarity graph is shown in Figure 16.1. It has |V| = n = 150 nodes and |E| = m = 1730 edges. Edges with similarity aij ≥ 0.9 are shown in black, and the remaining edges are shown in gray. Although aii = 1.0 for all nodes, we do not show the self-edges or loops.
Normalized Adjacency Matrix
The normalized adjacency matrix is obtained by dividing each row of the adjacency matrix by the degree of the corresponding node. Given the weightedadjacency matrix
396
Spectral and Graph Clustering
in M; we have
di
n naijdi
mij= di =di=1 (16.3)
j=1 j=1
Figure 16.1. Iris similarity graph.
A for a graph G, its normalized adjacency matrix is defined as
···
···
an1 an2 ··· ann dndn dn
a11 a12 d1 d1 a21 a22
a1n d1 a2n d2
M=−1A=d2 d2
. . … .
(16.2)
Because A is assumed to have non-negative elements, this implies that each element of M, namely mij is also non-negative, as mij = aij ≥ 0. Consider the sum of the ith row
Thus, each row in M sums to 1. This implies that 1 is an eigenvalue of M. In fact,
λ1 = 1 is the largest eigenvalue of M, and the other eigenvalues satisfy the property
that |λi| ≤ 1. Also, if G is connected then the eigenvector corresponding to λ1 is
u = 1 (1,1,…,1)T = 1 1. Because M is not symmetric, its eigenvectors are not 1√n √n
necessarily orthogonal.
Graphs and Matrices
397
16
245
37 Figure 16.2. Example graph.
Example 16.2. Consider the graph in Figure 16.2. Its adjacency and degree matrices are given as
0101010 3 0 0 0 0 0 0 1011000 0 3 0 0 0 0 0 0101001 0 0 3 0 0 0 0
A=1110100 =0 0 0 4 0 0 0
0001011 1000101
0 0 0 0 3 0 0
0 0 0 0 0 3 0 0010110 0000003
The normalized adjacency matrix is as 00.33
follows: 00.33
00.33 0 0 0 0 0 0 0.33
0.33 0 −1 0 0.33 M= A=0.25 0.25 0 0
0.33 0.33 0 0.33 0.25 0 00.33
0.25 0 0 00.330.33 0.33 0 0 0 0.33 0 0.33
0 0 0.33 0 0.33 0.33 0 The eigenvalues of M sorted in decreasing order are as follows:
λ1 =1 λ2 =0.483 λ3 =0.206 λ4 =−0.045 λ5 = −0.405 λ6 = −0.539 λ7 = −0.7
The eigenvector corresponding to λ1 = 1 is
1TT u1 = √7 (1, 1, 1, 1, 1, 1, 1) = (0.38, 0.38, 0.38, 0.38, 0.38, 0.38, 0.38)
398 Spectral and Graph Clustering Graph Laplacian Matrices
The Laplacian matrix of a graph is defined as L = − A
nj = 1 a 1 j 0 · · · 0 a 1 1 a 1 2 · · · a 1 n
0 nj=1 a2j ··· 0 a21 a22 ··· a2n = . . … . − . . ··· .
0 0···nj=1anj an1an2···ann
j̸=1 a1j −a12 ··· −a1n
= −a21 j ̸=2 a2j · · · −a2n . .···.
(16.4)
−an1 −an2 ··· j̸=n anj
It is interesting to note that L is a symmetric, positive semidefinite matrix, as for
any c ∈ Rn, we have cTLc=cT(−A)c=cTc−cTAc
n nn
= d i c i2 − c i c j a i j
1
= 2
1
= 2
1
= 2
≥0
n n n d i c i2 − 2
i=1 i=1 j=1
n
c i c j a i j +
d j c j2
i=1 i=1 j=1
n n
a i j c i2 − 2
i=1 j=1
j=1 c i c j a i j +
n n
( 1 6 . 5 )
n n i=1 j=1
i=j i=1
a i j c j2
n n
aij(ci −cj)2
becauseaij ≥0and(ci −cj)2 ≥0
i=1 j=1
This means that L has n real, non-negative eigenvalues, which can be arranged in
decreasing order as follows: λ1 ≥ λ2 ≥ ··· ≥ λn ≥ 0. Because L is symmetric, its
eigenvectors are orthonormal. Further, from Eq. (16.4) we can see that the first column
(and the first row) is a linear combination of the remaining columns (rows). That is, if
Li denotes the ith column of L, then we can observe that L1 +L2 +L3 +···+Ln = 0.
This implies that the rank of L is at most n − 1, and the smallest eigenvalue is λn = 0, 1T1
with the corresponding eigenvector given as un = √n (1, 1, . . . , 1) = √n 1, provided the graph is connected. If the graph is disconnected, then the number of eigenvalues equal to zero specifies the number of connected components in the graph.
Graphs and Matrices 399
Example 16.3. Consider the graph in Figure 16.2, whose adjacency and degree matrices are shown in Example 16.2. The graph Laplacian is given as
3−1 −1 3 0 −1
L=−A=−1 −1 0 0 −1 0 0 0
The eigenvalues of L are as follows: λ1 =5.618 λ2 =4.618
λ5 =2.382 λ6 =1.586
The eigenvector corresponding to λ7 = 0 is
0−1 0 0 0 0 0 0 −1
0−1 −1 −1
3 −1 0 0 −1
1TT u7 = √7 (1, 1, 1, 1, 1, 1, 1) = (0.38, 0.38, 0.38, 0.38, 0.38, 0.38, 0.38)
−1
4 −1 0 0
−1 3 −1 −1 0 −1 3 −1 0 −1 −1 3
λ3 =4.414 λ7 =0
λ4 =3.382
The normalized symmetric Laplacian matrix of a graph is defined as
Ls = −1/2L−1/2 (16.6)
=−1/2(−A)−1/2 =−1/2−1/2 −−1/2A−1/2 = I − −1/2 A−1/2
where 1/2 is the diagonal matrix given as 1/2(i,i) = √di, and −1/2 is the diagonal
Ls =−1/2L−1/2
matrix given as
−1/2 1 (i,i) = √d
(assuming that di ̸= 0), for 1 ≤ i ≤ n. In other words, the normalized Laplacian is given as
i
j̸=1a1j a12 a1n
√d1d1 −√d1d2 ··· −√d1dn
a j̸=2a2j a
2n = d2d1 d2d2 d2dn
. … . a j̸=nanj
−√ .
21 √ ···−√
(16.7)
a −√
−√
n2 ··· √
n1 dn d1
dn dn
Like the derivation in Eq. (16.5), we can show that Ls is also positive semidefinite
because for any c ∈ Rd , we get
Ts 1nn ci cj2
dn d2
c L c = 2 aij √di − dj ≥ 0 (16.8) i=1 j=1
400 Spectral and Graph Clustering
Further,ifLsi denotestheithcolumnofLs,thenfromEq.(16.7)wecanseethat d1Ls1 +d2Ls2 +d3Ls3 +···+dnLsn =0
That is, the first column is a linear combination of the other columns, which means that Ls has rank at most n − 1, with the smallest eigenvalue λn = 0, and the corresponding
1 √ √ √ T 1 1/2
d ( d1, d2,…, dn) = √ d 1. Combined with the fact that
eigenvector √
ii ii
Ls is positive semidefinite, we conclude that Ls has n (not necessarily distinct) real, positive eigenvalues λ1 ≥ λ2 ≥ ··· ≥ λn = 0.
Example 16.4. We continue with Example 16.3. For the graph in Figure 16.2, its normalized symmetric Laplacian is given as
1 −0.33
s 0 L= −0.29
0 −0.33 0
−0.33 0 1 −0.33 −0.33 1 −0.29 −0.29 0 0 0 0 0 −0.33
−0.29 −0.29 −0.29
1 −0.29 0 0
0 −0.33 0 0 0 0
−0.29 0 1 −0.33 −0.33 1 −0.33 −0.33
0 0
−0.33 0 −0.33 −0.33
1
λ4 =1.045
The eigenvalues of Ls are as follows: λ1 =1.7 λ2 =1.539
λ5 =0.794 λ6 =0.517
The eigenvector corresponding to λ7 = 0 is
λ3 =1.405 λ7 =0
1√√√√√√√T u7=√22( 3, 3, 3, 4, 3, 3, 3)
= (0.37, 0.37, 0.37, 0.43, 0.37, 0.37, 0.37)T
The normalized asymmetric Laplacian matrix is defined as
La =−1L =−1(−A)=I−−1A
j̸=1a1j −a12 ··· −a1n
d1 d1 d1
−a21 j̸=2a2j ··· −a2n = d2 d2 d2
. ….. −an1 −an2 ··· j̸=nanj
dn dn dn
Consider the eigenvalue equation for the symmetric Laplacian Ls :
Lsu=λu
(16.9)
Clustering as Graph Cuts 401 Left multiplying by −1/2 on both sides, we get
−1/2 Ls u = λ−1/2 u −1/2−1/2L−1/2u = λ−1/2u −1 L −1/2 u = λ −1/2 u Lav=λv
where v = −1/2 u is an eigenvector of La , and u is an eigenvector of Ls . Further, La
has the same set of eigenvalues as Ls , which means that La is a positive semi-definite
matrix with n real eigenvalues λ1 ≥ λ2 ≥ ··· ≥ λn = 0. From Eq. (16.9) we can see that
if Lai denotes the ith column of La, then La1 + La2 + ··· + Lan = 0, which implies that 1
vn = √n 1 is the eigenvector corresponding to the smallest eigenvalue λn = 0.
Example 16.5. For the graph in Figure 16.2, its normalized asymmetric Laplacian matrix is given as
1 −0.33
a −1 0 L = L=−0.25
0 −0.33 0
−0.33 0 −0.33 0 1 −0.33 −0.33 0 −0.33 1 −0.33 0 −0.25 −0.25 1 −0.25 0 0 −0.33 1 0 0 0 −0.33 0 −0.33 0 −0.33
−0.33 0 0 0
The eigenvalues of La are identical
λ1 =1.7 λ2 =1.539
λ5 =0.794 λ6 =0.517
The eigenvector corresponding to λ7 = 0 is
for Ls , namely λ3 =1.405
λ7 =0
−0.33 −0.33 1 −0.33
−0.33 1 λ4 =1.045
to those
1TT u7 = √7 (1, 1, 1, 1, 1, 1, 1) = (0.38, 0.38, 0.38, 0.38, 0.38, 0.38, 0.38)
0 −0.33 0 0
16.2 CLUSTERING AS GRAPH CUTS
A k-way cut in a graph is a partitioning or clustering of the vertex set, given as C={C1,…,Ck},suchthatCi ̸=∅foralli,Ci∩Cj =∅foralli,j,andV= iCi.We require C to optimize some objective function that captures the intuition that nodes within a cluster should have high similarity, and nodes from different clusters should have low similarity.
Given a weighted graph G defined by its similarity matrix [Eq. (16.1)], let S, T ⊆ V be any two subsets of the vertices. We denote by W(S,T) the sum of the weights on all
402 Spectral and Graph Clustering edges with one vertex in S and the other in T, given as
W(S,T)=aij vi∈Svj∈T
Given S ⊆ V, we denote by S the complementary set of vertices, that is, S = V − S. A (vertex) cut in a graph is defined as a partitioning of V into S ⊂ V and S. The weight of the cut or cut weight is defined as the sum of all the weights on edges between vertices in S and S, given as W(S,S).
Given a clustering C = {C1,…,Ck} comprising k clusters, the size of a cluster Ci is the number of nodes in the cluster, given as |Ci |. The volume of a cluster Ci is defined as the sum of all the weights on edges with one end in cluster Ci :
vol(Ci)= dj = ajr =W(Ci,V) vj ∈Ci vj ∈Ci vr ∈V
Let ci ∈ {0, 1}n be the cluster indicator vector that records the cluster membership for cluster Ci , defined as
cij =1 ifvj ∈Ci 0 ifvj̸∈Ci
Because a clustering creates pairwise disjoint clusters, we immediately have c Ti c j = 0
Further, the cluster size can be written as
|Ci| = cTi ci = ∥ci∥2
The following identities allow us to express the weight of a cut in terms of matrix operations. Let us derive an expression for the sum of the weights for all edges with one end in Ci . These edges include internal cluster edges (with both ends in Ci ), as well as external cluster edges (with the other end in another cluster Cj ̸=i ).
vol(Ci)=W(Ci,V)= dr = cirdrcir vr ∈Ci vr ∈Ci
W(Ci,Ci)= ars vr ∈Ci vs ∈Ci
n n r=1 s=1
cirrscis =cTi ci Consider the sum of weights of all internal edges:
(16.10)
(16.11)
=
cirarscis =cTi Aci
=
n n r=1 s=1
Clustering as Graph Cuts 403 We can get the sum of weights for all the external edges, or the cut weight by
subtracting Eq. (16.11) from Eq. (16.10), as follows:
W(Ci,Ci)= ars =W(Ci,V)−W(Ci,Ci) vr ∈Ci vs ∈V−Ci
=ci(−A)ci =cTi Lci (16.12)
Example 16.6. Consider the graph in Figure 16.2. Assume that C1 = {1,2,3,4} and C2 = {5, 6, 7} are two clusters. Their cluster indicator vectors are given as
c1 = (1,1,1,1,0,0,0)T c2 = (0,0,0,0,1,1,1)T
As required, we have cT1 c2 = 0, and cT1 c1 = ∥c1∥2 = 4 and cT2 c2 = 3 give the cluster sizes. Consider the cut weight between C1 and C2. Because there are three edges between the two clusters, we have W(C1,C1) = W(C1,C2) = 3. Using the Laplacian matrix from Example 16.3, by Eq. (16.12) we have
W(C1,C1) = cT1 Lc1 =(1,1,1,1,0,0,0)−1
3 −1
−1 0 3 −1 −1 3 −1 −1 0 0 0 0 0 −1
−1 0 −1 0 −1 0
4 −1 −1 3 0 −1 0 −1
−1 0 1
0 0 1 0 −1 1 0 01
−1 −1 0 3 −1 0
−1 3 0
0 0
−1 0
= (1,0,1,1,−1,−1,−1)(1,1,1,1,0,0,0)T = 3
16.2.1 Clustering Objective Functions: Ratio and Normalized Cut
The clustering objective function can be formulated as an optimization problem over the k-way cut C = {C1,…,Ck}. We consider two common minimization objectives, namely ratio and normalized cut. We consider maximization objectives in Section 16.2.3, after describing the spectral clustering algorithm.
Ratio Cut
The ratio cut objective is defined over a k-way cut as follows:
min Jrc(C)=k W(Ci,Ci) =k cTi Lci =k cTi Lci (16.13)
C | C i | c Ti c i ∥ c i ∥ 2 i=1 i=1 i=1
where we make use of Eq. (16.12), that is, W(Ci , Ci ) = cTi Lci .
Ratio cut tries to minimize the sum of the similarities from a cluster Ci to other
points not in the cluster Ci , taking into account the size of each cluster. One can observe that the objective function has a lower value when the cut weight is minimized and when the cluster size is large.
404 Spectral and Graph Clustering
Unfortunately, for binary cluster indicator vectors ci, the ratio cut objective is NP-hard. An obvious relaxation is to allow ci to take on any real value. In this case, we can rewrite the objective as
k c T L c i k c i T c i k
m i n J r c ( C ) = i = L = u Ti L u i ( 1 6 . 1 4 )
C ∥ci∥2 ∥ci∥ ∥ci∥
i=1 i=1 i=1
ci is the unit vector in the direction of ci ∈ Rn, that is, ci is assumed to be ∥ci ∥
where ui =
an arbitrary real vector.
To minimize Jrc we take its derivative with respect to ui and set it to the zero vector. To incorporate the constraint that uTi ui = 1, we introduce the Lagrange multiplier λi for each cluster Ci . We have
∂k n
∂u uTi Lui + λi(1−uTi ui) =0, whichimpliesthat
i i=1 i=1
This implies that ui is one of the eigenvectors of the Laplacian matrix L, corresponding
to the eigenvalue λi . Using Eq. (16.15), we can see that u Ti L u i = u Ti λ i u i = λ i
which in turn implies that to minimize the ratio cut objective [Eq. (16.14)], we should choose the k smallest eigenvalues, and the corresponding eigenvectors, so that
min Jrc(C) = uTn Lun + ··· + uTn−k+1Lun−k+1 C
= λn + ··· + λn−k+1 (16.16)
where we assume that the eigenvalues have been sorted so that λ1 ≥ λ2 ≥ ··· ≥ λn.
Noting that the smallest eigenvalue of L is λn = 0, the k smallest eigenvalues are as
2Lui − 2λi ui = 0, and thus Lui =λiui
(16.15)
follows: 0 = λn ≤ λn−1 ≤ λn−k+1. The corresponding eigenvectors un,un−1,…,un−k+1 1
represent the relaxed cluster indicator vectors. However, because un = √n 1, it does not provide any guidance on how to separate the graph nodes if the graph is connected.
Normalized Cut
Normalized cut is similar to ratio cut, except that it divides the cut weight of each cluster by the volume of a cluster instead of its size. The objective function is given as
min Jnc(C)=k W(Ci,Ci) =k cTi Lci (16.17) C v o l ( C i ) c Ti c i
where we use Eqs. (16.12) and (16.10), that is, W(Ci , Ci ) = cTi Lci and vol(Ci ) = cTi ci , respectively. The Jnc objective function has lower values when the cut weight is low and when the cluster volume is high, as desired.
i=1 i=1
Clustering as Graph Cuts 405
As in the case of ratio cut, we can obtain an optimal solution to the normalized cut objective if we relax the condition that ci be a binary cluster indicator vector. Instead we assume ci to be an arbitrary real vector. Using the observation that the diagonal degree matrix can be written as = 1/21/2, and using the fact that I = 1/2−1/2 and T = (because is diagonal), we can rewrite the normalized cut objective in terms of the normalized symmetric Laplacian, as follows:
m i n J n c ( C ) = k
c Ti L c i c Ti c i
cTi 1/2−1/2L−1/21/2ci cT1/21/2ci
i
C
i=1 =
k
= i=1
k 1/2ci T
= 1 / 2 c i i=1
k
= uTiLsui
i=1 k
(1/2 ci )T (−1/2 L−1/2 )(1/2 ci ) (1/2ci)T(1/2ci)
L s
1/2ci 1 / 2 c i
i=1
is the unit vector in the direction of 1/2c . Following the same
where u = 1/2ci
i 1 / 2 c i i
approach as in Eq. (16.15), we conclude that the normalized cut objective is optimized by selecting the k smallest eigenvalues of the normalized Laplacian matrix Ls , namely 0=λn ≤···≤λn−k+1.
The normalized cut objective [Eq. (16.17)], can also be expressed in terms of the normalized asymmetric Laplacian, by differentiating Eq. (16.17) with respect to ci and setting the result to the zero vector. Noting that all terms other than that for ci are constant with respect to ci , we have:
∂k cjTLcj=∂cTiLci=0 ∂ c i j = 1 c j T c j ∂ c i c Ti c i
Lci(cTi ci)−ci(cTi Lci) = 0
(cTi ci)2
Lci =cTi Lci ci
c Ti c i −1Lci =λici
Laci =λici
where λ = cTi Lci is the eigenvalue corresponding to the ith eigenvector c of the
i cTici i asymmetric Laplacian matrix La. To minimize the normalized cut objective we
therefore choose the k smallest eigenvalues of La , namely, 0 = λn ≤ · · · ≤ λn−k+1 .
To derive the clustering, for La, we can use the corresponding eigenvectors un,…,un−k+1, with ci = ui representing the real-valued cluster indicator vectors.
406
Spectral and Graph Clustering
a1
However, note that for L , we have cn = un = √n 1. Further, for the normalized
symmetric Laplacian Ls, the real-valued cluster indicator vectors are given as −1/2 1
ci = ui , which again implies that cn = √n 1. This means that the eigenvector un corresponding to the smallest eigenvalue λn = 0 does not by itself contain any useful information for clustering if the graph is connected.
16.2.2 Spectral Clustering Algorithm
Algorithm 16.1 gives the pseudo-code for the spectral clustering approach. We assume that the underlying graph is connected. The method takes a dataset D as input and computes the similarity matrix A. Alternatively, the matrix A may be directly input as well. Depending on the objective function, we choose the corresponding matrix B. For instance, for normalized cut B is chosen to be either Ls or La, whereas for ratio cut we choose B = L. Next, we compute the k smallest eigenvalues and eigenvectors of B. However, the main problem we face is that the eigenvectors ui are not binary, and thus it is not immediately clear how we can assign points to clusters. One solution to this problem is to treat the n × k matrix of eigenvectors as a new data matrix:
| |
U = u u
n n−1
| un,1 un−1,1 ··· un−k+1,1
|
|
yi =
1T
···
u = un2 un−1,2
n−k+1 | | ··· |
(16.18)
| un,n un−1,n ··· un−k+1,n Next, we normalize each row of U to obtain the unit vector:
(16.19) which yields the new normalized data matrix Y ∈ Rn×k comprising n points in a reduced
(un,i, un−1,i, …, un−k+1,i)
— yT1 — Y=— yT2 —
. — yTn —
··· un−k+1,2
kj=1 u2n−j+1,i k dimensional space:
ALGORITHM16.1. SpectralClusteringAlgorithm
SPECTRAL CLUSTERING (D,k):
1 Compute the similarity matrix A ∈ Rn×n
2 ifratiocutthen B←L
3 else if normalized cut then B ← Ls or La
4 SolveBui =λiui fori=n,…,n−k+1,whereλn ≤λn−1 ≤···≤λn−k+1
5 U ← un un−1 ··· un−k+1
6 Y ← normalize rows of U using Eq. (16.19)
7 C ← {C1,…,Ck} via K-means on Y
Clustering as Graph Cuts 407
We can now cluster the new points in Y into k clusters via the K-means algorithm or any other fast clustering method, as it is expected that the clusters are well-separated in the k-dimensional eigen-space. Note that for L, Ls , and La , the cluster indicator vector corresponding to the smallest eigenvalue λn = 0 is a vector of all 1’s, which does not provide any information about how to separate the nodes. The real information for clustering is contained in eigenvectors starting from the second smallest eigenvalue. However, if the graph is disconnected, then even the eigenvector corresponding to λn can contain information valuable for clustering. Thus, we retain all k eigenvectors in U in Eq. (16.18).
Strictly speaking, the normalization step [Eq. (16.19)] is recommended only for the normalized symmetric Laplacian Ls . This is because the eigenvectors of Ls and the cluster indicator vectors are related as 1/2 ci = ui . The j th entry of ui , corresponding to vertex vj , is given as
djcij
uij = n d c2 r=1 r ir
If vertex degrees vary a lot, vertices with small degrees would have very small values uij . This can cause problems for K-means for correctly clustering these vertices. The normalization step helps alleviate this problem for Ls, though it can also help other objectives.
Computational Complexity
The computational complexity of the spectral clustering algorithm is O(n3), because computing the eigenvectors takes that much time. However, if the graph is sparse, the complexity to compute the eigenvectors is O(mn) where m is the number of edges in the graph. In particular, if m = O(n), then the complexity reduces to O(n2). Running the K-means method on Y takes O(tnk2) time, where t is the number of iterations K-means takes to converge.
Example 16.7. Consider the normalized cut approach applied to the graph in Figure 16.2. Assume that we want to find k = 2 clusters. For the normalized asymmetric Laplacian matrix from Example 16.5, we compute the eigenvectors, v7 and v6, corresponding to the two smallest eigenvalues, λ7 = 0 and λ6 = 0.517. The matrix composed of both the eigenvectors is given as
u1 u2 −0.378 −0.226 −0.378 −0.499
U= −0.378 −0.226 −0.378 −0.272 −0.378 0.425 −0.378 0.444
−0.378 0.444
408
Spectral and Graph Clustering
u2
0.5
0
−0.5
1,3
4
5 6,7
2
−1
−1 −0.9
u1
−0.8 −0.7
Figure 16.3. K-means on spectral dataset Y.
−0.6
We treat the ith component of u1 and u2 as the ith point (u1i,u2i) ∈ R2, and after normalizing all points to have unit length we obtain the new dataset:
−0.859 −0.604 −0.859
Y= −0.812 −0.664 −0.648 −0.648
−0.513 −0.797 −0.513 −0.584
0.747 0.761 0.761
For instance the first point is computed as
y1 = 1 (−0.378,−0.226)T =(−0.859,−0.513)T
(−0.378)2 + (−0.2262)
Figure 16.3 plots the new dataset Y. Clustering the points into k = 2 groups using K-means yields the two clusters C1 = {1, 2, 3, 4} and C2 = {5, 6, 7}.
Example 16.8. We apply spectral clustering on the Iris graph in Figure 16.1 using the normalized cut objective with the asymmetric Laplacian matrix La. Figure 16.4 shows the k = 3 clusters. Comparing them with the true Iris classes (not used in the clustering), we obtain the contingency table shown in Table 16.1, indicating the number of points clustered correctly (on the main diagonal) and incorrectly (off-diagonal). We can see that cluster C1 corresponds mainly to iris-setosa, C2 to iris-virginica, and C3 to iris-versicolor. The latter two are more difficult to separate. In total there are 18 points that are misclustered when compared to the true Iris types.
Clustering as Graph Cuts
409
C1 (triangle) 50 0 4 C2 (square) 0 36 0
C3 (circle) 0 14 46 16.2.3 MaximizationObjectives:AverageCutandModularity
Figure 16.4. Normalized cut on Iris graph.
Table 16.1. Contingency table: clusters versus Iris types
iris-setosa iris-virginica iris-versicolor
We now discuss two clustering objective functions that can be formulated as maximization problems over the k-way cut C = {C1,…,Ck}. These include average weight and modularity. We also explore their connections with normalized cut and kernel K-means.
Average Weight
The average weight objective is defined as
max Jaw(C) = k W(Ci,Ci) = k cTi Aci (16.20)
where we used the equivalence W(Ci , Ci ) = cTi Aci established in Eq. (16.11). Instead of trying to minimize the weights on edges between clusters as in ratio cut, average weight tries to maximize the within cluster weights. The problem of maximizing Jaw for binary cluster indicator vectors is also NP-hard; we can obtain a solution by relaxing
C | C i | c Ti c i i=1 i=1
410 Spectral and Graph Clustering the constraint on ci, by assuming that it can take on any real values for its elements.
This leads to the relaxed objective
k
m a x J a w ( C ) = u T1 A u 1 + · · · + u Tk A u k C
=λ1 +···+λk
whereλ1 ≥λ2 ≥···≥λn.
If we assume that A is the weighted adjacency matrix obtained from a symmetric
and positive semidefinite kernel, that is, with aij = K(xi , xj ), then A will be positive semidefinite and will have non-negative real eigenvalues. In general, if we threshold A or if A is the unweighted adjacency matrix for an undirected graph, then even though A is symmetric, it may not be positive semidefinite. This means that in general A can have negative eigenvalues, though they are all real. Because Jaw is a maximization problem, this means that we must consider only the positive eigenvalues and the corresponding eigenvectors.
u Ti A u i ( 1 6 . 2 1 ) where ui = ci . Following the same approach as in Eq. (16.15), we can maximize
m a x J a w ( C ) = C
i=1
the objective by selecting the k largest eigenvalues of A, and the corresponding
∥ci ∥ eigenvectors
Example 16.9. For the graph in Figure 16.2, with the adjacency matrix shown in Example 16.3, its eigenvalues are as follows:
λ1 =3.18 λ2 =1.49 λ3 =0.62 λ4 =−0.15 λ5 = −1.27 λ6 = −1.62 λ7 = −2.25
We can see that the eigenvalues can be negative, as A is the adjacency graph and is not positive semidefinite.
Average Weight and Kernel K-means The average weight objective leads to an interesting connection between kernel K-means and graph cuts. If the weighted adjacency matrix A represents the kernel value between a pair of points, so that aij = K(xi , xj ), then we may use the sum of squared errors objective [Eq. (13.3)] of kernel K-means for graph clustering. The SSE objective is given as
n k1
min Jsse(C)= K(xj,xj)− K(xr,xs)
C |Ci|
j=1 i=1 xr∈Ci xs∈Ci
n k1
= ajj − |Ci| ars j=1 i=1 vr∈Ci vs∈Ci
n k cTiAci
= ajj− cTici j=1 i=1
n j = 1
411
ajj −Jaw(C) (16.22)
Clustering as Graph Cuts
=
We can observe that because
SSE objective is the same as maximizing the average weight objective. In particular, if aij represents the linear kernel xTi xj between the nodes, then maximizing the average weight objective [Eq. (16.20)] is equivalent to minimizing the regular K-means SSE objective [Eq. (13.1)]. Thus, spectral clustering using Jaw and kernel K-means represent two different approaches to solve the same problem. Kernel K-means tries to solve the NP-hard problem by using a greedy iterative approach to directly optimize the SSE objective, whereas the graph cut formulation tries to solve the same NP-hard problem by optimally solving a relaxed problem.
Modularity
nj=1 ajj is independent of the clustering, minimizing the
Informally, modularity is defined as the difference between the observed and expected fraction of edges within a cluster. It measures the extent to which nodes of the same type (in our case, the same cluster) are linked to each other.
Unweighted Graphs Let us assume for the moment that the graph G is unweighted, and that A is its binary adjacency matrix. The number of edges within a cluster Ci is given as
1 ars 2 vr ∈Ci vs ∈Ci
where we divide by 1 because each edge is counted twice in the summation. Over all 2
the clusters, the observed number of edges within the same cluster is given as
1k ars (16.23)
Let us compute the expected number of edges between any two vertices vr and
vs , assuming that edges are placed at random, and allowing multiple edges between
the same pair of vertices. Let |E| = m be the total number of edges in the graph. The
probability that one end of an edge is vr is given as dr , where dr is the degree of vr . The 2m
probability that one end is vr and the other vs is then given as
prs=dr ·ds =drds 2m 2m 4m2
The number of edges between vr and vs follows a binomial distribution with success probability prs over 2m trials (because we are selecting the two ends of m edges). The expected number of edges between vr and vs is given as
2m·prs = drds 2m
The expected number of edges within a cluster Ci is then 1 drds
2vr∈C vs∈C 2m ii
2 i=1 vr∈Ci vs∈Ci
412 Spectral and Graph Clustering and the expected number of edges within the same cluster, summed over all k clusters,
is given as
1 k d r d s 2 i=1 vr∈C vs∈C 2m
(16.24)
ii
where we divide by 2 because each edge is counted twice. The modularity of the clustering C is defined as the difference between the observed and expected fraction of edges within the same cluster, obtained by subtracting Eq. (16.24) from Eq. (16.23), and dividing by the number of edges:
( 1 6 . 2 5 )
Q=1k ars−drds 2m i=1 vr∈C vs∈C 2m
ii
Because 2m = ni=1 di , we can rewrite modularity as follows:
Q = k a r s − d r d s nj=1dj n 2
i=1 vr∈Ci vs∈Ci j=1 dj
Weighted Graphs One advantage of the modularity formulation in Eq. (16.25) is that it directly generalizes to weighted graphs. Assume that A is the weighted adjacency matrix; we interpret the modularity of a clustering as the difference between the observed and expected fraction of weights on edges within the clusters.
From Eq. (16.11) we have
ars =W(Ci,Ci) vr ∈Ci vs ∈Ci
and from Eq. (16.10) we have
drds = dr ds=W(Ci,V)2
vr ∈Ci vs ∈Ci vr ∈Ci vs ∈Ci
Further, note that
n
dj =W(V,V) j=1
Using the above equivalences, can write the modularity objective [Eq. (16.25)] in terms of the weight function W as follows:
max JQ(C)=k W(Ci,Ci) −W(Ci,V)2 (16.26) C W(V, V) W(V, V)
i=1
We now express the modularity objective [Eq.(16.26)] in matrix terms. From Eq. (16.11), we have
W(Ci,Ci)=cTi Aci
Clustering as Graph Cuts 413 Also note that
n W(Ci,V)= dr = drcir = djcij =dTci
vr∈Ci vr∈Ci j=1
where d = (d1,d2,…,dn)T is the vector of vertex degrees. Further, we have
W(V,V)=
where tr() is the trace of , that is, sum of the diagonal entries of .
dj =tr()
The clustering objective based on modularity can then be written as
maxJQ(C)=k cTiAci−(dTci)2 C tr() tr()2
n j=1
i=1
= k c Ti A c i − c Ti d · d T c i
=
c Ti Q c i ( 1 6 . 2 7 )
i=1 tr() tr()2
k i=1
where Q is the modularity matrix:
Q= 1 A−d·dT
tr() tr()
Directly maximizing objective Eq. (16.27) for binary cluster vectors ci is hard. We resort to the approximation that elements of ci can take on real values. Further, we require that cTi ci = ∥ci ∥2 = 1 to ensure that JQ does not increase without bound. Following the approach in Eq. (16.15), we conclude that ci is an eigenvector of Q. However, because this a maximization problem, instead of selecting the k smallest eigenvalues, we select the k largest eigenvalues and the corresponding eigenvectors to obtain
m a x J Q ( C ) = u T1 Q u 1 + · · · + u Tk Q u k C
=λ1 +···+λk
where ui is the eigenvector corresponding to λi , and the eigenvalues are sorted so that
λ1 ≥ ··· ≥ λn. The relaxed cluster indicator vectors are given as ci = ui. Note that the
modularity matrix Q is symmetric, but it is not positive semidefinite. This means that
although it has real eigenvalues, they may be negative too. Also note that if Qi denotes
the ith column of Q, then we have Q1 +Q2 +···+Qn = 0, which implies that 0 is
an eigenvalue of Q with the corresponding eigenvector 1 1. Thus, for maximizing the √n
modularity one should use only the positive eigenvalues.
414 Spectral and Graph Clustering
Example 16.10. Consider the graph in Figure 16.2. The degree vector is d = (3,3,3,4,3,3,3)T, and the sum of degrees is tr() = 22. The modularity matrix is given as
Q= 1 A− 1 d·dT tr() tr()2
0101010 9 9
9 9 9 9 9 9
9 12 9 12 9 12
12 16
0 0 1 0 1 1 0 9 9 9 12 9 9 9
1 0 1 1 0 0 0 9 9 10101001 19 9 =221 1 1 0 1 0 0−48412 12
9 9 9 12 12 12 0 0 0 1 0 1 1 9 9 912 9 9 9 1 0 0 0 1 0 1 9 9 912 9 9 9
−0.019 0.027 −0.019
= 0.021 −0.019 0.027 −0.019
The eigenvalues of λ1 =0.0678
0.027 −0.019 0.021 −0.019
−0.019 −0.019 0.027 −0.025 0.027 0.027
−0.019
λ4 =−0.0068
−0.019 0.027 0.021 0.027 −0.019 0.021 0.021 0.021 −0.033
−0.019 −0.019 0.021 −0.019 −0.019 −0.025 −0.019 0.027 −0.025
−0.019 −0.019 0.021 −0.019 0.027 0.027
0.027 −0.019 −0.019 −0.025 0.027 −0.019 0.027
Q are as follows:
λ2 =0.0281 λ3 =0
λ5 = −0.0579 λ6 = −0.0736 λ7 = −0.1024 The eigenvector corresponding to λ3 = 0 is
1TT u3 = √7 (1, 1, 1, 1, 1, 1, 1) = (0.38, 0.38, 0.38, 0.38, 0.38, 0.38, 0.38)
Modularity as Average Weight Consider what happens to the modularity matrix Q if we use the normalized adjacency matrix M = −1A in place of the standard adjacency matrix A in Eq. (16.27). In this case, we know by Eq. (16.3) that each row of M sums to 1, that is,
n
mij = di = 1, for all i = 1,…,n j=1
Wethushavetr()= ni=1di =n,andfurtherd·dT =1n×n,where1n×n isthen×n matrix of all 1’s. The modularity matrix can then be written as
Q= 1M− 1 1n×n n n2
For large graphs with many nodes, n is large and the second term practically
vanishes, as 1 will be very small. Thus, the modularity matrix can be reasonably n2
Clustering as Graph Cuts
approximated as
415
Q≃1M (16.28) n
Substituting the above in the modularity objective [Eq. (16.27)], we get
k
cTi Qci =
where we dropped the 1 factor because it is a constant for a given graph; it only scales
n
the eigenvalues without effecting the eigenvectors.
In conclusion, if we use the normalized adjacency matrix, maximizing the
modularity is equivalent to selecting the k largest eigenvalues and the corresponding eigenvectors of the normalized adjacency matrix M. Note that in this case modularity is also equivalent to the average weight objective and kernel K-means as established in Eq. (16.22).
Normalized Modularity as Normalized Cut Define the normalized modularity objective as follows:
maxJQ(C) = C
k i=1
cTi Mci (16.29)
i=1
max JnQ(C)=k 1 W(Ci,Ci) −W(Ci,V)2 (16.30) C W(Ci , V) W(V, V) W(V, V)
i=1
We can observe that the main difference from the modularity objective [Eq. (16.26)] is
that we divide by vol(Ci ) = W(C, Vi ) for each cluster. Simplifying the above, we obtain JnQ(C)= 1 k W(Ci,Ci)−W(Ci,V)
W(V,V) i=1 W(Ci,V) W(V,V)
= 1 k W ( C i , C i ) − k W ( C i , V )
W(V,V) i=1 W(Ci,V) i=1 W(V,V) = 1 k W(Ci,Ci)−1
W(V,V) i=1 W(Ci,V)
Now consider the expression (k−1)−W(V,V)·JnQ(C), we have
(k−1)−W(V,V)JnQ(C)=(k−1)−k W(Ci,Ci)−1 i=1 W(Ci,V)
=k−k W(Ci,Ci) i=1 W(Ci,V)
k W ( C i , C i ) = 1− W(Ci,V)
i=1
k W(Ci,V)−W(Ci,Ci)
= i=1
W(Ci , V)
416
Spectral and Graph Clustering
=k W(Ci,Ci) i=1 W(Ci,V)
=k W(Ci,Ci) i=1 vol(Ci )
=Jnc(C)
In other words the normalized cut objective [Eq. (16.17)] is related to the normalized
modularity objective [Eq. (16.30)] by the following equation: Jnc(C)=(k−1)−W(V,V)·JnQ(C)
Since W(V,V) is a constant for a given graph, we observe that minimizing normalized cut is equivalent to maximizing normalized modularity.
Spectral Clustering Algorithm
Both average weight and modularity are maximization objectives; therefore we have to slightly modify Algorithm 16.1 for spectral clustering to use these objectives. The matrix B is chosen to be A if we are maximizing average weight or Q for the modularity objective. Next, instead of computing the k smallest eigenvalues we have to select the k largest eigenvalues and their corresponding eigenvectors. Because both A and Q can have negative eigenvalues, we must select only the positive eigenvalues. The rest of the algorithm remains the same.
16.3 MARKOV CLUSTERING
We now consider a graph clustering method based on simulating a random walk on a weighted graph. The basic intuition is that if node transitions reflect the weights on the edges, then transitions from one node to another within a cluster are much more likely than transitions between nodes from different clusters. This is because nodes within a cluster have higher similarities or weights, and nodes across clusters have lower similarities.
Given the weighted adjacency matrix A for a graph G, the normalized adjacency
matrix [Eq. (16.2)] is given as M = −1 A. The matrix M can be interpreted as the n × n
transition matrix where the entry mij = aij can be interpreted as the probability of di
transitioning or jumping from node i to node j in the graph G. This is because M is a row stochastic or Markov matrix, which satisfies the following conditions: (1) elements of the matrix are non-negative, that is, mij ≥ 0, which follows from the fact that A is non-negative, and (2) rows of M are probability vectors, that is, row elements add to 1, because
n naij mij= di=1
j=1 j=1
The matrix M is thus the transition matrix for a Markov chain or a Markov random walk on graph G. A Markov chain is a discrete-time stochastic process over a set of
Markov Clustering 417
states, in our case the set of vertices V. The Markov chain makes a transition from one node to another at discrete timesteps t = 1, 2, . . . , with the probability of making a transition from node i to node j given as mij . Let the random variable Xt denote the state at time t. The Markov property means that the probability distribution of Xt over the states at time t depends only on the probability distribution of Xt−1, that is,
P(Xt =i|X0,X1,…,Xt−1)=P(Xt =i|Xt−1)
Further, we assume that the Markov chain is homogeneous, that is, the transition
probability
P(Xt =j|Xt−1 =i)=mij
is independent of the time step t.
Given node i the transition matrix M specifies the probabilities of reaching any
other node j in one time step. Starting from node i at t = 0, let us consider the probability of being at node j at t = 2, that is, after two steps. We denote by mij (2) the probability of reaching j from i in two time steps. We can compute this as follows:
n a=1
mij(2)=P(X2 =j|X0 =i)=
P(X1 =a|X0 =i)P(X2 =j|X1 =a)
(16.31)
=
n a=1
miamaj =mTi Mj
where mi = (mi1,mi2,…,min)T denotes the vector corresponding to the ith row of MandMj =(m1j,m2j,…,mnj)T denotesthevectorcorrespondingtothejthcolumn of M.
Consider the product of M with itself:
— m T1 — 2 — m T2 — | | |
M =M·M= .
— m Tn —
M1 M2 ··· Mn .|| |
n n = m Ti M j = m i j ( 2 )
i,j =1 i,j =1
( 1 6 . 3 2 )
Equations (16.31) and (16.32) imply that M2 is precisely the transition probability matrix for the Markov chain over two time-steps. Likewise, the three-step transition matrix is M2 · M = M3 . In general, the transition probability matrix for t time steps is given as
Mt−1 · M = Mt (16.33)
A random walk on G thus corresponds to taking successive powers of the transition matrix M. Let π0 specify the initial state probability vector at time t = 0, that is, π0i = P(X0 = i) is the probability of starting at node i, for all i = 1,…,n. Starting
418 Spectral and Graph Clustering from π0, we can obtain the state probability vector for Xt, that is, the probability of
being at node i at time-step t, as follows
π Tt = π Tt − 1 M =πTt−2M·M=πTt−2M2 =πTt−3M2·M=πTt−3M3
= .
= π T0 M t Equivalently, taking transpose on both sides, we get
πt =(Mt)Tπ0 =(MT)tπ0
The state probability vector thus converges to the dominant eigenvector of MT, reflecting the steady-state probability of reaching any node in the graph, regardless of the starting node. Note that if the graph is directed, then the steady-state vector is equivalent to the normalized prestige vector [Eq. (4.6)].
Transition Probability Inflation
We now consider a variation of the random walk, where the probability of transitioning from node i to j is inflated by taking each element mij to the power r ≥ 1. Given a transition matrix M, define the inflation operator Υ as follows:
(mij)r n
Υ(M,r) = n (mia)r (16.34)
a=1
The inflation operation results in a transformed or inflated transition probability matrix because the elements remain non-negative, and each row is normalized to sum to 1. The net effect of the inflation operator is to increase the higher probability transitions and decrease the lower probability transitions.
16.3.1 Markov Clustering Algorithm
The Markov clustering algorithm (MCL) is an iterative method that interleaves matrix expansion and inflation steps. Matrix expansion corresponds to taking successive powers of the transition matrix, leading to random walks of longer lengths. On the other hand, matrix inflation makes the higher probability transitions even more likely and reduces the lower probability transitions. Because nodes in the same cluster are expected to have higher weights, and consequently higher transition probabilities between them, the inflation operator makes it more likely to stay within the cluster. It thus limits the extent of the random walk.
The pseudo-code for MCL is given in Algorithm 16.2. The method works on the weighted adjacency matrix for a graph. Instead of relying on a user-specified value for k, the number of output clusters, MCL takes as input the inflation parameter r ≥ 1. Higher values lead to more, smaller clusters, whereas smaller values lead to fewer, but larger clusters. However, the exact number of clusters cannot be pre-determined. Given the adjacency matrix A, MCL first adds loops or self-edges to A if they do
i,j=1
Markov Clustering 419 ALGORITHM16.2. MarkovClusteringAlgorithm(MCL)
MARKOV CLUSTERING (A,r,ǫ): 1t←0
2 Add self-edges to A if they do not exist 3 Mt ←−1A
4 repeat
5 t←t+1
6 Mt ←Mt−1·Mt−1
7 Mt ←Υ(Mt,r)
8 until∥Mt−Mt−1∥F≤ǫ
9 Gt ← directed graph induced by Mt
10 C ← {weakly connected components in Gt }
not exist. If A is a similarity matrix, then this is not required, as a node is most similar to itself, and thus A should have high values on the diagonals. For simple, undirected graphs, if A is the adjacency matrix, then adding self-edges associates return probabilities with each node.
The iterative MCL expansion and inflation process stops when the transition matrix converges, that is, when the difference between the transition matrix from two successive iterations falls below some threshold ǫ ≥ 0. The matrix difference is given in terms of the Frobenius norm:
n n
∥Mt −Mt−1∥F = Mt(i,j)−Mt−1(i,j)
i=1 j=1
The MCL process stops when ∥Mt − Mt−1∥F ≤ ǫ.
MCL Graph
2
The final clusters are found by enumerating the weakly connected components in the directed graph induced by the converged transition matrix Mt. The directed graph induced by Mt is denoted as Gt = (Vt , Et ). The vertex set is the same as the set of nodes in the original graph, that is, Vt = V, and the edge set is given as
Et =(i,j)|Mt(i,j)>0
In other words, a directed edge (i,j) exists only if node i can transition to node j within t steps of the expansion and inflation process. A node j is called an attractor if Mt(j,j) > 0, and we say that node i is attracted to attractor j if Mt(i,j) > 0. The MCL process yields a set of attractor nodes, Va ⊆ V, such that other nodes are attracted to at least one attractor in Va. That is, for all nodes i there exists a node j ∈ Va, such that (i,j) ∈ Et. A strongly connectedcomponent in a directed graph is defined a maximal subgraph such that there exists a directed path between all pairs of vertices in the subgraph. To extract the clusters from Gt, MCL first finds
420 Spectral and Graph Clustering
the strongly connected components S1 , S2 , . . . , Sq over the set of attractors Va . Next, for each strongly connected set of attractors Sj , MCL finds the weakly connected components consisting of all nodes i ∈ Vt − Va attracted to an attractor in Sj . If a node i is attracted to multiple strongly connected components, it is added to each such cluster, resulting in possibly overlapping clusters.
Example 16.11. We apply the MCL method to find k = 2 clusters for the graph shown in Figure 16.2. We add the self-loops to the graph to obtain the adjacency matrix:
1 1 0 1 0 1 0 1 1 1 1 0 0 0 0 1 1 1 0 0 1
A=1 1 1 1 1 0 0 0 0 0 1 1 1 1 1 0 0 0 1 1 1
0010111 The corresponding Markov matrix is given as
0.25 0.25 0
0.25 0.25 0.25 −1 0 0.25 0.25 M0 = A=0.20 0.20 0.20
0 0 0 0.25 0 0
0 0 0.25 In the first iteration, we apply expansion and
0 0.250 0 0 0 0 0 0.25
0.237 0.175 0.113 0.175 0.175 0.237 0.175 0.237 0.113 0.175 0.237 0.175
M1 = M0 · M0 = 0.140 0.113 0.050
0.125 0.062 0.062 0.062
0.113 0.125 0.050 0.062 0.113 0.062 0.090 0.100 0.237 0.188 0.188 0.250 0.188 0.188
0.062 0.062 0.125 0.100 0.188 0.188 0.250
0.404 0.188 0.062 0.154 0.331 0.154 0.062 0.188 0.404
M1 = Υ(M1,2.5) = 0.109
0.060 0.008 0.060
0.074 0.013 0.013 0.013 0.013 0.074
0.081 0.014 0.012 0.012 0.014 0.081
0.190 0.140 0.240 0.113 0.113
0.062 0.125 0.125 0.125
0.234 0.109
0.188 0.062
0.331 0.007
0.188 0.062
0.419 0.036
0.060 0.386 0.214 0.074 0.204 0.418 0.074 0.204 0.204
0.047 0.214 0.204 0.418
0.25 0.25 0.25 0.20 0.25 0
0
then inflation
0.20 0 0 0.25 0.25 0.25 0.25 0.25 0.25
0.25
0.25 0.25
(with r = 2.5) to obtain
0.047
Markov Clustering
421
16 1 0.5
1
2 4 5 0.5
1 0.5
37 Figure 16.5. MCL attractors and clusters.
MCL converges in 10 iterations (using ǫ = 0.001), with the final transition matrix
1234567 1 0 0 0 1 0 0 0 2 0 0 0 1 0 0 0
M=3 0 0 0 1 0 0 0 4 0 0 0 1 0 0 0 5 000000.50.5 6 000000.50.5
7 000000.50.5
Figure 16.5 shows the directed graph induced by the converged M matrix, where an edge (i,j) exists if and only if M(i,j) > 0. The nonzero diagonal elements of M are the attractors (nodes with self-loops, shown in gray). We can observe that M(4,4), M(6,6), and M(7,7) are all greater than zero, making nodes 4, 6, and 7 the three attractors. Because both 6 and 7 can reach each other, the equivalence classes of attractors are {4} and {6,7}. Nodes 1,2, and 3 are attracted to 4, and node 5 is attracted to both 6 and 7. Thus, the two weakly connected components that make up the two clusters are C1 = {1,2,3,4} and C2 = {5,6,7}.
Example 16.12. Figure 16.6a shows the clusters obtained via the MCL algorithm on the Iris graph from Figure 16.1, using r = 1.3 in the inflation step. MCL yields three attractors (shown as gray nodes; self-loops omitted), which separate the graph into three clusters. The contingency table for the discovered clusters versus the true Iris types is given in Table 16.2. One point with class iris-versicolor is (wrongly) grouped with iris-setosa in C1 , but 14 points from iris-virginica are misclustered.
Notice that the only parameter for MCL is r, the exponent for the inflation step. The number of clusters is not explicitly specified, but higher values of r result in more clusters. The value of r = 1.3 was used above because it resulted in three clusters. Figure 16.6b shows the results for r = 2. MCL yields nine clusters, where one of the clusters (top-most) has two attractors.
422
Spectral and Graph Clustering
Table 16.2. Contingency table: MCL clusters versus Iris types iris-setosa iris-virginica iris-versicolor
C1 (triangle) C2 (square) C3 (circle)
50 0 0
0 36 14
1
0 49
Computational Complexity
(a) r = 1.3
(b) r = 2
Figure 16.6. MCL on Iris graph.
The computational complexity of the MCL algorithm is O(tn3), where t is the number of iterations until convergence. This follows from the fact that whereas the inflation operation takes O(n2) time, the expansion operation requires matrix multiplication, which takes O(n3) time. However, the matrices become sparse very quickly, and it is possible to use sparse matrix multiplication to obtain O(n2) complexity for expansion in later iterations. On convergence, the weakly connected components in Gt can be found in O(n + m) time, where m is the number of edges. Because Gt is very sparse, with m = O(n), the final clustering step takes O(n) time.
16.4 FURTHER READING
Spectral partitioning of graphs was first proposed in Donath and Hoffman (1973). Properties of the second smallest eigenvalue of the Laplacian matrix, also called algebraic connectivity, were studied in Fiedler (1973). A recursive bipartitioning approach to find k clusters using the normalized cut objective was given in Shi and Malik (2000). The direct k-way partitioning approach for normalized cut, using the normalized symmetric Laplacian matrix, was proposed in Ng, Jordan, and Weiss
Exercises 423
(2001). The connection between spectral clustering objective and kernel K-means was established in Dhillon, Guan, and Kulis (2007). The modularity objective was introduced in Newman (2003), where it was called assortativity coefficient. The spectral algorithm using the modularity matrix was first proposed in White and Smyth (2005). The relationship between modularity and normalized cut was shown in Yu and Ding (2010). For an excellent tutorial on spectral clustering techniques see Luxburg (2007). The Markov clustering algorithm was originally proposed in Dongen (2000). For an extensive review of graph clustering methods see Fortunato (2010).
Dhillon, I. S., Guan, Y., and Kulis, B. (2007). Weighted graph cuts without eigenvectors a multilevel approach. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29 (11): 1944–1957.
Donath, W. E. and Hoffman, A. J. (1973). Lower bounds for the partitioning of graphs. IBM Journal of Research and Development, 17 (5): 420–425.
Dongen, S. M. van (2000). “Graph clustering by flow simulation”. PhD thesis. The University of Utrecht, The Netherlands.
Fiedler, M. (1973). Algebraic connectivity of graphs. Czechoslovak Mathematical Journal, 23 (2): 298–305.
Fortunato, S. (2010). Community detection in graphs. Physics Reports, 486 (3): 75–174. Luxburg, U. (2007). A tutorial on spectral clustering. Statistics and Computing, 17 (4):
395–416.
Newman, M. E. (2003). Mixing patterns in networks. Physical Review E, 67 (2): 026126. Ng, A. Y., Jordan, M. I., and Weiss, Y. (2001). On spectral clustering: Analysis and
an algorithm. Advances in Neural Information Processing Systems 14. Cambridge,
MA: MIT Press, pp. 849–856.
Shi, J. and Malik, J. (2000). Normalized Cuts and Image Segmentation. IEEE
Transactions on Pattern Analysis Machine Intelligence, 22 (8): 888–905.
White, S. and Smyth, P. (2005). A spectral clustering approach to finding communities in graphs. Proceedings of the 5th SIAM International Conference on Data Mining.
Philadelphia: SIAM, pp. 76–84.
Yu, L. and Ding, C. (2010). Network community discovery: solving modularity
clustering via normalized cut. Proceedings of the 8th Workshop on Mining and Learning with Graphs. ACM, pp. 34–36.
16.5 EXERCISES
Q1. ShowthatifQidenotestheithcolumnofthemodularitymatrixQ,thenni=1Qi=0.
Q2. Prove that both the normalized symmetric and asymmetric Laplacian matrices Ls [Eq. (16.6)] and La [Eq. (16.9)] are positive semidefinite. Also show that the smallest eigenvalue is λn = 0 for both.
Q3. Prove that the largest eigenvalue of the normalized adjacency matrix M [Eq. (16.2)] is 1, and further that all eigenvalues satisfy the condition that |λi | ≤ 1.
Q4. Show that vr∈Ci cirdrcir = nr=1ns=1cirrscis, where ci is the cluster indicator vector for cluster Ci and is the degree matrix for the graph.
424
Q5.
Spectral and Graph Clustering
For the normalized symmetric Laplacian Ls, show that for the normalized cut objective the real-valued cluster indicator vector corresponding to the smallest
1 1/2 eigenvalueλn=0isgivenascn=√n d
i=1 i 1
24
3
Figure 16.7. Graph for Q6.
Given the graph in Figure 16.7, answer the following questions:
(a) Cluster the graph into two clusters using ratio cut and normalized cut.
(b) Use the normalized adjacency matrix M for the graph and cluster it into two
clusters using average weight and kernel K-means, using K = M + I.
(c) Cluster the graph using the MCL algorithm with inflation parameters r = 2 and
r = 2.5.
Table 16.3. Data for Q7
Q7. Consider Table 16.3. Assuming these are nodes in a graph, define the weighted adjacency matrix A using the linear kernel
A ( i , j ) = 1 + x Ti x j
Cluster the data into two groups using the modularity objective.
1.
Q6.
X1
X2
X3
x1 x2 x3 x4
0.4 0.5 0.6 0.4
0.9 0.1 0.3 0.8
0.6 0.6 0.6 0.5
CHAPTER 17 ClusteringValidation
There exist many different clustering methods, depending on the type of clusters sought and on the inherent data characteristics. Given the diversity of clustering algorithms and their parameters it is important to develop objective approaches to assess clustering results. Cluster validation and assessment encompasses three main tasks: clustering evaluation seeks to assess the goodness or quality of the clustering, clustering stability seeks to understand the sensitivity of the clustering result to various algorithmic parameters, for example, the number of clusters, and clustering tendency assesses the suitability of applying clustering in the first place, that is, whether the data has any inherent grouping structure. There are a number of validity measures and statistics that have been proposed for each of the aforementioned tasks, which can be divided into three main types:
External: External validation measures employ criteria that are not inherent to the dataset. This can be in form of prior or expert-specified knowledge about the clusters, for example, class labels for each point.
Internal: Internal validation measures employ criteria that are derived from the data itself. For instance, we can use intracluster and intercluster distances to obtain measures of cluster compactness (e.g., how similar are the points in the same cluster?) and separation (e.g., how far apart are the points in different clusters?).
Relative: Relative validation measures aim to directly compare different clusterings, usually those obtained via different parameter settings for the same algorithm.
In this chapter we study some of the main techniques for clustering validation and assessment spanning all three types of measures.
17.1 EXTERNAL MEASURES
As the name implies, external measures assume that the correct or ground-truth
clustering is known a priori. The true cluster labels play the role of external information 425
426 Clustering Validation
that is used to evaluate a given clustering. In general, we would not know the correct clustering; however, external measures can serve as way to test and validate different methods. For instance, classification datasets that specify the class for each point can be used to evaluate the quality of a clustering. Likewise, synthetic datasets with known cluster structure can be created to evaluate various clustering algorithms by quantifying the extent to which they can recover the known groupings.
Let D = {xi}ni=1 be a dataset consisting of n points in a d-dimensional space, partitioned into k clusters. Let yi ∈ {1, 2, . . . , k} denote the ground-truth cluster membership or label information for each point. The ground-truth clustering is given as T = {T1,T2,…,Tk}, where the cluster Tj consists of all the points with label j, i.e., Tj = {xi ∈ D|yi = j}. Also, let C = {C1,…,Cr} denote a clustering of the same dataset into r clusters, obtained via some clustering algorithm, and let yˆi ∈ {1,2,…,r} denote the cluster label for xi. For clarity, henceforth, we will refer to T as the ground-truth partitioning, and to each Ti as a partition. We will call C a clustering, with each Ci referred to as a cluster. Because the ground truth is assumed to be known, typically clustering methods will be run with the correct number of clusters, that is, with r = k. However, to keep the discussion more general, we allow r to be different from k.
External evaluation measures try capture the extent to which points from the same partition appear in the same cluster, and the extent to which points from different partitions are grouped in different clusters. There is usually a trade-off between these two goals, which is either explicitly captured by a measure or is implicit in its computation. All of the external measures rely on the r × k contingency table N that is induced by a clustering C and the ground-truth partitioning T , defined as follows
N ( i , j ) = n i j = C i ∩ T j
In other words, the count nij denotes the number of points that are common to cluster Ci and ground-truth partition Tj . Further, for clarity, let ni = |Ci | denote the number of points in cluster Ci , and let mj = |Tj | denote the number of points in partition Tj . The contingency table can be computed from T and C in O(n) time by examining the partition and cluster labels, yi and yˆi, for each point xi ∈ D and incrementing the corresponding count nyi yˆi .
17.1.1 Matching Based Measures
Purity
Purity quantifies the extent to which a cluster Ci contains entities from only one partition. In other words, it measures how “pure” each cluster is. The purity of cluster Ci is defined as
1k purityi = max{nij}
ni j=1
The purity of clustering C is defined as the weighted sum of the clusterwise purity values:
rni 1rk purity= purityi = max{nij}
i=1 n ni=1 j=1
External Measures 427 where the ratio ni denotes the fraction of points in cluster Ci . The larger the purity of C ,
n
the better the agreement with the groundtruth. The maximum value of purity is 1, when each cluster comprises points from only one partition. When r = k, a purity value of 1 indicates a perfect clustering, with a one-to-one correspondence between the clusters and partitions. However, purity can be 1 even for r > k, when each of the clusters is a subset of a ground-truth partition. When r < k, purity can never be 1, because at least one cluster must contain points from more than one partition.
Maximum Matching
The maximum matching measure selects the mapping between clusters and partitions, such that the sum of the number of common points (nij ) is maximized, provided that only one cluster can match with a given partition. This is unlike purity, where two different clusters may share the same majority partition.
Formally, we treat the contingency table as a complete weighted bipartite graph G = (V, E), where each partition and cluster is a node, that is, V = C ∪ T , and there existsanedge(Ci,Tj)∈E,withweightw(Ci,Tj)=nij,forallCi ∈CandTj ∈T.A matching M in G is a subset of E, such that the edges in M are pairwise nonadjacent, that is, they do not have a common vertex. The maximum matching measure is defined as the maximum weight matching in G:
match=argmaxw(M) Mn
where the weight of a matching M is simply the sum of all the edge weights in M, given as w(M) = e∈M w(e). The maximum matching can be computed in time O(|V|2 · |E|) = O((r + k)2rk), which is equivalent to O(k4) if r = O(k).
F-Measure
purity:
It measures the fraction of points in Ci from the majority partition Tji . The recall of cluster Ci is defined as
recalli = niji = niji |Tji | mji
where mji = |Tji |. It measures the fraction of point in partition Tji shared in common with cluster Ci .
The F-measure is the harmonic mean of the precision and recall values for each cluster. The F-measure for cluster Ci is therefore given as
Given cluster Ci, let ji denote the partition that contains the maximum number of points from Ci , that is, ji = maxk {nij }. The precision of a cluster Ci is the same as its
j=1
preci = max nij =
1 k niji ni j=1 ni
Fi =
preci
2 = 2·preci ·recalli = 2niji (17.1)
+ 1 preci +recalli ni +mji recalli
1
428 Clustering Validation The F-measure for the clustering C is the mean of clusterwise F-measure values:
1 r F=r Fi
i=1
F-measure thus tries to balance the precision and recall values across all the clusters. For a perfect clustering, when r = k, the maximum value of the F-measure is 1.
Example 17.1. Figure 17.1 shows two different clusterings obtained via the K-means algorithm on the Iris dataset, using the first two principal components as the two dimensions. Here n = 150, and k = 3. Visual inspection confirms that Figure 17.1a is a better clustering than that in Figure 17.1b. We now examine how the different contingency table based measures can be used to evaluate these two clusterings.
Consider the clustering in Figure 17.1a. The three clusters are illustrated with different symbols; the gray points are in the correct partition, whereas the white ones are wrongly clustered compared to the ground-truth Iris types. For instance, C3 mainly corresponds to partition T3 (Iris-virginica), but it has three points (the white triangles) from T2. The complete contingency table is as follows:
iris-setosa iris-versicolor iris-virginica
T1 T2 T3 ni C1 (squares) 61 C2 (circles) 50 C3 (triangles) 39
mj 50 50 50 n=150
To compute purity, we first note for each cluster the partition with the maximum overlap. We have the correspondence (C1,T2), (C2,T1), and (C3,T3). Thus, purity is given as
purity= 1 (47+50+36)= 133 =0.887 150 150
For this contingency table, the maximum matching measure gives the same result, as the correspondence above is in fact a maximum weight matching. Thus, match = 0.887.
The cluster C1 contains n1 = 47 + 14 = 61 points, whereas its corresponding partition T2 contains m2 = 47 + 3 = 50 points. Thus, the precision and recall for C1 are given as
prec = 47 = 0.77 1 61
recall1 = 47 = 0.94 50
The F-measure for C1 is therefore
F1 = 2·0.77·0.94 = 1.45 =0.85 0.77 + 0.94 1.71
0 47 14 50 0 0
0 3 36
External Measures
429
u2
1.0 0.5 0 −0.5 −1.0
−1.5 u1 −4 −3 −2 −1 0 1 2 3
u2
1.0 0.5 0 −0.5 −1.0
(a) K-means: good
−1.5 u1 −4 −3 −2 −1 0 1 2 3
(b) K-means: bad
Figure 17.1. K-means: Iris principal components dataset.
We can also directly compute F1 using Eq. (17.1) F1=2·n12 =2·47=94=0.85
n1+m2 61+50 111
Likewise, we obtain F2 = 1.0 and F3 = 0.81. Thus, the F-measure value for the
clustering is given as
F = 1(F1 +F2 +F3)= 2.66 =0.88 33
For the clustering in Figure 17.1b, we have the following contingency table:
iris-setosa iris-versicolor iris-virginica
T1 T2 T3 ni
C1 30 C2 24 C3 96 mj 50 50 50 n=150
30 0 0 20 4 0
0 46 50
430
Clustering Validation
For the purity measure, the partition with which each cluster shares the most points is given as (C1,T1), (C2,T1), and (C3,T3). Thus, the purity value for this clustering is
purity= 1 (30+20+50)= 100 =0.67 150 150
We can see that both C1 and C2 choose partition T1 as the maximum overlapping partition. However, the maximum weight matching is different; it yields the correspondence (C1,T1), (C2,T2), and (C3,T3), and thus
match= 1 (30+4+50)= 84 =0.56 150 150
The table below compares the different contingency based measures for the two clusterings shown in Figure 17.1.
As expected, the good clustering in Figure 17.1a has higher scores for the purity, maximum matching, and F-measure.
purity match F
(a) Good (b) Bad
0.887 0.887 0.885 0.667 0.560 0.658
17.1.2 Entropy-based Measures Conditional Entropy
The entropy of a clustering C is defined as
where pC = ni in
T is defined as
pTj logpTj
The cluster-specific entropy of T , that is, the conditional entropy of T with respect
to cluster Ci is defined as
H(T |Ci) = −k nij lognij
j=1 ni ni
The conditional entropy of T given clustering C is then defined as the weighted sum:
H(T|C)=r niH(T|Ci)=−r k nij lognij i=1 n i=1 j=1 n ni
r k p i j
=− pijlog p (17.2)
i=1 j=1 Ci
r i=1
H(C)=−
is the probability of cluster Ci . Likewise, the entropy of the partitioning
pCi logpCi
where pT = mj jn
H(T)=−
is the probability of partition Tj .
k j=1
External Measures 431 where pij = nij is the probability that a point in cluster i also belongs to partition j . The
n
more a cluster’s members are split into different partitions, the higher the conditional entropy. For a perfect clustering, the conditional entropy value is zero, whereas the worst possible conditional entropy value is log k. Further, expanding Eq. (17.2), we can see that
rk
pij logpij −logpCi
H(T|C)=−
=−
= −
= H(C , T ) − H(C )
i=1 j=1 rk
r k + logpCi pij
i=1 j=1
i=1 j=1 r k
r i=1
pCi logpCi
i=1 j=1
pij logpij pij logpij +
whereH(C,T)=−ri=1kj=1pijlogpij isthejointentropyofCandT.Theconditional entropy H(T |C) thus measures the remaining entropy of T given the clustering C. In particular, H(T |C) = 0 if and only if T is completely determined by C, corresponding to the ideal clustering. On the other hand, if C and T are independent of each other, then H(T |C) = H(T ), which means that C provides no information about T .
Normalized Mutual Information
The mutual information tries to quantify the amount of shared information between the clustering C and partitioning T , and it is defined as
rk pij
I(C,T)= pijlog p ·p (17.4)
i=1 j=1 Ci Tj
It measures the dependence between the observed joint probability pij of C and T , and the expected joint probability pCi · pTj under the independence assumption. When C andT areindependentthenpij =pCi ·pTj,andthusI(C,T)=0.However,thereisno upper bound on the mutual information.
Expanding Eq. (17.4) we observe that I(C,T ) = H(C) + H(T ) − H(C,T ). Using Eq. (17.3), we obtain the two equivalent expressions:
I(C , T ) = H(T ) − H(T |C ) I(C , T ) = H(C ) − H(C |T )
Finally, because H(C|T ) ≥ 0 and H(T |C) ≥ 0, we have the inequalities I(C,T ) ≤ H(C) and I(C,T ) ≤ H(T ). We can obtain a normalized version of mutual information by considering the ratios I(C,T )/H(C) and I(C,T )/H(T ), both of which can beat most one. The normalized mutual information (NMI) is defined as the geometric mean of these two ratios:
NMI(C,T)=I(C,T)·I(C,T)=√ I(C,T) H(C) H(T ) H(C) · H(T )
(17.3)
432 Clustering Validation The NMI value lies in the range [0,1]. Values close to 1 indicate a good clustering.
Variation of Information
This criterion is based on the mutual information between the clustering C and the ground-truth partitioning T , and their entropy; it is defined as
VI(C , T ) = (H(T ) − I(C , T )) + (H(C ) − I(C , T ))
= H(T ) + H(C ) − 2I(C , T ) (17.5)
Variation of information (VI) is zero only when C and T are identical. Thus, the lower the VI value the better the clustering C.
Using the equivalence I(C,T ) = H(T ) − H(T |C) = H(C) − H(C|T ), we can also express Eq. (17.5) as
VI(C , T ) = H(T |C ) + H(C |T )
Finally, noting that H(T |C ) = H(T , C ) − H(C ), another expression for VI is given as
VI(C , T ) = 2H(T , C ) − H(T ) − H(C )
Example 17.2. We continue with Example 2, which compares the two clusterings shown in Figure 17.1. For the entropy-based measures, we use base 2 for the logarithms; the formulas are valid for any base as such.
For the clustering in Figure 17.1a, we have the following contingency table:
iris-setosa iris-versicolor iris-virginica
T1 T2 T3 ni
C1 61 C2 50 C3 39 mj 50 50 50 n=100
0 47 14 50 0 0
0 3 36
Consider the conditional entropy for cluster C1:
H(T|C1)=− 0 log2 0 −47log247−14log214
61 61 61 61 61 61
= −0 − 0.77 log2 (0.77) − 0.23 log2 (0.23) = 0.29 + 0.49 = 0.78
In a similar manner, we obtain H(T |C2) = 0 and H(T |C3) = 0.39. The conditional entropy for the clustering C is then given as
H(T|C)= 61 ·0.78+ 50 ·0+ 39 ·0.39=0.32+0+0.10=0.42 150 150 150
External Measures 433
To compute the normalized mutual information, note that
H(T)=−3 50 log2 50 =1.585 150 150
H(C)=− 61 log2 61 + 50 log2 50 + 39 log2 39 150 150 150 150 150 150
= 0.528 + 0.528 + 0.505 = 1.561
I(C,T)= 47 log247·150+ 14 log214·150+ 50 log250·150 150 61·50 150 61·50 150 50·50
+ 3 log 3·150+ 36 log 36·150 150 2 39·50 150 2 39·50
= 0.379 − 0.05 + 0.528 − 0.042 + 0.353 = 1.167 Thus, the NMI and VI values are
NMI(C,T ) = √ I(C,T ) = √ 1.167 = 0.742 H(T ) · H(C) 1.585 × 1.561
VI(C , T ) = H(T ) + H(C ) − 2I(C , T ) = 1.585 + 1.561 − 2 · 1.167 = 0.812
We can likewise compute these measures for the other clustering in Figure 17.1b, whose contingency table is shown in Example 2.
The table below compares the entropy based measures for the two clusterings shown in Figure 17.1.
As expected, the good clustering in Figure 17.1a has a higher score for normalized mutual information, and lower scores for conditional entropy and variation of information.
H(T |C) NMI VI
(a) Good (b) Bad
0.418 0.742 0.812 0.743 0.587 1.200
17.1.3 Pairwise Measures
Given clustering C and ground-truth partitioning T , the pairwise measures utilize the partition and cluster label information over all pairs of points. Let xi , xj ∈ D be any two points, with i ̸= j . Let yi denote the true partition label and let yˆi denote the cluster labelforpointxi.Ifbothxi andxj belongtothesamecluster,thatis,yˆi =yˆj,wecallit a positive event, and if they do not belong to the same cluster, that is, yˆi ̸= yˆj , we call that a negative event. Depending on whether there is agreement between the cluster labels and partition labels, there are four possibilities to consider:
• True Positives: xi and xj belong to the same partition in T , and they are also in the same cluster in C. This is a true positive pair because the positive event, yˆi = yˆj , corresponds to the ground truth, yi = yj . The number of true positive pairs is given as
T P = { ( x i , x j ) : y i = y j a n d yˆ i = yˆ j }
434 Clustering Validation
• False Negatives: xi and xj belong to the same partition in T , but they do not belong to the same cluster in C. That is, the negative event, yˆi ̸= yˆj , does not correspond to the truth, yi = yj . This pair is thus a false negative, and the number of all false negative pairs is given as
FN={(xi,xj): yi =yj andyˆi ̸=yˆj}
• False Positives: xi and xj do not belong to the same partition in T , but they do belong to the same cluster in C. This pair is a false positive because the positive event, yˆi = yˆj , is actually false, that is, it does not agree with the ground-truth partitioning, which indicates that yi ̸= yj . The number of false positive pairs is given as
FP={(xi,xj): yi ̸=yj andyˆi =yˆj}
• True Negatives: xi and xj neither belong to the same partition in T , nor do they belong to the same cluster in C. This pair is thus a true negative, that is, yˆi ̸= yˆj and yi ̸= yj . The number of such true negative pairs is given as
TN={(xi,xj): yi ̸=yj andyˆi ̸=yˆj} BecausethereareN=n=n(n−1) pairsofpoints,wehavethefollowingidentity:
22
N=TP+FN+FP+TN (17.6)
A naive computation of the preceding four cases requires O(n2) time. However, they can be computed more efficiently using the contingency table N = nij , with 1 ≤ i ≤ r and 1 ≤ j ≤ k. The number of true positives is given as
k n i j ( n i j − 1 ) 1 r k r k
r k n i j r
TP= 2= 2 =2 n2ij− nij
i=1 j=1 i=1 j=1
1rk = 2 n 2i j − n
i=1 j=1
i=1 j=1 i=1 j=1
r k FN= 2 −TP=2 mj2− mj− n2ij+n
( 1 7 . 7 )
This follows from the fact that each pair of points among the nij share the same cluster label (i) and the same partition label (j). The last step follows from the fact that the sumofalltheentriesinthecontingencytablemustaddton,thatis,ri=1kj=1nij =n.
To compute the total number of false negatives, we remove the number of true positives from the number of pairs that belong to the same partition. Because two points xi and xj that belong to the same partition have yi = yj , if we remove the true positives, that is, pairs with yˆi = yˆj , we are left with pairs for whom yˆi ̸= yˆj , that is, the false negatives. We thus have
k m j 1 k k
j=1
j=1 j=1 1k rk
i=1 j=1
= 2 mj2 − n2ij j=1 i=1 j=1
The last step follows from the fact that kj =1 mj = n.
(17.8)
External Measures 435 The number of false positives can be obtained in a similar manner by subtracting
the number of true positives from the number of point pairs that are in the same cluster:
rni 1r rk
FP= 2 −TP=2 n2i− n2ij (17.9)
i=1 i=1 i=1 j=1
Finally, the number of true negatives can be obtained via Eq. (17.6) as follows:
1r k rk TN=N−(TP+FN+FP)=2 n2− n2i − mj2+ n2ij (17.10)
i=1 j=1 i=1 j=1
Each of the four values can be computed in O(rk) time. Because the contingency table can be obtained in linear time, the total time to compute the four values is O(n+rk), which is much better than the naive O(n2) bound. We next consider pairwise assessment measures based on these four values.
Jaccard Coefficient
The Jaccard Coefficient measures the fraction of true positive point pairs, but after ignoring the true negatives. It is defined as follows:
Jaccard = TP (17.11) TP+FN+FP
For a perfect clustering C (i.e., total agreement with the partitioning T ), the Jaccard Coefficient has value 1, as in that case there are no false positives or false negatives. The Jaccard coefficient is asymmetric in terms of the true positives and negatives because it ignores the true negatives. In other words, it emphasizes the similarity in terms of the point pairs that belong together in both the clustering and ground-truth partitioning, but it discounts the point pairs that do not belong together.
Rand Statistic
The Rand statistic measures the fraction of true positives and true negatives over all point pairs; it is defined as
Rand = TP + TN (17.12) N
The Rand statistic, which is symmetric, measures the fraction of point pairs where both C and T agree. A prefect clustering has a value of 1 for the statistic.
Fowlkes-Mallows Measure
Define the overall pairwise precision and pairwise recall values for a clustering C, as follows:
prec = TP recall = TP TP+FP TP+FN
Precision measures the fraction of true or correctly clustered point pairs compared to all the point pairs in the same cluster. On the other hand, recall measures the fraction of correctly labeled points pairs compared to all the point pairs in the same partition.
436 Clustering Validation The Fowlkes–Mallows (FM) measure is defined as the geometric mean of the
pairwise precision and recall
FM = prec · recall = √ TP (17.13)
(TP + FN)(TP + FP)
The FM measure is also asymmetric in terms of the true positives and negatives because it ignores the true negatives. Its highest value is also 1, achieved when there are no false positives or negatives.
Example17.3. LetuscontinuewithExample2.Consideragainthecontingencytable for the clustering in Figure 17.1a:
iris-setosa iris-versicolor iris-virginica T1 T2 T3 C1 0 47 14 C2 50 0 0
C3 0 3 36
Using Eq. (17.7), we can obtain the number of true positives as follows:
TP=47+14+50+3+36 22222
=1081+91+1225+3+630=3030 Using Eqs. (17.8), (17.9), and (17.10), we obtain
FN=645 FP=766 TN=6734 Note that there are a total of N = 150 = 11175 point pairs.
2
We can now compute the different pairwise measures for clustering
evaluation. The Jaccard coefficient [Eq.(17.11)], Rand statistic [Eq.(17.12)], and Fowlkes–Mallows measure [Eq. (17.13)], are given as
Jaccard = 3030 = 3030 = 0.68 3030 + 645 + 766 4441
Rand=3030+6734= 9764 =0.87 11175 11175
FM= √ 3030 = 3030 =0.81 3675 · 3796 3735
Using the contingency table for the clustering in Figure 17.1b from Example 2, we obtain
TP=2891 FN=784 FP=2380 TN=5120
The table below compares the different contingency based measures on the two
clusterings in Figure 17.1.
Jaccard Rand FM
(a) Good 0.682 0.873 0.811 (b) Bad 0.477 0.717 0.657
As expected, the clustering in Figure 17.1a has higher scores for all three measures.
External Measures 437 17.1.4 Correlation Measures
Let X and Y be two symmetric n × n matrices, and let N = n2. Let x,y ∈ RN denote the vectors obtained by linearizing the upper triangular elements (excluding the main diagonal) of X and Y (e.g., in a row-wise manner), respectively. Let μX denote the element-wise mean of x, given as
n−1 n
μX = 1 X(i,j) = 1 xTx
N i=1 j=i+1 N and let zx denote the centered x vector, defined as
zx =x−1·μX
where 1 ∈ RN is the vector of all ones. Likewise, let μY be the element-wise mean of y, and zy the centered y vector.
The Hubert statistic is defined as the averaged element-wise product between X and Y
n−1 n
Ŵ= 1 X(i,j)·Y(i,j)= 1xTy (17.14)
σ
N i=1 j=i+1 N
The normalized Hubert statistic is defined as the element-wise correlation between
X and Y
Ŵ = i=1 j=i+1
n−1 n X(i,j)−μX·Y(i,j)−μY
n n−1 n X(i,j)−μX2 n−1 n Y[i]−μY2
= XY σX2σY2
i=1 j =i+1 i=1 j =i+1
where σX2 and σY2 are the variances, and σXY the covariance, for the vectors x and y,
defined as
n−1 n
σ X2 = 1 X ( i , j ) − μ X 2 = 1 z Tx z x = 1 ∥ z x ∥ 2
N i=1 j=i+1 N N n−1 n
σ Y2 = 1 Y ( i , j ) − μ Y 2 = 1 z Ty z y = 1 z y 2 N i=1 j=i+1 N N
n−1 n
σXY = 1 X(i,j)−μXY(i,j)−μY= 1zTxzy
N i=1 j=i+1 N Thus, the normalized Hubert statistic can be rewritten as
z Tx z y
Ŵn = ∥zx∥·zy =cosθ
(17.15)
438 Clustering Validation
where θ is the angle between the two centered vectors zx and zy . It follows immediately that Ŵn ranges from −1 to +1.
When X and Y are arbitrary n × n matrices the above expressions can be easily modified to range over all the n2 elements of the two matrices. The (normalized) Hubert statistic can be used as an external evaluation measure, with appropriately defined matrices X and Y, as described next.
Discretized Hubert Statistic
Let T and C be the n × n matrices defined as
T(i,j)=1 ifyi =yj,i̸=j C(i,j)=1 ifyˆi =yˆj,i̸=j
0 otherwise 0 otherwise
Also, let t,c ∈ RN denote the N-dimensional vectors comprising the upper triangular elements (excluding the diagonal) of T and C, respectively, where N = n2 denotes the number of distinct point pairs. Finally, let zt and zc denote the centered t and c vectors.
The discretized Hubert statistic is computed via Eq. (17.14), by setting x = t and y=c:
Ŵ = 1 tTc = TP (17.16) NN
Because the ith element of t is 1 only when the ith pair of points belongs to the same partition, and, likewise, the ith element of c is 1 only when the ith pair of points also belongs to the same cluster, the dot product tTc is simply the number of true positives, and thus the Ŵ value is equivalent to the fraction of all pairs that are true positives. It follows that the higher the agreement between the ground-truth partitioning T and clustering C, the higher the Ŵ value.
Normalized Discretized Hubert Statistic
The normalized version of the discretized Hubert statistic is simply the correlation between t and c [Eq. (17.15)]:
Ŵ n = z Tt z c = c o s θ ( 1 7 . 1 7 ) ∥zt∥·∥zc∥
Note that μT = 1 tTt is the fraction of point pairs that belong to the same partition, that N
is, with yi = yj , regardless of whether yˆi matches yˆj or not. Thus, we have μT = tTt = TP+FN
NN
Similarly, μC = 1 cTc is the fraction of point pairs that belong to the same cluster, that N
is, with yˆi = yˆj , regardless of whether yi matches yj or not, so that
μC = cTc = TP+FP NN
External Measures 439 Substituting these into the numerator in Eq. (17.17), we get
zTt zc =(t−1·μT)T(c−1·μC)
= tTc − μCtT1 − μTcT1 + 1T1μTμC
= tTc − NμCμT − NμTμC + NμTμC
=tTc−NμTμC
= TP − NμTμC (17.18)
where 1 ∈ RN is the vector of all 1’s. We also made use of identities tT1 = tTt and cT1 = cTc. Likewise, we can derive
∥zt∥2 =zTt zt =tTt−Nμ2T =NμT −Nμ2T =NμT(1−μT) (17.19) ∥zc∥2 =zTc zc =cTc−Nμ2C =NμC −Nμ2C =NμC(1−μC) (17.20)
Plugging Eqs. (17.18), (17.19), and (17.20) into Eq. (17.17) the normalized, discretized Hubert statistic can be written as
TP −μTμC
N (17.21)
Ŵn = √
μTμC(1 − μT)(1 − μC)
because μT = TP+FN and μC = TP+FP , the normalized Ŵn statistic can be computed using NN
only the TP, FN, and FP values. The maximum value of Ŵn = +1 is obtained when there are no false positives or negatives, that is, when FN = FP = 0. The minimum value of Ŵn = −1 is when there are no true positives and negatives, that is, when TP = TN = 0.
Example 17.4. Continuing Example 17.3, for the good clustering in Figure 17.1a, we have
TP=3030 FN=645 FP=766 TN=6734 From these values, we obtain
μT = TP+FN = 3675 =0.33 N 11175
μC = TP+FP = 3796 =0.34 N 11175
Using Eqs. (17.16) and (17.21) the Hubert statistic values are
Ŵ= 3030 =0.271 11175
Ŵn = √ 0.27−0.33·0.34 = 0.159 =0.717 0.33·0.34·(1−0.33)·(1−0.34) 0.222
Likewise, for the bad clustering in Figure 17.1b, we have
TP=2891 FN=784 FP=2380 TN=5120
440 Clustering Validation
and the values for the discretized Hubert statistic are given as Ŵ = 0.258 Ŵn = 0.442
We observe that the good clustering has higher values, though the normalized statistic is more discerning than the unnormalized version, that is, the good clustering has a much higher value of Ŵn than the bad clustering, whereas the difference in Ŵ for the two clusterings is not that high.
17.2 INTERNAL MEASURES
Internal evaluation measures do not have recourse to the ground-truth partitioning, which is the typical scenario when clustering a dataset. To evaluate the quality of the clustering, internal measures therefore have to utilize notions of intracluster similarity or compactness, contrasted with notions of intercluster separation, with usually a trade-off in maximizing these two aims. The internal measures are based on the n × n distance matrix, also called the proximity matrix, of all pairwise distances among the n points:
where
n
W= δ(xi,xj) δ(xi,xj)=xi −xj2
(17.22)
is the Euclidean distance between xi , xj ∈ D, although other distance metrics can also be used. Because W is symmetric and δ(xi , xi ) = 0, usually only the upper triangular elements of W (excluding the diagonal) are used in the internal measures.
The proximity matrix W can also be considered as the adjacency matrix of the weighted complete graph G over the n points, that is, with nodes V = {xi | xi ∈ D}, edges E = {(xi,xj ) | xi,xj ∈ D}, and edge weights wij = W(i,j) for all xi,xj ∈ D. There is thus a close connection between the internal evaluation measures and the graph clustering objectives we examined in Chapter 16.
For internal measures, we assume that we do not have access to a ground-truth partitioning. Instead, we assume that we are given a clustering C = {C1,...,Ck} comprisingr=kclusters,withclusterCi containingni =|Ci|points.Letyˆi ∈{1,2,...,k} denote the cluster label for point xi. The clustering C canbe considered as a k-way cut inGbecauseCi ̸=∅foralli,Ci∩Cj =∅foralli,j,and iCi =V.Givenanysubsets S, R ⊂ V, define W(S, R) as the sum of the weights on all edges with one vertex in S and the other in R, given as
W(S,R)=wij xi∈Sxj∈R
Also, given S ⊆ V, we denote by S the complementary set of vertices, that is, S = V − S. The internal measures are based on various functions over the intracluster and intercluster weights. In particular, note that the sum of all the intracluster weights over
i,j =1
441
Win = 2 W(Ci,Ci) (17.23) i=1
We divide by 2 because each edge within Ci is counted twice in the summation given by W(Ci , Ci ). Also note that the sum of all intercluster weights is given as
Internal Measures
all clusters is given as
1 k
k k−1
Wout = 1W(Ci,Ci)=W(Ci,Cj) (17.24)
Here too we divide by 2 because each edge is counted twice in the summation across clusters. The number of distinct intracluster edges, denoted Nin, and intercluster edges, denoted Nout , are given as
2 i=1 i=1 j>i
k n i 1 k Nin = 2 = 2
i=1 i=1
ni(ni −1)
k−1 k
Nout = ni ·nj = 1ni ·nj
i=1 j=i+1 2 i=1 j=1 j ̸=i
Note that the total number of distinct pairs of points N satisfies the identity N=Nin +Nout =n= 1n(n−1)
22
BetaCV Measure
The BetaCV measure is the ratio of the mean intracluster distance to the mean intercluster distance:
Win/Nin Nout Win Nout ki=1 W(Ci,Ci) BetaCV=Wout/Nout = Nin ·Wout = Nin ki=1W(Ci,Ci)
The smaller the BetaCV ratio, the better the clustering, as it indicates that intracluster distances are on average smaller than intercluster distances.
C-index
Let Wmin(Nin) be the sum of the smallest Nin distances in the proximity matrix W, where Nin is the total number of intracluster edges, or point pairs. Let Wmax(Nin) be the sum of the largest Nin distances in W.
k k
Example 17.5. Figure 17.2 shows the graphs corresponding to the two K-means clusterings shown in Figure 17.1. Here, each vertex corresponds to a point xi ∈ D, and an edge (xi , xj ) exists between each pair of points. However, only the intracluster edges are shown (with intercluster edges omitted) to avoid clutter. Because internal measures do not have access to a ground truth labeling, the goodness of a clustering is measured based on intracluster and intercluster statistics.
442
Clustering Validation
1.0 0.5 0 −0.5 −1.0 −1.5
1.0 0.5 0 −0.5 −1.0 −1.5
u2
u2
−4 −3 −2 −1 0 1 2 3 (a) K-means: good
−4 −3 −2 −1 0 1 2 3 (b) K-means: bad
Figure 17.2. Clusterings as graphs: Iris.
The C-index measures to what extent the clustering puts together the Nin points that are the closest across the k clusters. It is defined as
Cindex = Win − Wmin(Nin) Wmax(Nin) − Wmin(Nin)
where Win is the sum of all the intracluster distances [Eq. (17.23)]. The C-index lies in the range [0,1]. The smaller the C-index, the better the clustering, as it indicates more compact clusters with relatively smaller distances within clusters rather than between clusters.
Normalized Cut Measure
The normalized cut objective [Eq. (16.17)] for graph clustering can also be used as an internal clustering evaluation measure:
NC = k W(Ci,Ci) = k W(Ci,Ci) i=1 vol(Ci ) i=1 W(Ci , V)
u1
u1
Internal Measures 443
where vol(Ci ) = W(Ci , V) is the volume of cluster Ci , that is, the total weights on edges with at least one end in the cluster. However, because we are using the proximity or distance matrix W, instead of the affinity or similarity matrix A, the higher the normalized cut value the better.
To see this, we make use of the observation that W(Ci , V) = W(Ci , Ci ) + W(Ci , Ci ), so that
N C = k W ( C i , C i ) = k 1
i=1 W(Ci,Ci)+W(Ci,Ci) i=1 W(Ci,Ci)+1
W(Ci,Ci)
We can see that NC is maximized when the ratios W(Ci , Ci ) (across the k clusters) are W(Ci,Ci)
as small as possible, which happens when the intracluster distances are much smaller compared to intercluster distances, that is, when the clustering is good. The maximum possible value of NC is k.
Modularity
The modularity objective for graph clustering [Eq. (16.26)] can also be used as an internal measure:
where
Q=k W(Ci,Ci)−W(Ci,V)2 i=1 W(V,V) W(V,V)
i=1 i=1 =2(Win +Wout)
k i=1
W(V, V) =
= W(Ci,Ci)+ W(Ci,Ci)
W(Ci , V)
k k
The last step follows from Eqs. (17.23) and (17.24). Modularity measures the difference between the observed and expected fraction of weights on edges within the clusters. Since we are using the distance matrix, the smaller the modularity measure the better the clustering, which indicates that the intracluster distances are lower than expected.
Dunn Index
The Dunn index is defined as the ratio between the minimum distance between point pairs from different clusters and the maximum distance between point pairs from the same cluster. More formally, we have
Wmin Dunn = out
where Wmin is the minimum intercluster distance:
out
Wmin =minwab|xa ∈Ci,xb ∈Cj out i,j >i
Wmax in
444 Clustering Validation and Wmax is the maximum intracluster distance:
in
Wmax = max wab|xa,xb ∈ Ci in i
The larger the Dunn index the better the clustering because it means even the closest distance between points in different clusters is much larger than the farthest distance between points in the same cluster. However, the Dunn index may be insensitive because the minimum intercluster and maximum intracluster distances do not capture all the information about a clustering.
Davies–Bouldin Index
Let μi denote the cluster mean, given as
μi = 1 xj (17.25)
ni xj∈Ci
Further, let σμi denote the dispersion or spread of the points around the cluster mean,
given as
xj∈Ci δ(xj,μi)2
σμi= ni = var(Ci) where var(Ci) is the total variance [Eq. (1.4)] of cluster Ci.
The Davies–Bouldin measure for a pair of clusters Ci and Cj is defined as the ratio DBij =σμi +σμj
δ(μi,μj)
DBij measures how compact the clusters are compared to the distance between the
cluster means. The Davies–Bouldin index is then defined as 1 k
DB = max{DBij } k i=1 j̸=i
That is, for each cluster Ci, we pick the cluster Cj that yields the largest DBij ratio. The smaller the DB value the better the clustering, as it means that the clusters are well separated (i.e., the distance between cluster means is large), and each cluster is well represented by its mean (i.e., has a small spread).
Silhouette Coefficient
The silhouette coefficient is a measure of both cohesion and separation of clusters, and is based on the difference between the average distance to points in the closest cluster and to points in the same cluster. For each point xi we calculate its silhouette coefficient si as
μmin(xi)−μin(xi) out
si = maxμmin(xi),μin(xi) (17.26) out
Internal Measures
whereμin(xi)isthemeandistancefromxi topointsinitsownclusteryˆi:
μmin(xi) = miny∈Cj δ(xi,y) out j ̸=yˆi nj
445
xj∈Cyˆi,j̸=i δ(xi,xj) n yˆ i − 1
μin(xi)=
and μmin(xi) is the mean of the distances from xi to points in the closest cluster:
out
The si value of a point lies in the interval [−1,+1]. A value close to +1 indicates that xi is much closer to points in its own cluster and is far from other clusters. A value close to zero indicates that xi is close to the boundary between two clusters. Finally, a value close to −1 indicates that xi is much closer to another cluster than its own cluster, and therefore, the point may be mis-clustered.
The silhouette coefficient is defined as the mean si value across all the points: 1 n
SC = n si (17.27) i=1
A value close to +1 indicates a good clustering.
Hubert Statistic
The Hubert Ŵ statistic [Eq. (17.14)], and its normalized version Ŵn [Eq. (17.15)], can both be used as internal evaluation measures by letting X = W be the pairwise distance matrix, and by defining Y as the matrix of distances between the cluster means:
n
Y = δ(μyˆi ,μyˆj ) (17.28)
i,j =1
Because both W and Y are symmetric, both Ŵ and Ŵn are computed over their upper triangular elements.
Example17.6. ConsiderthetwoclusteringsfortheIrisprincipalcomponentsdataset shown in Figure 17.1, along with their corresponding graph representations in Figure 17.2. Let us evaluate these two clusterings using internal measures.
The good clustering shown in Figure 17.1a and Figure 17.2a has clusters with the following sizes:
n1 =61 n2 =50 n3 =39
Thus, the number of intracluster and intercluster edges (i.e., point pairs) is given as
Nin =61+50+31=1830+1225+741=3796 222
Nout =61·50+61·39+50·39=3050+2379+1950=7379
In total there are N = Nin + Nout = 3796 + 7379 = 11175 distinct point pairs.
446 Clustering Validation
The weights on edges within each cluster W(Ci,Ci), and those from a cluster to another W(Ci , Cj ), are as given in the intercluster weight matrix
W C1 C2 C3 C1 3265.69 10402.30 4418.62 C2 10402.30 1523.10 9792.45 C3 4418.62 9792.45 1252.36
Thus, the sum of all the intracluster and intercluster edge weights is
Win = 1 (3265.69 + 1523.10 + 1252.36) = 3020.57 2
Wout =(10402.30+4418.62+9792.45)=24613.37 The BetaCV measure can then be computed as
BetaCV = Nout · Win = 7379 × 3020.57 = 0.239 Nin · Wout 3796 × 24613.37
(17.29)
For the C-index, we first compute the sum of the Nin smallest and largest pair-wise distances, given as
Wmin(Nin) = 2535.96 Thus, C-index is given as
Cindex = Win − Wmin(Nin) = Wmax(Nin) − Wmin(Nin)
Wmax(Nin) = 16889.57
3020.57 − 2535.96 = 484.61 = 0.0338
16889.57 − 2535.96 14535.61
For the normalized cut and modularity measures, we compute W(Ci,Ci), W(Ci,V) = kj=1 W(Ci,Cj) and W(V,V) = ki=1 W(Ci,V), using the intercluster weight matrix [Eq. (17.29)]:
W(C1 , C1 ) = 10402.30 + 4418.62 = 14820.91 W(C2 , C2 ) = 10402.30 + 9792.45 = 20194.75 W(C3 , C3 ) = 4418.62 + 9792.45 = 14211.07
W(C1 , V) = 3265.69 + W(C1, C1 ) = 18086.61 W(C2 , V) = 1523.10 + W(C2, C2 ) = 21717.85 W(C3 , V) = 1252.36 + W(C3, C3 ) = 15463.43
W(V,V)=W(C1,V)+W(C2,V)+W(C3,V)=55267.89
Internal Measures 447
The normalized cut and modularity values are given as
NC = 14820.91 + 20194.75 + 14211.07 = 0.819 + 0.93 + 0.919 = 2.67
18086.61 21717.85 15463.43
Q= 3265.69 −18086.612+ 1523.10 −21717.852
55267.89 55267.89 55267.89 55267.89 + 1252.36 −15463.432
55267.89 55267.89
= −0.048 − 0.1269 − 0.0556 = −0.2305
The Dunn index can be computed from the minimum and maximum distances between pairs of points from two clusters Ci and Cj , computed as follows:
Wmin C1 C2 C3 Wmax C1 0 1.62 0.198 C1
C1 C2 C3 2.50 4.85 4.81 4.85 2.33 7.06 4.81 7.06 2.55
=0.078
To compute the Davies–Bouldin index, we compute the cluster mean and
1.62 0 3.49 C2 The Dunn index value for the clustering is given as
C2
C3 0.198 3.49 0 C3
Wmin Dunn= out =
0.198 2.55
dispersion values:
μ1 =−0.664
−0.33 σμ1 =0.723
μ2 =2.64 0.19
μ3 =−2.35 0.27
σμ3 =0.695 C3
σμ2 =0.512 and the DBij values for pairs of clusters:
DBij C1
C2 C3
C1 C2
– 0.369 0.794
Wmax in
0.369 – 0.242 0.794 0.242 –
For example, DB12 = σμ1 +σμ2 = 1.235 = 0.369. Finally, the DB index is given as δ(μ1,μ2) 3.346
DB = 1 (0.794 + 0.369 + 0.794) = 0.652 3
The silhouette coefficient [Eq. (17.26)] for a chosen point, say x1, is given as s1 = 1.902 − 0.701 = 1.201 = 0.632
max{1.902, 0.701} 1.902 The average value across all points is SC = 0.598
448 Clustering Validation
The Hubert statistic can be computed by taking the dot product over the upper triangular elements of the proximity matrix W [Eq. (17.22)] and the n × n matrix of distances among cluster means Y [Eq. (17.28)], and then dividing by the number of distinct point pairs N:
Ŵ = wTy = 91545.85 = 8.19 N 11175
where w, y ∈ RN are vectors comprising the upper triangular elements of W and Y. The normalized Hubert statistic can be obtained as the correlation between w and y [Eq. (17.15)]:
z Tw z y
Ŵn = ∥xw∥·zy =0.918
where zw , zy are the centered vectors corresponding to w and y, respectively.
The following table summarizes the various internal measure values for the good
and bad clusterings shown in Figure 17.1 and Figure 17.2.
0.24 0.034 −0.23 0.65 0.33 0.08 −0.20 1.11
Despite the fact that these internal measures do not have access to the ground-truth partitioning, we can observe that the good clustering has higher values for normalized cut, Dunn, silhouette coefficient, and the Hubert statistics, and lower values for BetaCV, C-index, modularity, and Davies–Bouldin measures. These measures are thus capable of discerning good versus bad clusterings of the data.
Lower better
BetaCV Cindex Q DB
Higher better
NC Dunn SC Ŵ Ŵn
(a) Good (b) Bad
2.67 0.08 0.60 8.19 0.92 2.56 0.03 0.55 7.32 0.83
17.3 RELATIVE MEASURES
Relative measures are used to compare different clusterings obtained by varying different parameters for the same algorithm, for example, to choose the number of clusters k.
Silhouette Coefficient
The silhouette coefficient [Eq. (17.26)] for each point sj , and the average SC value [Eq. (17.27)], can be used to estimate the number of clusters in the data. The approach consists of plotting the sj values in descending order for each cluster, and to note the overall SC value for a particular value of k, as well as clusterwise SC values:
SCi = 1 sj ni xj∈Ci
We can then pick the value k that yields the best clustering, with many points having high sj values within each cluster, as well as high values for SC and SCi (1 ≤ i ≤ k).
Relative Measures
449
1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1
0
0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1
0
0.9
0.8
0.7
0.6 0.5 0.4 0.3 0.2 0.1
0
(a) k = 2, SC = 0.706
SC1 = 0.706 n1 = 97
SC2 = 0.662 n2 = 53
SC1 = 0.466 n1 = 61
SC2 = 0.818 n2 = 50
SC3 = 0.52 n3 = 39
(b) k = 3, SC = 0.598
SC4 = 0.484 n4 = 23
SC1 = 0.376 n1 = 49
SC2 = 0.534 n2 = 28
SC3 = 0.787 n3 = 50
(c) k = 4, SC = 0.559
Figure 17.3. Iris K-means: silhouette coefficient plot.
Example17.7. Figure17.3showsthesilhouettecoefficientplotforthebestclustering results for the K-means algorithm on the Iris principal components dataset for three different values of k, namely k = 2,3,4. The silhouette coefficient values si for points
silhouette coefficient silhouette coefficient silhouette coefficient
450 Clustering Validation
within each cluster are plotted in decreasing order. The overall average (SC) and clusterwise averages (SCi , for 1 ≤ i ≤ k) are also shown, along with the cluster sizes.
Figure 17.3a shows that k = 2 has the highest average silhouette coefficient, SC = 0.706. It shows two well separated clusters. The points in cluster C1 start out with high si values, which gradually drop as we get to border points. The second cluster C2 is even better separated, since it has a higher silhouette coefficient and the pointwise scores are all high, except for the last three points, suggesting that almost all the points are well clustered.
The silhouette plot in Figure 17.3b, with k = 3, corresponds to the “good” clustering shown in Figure 17.1a. We can see that cluster C1 from Figure 17.3a has been split into two clusters for k = 3, namely C1 and C3. Both of these have many bordering points, whereas C2 is well separated with high silhouette coefficients across all points.
Finally, the silhouette plot for k = 4 is shown in Figure 17.3c. Here C3 is the well separated cluster, corresponding to C2 above, and the remaining clusters are essentially subclusters of C1 for k = 2 (Figure 17.3a). Cluster C1 also has two points with negative si values, indicating that they are probably misclustered.
Because k = 2 yields the highest silhouette coefficient, and the two clusters are essentially well separated, in the absence of prior knowledge, we would choose k = 2 as the best number of clusters for this dataset.
Calinski–Harabasz Index
Given the dataset D = {xi}ni=1, the scatter matrix for D is given as S=n=n xj−μxj−μT
j=1
where μ = 1 n xj is the mean and is the covariance matrix. The scatter matrix can
n j=1
be decomposed into two matrices S = SW + SB, where SW is the within-cluster scatter
matrix and SB is the between-cluster scatter matrix, given as SW=k xj−μixj−μiT
i=1 xj∈Ci
k T
SB= ni μi−μ μi−μ i=1
where μi = 1 x ∈C xj is the mean for cluster Ci. ni j i
The Calinski–Harabasz (CH) variance ratio criterion for a given value of k is defined as follows:
CH(k)= tr(SB)/(k−1) = n−k · tr(SB) tr(SW)/(n−k) k−1 tr(SW)
where tr(SW) and tr(SB) are the traces (the sum of the diagonal elements) of the within-cluster and between-cluster scatter matrices.
For a good value of k, we expect the within-cluster scatter to be smaller relative to the between-cluster scatter, which should result in a higher CH(k) value. On the other
Relative Measures
750
700
650
600
451
hand,wedonotdesireaverylargevalueofk;thusthetermn−k penalizeslargervalues k−1
of k. We could choose a value of k that maximizes CH(k). Alternatively, we can plot the CH values and look for a large increase in the value followed by little or no gain. For instance, we can choose the value k > 3 that minimizes the term
(k)=CH(k+1)−CH(k)−CH(k)−CH(k−1)
The intuition is that we want to find the value of k for which CH(k) is much higher than CH(k − 1) and there is only a little improvement or a decrease in the CH(k + 1) value.
23456789
k
Figure 17.4. Calinski–Harabasz variance ratio criterion.
Example 17.8. Figure 17.4 shows the CH ratio for various values of k on the Iris principal components dataset, using the K-means algorithm, with the best results chosen from 200 runs.
For k = 3, the within-cluster and between-cluster scatter matrices are given as SW = 39.14 −13.62 SB = 590.36 13.62
−13.62 24.73 13.62 11.36
CH(3) = (150 − 3) · (590.36 + 11.36) = (147/2) · 601.72 = 73.5 · 9.42 = 692.4
k23456789 CH(k) 570.25 692.40 717.79 683.14 708.26 700.17 738.05 728.63
(k) – −96.78 −60.03 59.78 −33.22 45.97 −47.30 –
Thus, we have
(3 − 1) (39.14 + 24.73) 63.87 The successive CH(k) and (k) values are as follows:
CH
452 Clustering Validation
Gap Statistic
The gap statistic compares the sum of intracluster weights Win [Eq.(17.23)] for different values of k with their expected values assuming no apparent clustering structure, which forms the null hypothesis.
Let Ck be the clustering obtained for a specified value of k, using a chosen clustering algorithm. Let Wkin(D) denote the sum of intracluster weights (over all clusters) for Ck on the input dataset D. We would like to compute the probability of the observed Wkin value under the null hypothesis that the points are randomly placed in the same data space as D. Unfortunately, the sampling distribution of Win is not known. Further, it depends on the number of clusters k, the number of points n, and other characteristics of D.
To obtain an empirical distribution for Win, we resort to Monte Carlo simulations of the sampling process. That is, we generate t random samples comprising n randomly distributed points within the same d-dimensional data space as the input dataset D. That is, for each dimension of D, say Xj , we compute its range [min(Xj ), max(Xj )] and generate values for the n points (for the jth dimension) uniformly at random within the given range. Let Ri ∈ Rn×d, 1 ≤ i ≤ t denote the ith sample. Let Wkin(Ri) denote the sum of intracluster weights for a given clustering of Ri into k clusters. From each sample dataset Ri, we generate clusterings for different values of k using the same algorithm and record the intracluster values Wkin(Ri). Let μW(k) and σW(k) denote the mean and standard deviation of these intracluster weights for each value of k, given as
1 t
μW(k)= t logWkin(Ri)
i=1
σ W ( k ) = 1 t l o g W ki n ( R i ) − μ W ( k ) 2
t i=1
where we use the logarithm of the Win values, as they can be quite large.
The gap statistic for a given k is then defined as gap(k)=μW(k)−logWkin(D)
It measures the deviation of the observed Wkin value from its expected value under the null hypothesis. We can select the value of k that yields the largest gap statistic because that indicates a clustering structure far away from the uniform distribution of points. A more robust approach is to choose k as follows:
k∗ = arg mingap(k) ≥ gap(k + 1) − σW(k + 1) k
That is, we select the least value of k such that the gap statistic is within one standard deviation of the gap at k + 1.
If we choose the first large peak before a decrease we would choose k = 4. However, (k) suggests k = 3 as the best (lowest) value, representing the “knee-of-the-curve”. One limitation of the (k) criteria is that values less than k = 3 cannot be evaluated, since (2) depends on CH(1), which is not defined.
Relative Measures 453
1.0
0.5
0
−0.5
−1.0
−1.5
−4 −3 −2 −1 0 1 2 3
(a) Randomly generated data (k = 3) 0.9
0.8
0.7
expected: μW(k) observed: Wkin
15 14 13 12 11 10
0.6 0.5
0.4
0.3 0.2
0.1
0
0123456789 0123456789
k
k
(b) Intracluster weights (c) Gap statistic
Figure 17.5. Gap statistic. (a) Randomly generated data. (b) Intracluster weights for different k. (c) Gap statistic as a function of k.
Example 17.9. To compute the gap statistic we have to generate t random samples of n points drawn from the same data space as the Iris principal components dataset. A random sample of n = 150 points is shown in Figure 17.5a, which does not have any apparent cluster structure. However, when we run K-means on this dataset it will output some clustering, an example of which is also shown, with k = 3. From this clustering, we can compute the log2 Wkin(Ri) value; we use base 2 for all logarithms.
For Monte Carlo sampling, we generate t = 200 such random datasets, and compute the mean or expected intracluster weight μW(k) under the null hypothesis, for each value of k. Figure 17.5b shows the expected intracluster weights for different values of k. It also shows the observed value of log2 Wkin computed from the K-means clustering of the Iris principal components dataset. For the Iris dataset, and each of the uniform random samples, we run K-means 100 times and select the best
log2 Wkin
gap(k)
454
Clustering Validation
Table 17.1. Gap statistic values as a function of k
k
gap(k) σW(k) 0.093 0.0456 0.346 0.0486 0.679 0.0529 0.753 0.0701 0.586 0.0711 0.715 0.0654 0.808 0.0611 0.680 0.0597 0.632 0.0606
gap(k) − σW(k) 0.047 0.297 0.626 0.682 0.515 0.650 0.746 0.620 0.571
1 2 3 4 5 6 7 8 9
possible clustering, from which the Wkin(Ri) values are computed. We can see that the observed Wkin(D) values are smaller than the expected values μW(k).
From these values, we then compute the gap statistic gap(k) for different values of k, which are plotted in Figure 17.5c. Table 17.1 lists the gap statistic and standard deviation values. The optimal value for the number of clusters is k = 4 because
gap(4) = 0.753 > gap(5) − σW(5) = 0.515
However, if we had relaxed the gap test to be within two standard deviations, then
the optimal value would have been k = 3 because gap(3)=0.679>gap(4)−2σW(4)=0.753−2·0.0701=0.613
Essentially, there is still some subjectivity in selecting the right number of clusters, but the gap statistic plot can help in this task.
17.3.1 Cluster Stability
The main idea behind cluster stability is that the clusterings obtained from several datasets sampled from the same underlying distribution as D should be similar or “stable.” The cluster stability approach can be used to find good parameter values for a given clustering algorithm; we will focus on the task of finding a good value for k, the correct number of clusters.
The joint probability distribution for D is typically unknown. Therefore, to sample a dataset from the same distribution we can try a variety of methods, including random perturbations, subsampling, or bootstrap resampling. Let us consider the bootstrapping approach; we generate t samples of size n by sampling from D with replacement, which allows the same point to be chosen possibly multiple times, and thus each sample Di will be different. Next, for each sample Di we run the same clustering algorithm with different cluster values k ranging from 2 to kmax.
Let Ck(Di) denote the clustering obtained from sample Di, for a given value of k. Next, the method compares the distance between all pairs of clusterings Ck(Di) and Ck(Dj ) via some distance function. Several of the external cluster evaluation measures can be used as distance measures, by setting, for example, C = Ck (Di ) and T = Ck (Dj ),
Relative Measures 455 ALGORITHM17.1. ClusteringStabilityAlgorithmforChoosingk
CLUSTERINGSTABILITY (A,t,kmax,D): n←|D|
// Generate t samples
for i = 1,2,…,t do
Di ← sample n points from D with replacement
// Generate clusterings for different values of k for i = 1,2,…,t do
for k = 2,3,…,kmax do
Ck(Di)←clusterDi intokclustersusingalgorithmA
// Compute mean difference between clusterings for each k foreachpairDi,Dj withj>ido
1
2 3
4 5 6
7 8 9
10
11 12
13
or vice versa. From these values we compute the expected pairwise distance for each value of k. Finally, the value k∗ that exhibits the least deviation between the clusterings obtained from the resampled datasets is the best choice for k because it exhibits the most stability.
There is, however, one complication when evaluating the distance between a pair of clusterings Ck (Di ) and Ck (Dj ), namely that the underlying datasets Di and Dj are different. That is, the set of points being clustered is different because each sample Di is different. Before computing the distance between the two clusterings, we have to restrict the clusterings only to the points common to both Di and Dj , denoted as Dij . Because sampling with replacement allows multiple instances of the same point, we also have to account for this when creating Dij . For each point xa in the input dataset D, let mai and mja denote the number of occurrences of xa in Di and Dj , respectively. Define
D i j = D i ∩ D j = m a i n s t a n c e s o f x a | x a ∈ D , m a = m i n { m ai , m ja } ( 1 7 . 3 0 )
That is, the common dataset Dij is created by selecting the minimum number of instances of the point xa in Di or Dj .
Algorithm 17.1 shows the pseudo-code for the clustering stability method for choosing the best k value. It takes as input the clustering algorithm A, the number of samples t, the maximum number of clusters kmax,and the input dataset D.
Dij ← Di ∩ Dj // create common dataset using Eq. (17.30) for k = 2,3,…,kmax do
d i j ( k ) ← d C k ( D i ) , C k ( Dj ) , D i j / / d i s t a n c e b e t w e e n clusterings
for k = 2,3,…,kmax do
μd (k) ← 2 t dij (k) // expected pairwise distance
t(t−1) i=1 j>i // Choose best k
k∗ ← arg mink μd (k)
456
Clustering Validation
0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1
0
0123456789
k
Figure 17.6. Clustering stability: Iris dataset.
μs(k):FM μd (k) : VI
It first generates the t bootstrap samples and clusters them using algorithm A. Next, it computes the distance between the clusterings for each pair of datasets Di and Dj , for each value of k. Finally, the method computes the expected pairwise distance μd (k) in line 12. We assume that the clustering distance function d is symmetric. If d is not symmetric, then the expected difference should be computed over all ordered pairs, thatis,μd(k)= 1 r dij(k).
similarity measure, in which case, after computing the average similarity between pairs of clusterings for a given k, we can choose the best value k∗ as the one that maximizes the expected similarity μs(k). In general, those external measures that yield lower values for better agreement between Ck(Di) and Ck(Dj) can be used as distance functions, whereas those that yield higher values for better agreement can be used as similarity functions. Examples of distance functions include normalized mutual information, variation of information, and conditional entropy (which is asymmetric). Examples of similarity functions include Jaccard, Fowlkes–Mallows, Hubert Ŵ statistic, and so on.
t(t−1) i=1 j̸=i
Instead of a distance function d, we can also evaluate clustering stability via a
Example 17.10. We study the clustering stability for the Iris principal components dataset, with n = 150, using the K-means algorithm. We use t = 500 bootstrap samples. For each dataset Di, and each value of k, we run K-means with 100 initial starting configurations, and select the best clustering.
For the distance function, we used the variation of information [Eq.(17.5)] between each pair of clusterings. We also used the Fowlkes–Mallows measure [Eq. (17.13)] as an example of a similarity measure. The expected values of the pairwise distance μd (k) for the VI measure, and the pairwise similarity μs (k) for the FM measure are plotted in Figure 17.6. Both the measures indicate that k = 2 is the best value, as for the VI measure this leads to the least expected distance between pairs of clusterings, and for the FM measure this choice leads to the most expected similarity between clusterings.
Expected Value
Relative Measures 457 17.3.2 Clustering Tendency
Clustering tendency or clusterability aims to determine whether the dataset D has any meaningful groups to begin with. This is usually a hard task given the different definitions of what it means to be a cluster, for example, partitional, hierarchical, density-based, graph-based and so on. Even if we fix the cluster type, it is still a hard task to define the appropriate null model (e.g., the one without any clustering structure) for a given dataset D. Furthermore, if we do determine that the data is clusterable, then we are still faced with the question of how many clusters there are. Nevertheless, it is still worthwhile to assess the clusterability of a dataset; we look at some approaches to answer the question whether the data is clusterable or not.
Spatial Histogram
One simple approach is to contrast the d-dimensional spatial histogram of the input dataset D with the histogram from samples generated randomly in the same data space. Let X1,X2,…,Xd denote the d dimensions. Given b, the number of bins for each dimension, we divide each dimension Xj into b equi-width bins, and simply count how many points lie in each of the bd d-dimensional cells. From this spatial histogram, we can obtain the empirical joint probability mass function (EPMF) for the dataset D, which is an approximation of the unknown joint probability density function. The EPMF is given as
f(i)=P(xj ∈celli)={xj ∈celli} n
where i = (i1,i2,…,id) denotes a cell index, with ij denoting the bin index along dimension Xj .
Next, we generate t random samples, each comprising n points within the same d-dimensional space as the input dataset D. That is, for each dimension Xj , we compute its range [min(Xj),max(Xj)], and generate values uniformly at random within the given range. Let Rj denote the jth such random sample. We can then compute the corresponding EPMF gj (i) for each Rj , 1 ≤ j ≤ t.
Finally, we can compute how much the distribution f differs from gj (for j=1,…,t),usingtheKullback–Leibler(KL)divergencefromf togj,definedas
KL(f|g )=f(i)logf(i) (17.31)
j
The KL divergence is zero only when f and gj are the same distributions. Using these divergence values, we can compute how much the dataset D differs from a random dataset.
The main limitation of this approach is that as dimensionality increases, the number of cells (bd) increases exponentially, and with a fixed sample size n, most of the cells will be empty, or will have only one point, making it hard to estimate the divergence. The method is also sensitive to the choice of parameter b. Instead of histograms, and the corresponding EPMF, we can also use density estimation methods (see Section 15.2) to determine the joint probability density function (PDF) for the
i
gj(i)
458
u2
Clustering Validation
u2
1.0 0.5 0 −0.5 −1.0 −1.5
1.0 0.5 0 −0.5 −1.0
−4 −3 −2 −1 0 1 2 3 (a) Iris: spatial cells
0.18 0.16 0.14 0.12 0.10 0.08 0.06 0.04 0.02
u1
−1.5 u1 −4 −3 −2 −1 0 1 2 3
(b) Uniform: spatial cells
Iris (f ) Uniform (gj )
0
0 1 2 3 4 5 6 7 8 9 101112131415161718192021222324
Spatial Cells
(c) Empirical probability mass function
0.25 0.20 0.15 0.10 0.05
0
0.65 0.80 0.95
1.10 1.25 1.40
1.55 1.70
KL Divergence
(d) KL-divergence distribution
Figure 17.7. Iris dataset: spatial histogram.
Probability
Probability
Relative Measures 459 dataset D, and see how it differs from the PDF for the random datasets. However, the
curse of dimensionality also causes problems for density estimation.
Example 17.11. Figure 17.7c shows the empirical joint probability mass function for the Iris principal components dataset that has n = 150 points in d = 2 dimensions. It also shows the EPMF for one of the datasets generated uniformly at random in the same data space. Both EPMFs were computed using b = 5 bins in each dimension, for a total of 25 spatial cells. The spatial grids/cells for the Iris dataset D, and the random sample R, are shown in Figures 17.7a and 17.7b, respectively. The cells are numbered starting from 0, from bottom to top, and then left to right. Thus, the bottom left cell is 0, top left is 4, bottom right is 19, and top right is 24. These indices are used along the x-axis in the EPMF plot in Figure 17.7c.
We generated t = 500 random samples from the null distribution, and computed the KL divergence from f to gj for each 1 ≤ j ≤ t (using logarithm with base 2). The distribution of the KL values is plotted in Figure 17.7d. The mean KL value was μKL = 1.17, with a standard deviation of σKL = 0.18, indicating that the Iris data is indeed far from the randomly generated data, and thus is clusterable.
Distance Distribution
Instead of trying to estimate the density, another approach to determine clusterability is to compare the pairwise point distances from D, with those from the randomly generated samples Ri from the null distribution. That is, we create the EPMF from the proximity matrix W for D [Eq.(17.22)] by binning the distances into b bins:
f(i)=P(wpq ∈ bini|xp,xq ∈D,p 0 centered at x:
P(x|ci)=2ǫ·fi(x)
fi(x)=f(x|μi,i)= √
√
1
( 2π)d |i|
468 Probabilistic Classification The posterior probability is then given as
P(ci|x)= 2ǫ·fi(x)P(ci) = fi(x)P(ci) (18.4) kj = 1 2 ǫ · f j ( x ) P ( c j ) kj = 1 f j ( x ) P ( c j )
Further, because kj =1 fj (x)P (cj ) remains fixed for x, we can predict the class for x by modifying Eq. (18.2) as follows:
yˆ = arg maxfi (x)P (ci ) ci
To classify a numeric test point x, the Bayes classifier estimates the parameters via the sample mean and sample covariance matrix. The sample mean for the class ci can be estimated as
μˆ i = 1 x j ni xj∈Di
and the sample covariance matrix for each class can be estimated using Eq. (2.30), as follows
i = 1 Z Ti Z i ni
where Zi is the centered data matrix for class ci given as Zi = Di − 1 · μˆ Ti . These values can be used to estimate the probability density in Eq. (18.3) as fˆi (x) = f (x|μˆ i , i ).
Algorithm 18.1 shows the pseudo-code for the Bayes classifier. Given an input dataset D, the method estimates the prior probability, mean and covariance matrix for each class. For testing, given a test point x, it simply returns the class with the maximum posterior probability. The cost of training is dominated by the covariance matrix computation step which takes O(nd2) time.
ALGORITHM 18.1. Bayes Classifier
1 2 3 4 5 6 7
8
9 10
fori=1,…,kdo
BAYESCLASSIFIER (D = {(xj ,yj )}n ): j=1
ni ←|Di|//cardinality
P (ci ) ← ni / n // prior probability ˆ
μˆ ← 1 x / / m e a n inixj∈Dij
Zi ← Di − 1ni μˆ Ti // centered data i ← 1 ZTZi // covariance matrix
ni i
ˆ returnP(ci),μˆi,i foralli=1,…,k
T E S T I N G ( x a n d Pˆ ( c i ) , μˆ i , i , f o r a l l i ∈ [ 1 , k ] ) : yˆ ←argmaxf(x|μˆi,i)·P(ci)
ci returnyˆ
Di ←xj |yj =ci,j =1,…,n//class-specificsubsets
Bayes Classifier
469
X2
4.0
3.5
3.0
2.5
x=(6.75,4.25)T
X1
Figure 18.1. Iris data: X1:sepallengthversus X2:sepalwidth. The class means are show in black; the
2
4
4.5
density contours are also shown. The square represents a test point labeled x.
5.0
5.5 6.0
6.5 7.0 7.5 8.0
Example 18.1. Consider the 2-dimensional Iris data, with attributes sepal length and sepal width, shown in Figure 18.1. Class c1 , which corresponds to iris-setosa (shown as circles), has n1 = 50 points, whereas the other class c2 (shown as triangles) has n2 = 100 points. The prior probabilities for the two classes are
Pˆ(c1)= n1 = 50 =0.33 Pˆ(c2)= n2 = 100 =0.67 n 150 n 150
The means for c1 and c2 (shown as black circle and triangle) are given as μˆ 1 = 5.01 μˆ 2 = 6.26
3.42 2.87 and the corresponding covariance matrices are as follows:
1 = 0.122 0.098 2 = 0.435 0.121 0.098 0.142 0.121 0.110
Figure 18.1 shows the contour or level curve (corresponding to 1% of the peak density) for the multivariate normal distribution modeling the probability density for both classes.
Let x = (6.75,4.25)T be a test point (shown as white square). The posterior probabilities for c1 and c2 can be computed using Eq. (18.4):
Pˆ(c1|x)∝fˆ(x|μˆ1,1)Pˆ(c1)=(4.914×10−7)×0.33=1.622×10−7
Pˆ(c2|x)∝fˆ(x|μˆ2,2)Pˆ(c2)=(2.589×10−5)×0.67=1.735×10−5 Because Pˆ (c2|x) > Pˆ (c1|x) the class for x is predicted as yˆ = c2.
470 Probabilistic Classification
Categorical Attributes
If the attributes are categorical, the likelihood can be computed using the categorical data modeling approach presented in Chapter 3. Formally, let Xj be a categorical attribute over the domain dom(Xj) = {aj1,aj2,…,ajmj }, that is, attribute Xj can take on mj distinct categorical values. Each categorical attribute Xj is modeled as an mj -dimensional multivariate Bernoulli random variable Xj that takes on mj distinct vector values ej1,ej2,…,ejmj , where ejr is the rth standard basis vector in Rmj and corresponds to the r th value or symbol aj r ∈ d om(Xj ). The entire d -dimensional dataset is modeled as the vector random variable X = (X1,X2,…,Xd)T. Let d′ = dj=1mj; a categorical point x = (x1,x2,…,xd)T is therefore represented as the d′-dimensional binary vector
v1 e1r1 v= . = .
vd edrd
where vj = ej rj provided xj = aj rj is the rj th value in the domain of Xj . The probability of the categorical point x is obtained from the joint probability mass function (PMF) for the vector random variable X:
P(x|ci)=f(v|ci)=fX1 =e1r1,…,Xd =edrd|ci (18.5) The above joint PMF can be estimated directly from the data Di for each class ci as
follows:
fˆ ( v | c i ) = n i ( v ) ni
where ni(v) is the number of times the value v occurs in class ci. Unfortunately, if the probability mass at the point v is zero for one or both classes, it would lead to a zero value for the posterior probability. To avoid zero probabilities, one approach is to introduce a small prior probability for all the possible values of the vector random variable X. One simple approach is to assume a pseudo-count of 1 for each value, that is, to assume that each value of X occurs at least one time, and to augment this base count of 1 with the actual number of occurrences of the observed value v in class ci. The adjusted probability mass at v is then given as
ni(v)+1
fˆ ( v | c i ) = n + d m ( 1 8 . 6 )
i j=1j
where dj=1mj gives the number of possible values of X. Extending the code in Algorithm 18.1 to incorporate categorical attributes is relatively straightforward; all that is required is to compute the joint PMF for each class using Eq. (18.6).
Bayes Classifier
471
Table18.1. Discretizedsepallengthandsepalwidthattributes
Bins
Domain
[4.3, 5.2] (5.2, 6.1] (6.1, 7.0] (7.0, 7.9]
Very Short (a11) Short (a12) Long (a13)
Very Long (a14)
Bins
Domain
[2.0, 2.8] (2.8, 3.6] (3.6, 4.4]
Short (a21) Medium (a22) Long (a23)
(b) Discretized sepal width Table 18.2. Class-specific empirical (joint) probability mass function
(a) Discretized sepal length
Class: c1
X2
fˆX1
Short (e21 )
Medium (e22 )
Long (e23 )
X1
Very Short (e11) Short (e12) Long (e13)
Very Long (e14)
1/50 0 0 0
33/50 3/50 0
0
5/50 8/50 0 0
39/50 13/50 0
0
fˆX2
1/50
36/50
13/50
Class: c2
X2
fˆX1
Short (e21 )
Medium (e22 )
Long (e23 )
X1
Very Short (e11) Short (e12) Long (e13)
Very Long (e14)
6/100 24/100 13/100 3/100
0 15/100 30/100 7/100
0
0
0 2/100
6/100 39/100 43/100 12/100
fˆX2
46/100
52/100
2/100
Example 18.2. Assume that the sepal length and sepal width attributes in the Iris dataset have been discretized as shown in Table18.1a and Table18.1b, respectively. We have |dom(X1)| = m1 = 4 and |dom(X2)| = m2 = 3. These intervals are also illustrated in Figure 18.1: via the gray grid lines. Table 18.2 shows the empirical joint PMF for both the classes. Also, as in Example 18.1, the prior probabilities of the classes are given as Pˆ (c1) = 0.33 and Pˆ (c2) = 0.67.
Consider a test point x = (5.3,3.0)T corresponding to the categorical point (Short, Medium), which is represented as v = eT12 eT22T. The likelihood and posterior probability for each class is given as
Pˆ (x|c1) = fˆ(v|c1) = 3/50 = 0.06
Pˆ (x|c2) = fˆ(v|c2) = 15/100 = 0.15 Pˆ (c1 |x) ∝ 0.06 × 0.33 = 0.0198
Pˆ (c2 |x) ∝ 0.15 × 0.67 = 0.1005
In this case the predicted class is yˆ = c2.
On the other hand, the test point x = (6.75,4.25)T corresponding to the
categorical point (Long, Long) is represented as v = eT13 eT23T. Unfortunately the probability mass at v is zero for both classes. We adjust the PMF via pseudo-counts
472 Probabilistic Classification
[Eq. (18.6)]; note that the number of possible values are m1 × m2 = 4 × 3 = 12. The likelihood and prior probability can then be computed as
Pˆ(x|c1)=fˆ(v|c1)= 0+1 =1.61×10−2 50+12
Pˆ(x|c2)=fˆ(v|c2)= 0+1 =8.93×10−3 100+12
Pˆ (c1|x) ∝ (1.61 × 10−2) × 0.33 = 5.32 × 10−3 Pˆ (c2|x) ∝ (8.93 × 10−3) × 0.67 = 5.98 × 10−3
Thus, the predicted class is yˆ = c2.
Challenges
The main problem with the Bayes classifier is the lack of enough data to reliably estimate the joint probability density or mass function, especially for high-dimensional data. For instance, for numeric attributes we have to estimate O(d2) covariances, and as the dimensionality increases, this requires us to estimate too many parameters. For categorical attributes we have to estimate the joint probability for all the possible values of v, given as j |domXj |. Even if each categorical attribute has only two values, we would need to estimate the probability for 2d values. However, because there can be at most n distinct values for v, most of the counts will be zero. To address some of these concerns we can use reduced set of parameters in practice, as described next.
18.2 NAIVE BAYES CLASSIFIER
We saw earlier that the full Bayes approach is fraught with estimation related problems, especially with large number of dimensions. The naive Bayes approach makes the simple assumption that all the attributes are independent. This leads to a much simpler, though surprisingly effective classifier in practice. The independence assumption immediately implies that the likelihood can be decomposed into a product of dimension-wise probabilities:
d P(x|ci)=P(x1,x2,…,xd|ci)= P(xj|ci) (18.7) j=1
Numeric Attributes
For numeric attributes we make the default assumption that each of them is normally distributedforeachclassci.Letμij andσ2 denotethemeanandvarianceforattribute
ij
Xj , for class ci . The likelihood for class ci , for dimension Xj , is given as
P(xj|ci)∝f(xj|μij,σ2)= √ 1
exp−(xj −μij)2 2σ2
ij
ij 2πσ ij
Naive Bayes Classifier 473 Incidentally, the naive assumption corresponds to setting all the covariances to
zero in i , that is,
This yields
Also, we have
σ2 0 i1
… 0 … 0
0 σ2 = i2
…
i . . 0 0
… σ2 id
d |i|=det(i)=σ2σ2 ···σ2 = σ2
i1i2 id ij j=1
1 0…0 i1
σ2
0 1 … 0
σ2
−1 = i2
i . . …
(x−μ)T−1(x−μ)=d (xj−μij)2
0 0 … 1 σ2 id
assuming that σ 2 ̸= 0 for all j . Finally, ij
Plugging these into Eq. (18.3) gives us 1
d ( x j − μ i j ) 2
i i i σ2 j=1 ij
P(x|c)= exp −
(√2π)d
d 1 (xj−μij)2
i
d
= P(xj|ci)
ij j=1 2σ2
d σ2 j=1 ij
= √2πσexp−2σ2 j=1 ij ij
j=1
which is equivalent to Eq.(18.7). In other words, the joint probability has been decomposed into a product of the probability along each dimension, as required by the independence assumption.
ThenaiveBayesclassifierusesthesamplemeanμˆi =(μˆi1,…,μˆid)T andadiagonal sample covariance matrix i = diag(σ2 ,…,σ2 ) for each class ci. Thus, in total 2d
i1 id
parameters have to be estimated, corresponding to the sample mean and sample
variance for each dimension Xj .
Algorithm 18.2 shows the pseudo-code for the naive Bayes classifier. Given an
input dataset D, the method estimates the prior probability and mean for each class. Next,itcomputesthevarianceσˆ2 foreachoftheattributesXj,withallthedvariances
ij
for class ci stored in the vector σˆi. The variance for attribute Xj is obtained by first
474
Probabilistic Classification
Di ←xj |yj =ci,j =1,…,n//class-specificsubsets
ni ←|Di|//cardinality
P (ci ) ← ni / n // prior probability ˆ
μˆ ← 1 x / / m e a n inixj∈Dij
Z i = D i − 1 · μˆ Ti / / c e n t e r e d d a t a f o r c l a s s c i
for j = 1, .., d do // class-specific variance for Xj
σˆ 2 ← 1 Z T Z i j / / v a r i a n c e ij ni ij
σˆ i = σˆ 2 , . . . , σˆ 2 T // class-specific attribute variances i1 id
ALGORITHM 18.2. Naive Bayes Classifier
1 2 3 4 5 6 7 8
9 10
fori=1,…,kdo
NAIVEBAYES (D={(xj,yj)}n ): j=1
returnPˆ(ci),μˆi,σˆi foralli=1,…,k
TESTING (xandP(ci),μˆi,σˆi,foralli∈[1,k]):
ˆ d
yˆ ←argmax Pˆ(ci) f(xj|μˆij,σˆ2) 11ci ij
j=1 12 returnyˆ
centering the data for class Di via Zi = Di − 1 · μˆ Ti . We denote by Zij the centered data for class ci corresponding to attribute Xj . The variance is then given as σˆ = 1 ZT Zij .
ni ij Training the naive Bayes classifier is very fast, with O(nd) computational
complexity. For testing, given a test point x, it simply returns the class with the maximum posterior probability obtained as a product of the likelihood for each dimension and the class prior probability.
Example 18.3. Consider Example 18.1. In the naive Bayes approach the prior probabilities Pˆ (ci ) and means μˆ i remain unchanged. The key difference is that the covariance matrices are assumed to be diagonal, as follows:
1 = 0 . 1 2 2 0 2 = 0 . 4 3 5 0 0 0.142 0 0.110
Figure 18.2 shows the contour or level curve (corresponding to 1% of the peak density) of the multivariate normal distribution for both classes. One can see that the diagonal assumption leads to contours that are axis-parallel ellipses; contrast these with the contours in Figure 18.1 for the full Bayes classifier.
For the test point x = (6.75, 4.25)T, the posterior probabilities for c1 and c2 are as follows:
Pˆ(c1|x)∝fˆ(x|μˆ1,1)Pˆ(c1)=(3.99×10−7)×0.33=1.32×10−7 Pˆ(c2|x)∝fˆ(x|μˆ2,2)Pˆ(c2)=(9.597×10−5)×0.67=6.43×10−5
Because Pˆ (c2|x) > Pˆ (c1|x) the class for x is predicted as yˆ = c2.
Naive Bayes Classifier
X2
4.0
3.5
3.0
2.5
475
2
4
x=(6.75,4.25)T
Categorical Attributes
The independence assumption leads to a simplification of the joint probability mass function in Eq. (18.5), which can be rewritten as
d d P(x|ci)= P(xj|ci)= f Xj =ejrj|ci
j=1 j=1
where f (Xj = ej rj |ci ) is the probability mass function for Xj , which can be estimated
from Di as follows:
fˆ ( v j | c i ) = n i ( v j ) ni
where ni (vj ) is the observed frequency of the value vj = ej rj corresponding to the rj th categorical value ajrj for the attribute Xj for class ci. As in the full Bayes case, if the count is zero, we can use the pseudo-count method to obtain a prior probability. The adjusted estimates with pseudo-counts are given as
fˆ ( v j | c i ) = n i ( v j ) + 1 ni +mj
where mj = |d om(Xj )|. Extending the code in Algorithm 18.2 to incorporate categorical attributes is straightforward.
Example 18.4. Continuing Example 18.2, the class-specific PMF for each discretized attribute is shown in Table 18.2. In particular, these correspond to the row and columnmarginalprobabilitiesfˆX1 andfˆX2,respectively.
X1
Figure 18.2. Naive Bayes: X1:sepallengthversus X2:sepalwidth. The class means are shown in black;
4.5
the density contours are also shown. The square represents a test point labeled x.
5.0
5.5 6.0
6.5
7.0
7.5
8.0
476 Probabilistic Classification
The test point x = (6.75,4.25), corresponding to (Long, Long) or v = (e13,e23), is classified as follows:
Pˆ(v|c1)=Pˆ(e13|c1)·Pˆ(e23|c1)= 0+1 ·13=4.81×10−3 50+4 50
Pˆ(v|c2)=Pˆ(e13|c2)·Pˆ(e23|c2)= 43 · 2 =8.60×10−3 100 100
Pˆ (c1|v) ∝ (4.81 × 10−3) × 0.33 = 1.59 × 10−3 Pˆ (c2|v) ∝ (8.6 × 10−3) × 0.67 = 5.76 × 10−3
Thus, the predicted class is yˆ = c2.
18.3 K NEAREST NEIGHBORS CLASSIFIER
In the preceding sections we considered a parametric approach for estimating the likelihood P (x|ci ). In this section, we consider a non-parametric approach, which does not make any assumptions about the underlying joint probability density function. Instead, it directly uses the data sample to estimate the density, for example, using the density estimation methods from Chapter 15. We illustrate the non-parametric approach using nearest neighbors density estimation from Section 15.2.3, which leads to the K nearest neighbors (KNN) classifier.
Let D be a training dataset comprising n points xi ∈ Rd , and let Di denote the subset of points in D that are labeled with class ci , with ni = |Di |. Given a test point x ∈ Rd , and K, the number of neighbors to consider, let r denote the distance from x to its Kth nearest neighbor in D.
Consider the d-dimensional hyperball of radius r around the test point x, defined as Bd(x,r)=xi ∈D| δ(x,xi)≤r
Here δ(x,xi) is the distance between x and xi, which is usually assumed to be the Euclidean distance, i.e., δ(x,xi)=∥x−xi∥2. However, other distance metrics can also be used. We assume that |Bd (x, r )| = K.
Let Ki denote the number of points among the K nearest neighbors of x that are labeled with class ci , that is
Ki =xj ∈Bd(x,r)|yj =ci
The class conditional probability density at x can be estimated as the fraction of
points from class ci that lie within the hyperball divided by its volume, that is fˆ(x|ci)=Ki/ni = Ki
V niV
where V = vol(Bd (x, r )) is the volume of the d -dimensional hyperball [Eq. (6.4)].
Using Eq. (18.4), the posterior probability P (ci |x) can be estimated as
P(ci|x)= fˆ(x|ci)Pˆ(ci)
kj = 1 f ˆ ( x | c j ) Pˆ ( c j )
K Nearest Neighbors Classifier 477 However, because Pˆ (ci ) = ni , we have
n
Thus the posterior probability is given as
Ki Ki P (ci |x) = nV =
2
4
X1
4.5
fˆ(x|ci)Pˆ(ci)= Ki ·ni = Ki niV n nV
Finally, the predicted class for x is
yˆ =argmax{P(ci|x)}=argmaxKi =argmax{Ki}
ci ciKci
Because K is fixed, the KNN classifier predicts the class of x as the majority class among
its K nearest neighbors.
Example 18.5. Consider the 2D Iris dataset shown in Figure 18.3. The two classes are: c1 (circles) with n1 = 50 points and c2 (triangles) with n2 = 100 points.
Let us classify the test point x = (6.75, 4.25)T using its K = 5 nearest neighbors. T√he distance from x to its 5th nearest neighbor, namely (6.2,3.4)T, is given as r = 1.025 = 1.012. The enclosing ball or circle of radius r is shown in the figure. It encompasses K1 = 1 point from class c1 and K2 = 4 points from class c2. Therefore,
the predicted class for x is yˆ = c2.
X2
4.0
3.5
3.0
2.5
r
x=(6.75,4.25)T
5.0
Figure 18.3. Iris Data: K Nearest Neighbors Classifier
5.5
6.0
6.5
7.0
7.5
8.0
k Kj K j=1 nV
478 Probabilistic Classification
18.4 FURTHER READING
The naive Bayes classifier is surprisingly effective even though the independence assumption is usually violated in real datasets. Comparison of the naive Bayes classifier against other classification approaches and reasons for why is works well have appeared in Langley, Iba, and Thompson (1992), Domingos and Pazzani (1997), Zhang (2005), and Hand and Yu (2001) and Rish (2001). For the long history of naive Bayes in information retrieval see Lewis (1998). The K nearest neighbor classification approach was first proposed in Fix and Hodges (1951).
Domingos, P. and Pazzani, M. (1997). On the optimality of the simple Bayesian classifier under zero-one loss. Machine learning, 29 (2-3): 103–130.
Fix, E. and Hodges Jr., J. L. (1951). Discriminatory Analysis–Nonparametric Discrim- ination: Consistency Properties. Tech. rep. USAF School of Aviation Medicine, Randolph Field, TX, Project 21-49-004, Report 4, Contract AF41(128)-31.
Hand, D. J. and Yu, K. (2001). Idiot’s Bayes-not so stupid after all? International Statistical Review, 69 (3): 385–398.
Langley, P., Iba, W., and Thompson, K. (1992). An analysis of Bayesian classifiers. Proceedings of the National Conference on Artificial Intelligence. Palo Alto, CA: AAAI Press, pp. 223–228.
Lewis, D. D. (1998). “Naive (Bayes) at forty: The independence assumption in information retrieval”. In: Proceedings of the 10th European Conference on Machine learning. New York: Springer Science + Business Media, pp. 4–15.
Rish, I. (2001). An empirical study of the naive Bayes classifier. Proceedings of the IJCAI Workshop on Empirical Methods in Artificial Intelligence, pp. 41–46.
Zhang, H. (2005). Exploring conditions for the optimality of naive Bayes. International Journal of Pattern Recognition and Artificial Intelligence, 19 (02): 183–198.
18.5 EXERCISES
Q1. Consider the dataset in Table 18.3. Classify the new point: (Age=23, Car=truck) via the full and naive Bayes approach. You may assume that the domain of Car is given as {sports, vintage, suv, truck}.
Table 18.3. Data for Q1
xi
Age Car
Class
x1 x2 x3 x4 x5 x6
25
20
25
45
20
25
sports vintage sports suv sports suv
L H L H H H
Exercises
479
Table 18.4. Data for Q2
xi
a1 a2 a3
Class
x1 x2 x3 x4 x5 x6 x7 x8 x9
T T 5.0 T T 7.0 T F 8.0 F F 3.0 F T 7.0 F T 4.0 F F 5.0 T F 6.0 F T 1.0
Y Y N Y N N N Y N
Q2. GiventhedatasetinTable18.4,usethenaiveBayesclassifiertoclassifythenewpoint (T, F , 1.0).
Q3. Considertheclassmeansandcovariancematricesforclassesc1andc2: μ1 = (1,3) μ2 = (5,5)
1=5 3 2=2 0 32 01
Classify the point (3,4)T via the (full) Bayesian approach, assuming normally distributed classes, and P(c1) = P(c2) = 0.5. Show all steps. Recall that the inverse
ofa2×2matrixA= a b isgivenasA−1 = 1 d −b . c d det(A) −c a
CHAPTER 19 DecisionTreeClassifier
Let the training dataset D = {xi,yi}ni=1 consist of n points in a d-dimensional space, with yi being the class label for point xi. We assume that the dimensions or the attributes Xj are numeric or categorical, and that there are k distinct classes, so that yi ∈ {c1,c2,…,ck}. A decision tree classifier is a recursive, partition-based tree model that predicts the class yˆi for each point xi. Let R denote the data space that encompasses the set of input points D. A decision tree uses an axis-parallel hyperplane to split the data space R into two resulting half-spaces or regions, say R1 and R2, which also induces a partition of the input points into D1 and D2, respectively. Each of these regions is recursively split via axis-parallel hyperplanes until the points within an induced partition are relatively pure in terms of their class labels, that is, most of the points belong to the same class. The resulting hierarchy of split decisions constitutes the decision tree model, with the leaf nodes labeled with the majority class among points in those regions. To classify a new test point we have to recursively evaluate which half-space it belongs to until we reach a leaf node in the decision tree, at which point we predict its class as the label of the leaf.
Example 19.1. Consider the Iris dataset shown in Figure 19.1a, which plots the attributes sepal length (X1) and sepal width (X2). The classification task is to discriminate between c1, corresponding to iris-setosa (in circles), and c2, corresponding to the other two types of Irises (in triangles). The input dataset D has n = 150 points that lie in the data space which is given as the rectangle, R=range(X1)×range(X2)=[4.3,7.9]×[2.0,4.4].
The recursive partitioning of the space R via axis-parallel hyperplanes is illustrated in Figure 19.1a. In two dimensions a hyperplane is simply a line. The first split corresponds to hyperplane h0 shown as a black line. The resulting left and right half-spaces are further split via hyperplanes h2 and h3, respectively (shown as gray lines). The bottom half-space for h2 is further split via h4, and the top half-space for h3 is split via h5; these third level hyperplanes, h4 and h5, are shown as dashed lines. The set of hyperplanes and the set of six leaf regions, namely R1,…,R6, constitute the decision tree model. Note also the induced partitioning of the input points into these six regions.
480
Decision Tree Classifier
481
X2 h0 h5
4.0
3.5
3.0
h3
h4 h2
R1
z
R6
R5
R2
R3
R4
2.5
2 X1
4.3
4.8
5.3
X2 ≤2.8 Yes
5.8 6.3
(a) Recursive Splits
X1 ≤ 5.45
6.8 7.3
7.8
No
Yes
No
No
Yes
X2 ≤3.45
c1
44
c2
1
c1
0
c2
90
X1 ≤4.7
Yes No R1 R2 Yes No
X1 ≤6.5
c1
1
c2
0
c1
0
c2
6
c1
5
c2
0
c1
0
c2
3
R3 R4
Figure 19.1. Decision trees: recursive partitioning via axis-parallel hyperplanes.
(b) Decision Tree
R5 R6
Consider the test point z = (6.75, 4.25)T (shown as a white square). To predict its class, the decision tree first checks which side of h0 it lies in. Because the point lies in the right half-space, the decision tree next checks h3 to determine that z is in the top half-space. Finally, we check and find that z is in the right half-space of h5, and we reach the leaf region R6. The predicted class is c2, as that leaf region has all points (three of them) with class c2 (triangles).
482 Decision Tree Classifier
19.1 DECISION TREES
A decision tree consists of internal nodes that represent the decisions corresponding to the hyperplanes or split points (i.e., which half-space a given point lies in), and leaf nodes that represent regions or partitions of the data space, which are labeled with the majority class. A region is characterized by the subset of data points that lie in that region.
Axis-Parallel Hyperplanes
A hyperplane h(x) is defined as the set of all points x that satisfy the following equation
h(x):wTx+b=0 (19.1)
Herew∈Rd isaweightvectorthatisnormaltothehyperplane,andbistheoffsetofthe hyperplane from the origin. A decision tree considers only axis-parallel hyperplanes, that is, the weight vector must be parallel to one of the original dimensions or axes Xj . Put differently, the weight vector w is restricted a priori to one of the standard basis vectors {e1,e2,…,ed}, where ei ∈ Rd has a 1 for the jth dimension, and 0 for all other dimensions.Ifx=(x1,x2,…,xd)T andassumingw=ej,wecanrewriteEq.(19.1)as
h ( x ) : e jT x + b = 0 , w h i c h i m p l i e s t h a t
h(x):xj +b=0
where the choice of the offset b yields different hyperplanes along dimension Xj .
Split Points
A hyperplane specifies a decision or split point because it splits the data space R into two half-spaces. All points x such that h(x) ≤ 0 are on the hyperplane or to one side of the hyperplane, whereas all points such that h(x) > 0 are on the other side. The split point associated with an axis-parallel hyperplane can be written as h(x) ≤ 0, which implies that xi + b ≤ 0, or xi ≤ −b. Because xi is some value from dimension Xj and the offset b can be chosen to be any value, the generic form of a split point for a numeric attribute Xj is given as
Xj ≤ v
where v = −b is some value in the domain of attribute Xj . The decision or split point Xj ≤ v thus splits the input data space R into two regions RY and RN, which denote the set of all possible points that satisfy the decision and those that do not.
Data Partition
Each split of R into RY and RN also induces a binary partition of the corresponding input data points D. That is, a split point of the form Xj ≤ v induces the data partition
DY ={x| x∈D,xj ≤v} DN ={x| x∈D,xj >v}
where DY is the subset of data points that lie in region RY and DN is the subset of input points that line in RN.
Decision Trees 483
Purity
The purity of a region Rj is defined in terms of the mixture of classes for points in the corresponding data partition Dj . Formally, purity is the fraction of points with the majority label in Dj , that is,
purity(Dj ) = max nj i (19.2) i nj
where nj = |Dj | is the total number of data points in the region Rj , and nj i is the number ofpointsinDj withclasslabelci.
Example 19.2. Figure 19.1b shows the resulting decision tree that corresponds to the recursive partitioning of the space via axis-parallel hyperplanes illustrated in Figure 19.1a. The recursive splitting terminates when appropriate stopping conditions are met, usually taking into account the size and purity of the regions. In this example, we use a size threshold of 5 and a purity threshold of 0.95. That is, a region will be split further only if the number of points is more than five and the purity is less than 0.95.
The very first hyperplane to be considered is h1(x) : x1 − 5.45 = 0 which corresponds to the decision
X1 ≤ 5.45
at the root of the decision tree. The two resulting half-spaces are recursively split into smaller half-spaces.
For example, the region X1 ≤ 5.45 is further split using the hyperplane h2 (x) : x2 − 2.8 = 0 corresponding to the decision
X2 ≤ 2.8
which forms the left child of the root. Notice how this hyperplane is restricted only to the region X1 ≤ 5.45. This is because each region is considered independently after the split, as if it were a separate dataset. There are seven points that satisfy the condition X2 ≤ 2.8, out of which one is from class c1 (circle) and six are from class c2 (triangles). The purity of this region is therefore 6/7 = 0.857. Because the region has more than five points, and its purity is less than 0.95, it is further split via the hyperplane h4 (x) : x1 − 4.7 = 0 yielding the left-most decision node
X1 ≤ 4.7
in the decision tree shown in Figure 19.1b.
Returning back to the right half-space corresponding to h2, namely the region
X2 > 2.8, it has 45 points, of which only one is a triangle. The size of the region is 45, but the purity is 44/45 = 0.98. Because the region exceeds the purity threshold it is not split further. Instead, it becomes a leaf node in the decision tree, and the entire region (R1) is labeled with the majority class c1. The frequency for each class is also noted at a leaf node so that the potential error rate for that leaf can be computed. For example, we can expect that the probability of misclassification in region R1 is 1/45 = 0.022, which is the error rate for that leaf.
484 Decision Tree Classifier
Categorical Attributes
In addition to numeric attributes, a decision tree can also handle categorical data. For a categorical attribute Xj , the split points or decisions are of the Xj ∈ V, where V ⊂ dom(Xj), and dom(Xj) denotes the domain for Xj. Intuitively, this split can be considered to be the categorical analog of a hyperplane. It results in two “half-spaces,” one region RY consisting of points x that satisfy the condition xi ∈ V, and the other region RN comprising points that satisfy the condition xi ̸∈ V.
Decision Rules
One of the advantages of decision trees is that they produce models that are relatively easy to interpret. In particular, a tree can be read as set of decision rules, with each rule’s antecedent comprising the decisions on the internal nodes along a path to a leaf, and its consequent being the label of the leaf node. Further, because the regions are all disjoint and cover the entire space, the set of rules can be interpreted as a set of alternatives or disjunctions.
Example19.3. ConsiderthedecisiontreeinFigure19.1b.Itcanbeinterpretedasthe following set of disjunctive rules, one per leaf region Ri
R3:IfX1 ≤5.45andX2 ≤2.8andX1 ≤4.7,thenclassisc1,or R4:IfX1 ≤5.45andX2 ≤2.8andX1 >4.7,thenclassisc2,or R1:IfX1 ≤5.45andX2 >2.8,thenclassisc1,or
R2:IfX1 >5.45andX2 ≤3.45,thenclassisc2,or
R5:IfX1 >5.45andX2 >3.45andX1 ≤6.5,thenclassisc1,or R6:IfX1 >5.45andX2 >3.45andX1 >6.5,thenclassisc2
19.2 DECISION TREE ALGORITHM
The pseudo-code for decision tree model construction is shown in Algorithm 19.1. It takes as input a training dataset D, and two parameters η and π , where η is the leaf size and π the leaf purity threshold. Different split points are evaluated for each attribute in D. Numeric decisions are of the form Xj ≤ v for some value v in the value range for attribute Xj , and categorical decisions are of the form Xj ∈ V for some subset of values in the domain of Xj . The best split point is chosen to partition the data into two subsets, DY and DN, where DY corresponds to all points x ∈ D that satisfy the split decision, and DN corresponds to all points that do not satisfy the split decision. The decision tree method is then called recursively on DY and DN. A number of stopping conditions can be used to stop the recursive partitioning process. The simplest condition is based on the size of the partition D. If the number of points n in D drops below the user-specified size threshold η, then we stop the partitioning process and make D a leaf. This condition prevents over-fitting the model to the training set, by avoiding to model very small subsets of the data. Size alone is not sufficient because if
Decision Tree Algorithm 485 ALGORITHM 19.1. Decision Tree Algorithm
1 2 3 4 5 6 7
8
9 10 11 12
13 14 15
16 17 18 19
DECISIONTREE (D,η,π):
n ← |D| // partition size
ni ←|{xj|xj ∈D,yj =ci}|//sizeofclassci
purity(D) ← maxi ni n
if n ≤ η or purity(D) ≥ π then // stopping condition c∗←argmaxc ni//majorityclass
in
create leaf node, and label it with class c∗
return
(splitpoint∗,score∗)←(∅,0)//initializebestsplitpoint foreach (attribute Xj ) do
if (Xj is numeric) then
(v,score) ← EVALUATE-NUMERIC-ATTRIBUTE(D,Xj )
if score > score∗ then (split point∗,score∗) ← (Xj ≤ v,score)
else if (Xj is categorical) then
(V,score) ← EVALUATE-CATEGORICAL-ATTRIBUTE(D,Xj ) ifscore>score∗ then (splitpoint∗,score∗)←(Xj ∈V,score)
// partition D into DY and DN using split point∗, and call recursively
DY ← {x ∈ D | x satisfies split point∗}
DN ← {x ∈ D | x does not satisfy split point∗}
create internal node split point∗, with two child nodes, DY and DN DECISIONTREE(DY); DECISIONTREE(DN)
the partition is already pure then it does not make sense to split it further. Thus, the recursive partitioning is also terminated if the purity of D is above the purity threshold π . Details of how the split points are evaluated and chosen are given next.
19.2.1 Split Point Evaluation Measures
Given a split point of the form Xj ≤ v or Xj ∈ V for a numeric or categorical attribute, respectively, we need an objective criterion for scoring the split point. Intuitively, we want to select a split point that gives the best separation or discrimination between the different class labels.
Entropy
Entropy, in general, measures the amount of disorder or uncertainty in a system. In the classification setting, a partition has lower entropy (or low disorder) if it is relatively pure, that is, if most of the points have the same label. On the other hand, a partition has higher entropy (or more disorder) if the class labels are mixed, and there is no majority class as such.
486 Decision Tree Classifier The entropy of a set of labeled points D is defined as follows:
k i=1
H(D) = −
P (ci |D) log2 P (ci |D) (19.3)
where P (ci |D) is the probability of class ci in D, and k is the number of classes. If a region is pure, that is, has points from the same class, then the entropy is zero. On the other hand, if the classes are all mixed up, and each appears with equal probability P (ci |D) = 1 , then the entropy has the highest value, H(D) = log2 k.
k
Assume that a split point partitions D into DY and DN. Define the split entropy as the weighted entropy of each of the resulting partitions, given as
H(DY,DN)= nY H(DY)+ nN H(DN) (19.4) nn
where n=|D| is the number of points in D, and nY =|DY| and nN =|DN| are the number of points in DY and DN.
To see if the split point results in a reduced overall entropy, we define the information gain for a given split point as follows:
Gain(D,DY,DN)=H(D)−H(DY,DN) (19.5)
The higher the information gain, the more the reduction in entropy, and the better the split point. Thus, given split points and their corresponding partitions, we can score each split point and choose the one that gives the highest information gain.
Gini Index
Another common measure to gauge the purity of a split point is the Gini index, defined as follows:
k i=1
If the partition is pure, then the probability of the majority class is 1 and the probability
of all other classes is 0, and thus, the Gini index is 0. On the other hand, when each class
is equally represented, with probability P (ci |D) = 1 , then the Gini index has value k−1 . kk
Thus, higher values of the Gini index indicate more disorder, and lower values indicate more order in terms of the class labels.
We can compute the weighted Gini index of a split point as follows:
G(DY , DN ) = nY G(DY ) + nN G(DN ) nn
where n, nY , and nN denote the number of points in regions D, DY , and DN , respectively. The lower the Gini index value, the better the split point.
Other measures can also be used instead of entropy and Gini index to evaluate the splits. For example, the Classification And Regression Trees (CART) measure is given as
CART(DY,DN)=2nY nN k P(ci|DY)−P(ci|DN) (19.7) n n i=1
G(D) = 1 −
P (ci |D)2 (19.6)
Decision Tree Algorithm 487
This measure thus prefers a split point that maximizes the difference between the class probability mass function for the two partitions; the higher the CART measure, the better the split point.
19.2.2 Evaluating Split Points
All of the split point evaluation measures, such as entropy [Eq. (19.3)], Gini-index [Eq. (19.6)], and CART [Eq. (19.7)], considered in the preceding section depend on the class probability mass function (PMF) for D, namely, P (ci |D), and the class PMFs for the resulting partitions DY and DN, namely P(ci|DY) and P(ci|DN). Note that we have to compute the class PMFs for all possible split points; scoring each of them independently would result in significant computational overhead. Instead, one can incrementally compute the PMFs as described in the following paragraphs.
Numeric Attributes
If X is a numeric attribute, we have to evaluate split points of the form X ≤ v. Even if we restrict v to lie within the value range of attribute X, there are still an infinite number of choices for v. One reasonable approach is to consider only the midpoints between two successive distinct values for X in the sample D. This is because split points of the form X ≤ v, for v ∈ [xa,xb), where xa and xb are two successive distinct values of X in D, produce the same partitioning of D into DY and DN, and thus yield the same scores. Because there can be at most n distinct values for X, there are at most n − 1 midpoint values to consider.
Let {v1,…,vm} denote the set of all such midpoints, such that v1 < v2 < ··· < vm. For each split point X ≤ v, we have to estimate the class PMFs:
Pˆ(ci|DY)=Pˆ(ci|X≤v) (19.8) Pˆ(ci|DN)=Pˆ(ci|X>v) (19.9)
Let I() be an indicator variable that takes on the value 1 only when its argument is true, and is 0 otherwise. Using the Bayes theorem, we have
Pˆ (ci|X ≤ v) = Pˆ (X ≤ v|ci)Pˆ (ci) = Pˆ (X ≤ v|ci)Pˆ (ci)
Pˆ ( X ≤ v ) kj = 1 Pˆ ( X ≤ v | c j ) Pˆ ( c j )
(19.10)
(19.11)
The prior probability for each class in D can be estimated as follows: ˆ1n ni
P(ci)= n I(yj =ci)= n j=1
where yj is the class for point xj , n = |D| is the total number of points, and ni is the number of points in D with class ci . Define Nvi as the number of points xj ≤ v with class ci , where xj is the value of data point xj for the attribute X, given as
n j=1
Nvi =
I(xj ≤vandyj =ci) (19.12)
488
We can then estimate P (X ≤ v|ci ) as follows:
1 n
Decision Tree Classifier
ˆ P ˆ ( X ≤ v a n d c i ) P(X≤v|ci)= Pˆ(ci)
I(xj ≤vandyj =ci)
Plugging Eqs. (19.11) and (19.13) into Eq. (19.10), and using Eq. (19.8), we have
= n
Pˆ (ci|DY) = Pˆ (ci|X ≤ v) = Nvi
kj = 1 N v j
We can estimate Pˆ (X > v|ci ) as follows: Pˆ(X>v|ci)=1−Pˆ(X≤v|ci)=1−Nvi =ni−Nvi
ni ni Using Eqs. (19.11) and (19.15), the class PMF Pˆ (ci |DN ) is given as
Pˆ(ci|DN)=Pˆ(ci|X>v)= Pˆ(X>v|ci)Pˆ(ci) = ni −Nvi kj=1Pˆ(X>v|cj)Pˆ(cj) kj=1(nj −Nvj)
ni/n
= Nvi ni
j=1
(19.13)
(19.14)
(19.15)
(19.16)
Algorithm 19.2 shows the split point evaluation method for numeric attributes. The for loop on line 4 iterates through all the points and computes the midpoint values v and the number of points Nvi from class ci such that xj ≤ v. The for loop on line 12 enumerates all possible split points of the form X ≤ v, one for each midpoint v, and scores them using the gain criterion [Eq. (19.5)]; the best split point and score are recorded and returned. Any of the other evaluation measures can also be used. However, for Gini index and CART a lower score is better unlike for gain where a higher score is better.
In terms of computational complexity, the initial sorting of values of X (line 1) takes time O(n log n). The cost of computing the midpoints and the class-specific counts Nvi takes time O(nk) (for loop on line 4). The cost of computing the score is also bounded by O(nk), because the total number of midpoints v can be at most n (for loop on line 12). The total cost of evaluating a numeric attribute is therefore O(n log n + nk). Ignoring k, because it is usually a small constant, the total cost of numeric split point evaluation is O(n log n).
Example 19.4 (Numeric Attributes). Consider the 2-dimensional Iris dataset shown in Figure 19.1a. In the initial invocation of Algorithm 19.1, the entire dataset D with n = 150 points is considered at the root of the decision tree. The task is to find the best split point considering both the attributes, X1 (sepal length) and X2 (sepal width). Because there are n1 = 50 points labeled c1 (iris-setosa), the other class c2 has n2 = 100 points. We thus have
Pˆ (c1) = 50/150 = 1/3 Pˆ (c2) = 100/150 = 2/3
Decision Tree Algorithm 489 ALGORITHM19.2. EvaluateNumericAttribute(UsingGain)
1 2 3 4 5 6 7 8 9
10
11 12 13
14
15
16 17 18
19
EVALUATE-NUMERIC-ATTRIBUTE (D,X): sortDonattributeX,sothatxj ≤xj+1,∀j=1,…,n−1 M ← ∅ // set of midpoints
fori=1,…,kdoni ←0
forj=1,…,n−1do
if yj = ci then ni ← ni + 1 // running count for class ci ifxj+1̸=xj then
v ← xj + 1 + xj ; M ← M ∪ { v } / / m i d p o i n t s 2
for i = 1,…,k do
Nvi ← ni // Number of points such that xj ≤ v and yj = ci
ifyn=cithenni←ni+1
// evaluate split points of the form X ≤ v
v∗ ← ∅; score∗ ← 0 // initialize best split point forallv∈Mdo
for i = 1,…,k do
Pˆ ( c | D ) ← N v i
iYkj=1Nvj Pˆ ( c | D ) ← n i − N v i
i N kj = 1 n j − N v j
score(X ≤ v) ← Gain(D,DY,DN)// use Eq.(19.5) if score(X ≤ v) > score∗ then
v∗ ← v; score∗ ← score(X ≤ v) return (v∗,score∗)
The entropy [Eq. (19.3)] of the dataset D is therefore H(D)=−1log 1+2log 2=0.918
Consider split points for attribute X1. To evaluate the splits we first compute the frequencies Nvi using Eq. (19.12), which are plotted in Figure 19.2 for both the classes. For example, consider the split point X1 ≤ 5.45. From Figure 19.2, we see that
Nv1 = 45
Plugging in these values into Eq. (19.14) we get
Pˆ (c1|DY) = Nv1 = 45 Nv1 +Nv2 45+7
Pˆ (c2|DY) = Nv2 = 7 Nv1 +Nv2 45+7
Nv2 = 7
= 0.865 = 0.135
323323
490
Decision Tree Classifier
other (c2 )
100
90
80
70
60
50
40
30
20
10
0
v = 5.45
45
iris-setosa (c1 )
4
7 4.5 5.0 5.5
7.0
Figure 19.2. Iris: frequencies Nvi for classes c1 and c2 for attribute sepal length.
6.0 6.5 Midpoints: v
7.5
and using Eq. (19.16), we obtain
Pˆ(c1|DN)= n1 −Nv1 =
50−45 =0.051 (50−45)+(100−7)
(100−7) =0.949 (50−45)+(100−7)
(n1 −Nv1)+(n2 −Nv2) Pˆ(c2|DN)= n2 −Nv2 =
(n1 −Nv1)+(n2 −Nv2)
We can now compute the entropy of the partitions DY and DN as follows:
H(DY ) = −(0.865 log2 0.865 + 0.135 log2 0.135) = 0.571 H(DN ) = −(0.051 log2 0.051 + 0.949 log2 0.949) = 0.291
The entropy of the split point X ≤ 5.45 is given via Eq. (19.4) H(DY,DN)= 52 H(DY)+ 98 H(DN)=0.388
where nY = |DY| = 52 and nN = |DN| = 98. The information gain for the split point is therefore
Gain=H(D)−H(DY,DN)=0.918−0.388=0.53
In a similar manner, we can evaluate all of the split points for both attributes X1 and X2. Figure 19.3 plots the gain values for the different split points for the two attributes. We can observe that X ≤ 5.45 is the best split point and it is thus chosen as the root of the decision tree in Figure 19.1b.
The recursive tree growth process continues and yields the final decision tree and the split points as shown in Figure 19.1b. In this example, we use a leaf size threshold of 5 and a purity threshold of 0.95.
150 150
Frequency: Nvi
Decision Tree Algorithm
491
0.55 0.50 0.45 0.40 0.35 0.30 0.25 0.20 0.15 0.10 0.05
0
sepal-length(X1)
X1 ≤ 5.45
sepal-width(X2)
2 2.5
Figure19.3. Iris:gainfordifferentsplitpoints,forsepallengthandsepalwidth.
Categorical Attributes
3.0 3.5 4.0
4.5 5.0
Split points: Xi ≤ v
7.5
If X is a categorical attribute we evaluate split points of the form X ∈ V, where V ⊂ dom(X) and V ̸= ∅. In words, all distinct partitions of the set of values of X are considered. Because the split point X ∈ V yields the same partition as X ∈ V, where V = dom(X) \ V is the complement of V, the total number of distinct partitions is given as
⌊m/2⌋ mi = O(2m−1 ) (19.17) i=1
where m is the number of values in the domain of X, that is, m = |dom(X)|. The number of possible split points to consider is therefore exponential in m, which can pose problems if m is large. One simplification is to restrict V to be of size one, so that there are only m split points of the form Xj ∈ {v}, where v ∈ dom(Xj ).
To evaluate a given split point X ∈ V we have to compute the following class probability mass functions:
P(ci|DY)=P(ci|X∈V) P(ci|DN)=P(ci|X̸∈V) Making use of the Bayes theorem, we have
P(ci|X∈V)= P(X∈V|ci)P(ci) = P(X∈V|ci)P(ci)
P ( X ∈ V ) kj = 1 P ( X ∈ V | c j ) P ( c j )
However, note that a given point x can take on only one value in the domain of X, and thus the values v ∈ dom(X) are mutually exclusive. Therefore, we have
P(X ∈ V|ci) = P(X = v|ci) v∈V
5.5 6.0 6.5 7.0
Information Gain
492 Decision Tree Classifier
and we can rewrite P (ci |DY) as
P ( c i | D Y ) = kj = 1 v ∈ V P ( X = v | c j ) P ( c j ) ( 1 9 . 1 8 )
v∈V P(X = v|ci)P(ci)
Define nvi as the number of points xj ∈ D, with value xj = v for attribute X and
havingclassyj =ci:
n
(19.19)
(19.20)
I(xj =vandyj =ci) The class conditional empirical PMF for X is then given as
nvi =
Pˆ ( X = v | c i ) = Pˆ X = v a n d c i
j=1
ˆ 1n
= n = nvi
I(xj =vandyj =ci)
ni/n
P(ci)
j=1 ni
Note that the class prior probabilities can be estimated using Eq. (19.11) as discussed e a r l i e r , t h a t i s , Pˆ ( c i ) = n i / n . T h u s , s u b s t i t u t i n g E q . ( 1 9 . 2 0 ) i n E q . ( 1 9 . 1 8 ) , t h e c l a s s P M F
for the partition DY for the split point X ∈ V is given as
ˆ
P(ci|DY)= k
v∈V nvi Pˆ(X=v|c )Pˆ(c ) = k n
(19.21)
(19.22)
v∈V Pˆ (X = v|ci )Pˆ (ci )
j=1 v∈V j j j=1 v∈V vj In a similar manner, the class PMF for the partition DN is given as
ˆ ˆ v̸∈V nvi P(ci|DN)=P(ci|X̸∈V)= k n
Algorithm 19.3 shows the split point evaluation method for categorical attributes. The for loop on line 4 iterates through all the points and computes nvi, that is, the number of points having value v ∈ dom(X) and class ci. The for loop on line 7 enumerates all possible split points of the form X ∈ V for V ⊂ dom(X), such that |V| ≤ l, where l is a user specified parameter denoting the maximum cardinality of V. For example, to control the number of split points, we can also restrict V to be a single item, that is, l = 1, so that splits are of the form V ∈ {v}, with v ∈ dom(X). If l = ⌊m/2⌋, we have to consider all possible distinct partitions V. Given a split point X ∈ V, the method scores it using information gain [Eq. (19.5)], although any of the other scoring criteria can also be used. The best split point and score are recorded and returned.
In terms of computational complexity the class-specific counts for each value nvi takes O(n) time (for loop on line 4). With m = |dom(X)|, the maximum number of partitions V is O(2m−1), and because each split point can be evaluated in time O(mk), the for loop in line 7 takes time O(mk2m−1). The total cost for categorical attributes is therefore O(n + mk2m−1). If we make the assumption that 2m−1 = O(n), that is, if we bound the maximum size of V to l = O(log n), then the cost of categorical splits is bounded as O(n log n), ignoring k.
j=1 v̸∈V vj
Decision Tree Algorithm 493 ALGORITHM19.3. EvaluateCategoricalAttribute(UsingGain)
1 2 3
4 5
6 7 8
9
10
11 12 13
14
EVALUATE-CATEGORICAL-ATTRIBUTE (D,X,l): fori=1,…,kdo
ni ← 0 forallv∈dom(X)donvi ←0
forj=1,…,ndo
if xj = v and yj = ci then nvi ← nvi + 1 // frequency statistics
// evaluate split points of the form X ∈ V
V∗ ← ∅; score∗ ← 0 // initialize best split point forallV⊂dom(X),suchthat1≤|V|≤ldo
fori=1,…,kdo n ˆ v∈V vi
P(ci|DY)← k n
ˆ
P(ci|DN)← kj=1v̸∈Vnvj
j=1 v∈V vj v̸∈V nvi
score(X ∈ V) ← Gain(D,DY,DN) // use Eq.(19.5) if score(X ∈ V) > score∗ then
V∗ ← V; score∗ ← score(X ∈ V) return (V∗,score∗)
Example 19.5 (Categorical Attributes). Consider the 2-dimensional Iris dataset comprising the sepal length and sepal width attributes. Let us assume that sepal lengthhasbeendiscretizedasshowninTable19.1.Theclassfrequenciesnvi arealso shown. For instance na12 = 6 denotes the fact that there are 6 points in D with value v=a1 andclassc2.
Consider the split point X1 ∈ {a1,a3}. From Table 19.1 we can compute the class PMF for partition DY using Eq. (19.21)
Pˆ(c1|DY)= na11 +na31 = 39+0 =0.443 (na11 +na31)+(na12 +na32) (39+0)+(6+43)
Pˆ (c2|DY) = 1 − Pˆ (c1|DY) = 0.557 with the entropy given as
H(DY ) = −(0.443 log2 0.443 + 0.557 log2 0.557) = 0.991
To compute the class PMF for DN [Eq. (19.22)], we sum up the frequencies over
valuesv̸∈V={a1,a3},thatis,wesumoverv=a2 andv=a4,asfollows: Pˆ(c1|DN)= na21 +na41 = 11+0 =0.177
(na21 +na41)+(na22 +na42) (11+0)+(39+12) Pˆ (c2|DN) = 1 − Pˆ (c1|DN) = 0.823
494 Decision Tree Classifier Table19.1. Discretizedsepallengthattribute:classfrequencies
Table19.2. Categoricalsplitpointsforsepallength
Bins
v: values
Class frequencies (nvi )
c1 :iris-setosa
c2 :other
[4.3, 5.2] (5.2, 6.1] (6.1, 7.0] (7.0, 7.9]
Very Short (a1) Short (a2) Long (a3)
Very Long (a4)
39 11 0 0
6 39 43 12
V
Split entropy
Info. gain
{a1 } {a2 } {a3 } {a4 } {a1,a2} {a1,a3} {a1,a4} {a2,a3} {a2,a4} {a3,a4}
0.509 0.897 0.711 0.869 0.632 0.860 0.667 0.667 0.860 0.632
0.410 0.217 0.207 0.049 0.286 0.058 0.251 0.251 0.058 0.286
with the entropy given as
H(DN ) = −(0.177 log2 0.177 + 0.823 log2 0.823) = 0.673
We can see from Table 19.1 that V ∈ {a1,a3} splits the input data D into partitions ofsize|DY|=39+6+43=88,andDN =150−88=62.Theentropyofthesplitis therefore given as
H(DY,DN)= 88 H(DY)+ 62 H(DN)=0.86 150 150
As noted in Example 19.4, the entropy of the whole dataset D is H(D) = 0.918. The gain is then given as
Gain=H(D)−H(DY,DN)=0.918−0.86=0.058
The split entropy and gain values for all the categorical split points are given in Table 19.2. We can see that X1 ∈ {a1} is the best split point on the discretized attribute X1.
19.2.3 Computational Complexity
To analyze the computational complexity of the decision tree method in Algorithm 19.1, we assume that the cost of evaluating all the split points for a numeric or categorical
Further Reading 495
attribute is O(n log n), where n = |D| is the size of the dataset. Given D, the decision tree algorithm evaluates all d attributes, with cost (dnlogn). The total cost depends on the depth of the decision tree. In the worst case, the tree can have depth n, and thus the total cost is O(dn2 logn).
19.3 FURTHER READING
Among the earliest works on decision trees are Hunt, Marin, and Stone (1966); Breiman et al. (1984); and Quinlan (1986). The description in this chapter is largely based on the C4.5 method described in Quinlan (1993), which is an excellent reference for further details, such as how to prune decision trees to prevent overfitting, how to handle missing attribute values, and other implementation issues. A survey of methods for simplifying decision trees appears in Breslow and Aha (1997). Scalable implementation techniques are described in Mehta, Agrawal, and Rissanen (1996) and Gehrke et al. (1999).
Breslow, L. A. and Aha, D. W. (1997). Simplifying decision trees: A survey. Knowledge Engineering Review, 12 (1): 1–40.
Gehrke, J., Ganti, V., Ramakrishnan, R., and Loh, W.-Y. (1999). BOAT-optimistic decision tree construction. ACM SIGMOD Record, 28 (2): 169–180.
Hunt, E. B., Marin, J., and Stone, P. J. (1966). Experiments in induction. New York: Academic Press.
Leo, B., Jerome, F., Charles, J., and Olshen, R. (1984). Classification and regression trees. Boca Raton, FL: Chapman and Hall/CRC Press.
Mehta, M., Agrawal, R., and Rissanen, J. (1996). “SLIQ: A fast scalable classifier for data mining”. In: Proceedings of the International Conference on Extending Database Technology. New York: Springer-Verlag, pp. 18–32.
Quinlan, J. R. (1986). Induction of decision trees. Machine learning, 1 (1): 81–106. Quinlan, J. R. (1993). C4.5: programs for machine learning. New York: Morgan
Kaufmann.
19.4 EXERCISES
Q1. TrueorFalse:
(a) High entropy means that the partitions in classification are “pure.”
(b) Multiway split of a categorical attribute generally results in more pure partitions
than a binary split.
Q2. Given Table 19.3, construct a decision tree using a purity threshold of 100%. Use information gain as the split point evaluation measure. Next, classify the point (Age=27,Car=Vintage).
Q3. What is the maximum and minimum value of the CART measure [Eq. (19.7)] and under what conditions?
Q4. GiventhedatasetinTable19.4.Answerthefollowingquestions:
496
Decision Tree Classifier Table19.3. DataforQ2:AgeisnumericandCariscategorical.Riskgivestheclass
label for each point: high (H) or low (L)
Point
Age
Car
Risk
x1 x2 x3 x4 x5 x6
25
20
25
45
20
25
Sports Vintage Sports SUV Sports SUV
L H L H H H
Table 19.4. Data for Q4
Instance
a1 a2 a3
Class
1 2 3 4 5 6 7 8 9
T T 5.0 T T 7.0 T F 8.0 F F 3.0 F T 7.0 F T 4.0 F F 5.0 T F 6.0 F T 1.0
Y Y N Y N N N Y N
(a)
(b)
Show which decision will be chosen at the root of the decision tree using information gain [Eq.(19.5)], Gini index [Eq.(19.6)], and CART [Eq.(19.7)] measures. Show all split points for all attributes.
What happens to the purity if we use Instance as another attribute? Do you think this attribute should be used for a decision in the tree?
Q5. Consider Table 19.5. Let us make a nonlinear split instead of an axis parallel split, given as follows: AB − B2 ≤ 0. Compute the information gain of this split based on entropy (use log2, i.e., log to the base 2).
Table 19.5. Data for Q5
A
B
Class
x1 x2 x3 x4 x5 x6 x7 x8
3.5 2 9.1 2 1.5 7 2.1 8
4
4 4.5 6 7 6.5 2.5 4
H H L H H H L L
CHAPTER 20 LinearDiscriminantAnalysis
Given labeled data consisting of d-dimensional points xi along with their classes yi, the goal of linear discriminant analysis (LDA) is to find a vector w that maximizes the separation between the classes after projection onto w. Recall from Chapter 7 that the first principal component is the vector that maximizes the projected variance of the points. The key difference between principal component analysis and LDA is that the former deals with unlabeled data and tries to maximize variance, whereas the latter deals with labeled data and tries to maximize the discrimination between the classes.
20.1 OPTIMAL LINEAR DISCRIMINANT
Let us assume that the dataset D consists of n labeled points {xi,yi}, where xi ∈ Rd and yi ∈ {c1,c2,…,ck}. Let Di denote the subset of points labeled with class ci, i.e., Di = {xj|yj = ci}, and let |Di| = ni denote the number of points with class ci. We assume that there are only k = 2 classes. Thus, the dataset D can be partitioned into D1 and D2.
Let w be a unit vector, that is, wTw = 1. By Eq. (1.7), the projection of any d-dimensional point xi onto the vector w is given as
x′i = wTxi w = wTxiw = aiw wTw
where ai specifies the offset or coordinate of x′i along the line w: ai =wTxi
Thus, the set of n scalars {a1,a2,…,an} represents the mapping from Rd to R, that is, from the original d-dimensional space to a 1-dimensional space (along w).
Example 20.1. Consider Figure 20.1, which shows the 2-dimensional Iris dataset with sepal length and sepal width as the attributes, and iris-setosa as class c1 (circles), and the other two Iris types as class c2 (triangles). There are n1 = 50 points in c1 and n2 = 100 points in c2 . One possible vector w is shown, along with the projection
497
498
4.5 4.0 3.5 3.0 2.5 2.0 1.5
Linear Discriminant Analysis
w
4.0
4.5 5.0
5.5
Figure 20.1. Projection onto w.
8.0
6.0
6.5
7.0 7.5
of all the points onto w. The projected means of the two classes are shown in black. Here w has been translated so that it passes through the mean of the entire data. One can observe that w is not very good in discriminating between the two classes because the projection of the points onto w are all mixed up in terms of their class labels. The optimal linear discriminant direction is shown in Figure 20.2.
Each point coordinate ai has associated with it the original class label yi , and thus we can compute, for each of the two classes, the mean of the projected points as follows:
m1=1 ai n 1 x ∈ D
i1
=1 wTxi
n1 xi∈D1 =wT 1 xi
n1 xi∈D1 =wTμ1
where μ1 is the mean of all point in D1. Likewise, we can obtain m2 =wTμ2
In other words, the mean of the projected points is the same as the projection of the mean.
Optimal Linear Discriminant
499
4.5 4.0 3.5 3.0 2.5 2.0 1.5
4.5
w
4.0
5.0
5.5
Figure 20.2. Linear discriminant direction w.
7.5
8.0
6.0
6.5
7.0
To maximize the separation between the classes, it seems reasonable to maximize the difference between the projected means, |m1 − m2|. However, this is not enough. For good separation, the variance of the projected points for each class should also not be too large. A large variance would lead to possible overlaps among the points of the two classes due to the large spread of the points, and thus we may fail to have a good separation. LDA maximizes the separation by ensuring that the scatter si2 for the projected points within each class is small, where scatter is defined as
si2 = (aj −mi)2 xj∈Di
Scatter is the total squared deviation from the mean, as opposed to the variance, which is the average deviation from mean. In other words
si2 =niσi2
where ni = |Di | is the size, and σi2 is the variance, for class ci .
We can incorporate the two LDA criteria, namely, maximizing the distance
between projected means and minimizing the sum of projected scatter, into a single maximization criterion called the Fisher LDA objective:
(m1 −m2)2
max J(w)= s2+s2 (20.1)
w12
500 Linear Discriminant Analysis
The goal of LDA is to find the vector w that maximizes J(w), that is, the direction that maximizes the separation between the two means m1 and m2, and minimizes the total scatter s12 + s2 of the two classes. The vector w is also called the optimal linear discriminant (LD). The optimization objective [Eq. (20.1)] is in the projected space. To solve it, we have to rewrite it in terms of the input data, as described next.
Note that we can rewrite (m1 − m2)2 as follows: (m1 −m2)2 =wT(μ1 −μ2)2
=wT (μ1 −μ2)(μ1 −μ2)Tw
=wT Bw (20.2)
where B = (μ1 − μ2 )(μ1 − μ2 )T is a d × d rank-one matrix called the between-class scatter matrix.
As for the projected scatter for class c1, we can compute it as follows: s 12 = ( a i − m 1 ) 2
x∈D i1
(wTxi − wTμ1)2 2
= =
xi ∈D1 =wT S1 w
where S1 is the scatter matrix for D1. Likewise, we can obtain s2 = wTS2w
xi ∈D1 xi ∈D1
wT(xi −μ1)
=wT (xi −μ1)(xi −μ1)Tw
(20.3)
(20.4)
Notice again that the scatter matrix is essentially the same as the covariance matrix, but instead of recording the average deviation from the mean, it records the total deviation, that is,
Si =nii (20.5) Combining Eqs. (20.3) and (20.4), the denominator in Eq. (20.1) can be rewrit-
ten as
s12 + s2 = wTS1w + wTS2w = wT(S1 + S2)w = wTSw (20.6)
where S = S1 + S2 denotes the within-class scatter matrix for the pooled data. Because both S1 and S2 are d × d symmetric positive semidefinite matrices, S has the same properties.
Using Eqs. (20.2) and (20.6), we write the LDA objective function [Eq. (20.1)] as follows:
max J(w) = wTBw (20.7) w wTSw
Optimal Linear Discriminant 501
To solve for the best direction w, we differentiate the objective function with respect to w, and set the result to zero. We do not explicitly have to deal with the constraint that wT w = 1 because in Eq. (20.7) the terms related to the magnitude of w cancel out in the numerator and the denominator.
Recall that if f (x) and g(x) are two functions then we have d f(x)= f′(x)g(x)−g′(x)f(x)
dx g(x) g(x)2
where f ′(x) denotes the derivative of f (x). Taking the derivative of Eq. (20.7) with
respect to the vector w, and setting the result to the zero vector, gives us d J(w)=2Bw(wTSw)−2Sw(wTBw)=0
which yields
d w (wT Sw)2
B w(wTSw) = S w(wTBw) Bw=SwwTBw
wT Sw Bw=J(w)Sw
Bw = λSw
where λ = J(w). Eq. (20.8) represents a generalized eigenvalue problem where λ is a generalized eigenvalue of B and S; the eigenvalue λ satisfies the equation det(B − λS) = 0. Because the goal is to maximize the objective [Eq. (20.7)], J(w) = λ should be chosen to be the largest generalized eigenvalue, and w to be the corresponding eigenvector. If S is nonsingular, that is, if S−1 exists, then Eq. (20.8) leads to the regular eigenvalue–eigenvector equation, as
Bw =λSw S−1 Bw =λS−1 Sw
(S−1 B)w =λw (20.9)
Thus, if S−1 exists, then λ = J(w) is an eigenvalue, and w is an eigenvector of the matrix S−1B. To maximize J(w) we look for the largest eigenvalue λ, and the corresponding dominant eigenvector w specifies the best linear discriminant vector.
Algorithm 20.1 shows the pseudo-code for linear discriminant analysis. Here, we assume that there are two classes, and that S is nonsingular (i.e., S−1 exists). The vector 1ni is the vector of all ones, with the appropriate dimension for each class, i.e., 1ni ∈ Rni for class i = 1,2. After dividing D into the two groups D1 and D2, LDA proceeds to compute the between-class and within-class scatter matrices, B and S. The optimal LD vector is obtained as the dominant eigenvector of S−1B. In terms of com- putational complexity, computing S takes O(nd2) time, and computing the dominant eigenvalue-eigenvector pair takes O(d3) time in the worst case. Thus, the total time is O(d3 +nd2).
(20.8)
502
Linear Discriminant Analysis
ALGORITHM20.1. LinearDiscriminantAnalysis
LINEARDISCRIMINANT (D = {(xi,yi)}n ): i=1
1 Di ← xj |yj =ci,j=1,…,n ,i=1,2//class-specificsubsets
2 μi ←mean(Di),i=1,2//classmeans
3 B ← (μ1 − μ2 )(μ1 − μ2 )T // between-class scatter matrix
4 Zi ←Di −1niμTi ,i=1,2//centerclassmatrices
5 Si ←ZTi Zi,i=1,2//classscattermatrices
6 S ← S1 + S2 // within-class scatter matrix
7 λ1 , w ← eigen(S−1 B) // compute dominant eigenvector
Example 20.2 (Linear Discriminant Analysis). Consider the 2-dimensional Iris data (with attributes sepal length and sepal width) shown in Example 20.1. Class c1 , corresponding to iris-setosa, has n1 = 50 points, whereas the other class c2 has n2 = 100 points. The means for the two classes c1 and c2, and their difference is given as
μ1 = 5.01 μ2 = 6.26 3.42 2.87
μ1 − μ2 = −1.256 0.546
0.546 = 1.587 −0.693 −0.693 0.303
S=S1 +S2 =49.58 17.01 17.01 18.08
The between-class scatter matrix is
B = (μ1 − μ2 )(μ1 − μ2 )T = −1.256 −1.256
0.546 and the within-class scatter matrix is
S1 =6.09 4.91 S2 =43.5 12.09 4.91 7.11 12.09 10.96
S is nonsingular, with its inverse given as S−1 = 0.0298
−0.028 0.0817
−0.693 = 0.066 0.303 −0.100
−0.028 S−1 B = 0.0298 −0.028 1.587
Therefore, we have
−0.028 0.0817 −0.693
−0.029 0.044
The direction of most separation between c1 and c2 is the dominant eigenvector corresponding to the largest eigenvalue of the matrix S−1B. The solution is
J(w) = λ1 = 0.11 w = 0.551
−0.834
Figure 20.2 plots the optimal linear discriminant direction w, translated to the mean of the data. The projected means for the two classes are shown in black. We can
Optimal Linear Discriminant 503
For the two class scenario, if S is nonsingular, we can directly solve for w without computing the eigenvalues and eigenvectors. Note that B = (μ1 − μ2)(μ1 − μ2)T is a d × d rank-one matrix, and thus Bw must point in the same direction as (μ1 − μ2) because
clearly observe that along w the circles appear together as a group, and are quite well separated from the triangles. Except for one outlying circle corresponding to the point (4.5,2.3)T, all points in c1 are perfectly separated from points in c2.
Bw =(μ1 − μ2)(μ1 − μ2)Tw =(μ1 − μ2)(μ1 − μ2)Tw
=b(μ1 −μ2) where b = (μ1 − μ2 )T w is just a scalar multiplier.
We can then rewrite Eq. (20.9) as
Bw =λSw b(μ1 −μ2)=λSw
w=bS−1(μ1 −μ2) λ
Because b is just a scalar, we can solve for the best linear discriminant as λ
w =S−1 (μ1 − μ2 )
(20.10)
Once the direction w has been found we can normalize it to be a unit vector. Thus, instead of solving for the eigenvalue/eigenvector, in the two class case, we immediately obtain the direction w using Eq. (20.10). Intuitively, the direction that maximizes the separation between the classes can be viewed as a linear transformation (by S−1 ) of the vector joining the two class means (μ1 − μ2 ).
Example 20.3. Continuing Example 20.2, we can directly compute w as follows: w=S−1(μ1 −μ2)
= 0.066 −0.100
After normalizing, we have
w = w = ∥w∥
−0.029 −1.246 = −0.0527 0.044 0.546 0.0798
1 −0.0527 = −0.551 0.0956 0.0798 0.834
Note that even though the sign is reversed for w, compared to that in Example 20.2, they represent the same direction; only the scalar multiplier is different.
504 Linear Discriminant Analysis
20.2 KERNEL DISCRIMINANT ANALYSIS
Kernel discriminant analysis, like linear discriminant analysis, tries to find a direction that maximizes the separation between the classes. However, it does so in feature space via the use of kernel functions.
Given a dataset D = {(xi,yi)}ni=1, where xi is a point in input space and yi ∈ {c1,c2} is the class label, let Di = {xj |yj = ci } denote the data subset restricted to class ci , and let ni = |Di |. Further, let φ (xi ) denote the corresponding point in feature space, and let K be a kernel function.
The goal of kernel LDA is to find the direction vector w in feature space that maximizes
(m1 −m2)2
max J(w)= s2 +s2 (20.11)
w12
where m1 and m2 are the projected means, and s12 and s2 are projected scatter values in feature space. We first show that w can be expressed as a linear combination of the points in feature space, and then we transform the LDA objective in terms of the kernel matrix.
Optimal LD: Linear Combination of Feature Points
The mean for class ci in feature space is given as
μ φi = 1 φ ( x j )
ni xj∈Di
and the covariance matrix for class ci in feature space is
φi = 1 φ(xj)−μφi φ(xj)−μφi T ni xj∈Di
( 2 0 . 1 2 )
Using a derivation similar to Eq. (20.2) we obtain an expression for the between-class scatter matrix in feature space
φ φ φ φT T
Bφ = μ1 −μ2 μ1 −μ2 =dφdφ (20.13)
where dφ = μφ1 − μφ2 is the difference between the two class mean vectors. Likewise, using Eqs. (20.5) and (20.6) the within-class scatter matrix in feature space is given as
Sφ =n1φ1 +n2φ2
Sφ is a d × d symmetric, positive semidefinite matrix, where d is the dimensionality of the feature space. From Eq. (20.9), we conclude that the best linear discriminant vector w in feature space is the dominant eigenvector, which satisfies the expression
S−1 Bφ w = λw (20.14) φ
where we assume that Sφ is non-singular. Let δi denote the ith eigenvalue and ui the ith eigenvector of Sφ, for i = 1,…,d. The eigen-decomposition of Sφ yields Sφ = UUT,
Kernel Discriminant Analysis 505 with the inverse of S given as S−1 = U−1 UT . Here U is the matrix whose columns are
the eigenvectors of Sφ and is the diagonal matrix of eigenvalues of Sφ . The inverse
φφ
S−1 can thus be expressed as the spectral sum φ
S − 1 = d 1 u r u T φ δr r
( 2 0 . 1 5 )
r=1
Plugging Eqs. (20.13) and (20.15) into Eq. (20.14), we obtain
d 1 T T λw= δ urur dφdφw=
d 1 T T d
δ ur(ur dφ)(dφw) = brur
r=1 r
where br = 1 (uTr dφ)(dTφw) is a scalar value. Using a derivation similar to that in
r=1 r r=1
δr
Eq. (7.32), the rth eigenvector of Sφ can be expressed as a linear combination of the
feature points, say ur = nj =1 crj φ (xj ), where crj is a scalar coefficient. Thus, we can rewrite w as
1dn w= λ br crjφ(xj)
r=1 j=1
n d b r c r j
= φ(xj) λ j=1 r=1
n
= ajφ(xj)
j=1
where aj = dr=1 brcrj/λ is a scalar value for the feature point φ(xj). Therefore, the direction vector w can be expressed as a linear combination of the points in feature space.
LDA Objective via Kernel Matrix
We now rewrite the kernel LDA objective [Eq.(20.11)] in terms of the kernel matrix. Projecting the mean for class ci given in Eq. (20.12) onto the LD direction w, we have
n T1 m i = w T μ φi = a j φ ( x j ) n φ ( x k )
j=1 i xk∈Di =1n ajφ(xj)Tφ(xk)
ni j=1 xk∈Di
=1n ajK(xj,xk)
ni j=1 xk∈Di = aTmi
(20.16)
506 Linear Discriminant Analysis
wherea=(a1,a2,…,an)T istheweightvector,and
xk∈Di K(x1,xk)
m = 1 xk∈Di K(x2,xk) = 1 Kci 1
in . nni ii
xk∈Di K(xn,xk)
(20.17)
where Kci is the n × ni subset of the kernel matrix, restricted to columns belonging to pointsonlyinDi,and1ni istheni-dimensionalvectorallofwhoseentriesareone.The n-length vector mi thus stores for each point in D its average kernel value with respect to the points in Di .
We can rewrite the separation between the projected means in feature space as follows:
( m 1 − m 2 ) 2 = w T μ φ1 − w T μ φ2 2
=aTm1 −aTm22
= aT(m1 − m2)(m1 − m2)Ta = aTMa
(20.18)
whereM=(m1−m2)(m1−m2)T isthebetween-classscattermatrix.
We can also compute the projected scatter for each class, s12 and s2, purely in terms
of the kernel function, as
2 T Tφ2
s1 = w φ(xi)−w μ1 xi ∈D1
T 2 T TφTφ2 w φ(xi) −2 w φ(xi)·w μ1 + w μ1
= =
xi ∈D1 j =1
= aTKiKTi a−n1·aTm1mT1a byusingEq.(20.16)
xi ∈D1 xi ∈D1 xi ∈D1
2 ajφ(xj) φ(xi) −2·n1 ·w μ1 +n1 ·w μ1
n
T T φ 2 T φ 2
x i ∈ D 1
=aT KiKTi −n1m1mT1a
xi ∈D1 =aTN1a
where Ki is the ith column of the kernel matrix, and N1 is the class scatter matrix for c1. Let K(xi , xj ) = Kij . We can express N1 more compactly in matrix notation as follows:
N 1 = K i K Ti − n 1 m 1 m T1 xi ∈D1
= (Kc1 ) In1 − 1 1n1×n1 (Kc1 )T (20.19) n1
Kernel Discriminant Analysis 507
where In1 is the n1 × n1 identity matrix and 1n1×n1 is the n1 × n1 matrix, all of whose entries are 1’s.
In a similar manner we get s2 = aTN2a, where
N2 =(Kc2)In2 − 1 1n2×n2(Kc2)T
n2
where In2 is the n2 × n2 identity matrix and 1n2×n2 is the n2 × n2 matrix, all of whose entries are 1’s.
The sum of projected scatter values is then given as
s12 + s2 = aT(N1 + N2)a = aTNa (20.20)
where N is the n × n within-class scatter matrix.
Substituting Eqs. (20.18) and (20.20) in Eq. (20.11), we obtain the kernel LDA
maximization condition
max J(w) = max J(a) = aTMa w a aTNa
Notice how all the terms in the expression above involve only kernel functions. The weight vector a is the eigenvector corresponding to the largest eigenvalue of the generalized eigenvalue problem:
Ma = λ1Na (20.21) If N is nonsingular, a is the dominant eigenvector corresponding to the largest
eigenvalue for the system
(N−1M)a = λ1a
As in the case of linear discriminant analysis [Eq. (20.10)], when there are only two
classes we do not have to solve for the eigenvector because a can be obtained directly: a=N−1(m1 −m2)
Once a has been obtained, we can normalize w to be a unit vector by ensuring that
wTw = 1, which implies that
n n
aiaj φ(xi)Tφ(xj ) = 1, or i=1 j=1
aTKa=1
Put differently, we can ensure that w is a unit vector if we scale a by √
aT Ka Finally, we can project any point x onto the discriminant direction, as follows:
n n
wTφ(x)= ajφ(xj)Tφ(x)= ajK(xj,x) (20.22)
j=1 j=1
Algorithm 20.2 shows the pseudo-code for kernel discriminant analysis. The method proceeds by computing the n × n kernel matrix K, and the n × ni class
1.
508
Linear Discriminant Analysis
ALGORITHM20.2. KernelDiscriminantAnalysis
KERNELDISCRIMINANT (D = {(xi,yi)}n ,K): i=1
1 K ← K(xi , xj ) i,j =1,…,n // compute n × n kernel matrix
2 Kci ←K(j,k)|yk =ci,1≤j,k≤n,i=1,2//classkernelmatrix
3 mi ← 1 Kci1n ,i=1,2//classmeans ni i
4 M ← (m1 − m2)(m1 − m2)T // between-class scatter matrix
5 Ni ←Kci(In − 1 1n ×n )(Kci)T,i=1,2//classscattermatrices
6 N ← N1 + N2 // within-class scatter matrix
7 λ1 , a ← eigen(N−1 M) // compute weight vector
a←√
8 aT Ka
i ni i i
a // normalize w to be unit vector
specific kernel matrices Kci for each class ci . After computing the between-class and within-class scatter matrices M and N, the weight vector a is obtained as the dominant eigenvector of N−1M. The last step scales a so that w will be normalized to be unit length. The complexity of kernel discriminant analysis is O(n3), with the dominant steps being the computation of N and solving for the dominant eigenvector of N−1M, both of which take O(n3) time.
Example 20.4 (Kernel Discriminant Analysis). Consider the 2-dimensional Iris dataset comprising the sepal length and sepal width attributes. Figure 20.3a shows the points projected onto the first two principal components. The points have been divided into two classes: c1 (circles) corresponds to Iris-versicolor and c2 (triangles) corresponds to the other two Iris types. Here n1 = 50 and n2 = 100, with a total of n = 150 points.
Because c1 is surrounded by points in c2 a good linear discriminant will not be found. Instead, we apply kernel discriminant analysis using the homogeneous quadratic kernel
K(xi,xj)=(xTi xj)2 Solving for a via Eq. (20.21) yields
λ1 = 0.0511
However, we do not show a because it lies in R150. Figure 20.3a shows the contours of constant projections onto the best kernel discriminant. The contours are obtained by solving Eq. (20.22), that is, by solving wT φ (x) = nj =1 aj K(xj , x) = c for different values of the scalars c. The contours are hyperbolic, and thus form pairs starting from the center. For instance, the first curve on the left and right of the origin (0,0)T forms the same contour, that is, points along both the curves have the same value when projected onto w. We can see that contours or pairs of curves starting with the fourth curve (on the left and right) from the center all relate to class c2, whereas the first three contours deal mainly with class c1, indicating good discrimination with the homogeneous quadratic kernel.
Kernel Discriminant Analysis
u2
509
1.0 0.5 0 −0.5 −1.0 −1.5
−4 −3 −2 −1 0 1 2 3 (a)
−1 0 1 2 3 4 5 6 7 8 9 10
(b)
Figure 20.3. Kernel discriminant analysis: quadratic homogeneous kernel.
w
A better picture emerges when we plot the coordinates of all the points xi ∈ D when projected onto w, as shown in Figure 20.3b. We can observe that w is able to separate the two classes reasonably well; all the circles (c1) are concentrated on the left, whereas the triangles (c2) are spread out on the right. The projected means are shown in white. The scatters and means for both classes after projection are as follows:
The value of J(w) is given as (m1 − m2)2
m1 = 0.338 s12 = 13.862
m2 = 4.476 s2 = 320.934
(0.338 − 4.476)2
= 13.862 + 320.934 = 334.796 = 0.0511
which, as expected, matches λ1 = 0.0511 from above.
In general, it is not desirable or possible to obtain an explicit discriminant vector
w, since it lies in feature space. However, because each point x = (x1,x2)T ∈ R2 in
J(w) = s12 + s2
17.123
u1
510
Linear Discriminant Analysis
w
X1X2
X2
20.3 FURTHER READING
Figure 20.4. Homogeneous quadratic kernel feature space.
X21
input space is mapped to the point φ(x) = (√2x1x2,x12,x2)T ∈ R3 in feature space via the homogeneous quadratic kernel, for our example it is possible to visualize the feature space, as illustrated in Figure 20.4. The projection of each point φ(xi) onto the discriminant vector w is also shown, where
w = 0.511x1x2 + 0.761×12 − 0.4×2
The projections onto w are identical to those shown in Figure 20.3b.
Linear discriminant analysis was introduced in Fisher (1936). Its extension to kernel discriminant analysis was proposed in Mika et al. (1999). The 2-class LDA approach can be generalized to k > 2 classes by finding the optimal (k − 1)-dimensional subspace projection that best discriminates between the k classes; see Duda, Hart, and Stork (2012) for details.
Duda, R. O., Hart, P. E., and Stork, D. G. (2012). Pattern classification. New York: Wiley-Interscience.
Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals of eugenics, 7 (2): 179–188.
Exercises 511
Mika, S., Ratsch, G., Weston, J., Scholkopf, B., and Mullers, K. (1999). Fisher discriminant analysis with kernels. Proceedings of the IEEE Neural Networks for Signal Processing Workshop. IEEE, pp. 41–48.
20.4 EXERCISES
Q1. ConsiderthedatashowninTable20.1.Answerthefollowingquestions:
(a) Compute μ+1 and μ−1, and B, the between-class scatter matrix.
(b) Compute S+1 and S−1, and S, the within-class scatter matrix.
(c) Find the best direction w that discriminates between the classes. Use the fact that
theinverseofthematrixA= a b isgivenasA−1= 1 d −b . c d det(A) −c a
(d) Having found the direction w, find the point on w that best separates the two classes.
Table 20.1. Dataset for Q1
1
1 −1 −1
Q2. Giventhelabeledpoints(fromtwoclasses)showninFigure20.5,andgiventhatthe inverse of the within-class scatter matrix is
0.056 −0.029 −0.029 0.052
Find the best linear discriminant line w, and sketch it.
9 8 7 6 5 4 3 2 1
i
xi
yi
x1 x2 x3 x4
(4,2.9) (3.5,4) (2.5,1) (2,2.1)
123456789
Figure 20.5. Dataset for Q2.
512
Q3. Q4.
Linear Discriminant Analysis
Maximize the objective in Eq. (20.7) by explicitly considering the constraint wTw = 1, that is, by using a Lagrange multiplier for that constraint.
Prove the equality in Eq. (20.19). That is, show that
N1 = KiKTi −n1m1mT1 =(Kc1)(In1 − 1 1n1×n1)(Kc1)T
x ∈D n1 i1
CHAPTER 21 SupportVectorMachines
In this chapter we describe Support Vector Machines (SVMs), a classification method based on maximum margin linear discriminants, that is, the goal is to find the optimal hyperplane that maximizes the gap or margin between the classes. Further, we can use the kernel trick to find the optimal nonlinear decision boundary between classes, which corresponds to a hyperplane in some high-dimensional “nonlinear” space.
21.1 SUPPORT VECTORS AND MARGINS
Let D = {(xi,yi)}ni=1 be a classification dataset, with n points in a d-dimensional space. Further, let us assume that there are only two class labels, that is, yi ∈ {+1,−1}, denoting the positive and negative classes.
Hyperplanes
A hyperplane in d dimensions is given as the set of all points x ∈ Rd that satisfy the equation h(x) = 0, where h(x) is the hyperplane function, defined as follows:
h(x) = wTx + b (21.1) =w1x1 +w2x2 +···+wdxd +b
Here, w is a d dimensional weight vector and b is a scalar, called the bias. For points that lie on the hyperplane, we have
h(x)=wTx+b=0 (21.2)
The hyperplane is thus defined as the set of all points such that wTx = −b. To see the role played by b, assuming that w1 ̸= 0, and setting xi = 0 for all i > 1, we can obtain the offset where the hyperplane intersects the first axis, as by Eq. (21.2), we have
w1x1=−b or x1=−b w1
In other words, the point ( −b , 0, . . . , 0) lies on the hyperplane. In a similar manner, we w1
can obtain the offset where the hyperplane intersects each of the axes, which is given
as −b (provided wi ̸= 0). wi
513
514 Support Vector Machines
Separating Hyperplane
A hyperplane splits the original d-dimensional space into two half-spaces. A dataset is said to be linearly separable if each half-space has points only from a single class. If the input dataset is linearly separable, then we can find a separating hyperplane h(x) = 0, such that for all points labeled yi = −1, we have h(xi ) < 0, and for all points labeled yi = +1, we have h(xi ) > 0. In fact, the hyperplane function h(x) serves as a linear classifier or a linear discriminant, which predicts the class y for any given point x, according to the decision rule:
y=+1 ifh(x)>0 (21.3) −1 ifh(x)<0
Let a1 and a2 be two arbitrary points that lie on the hyperplane. From Eq. (21.2) we have
h(a1)=wTa1 +b=0 h(a2)=wTa2 +b=0
Subtracting one from the other we obtain
wT(a1 −a2)=0
This means that the weight vector w is orthogonal to the hyperplane because it is orthogonal to any arbitrary vector (a1 − a2) on the hyperplane. In other words, the weight vector w specifies the direction that is normal to the hyperplane, which fixes the orientation of the hyperplane, whereas the bias b fixes the offset of the hyperplane in the d-dimensional space. Because both w and −w are normal to the hyperplane, we remove this ambiguity by requiring that h(xi ) > 0 when yi = 1, and h(xi ) < 0 when yi =−1.
Distance of a Point to the Hyperplane
Consider a point x ∈ Rd , such that x does not lie on the hyperplane. Let xp be the orthogonal projection of x on the hyperplane, and let r = x − xp , then as shown in Figure 21.1 we can write x as
x = xp + r
x=xp+r w (21.4)
∥w∥
where r is the directed distance of the point x from xp, that is, r gives the offset of x
from xp in terms of the unit weight vector w . The offset r is positive if r is in the same ∥w∥
direction as w, and r is negative if r is in a direction opposite to w.
Support Vectors and Margins 515
5 4 3 2 1
h(x)<0
h(x)>0
w ∥w∥
x
xp
b
∥w∥
0
12345
Figure 21.1. Geometry of a separating hyperplane in 2D. Points labeled +1 are shown as circles, and those labeled −1 are shown as triangles. The hyperplane h(x) = 0 divides the space into two half-spaces. The shaded region comprises all points x satisfying h(x) < 0, whereas the unshaded region consists of all points satisfying h(x) > 0. The unit weight vector w (in gray) is orthogonal to the hyperplane. The directed distance of the origin to the hyperplane is b . ∥w∥
∥w∥
Plugging Eq. (21.4) into the hyperplane function [Eq. (21.1)], we get
h(x)=hxp+r w ∥w∥
=wT xp+rw +b ∥w∥
=wTxp +b+rwTw ∥w∥
h(xp )
=h(xp)+r∥w∥
0
=r∥w∥
The last step follows from the fact that h(xp) = 0 because xp lies on the hyperplane. Using the result above, we obtain an expression for the directed distance of a point to the hyperplane:
r = h(x) ∥w∥
h(x)=0
r=r w ∥w∥
516 Support Vector Machines
To obtain distance, which must be non-negative, we can conveniently multiply r by the class label y of the point because when h(x) < 0, the class is −1, and when h(x) > 0 the class is +1. The distance of a point x from the hyperplane h(x) = 0 is thus given as
δ = y r = y h(x) ∥w∥
In particular, for the origin x = 0, the directed distance is
(21.5)
r=h(0)=wT0+b= b ∥w∥ ∥w∥ ∥w∥
as illustrated in Figure 21.1.
Example 21.1. Consider the example shown in Figure 21.1. In this 2-dimensional example, the hyperplane is just a line, defined as the set of all points x = (x1,x2)T that satisfy the following equation:
h(x)=wTx+b=w1x1 +w2x2 +b=0 Rearranging the terms we get
x2=−w1x1− b w2 w2
where −w1 is the slope of the line, and − b is the intercept along the second w2 w2
dimension.
Consider any two points on the hyperplane, say p = (p1,p2) = (4,0), and
q = (q1,q2) = (2,5). The slope is given as
−w1 =q2−p2 =5−0=−5
w2 q1 − p1 2 − 4 2
which implies that w1 = 5 and w2 = 2. Given any point on the hyperplane, say (4, 0),
we can compute the offset b directly as follows:
b=−5×1 −2×2 =−5·4−2·0=−20
Thus, w = 52 is the weight vector, and b = −20 is the bias, and the equation of the hyperplane is given as
h(x)=wTx+b=5 2x1−20=0 x2
One can verify that the distance of the origin 0 from the hyperplane is given as
−b −(−20) δ=yr=−1r=∥w∥= √29 =3.71
Support Vectors and Margins 517
Margin and Support Vectors of a Hyperplane
Given a training dataset of labeled points, D = {xi,yi}ni=1 with yi ∈ {+1,−1}, and given a separating hyperplane h(x) = 0, for each point xi we can find its distance to the hyperplane by Eq. (21.5):
δi = yi h(xi) = yi(wTxi +b) ∥w∥ ∥w∥
Over all the n points, we define the margin of the linear classifier as the minimum distance of a point from the separating hyperplane, given as
δ∗ =min yi(wTxi +b) (21.6) xi ∥w∥
Note that δ∗ ̸= 0, since h(x) is assumed to be a separating hyperplane, and Eq. (21.3) must be satisfied.
All the points (or vectors) that achieve this minimum distance are called support vectors for the hyperplane. In other words, a support vector x∗ is a point that lies precisely on the margin of the classifier, and thus satisfies the condition
δ∗ = y∗(wTx∗ +b) ∥w∥
where y∗ is the class label for x∗. The numerator y∗(wTx∗ + b) gives the absolute distance of the support vector to the hyperplane, whereas the denominator ∥w∥ makes it a relative distance in terms of w.
Canonical Hyperplane
Consider the equation of the hyperplane [Eq. (21.2)]. Multiplying on both sides by some scalar s yields an equivalent hyperplane:
s h(x) = s wTx + s b = (sw)Tx + (sb) = 0
To obtain the unique or canonical hyperplane, we choose the scalar s such that the
absolute distance of a support vector from the hyperplane is 1. That is, sy∗(wTx∗ + b) = 1
which implies
s= 1 = 1 (21.7) y∗(wTx∗ + b) y∗h(x∗)
Henceforth, we will assume that any separating hyperplane is canonical. That is, it has already been suitably rescaled so that y∗h(x∗) = 1 for a support vector x∗, and the margin is given as
δ∗ = y∗h(x∗) = 1 ∥w∥ ∥w∥
For the canonical hyperplane, for each support vector x∗i (with label yi∗), we have yi∗ h(x∗i ) = 1, and for any point that is not a support vector we have yi h(xi ) > 1,
518 Support Vector Machines
5 4 3 2 1
1
1
∥w∥
∥w∥
12345
Figure 21.2. Margin of a separating hyperplane: 1 is the margin, and the shaded points are the support vectors. ∥w∥
because, by definition, it must be farther from the hyperplane than a support vector. Over all the n points in the dataset D, we thus obtain the following set of inequalities:
yi (wT xi + b) ≥ 1, for all points xi ∈ D (21.8)
Example 21.2. Figure 21.2 gives an illustration of the support vectors and the margin of a hyperplane. The equation of the separating hyperplane is
5T
h(x)= 2 x−20=0
Consider the support vector x∗ = (2,2)T, with class y∗ = −1. To find the canonical hyperplane equation, we have to rescale the weight vector and bias by the scalar s, obtained using Eq. (21.7):
s=1=1 =1 y∗h(x∗) 5T 2 6
−1 2 2−20 Thus, the rescaled weight vector is
w= 15=5/6 62 2/6
h(x)=0
SVM: Linear and Separable Case 519
and the rescaled bias is
The canonical form of the hyperplane is therefore
5/6T 0.833T
h(x)= 2/6 x−20/6= 0.333 x−3.33
b = −20 6
and the margin of the canonical hyperplane is ∗y∗h(x∗) 1 6
δ= ∥w∥ =
52 +22 = √29 =1.114
66
In this example there are five support vectors (shown as shaded points), namely, (2,2)T and (2.5,0.75)T with class y = −1 (shown as triangles), and (3.5,4.25)T, (4,3)T, and (4.5, 1.75)T with class y = +1 (shown as circles), as illustrated in Figure 21.2.
21.2 SVM: LINEAR AND SEPARABLE CASE
Given a dataset D = {xi,yi}ni=1 with xi ∈ Rd and yi ∈ {+1,−1}, let us assume for the moment that the points are linearly separable, that is, there exists a separating hyperplane that perfectly classifies each point. In other words, all points labeled yi = +1 lie on one side (h(x) > 0) and all points labeled yi = −1 lie on the other side (h(x) < 0) of the hyperplane. It is obvious that in the linearly separable case, there are in fact an infinite number of such separating hyperplanes. Which one should we choose?
Maximum Margin Hyperplane
The fundamental idea behind SVMs is to choose the canonical hyperplane, specified by the weight vector w and the bias b, that yields the maximum margin among all possible separating hyperplanes. If δh∗ represents the margin for hyperplane h(x) = 0, then the goal is to find the optimal hyperplane h∗:
h∗ = argmaxδh∗ = argmax 1 h w,b ∥w∥
The SVM task is to find the hyperplane that maximizes the margin 1 , subject to the
T ∥w∥
n constraints given in Eq. (21.8), namely, yi (w xi + b) ≥ 1, for all points xi ∈ D. Notice
that instead of maximizing the margin 1 , we can minimize ∥w∥. In fact, we can obtain ∥w∥
an equivalent minimization formulation given as follows: Objective Function: min ∥w∥2
w,b 2
Linear Constraints: yi (wTxi + b) ≥ 1, ∀xi ∈ D
We can directly solve the above primal convex minimization problem with the n linear constraints using standard optimization algorithms, as outlined later in
520 Support Vector Machines
Section 21.5. However, it is more common to solve the dual problem, which is obtained via the use of Lagrange multipliers. The main idea is to introduce a Lagrange multiplier αi for each constraint, which satisfies the Karush–Kuhn–Tucker (KKT) conditions at the optimal solution:
αi yi(wTxi +b)−1=0 andαi ≥0
Incorporating all the n constraints, the new objective function, called the Lagrangian, then becomes
1 2 n T
min L= 2∥w∥ − αi yi(w xi +b)−1 (21.9)
i=1
L should be minimized with respect to w and b, and it should be maximized with respect to αi .
Taking the derivative of L with respect to w and b, and setting those to zero, we obtain
∂n n ∂wL=w− αiyixi =0 or w=
i=1 i=1 ∂ n
∂bL= αiyi =0 i=1
αiyixi
(21.10) (21.11)
The above equations give important intuition about the optimal weight vector w. In particular, Eq. (21.10) implies that w can be expressed as a linear combination of the data points xi, with the signed Lagrange multipliers, αiyi, serving as the coefficients. Further, Eq. (21.11) implies that the sum of the signed Lagrange multipliers, αi yi , must be zero.
Plugging these into Eq. (21.9), we obtain the dual Lagrangian objective function, which is specified purely in terms of the Lagrange multipliers:
1T Tn n n Ldual=2w w−w αiyixi −b αiyi+ αi
i=1 i=1 i=1
n 1nn
= α i − 2 α i α j y i y j x Ti x j
i=1 i=1 j=1 The dual objective is thus given as
w0 =−2wTw+ αi
1n i=1
n 1nn ObjectiveFunction: max Ldual = αi − αiαjyiyjxTi xj
n i=1
Linear Constraints: αi ≥ 0, ∀i ∈ D, and
αi yi = 0
α i=1 2 i=1 j=1
(21.12)
SVM: Linear and Separable Case 521
where α = (α1,α2,...,αn)T is the vector comprising the Lagrange multipliers. Ldual is a convex quadratic programming problem (note the αi αj terms), which can be solved using standard optimization techniques. See Section 21.5 for a gradient-based method for solving the dual formulation.
Weight Vector and Bias
Oncewehaveobtainedtheαi valuesfori=1,...,n,wecansolvefortheweightvector w and the bias b. Note that according to the KKT conditions, we have
αi yi(wTxi +b)−1=0 which gives rise to two cases:
(1) αi = 0, or
(2) yi(wTxi +b)−1=0,whichimpliesyi(wTxi +b)=1
Thisisaveryimportantresultbecauseifαi >0,thenyi(wTxi+b)=1,andthusthe point xi must be a support vector. On the other hand if yi(wTxi + b) > 1, then αi = 0, that is, if a point is not a support vector, then αi = 0.
Once we know αi for all points, we can compute the weight vector w using Eq. (21.10), but by taking the summation only for the support vectors:
w = αiyixi (21.13) i,αi >0
In other words, w is obtained as a linear combination of the support vectors, with the αi yi ’s representing the weights. The rest of the points (with αi = 0) are not support vectors and thus do not play a role in determining w.
To compute the bias b, we first compute one solution bi, per support vector, as follows:
αi yi(wTxi +b)−1=0 yi(wTxi +b)=1
bi = 1 −wTxi =yi −wTxi yi
We can take b as the average bias value over all the support vectors: b = avgαi >0 {bi }
SVM Classifier
(21.14)
(21.15)
Given the optimal hyperplane function h(x) = wT x + b, for any new point z, we predict its class as
yˆ = sign(h(z)) = sign(wTz + b) (21.16)
where the sign(·) function returns +1 if its argument is positive, and −1 if its argument is negative.
522
Support Vector Machines Table 21.1. Dataset corresponding to Figure 21.2
+1 +1 +1 +1 +1 +1 +1 +1 −1 −1 −1 −1 −1 −1
xi
xi1
xi2
yi
x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14
3.5 4 4 4.5 4.9 5 5.5 5.5 0.5 1 1.25 1.5 2 2.5
4.25 3 4 1.75 4.5 4 2.5 3.5 1.5 2.5 0.5 1.5 2 0.75
Example 21.3. Let us continue with the example dataset shown in Figure 21.2. The dataset has 14 points as shown in Table 21.1.
Solving the Ldual quadratic program yields the following nonzero values for the Lagrangian multipliers, which determine the support vectors
+1 +1 +1 −1 −1
All other points have αi =0 and therefore they are not support vectors. Using Eq. (21.13), we can compute the weight vector for the hyperplane:
w=αiyixi i,αi >0
= 0.0437 3.5 + 0.2162 4 + 0.1427 4.5 − 0.3589 2 − 0.0437 2.5 4.25 3 1.75 2 0.75
= 0.833 0.334
The final bias is the average of the bias obtained from each support vector using Eq. (21.14):
xi
x1 x2 x4 x13 x14
xi1
3.5 4 4.5 2 2.5
xi2
4.25 3 1.75 2 0.75
yi
αi
0.0437 0.2162 0.1427 0.3589 0.0437
Soft Margin SVM: Linear and Nonseparable Case 523
bi =yi −wTxi −3.332 −3.331 −3.331 −3.333 −3.332
b=avg{bi} −3.332 Thus, the optimal hyperplane is given as follows:
0.833T
h(x)= 0.334 x−3.332=0
which matches the canonical hyperplane in Example 21.2.
xi
wT xi
x1 x2 x4 x13 x14
4.332 4.331 4.331 2.333 2.332
21.3 SOFT MARGIN SVM: LINEAR AND NONSEPARABLE CASE
So far we have assumed that the dataset is perfectly linearly separable. Here we consider the case where the classes overlap to some extent so that a perfect separation is not possible, as depicted in Figure 21.3.
5 4 3 2 1
1
∥w∥
1
∥w∥
12345
Figure 21.3. Soft margin hyperplane: the shaded points are the support vectors. The margin is 1/∥w∥ as illustrated, and points with positive slack values are also shown (thin black line).
h(x)=0
524 Support Vector Machines
Recall that when points are linearly separable we can find a separating hyperplane so that all points satisfy the condition yi (wT xi + b) ≥ 1. SVMs can handle non-separable points by introducing slack variables ξi in Eq. (21.8), as follows:
yi(wTxi +b)≥1−ξi
where ξi ≥ 0 is the slack variable for point xi , which indicates how much the point
violates the separability condition, that is, the point may no longer be at least 1/∥w∥
away from the hyperplane. The slack values indicate three types of points. If ξi = 0,
then the corresponding point xi is at least 1 away from the hyperplane. If 0 < ξi < 1, ∥w∥
then the point is within the margin and still correctly classified, that is, it is on the correct side of the hyperplane. However, if ξi ≥ 1 then the point is misclassified and appears on the wrong side of the hyperplane.
In the nonseparable case, also called the soft margin case, the goal of SVM classification is to find the hyperplane with maximum margin that also minimizes the
slack terms. The new objective function is given as
∥ w ∥ 2 ObjectiveFunction: min +C
n
(ξi)k i=1
w,b,ξi 2
LinearConstraints:yi (wTxi +b)≥1−ξi, ∀xi ∈D ξi ≥ 0 ∀xi ∈ D
(21.17)
where C and k are constants that incorporate the cost of misclassification. The term ni=1(ξi)k gives the loss, that is, an estimate of the deviation from the separable case. The scalar C, which is chosen empirically, is a regularization constant that controls the trade-off between maximizing the margin (corresponding to minimizing ∥w∥2 /2) or minimizing the loss (corresponding to minimizing the sum of the slack terms ni=1(ξi)k). For example, if C → 0, then the loss component essentially disappears, and the objective defaults to maximizing the margin. On the other hand, if C → ∞, then the margin ceases to have much effect, and the objective function tries to minimize the loss. The constant k governs the form of the loss. Typically k is set to 1 or 2. When k = 1, called hinge loss, the goal is to minimize the sum of the slack variables, whereas when k = 2, called quadratic loss, the goal is to minimize the sum of the squared slack
variables.
21.3.1 Hinge Loss
Assuming k = 1, we can compute the Lagrangian for the optimization problem in Eq. (21.17) by introducing Lagrange multipliers αi and βi that satisfy the following KKT conditions at the optimal solution:
αiyi(wTxi+b)−1+ξi=0withαi ≥0 βi(ξi −0)=0withβi ≥0
The Lagrangian is then given as
1 2 n n T n
L= 2∥w∥ +C ξi − αi yi(w xi +b)−1+ξi − βiξi i=1 i=1 i=1
(21.18)
(21.19)
Soft Margin SVM: Linear and Nonseparable Case 525 We turn this into a dual Lagrangian by taking its partial derivative with respect to
w, b and ξi , and setting those to zero:
∂n n
∂wL=w− αiyixi =0 or w= αiyixi i=1 i=1
∂ n
∂bL= αiyi =0
i=1
∂ L=C−αi −βi =0 or βi =C−αi
∂ξi
Plugging these values into Eq. (21.19), we get
(21.20)
Ldual = =
1T Tn n n n w w−w αiyixi −b αiyi + αi +
(C−αi −βi)ξi 2 i=1 i=1 i=1 i=1
w0 n 1nn
α i − 2 α i α j y i y j x Ti x j i=1 i=1 j=1
0
The dual objective is thus given as
ObjectiveFunction: max Ldual = α
LinearConstraints:0≤αi ≤C, ∀i∈Dand
n 1nn αi −
Notice that the objective is the same as the dual Lagrangian in the linearly separable case [Eq.(21.12)]. However, the constraints on αi’s are different because we now requirethatαi+βi =Cwithαi ≥0andβi ≥0,whichimpliesthat0≤αi ≤C.Section21.5 describes a gradient ascent approach for solving this dual objective function.
Weight Vector and Bias
Once we solve for αi , we have the same situation as before, namely, αi = 0 for points that are not support vectors, and αi > 0 only for the support vectors, which comprise all points xi for which we have
yi(wTxi +b)=1−ξi (21.22)
Notice that the support vectors now include all points that are on the margin, which have zero slack (ξi = 0), as well as all points with positive slack (ξi > 0).
We can obtain the weight vector from the support vectors as before:
w = αiyixi i,αi >0
We can also solve for the βi using Eq. (21.20):
βi =C−αi
(21.23)
n i=1
αiαjyiyjxTi xj αiyi =0
i=1 2 i=1 j=1
(21.21)
526 Support Vector Machines Replacing βi in the KKT conditions [Eq. (21.18)] with the expression from above we
obtain
(C−αi)ξi =0 (21.24) Thus, for the support vectors with αi > 0, we have two cases to consider:
(1) ξi >0,whichimpliesthatC−αi =0,thatis,αi =C,or
(2) C−αi > 0, that is αi < C. In this case, from Eq.(21.24) we must have ξi = 0. In
other words, these are precisely those support vectors that are on the margin. Using those support vectors that are on the margin, that is, have 0 < αi < C and
ξi =0,wecansolveforbi:
αi yi(wTxi +bi)−1=0
yi(wTxi +bi)=1
bi = 1 −wTxi =yi −wTxi (21.25)
yi
To obtain the final bias b, we can take the average over all the bi values. From Eqs. (21.23) and (21.25), both the weight vector w and the bias term b can be computed without explicitly computing the slack terms ξi for each point.
Once the optimal hyperplane plane has been determined, the SVM model predicts the class for a new point z as follows:
yˆ = sign(h(z)) = sign(wTz + b)
Example 21.4. Let us consider the data points shown in Figure 21.3. There are four new points in addition to the 14 points from Table 21.1 that we considered in Example 21.3; these points are
+1 +1 −1 −1
Let k = 1 and C = 1, then solving the Ldual yields the following support vectors and Lagrangian values αi :
xi
xi1
xi2
yi
x15 x16 x17 x18
4 2 3 5
2 3 2 3
Soft Margin SVM: Linear and Nonseparable Case 527
xi
xi1
xi2
+1 +1 +1 −1 −1 +1 +1 −1 −1
All other points are not support vectors, having αi = 0. Using Eq. (21.23) we compute the weight vector for the hyperplane:
w=αiyixi i,αi >0
= 0.0271 3.5 + 0.2162 4 + 0.9928 4.5 − 0.9928 2 4.25 3 1.75 2
−0.2434 2.5 +4+2−3−5 0.75 2 3 2 3
= 0.834 0.333
The final bias is the average of the biases obtained from each support vector using Eq. (21.25). Note that we compute the per-point bias only for the support vectors that lie precisely on the margin. These support vectors have ξi = 0 and have 0 < αi < C. Put another way, we do not compute the bias for support vectors with αi = C = 1, which include the points x15, x16, x17, and x18. From the remaining support vectors, we get
bi =yi −wTxi −3.334 −3.334 −3.334 −3.334 −3.334
b=avg{bi} −3.334 Thus, the optimal hyperplane is given as follows:
0.834T
h(x)= 0.333 x−3.334=0
3.5 4 4.5 2 2.5 4 2 3 5
4.25 3 1.75 2 0.75 2 3 2 3
yi
αi
x1 x2 x4 x13 x14 x15 x16 x17 x18
0.0271 0.2162 0.9928 0.9928 0.2434 1
1 1 1
xi
wT xi
x1 x2 x4 x13 x14
4.334 4.334 4.334 2.334 2.334
528 Support Vector Machines
One can see that this is essentially the same as the canonical hyperplane we found in Example 21.3.
It is instructive to see what the slack variables are in this case. Note that ξi = 0 for all points that are not support vectors, and also for those support vectors that are on the margin. So the slack is positive only for the remaining support vectors, for whom the slack can be computed directly from Eq. (21.22), as follows:
ξi =1−yi(wTxi +b) Thus, for all support vectors not on the margin, we have
As expected, the slack variable ξi > 1 for those points that are misclassified (i.e., are on the wrong side of the hyperplane), namely x16 = (3,3)T and x18 = (5,3)T. The other two points are correctly classified, but lie within the margin, and thus satisfy 0<ξi <1.Thetotalslackisgivenas
ξi =ξ15 +ξ16 +ξ17 +ξ18 =0.333+1.667+0.833+2.834=5.667 i
xi
wT xi
wTxi +b 0.667 −0.667 −0.167 1.834
ξi =1−yi(wTxi +b) 0.333
1.667
0.833
2.834
x15 x16 x17 x18
4.001 2.667 3.167 5.168
21.3.2 Quadratic Loss
For quadratic loss, we have k = 2 in the objective function [Eq. (21.17)]. In this case we can dropthe positivity constraint ξi ≥ 0 due to the fact that (1) the sum of the slack terms ni=1 ξi2 is always positive, and (2) a potential negative value of slack will be ruled out during optimization because a choice of ξi = 0 leads to a smaller value of theprimaryobjective,anditstillsatisfiestheconstraintyi(wTxi+b)≥1−ξi whenever ξi < 0. In other words, the optimization process will replace any negative slack variables by zero values. Thus, the SVM objective for quadratic loss is given as
(21.26)
∥w∥2 n ObjectiveFunction: min +C ξi2
w,b,ξi 2 i=1 LinearConstraints:yi (wTxi +b)≥1−ξi, ∀xi ∈D
The Lagrangian is then given as:
1nn
L= 2∥w∥2 +C ξi2 − αi yi(wTxi +b)−1+ξi i=1 i=1
Kernel SVM: Nonlinear Case 529 Differentiating with respect to w, b, and ξi and setting them to zero results in the
following conditions, respectively:
n
w= αiyixi
i=1 n
αiyi =0 i=1
ξi = 1 αi 2C
Substituting these back into Eq. (21.26) yields the dual objective
n
L d u a l =
i=1
1 n n
α i − 2
i=1 j=1
1 n
α i α j y i y j x Ti x j − 4 C
α i 2
n 1nn T 1 = αi − 2 αi αj yi yj xi xj + 2C δij
i=1 i=1 j=1
where δ is the Kronecker delta function, defined as δij = 1 if i = j , and δij = 0 otherwise.
i=1
Thus, the dual objective is given as
n 1nn T 1
max Ldual = αi − αiαjyiyj xi xj + δij α i=1 2 i=1 j=1 2C
(21.27)
subject to the constraints αi ≥ 0,∀i ∈ D,and
Once we solve for αi using the methods from Section 21.5, we can recover the weight
vector and bias as follows:
w=αiyixi i,αi >0
b=avgi,C>αi>0yi −wTxi 21.4 KERNEL SVM: NONLINEAR CASE
n i=1
αiyi = 0
The linear SVM approach can be used for datasets with a nonlinear decision boundary via the kernel trick from Chapter 5. Conceptually, the idea is to map the original d-dimensional points xi in the input space to points φ(xi) in a high-dimensional feature space via some nonlinear transformation φ. Given the extra flexibility, it is more likely that the points φ(xi) might be linearly separable in the feature space. Note, however, that a linear decision surface in feature space actually corresponds to a nonlinear decision surface in the input space. Further, the kernel trick allows us to carry out all operations via the kernel function computed in input space, rather than having to map the points into feature space.
530
Support Vector Machines
5 4 3 2 1 0
01234567
Figure 21.4. Nonlinear SVM: shaded points are the support vectors.
Example 21.5. Consider the set of points shown in Figure 21.4. There is no linear
classifier that can discriminate between the points. However, there exists a perfect
quadratic classifier that can separate the two classes. Given the input space over
the two dimensions X1 and X2 , if we transform each point x = (x1 , x2 )T into a
point in the feature space consisting of the dimensions (X1,X2,X2,X2,X1X2), via √√22√T 12
the transformation φ(x) = ( 2×1, 2×2,x1 ,x2 , 2x1x2) , then it is possible to find a separating hyperplane in feature space. For this dataset, it is possible to map the hyperplane back to the input space, where it is seen as an ellipse (thick black line) that separates the two classes (circles and triangles). The support vectors are those points (shown in gray) that lie on the margin (dashed ellipses).
To apply the kernel trick for nonlinear SVM classification, we have to show that all operations require only the kernel function:
K(xi,xj)=φ(xi)Tφ(xj)
Let the original database be given as D = {xi,yi}ni=1. Applying φ to each point, we can obtain the new dataset in the feature space Dφ = {φ(xi),yi}ni=1.
The SVM objective function [Eq. (21.17)] in feature space is given as
∥ w ∥ 2 ObjectiveFunction: min +C
where w is the weight vector, b is the bias, and ξi are the slack variables, all in feature space.
(ξi)k LinearConstraints:yi (wTφ(xi)+b)≥1−ξi,andξi ≥0, ∀xi ∈D
n i=1
w,b,ξi 2
(21.28)
Kernel SVM: Nonlinear Case 531 Hinge Loss
For hinge loss, the dual Lagrangian [Eq. (21.21)] in feature space is given as
max Ldual = α
=
n 1nn
αi − αiαjyiyjφ(xi)Tφ(xj)
i=1 2 i=1 j=1 n 1nn
(21.29)
αi − 2 αiαjyiyjK(xi,xj) i=1 i=1 j=1
Subject to the constraints that 0 ≤ αi ≤ C, and ni=1 αi yi = 0. Notice that the dual Lagrangian depends only on the dot product between two vectors in feature space φ(xi)Tφ(xj) = K(xi,xj), and thus we can solve the optimization problem using the kernel matrix K = {K(xi , xj )}i,j =1,…,n . Section 21.5 describes a stochastic gradient-based approach for solving the dual objective function.
Quadratic Loss
For quadratic loss, the dual Lagrangian [Eq. (21.27)] corresponds to a change of kernel. Define a new kernel function Kq , as follows:
Kq(xi,xj)=xTi xj + 1 δij =K(xi,xj)+ 1 δij 2C 2C
which affects only the diagonal entries of the kernel matrix K, as δij = 1 iff i = j , and zero otherwise. Thus, the dual Lagrangian is given as
n 1nn
max Ldual = αi − αiαjyiyjKq(xi,xj) (21.30)
α i=1 2 i=1 j=1
subject to the constraints that αi ≥ 0, and ni=1 αi yi = 0. The above optimization can be
solved using the same approach as for hinge loss, with a simple change of kernel.
Weight Vector and Bias
We can solve for w in feature space as follows:
w=αiyiφ(xi) (21.31)
αi >0
Because w uses φ(xi) directly, in general, we may not be able or willing to compute w explicitly. However, as we shall see next, it is not necessary to explicitly compute w for classifying the points.
Let us now see how to compute the bias via kernel operations. Using Eq. (21.25), we compute b as the average over the support vectors that are on the margin, that is, thosewith0<αi
=yi −αjyjK(xj,xi) (21.33) αj>0
Notice that bi is a function of the dot product between two vectors in feature space and therefore it can be computed via the kernel function in the input space.
Kernel SVM Classifier
We can predict the class for a new point z as follows: yˆ = s i g n ( w T φ ( z ) + b )
=signαiyiφ(xi)Tφ(z)+b
αi >0
= sign αiyiK(xi,z)+b
αi >0
Once again we see that yˆ uses only dot products in feature space.
Based on the above derivations, we can see that, to train and test the SVM
classifier, the mapped points φ (xi ) are never needed in isolation. Instead, all operations can be carried out in terms of the kernel function K(xi,xj) = φ(xi)Tφ(xj). Thus, any nonlinear kernel function can be used to do nonlinear classification in the input space. Examples of such nonlinear kernels include the polynomial kernel [Eq. (5.9)], and the Gaussian kernel [Eq. (5.10)], among others.
Example 21.6. Let us consider the example dataset shown in Figure 21.4; it has 29 points in total. Although it is generally too expensive or infeasible (depending on the choice of the kernel) to compute an explicit representation of the hyperplane in feature space, and to map it back into input space, we will illustrate the application of SVMs in both input and feature space to aid understanding.
We use an inhomogeneous polynomial kernel [Eq. (5.9)] of degree q = 2, that is, we use the kernel:
K(xi,xj)=φ(xi)Tφ(xj)=(1+xTi xj)2
With C=4, solving the Ldual quadratic program [Eq.(21.30)] in input space yields the following six support vectors, shown as the shaded (gray) points in Figure 21.4.
SVM Training Algorithms 533
xi
(xi1,xi2)T
φ(xi)
(1, 2)T
(4, 1)T (6, 4.5)T (7, 2)T (4, 4)T (6, 3)T
(1, 1.41, 2.83, 1, 4, 2.83)T
(1, 5.66, 1.41, 16, 1, 5.66)T (1, 8.49, 6.36, 36, 20.25, 38.18)T (1, 9.90, 2.83, 49, 4, 19.80)T (1, 5.66, 5.66, 16, 16, 15.91)T (1, 8.49, 4.24, 36, 9, 25.46)T
+1 +1 +1 +1 −1 −1
For the inhomogeneous quadratic kernel, the mapping φ maps an input point xi into feature space as follows:
φx=(x1,x2)T=1,√2×1,√2×2,x12,x2,√2x1x2T
The table above shows all the mapped points, which reside in feature space. For
example, x1 = (1, 2)T is transformed into
√ √ 2 2 √ T T
φ(xi)= 1, 2·1, 2·2,1 ,2 , 2·1·2 =(1,1.41,2.83,1,2,2.83) We compute the weight vector for the hyperplane using Eq. (21.31):
w = αi yi φ (xi ) = (0, −1.413, −3.298, 0.256, 0.82, −0.018)T i,αi >0
and the bias is computed using Eq. (21.32), which yields b = −8.841
For the quadratic polynomial kernel, the decision boundary in input space corresponds to an ellipse. For our example, the center of the ellipse is given as (4.046,2.907), and the semimajor axis length is 2.78 and the semiminor axis length is 1.55. The resulting decision boundary is the ellipse shown in Figure 21.4. We emphasize that in this example we explicitly transformed all the points into the feature space just for illustration purposes. The kernel trick allows us to achieve the same goal using only the kernel function.
yi
αi
x1 x2 x3 x4 x5 x6
0.6198 2.069 3.803 0.3182 2.9598 3.8502
21.5 SVM TRAINING ALGORITHMS
We now turn our attention to algorithms for solving the SVM optimization problems. We will consider simple optimization approaches for solving the dual as well as the primal formulations. It is important to note that these methods are not the most efficient. However, since they are relatively simple, they can serve as a starting point for more sophisticated methods.
For the SVM algorithms in this section, instead of dealing explicitly with the bias b, we map each point xi ∈ Rd to the point x′i ∈ Rd+1 as follows:
x′i =(xi1,…,xid,1)T (21.34)
534 Support Vector Machines
Furthermore, we also map the weight vector to Rd+1, with wd+1 = b, so that w=(w1,…,wd,b)T (21.35)
The equation of the hyperplane [Eq. (21.1)] is then given as follows:
h(x′):wTx′ =0
xi1
′ . h(x): w1 ··· wd b xid=0
1 h(x′):w1xi1 +···+wdxid +b=0
In the discussion below we assume that the bias term has been included in w, and that each point has been mapped to Rd+1 as per Eqs. (21.34) and (21.35). Thus, the last component of w yields the bias b. Another consequence of mapping the points to Rd+1 is that the constraint ni=1 αi yi = 0 does not apply in the SVM dual formulations given in Eqs. (21.21), (21.27), (21.29), and (21.30), as there is no explicit bias term b for the linear constraints in the SVM objective given in Eq. (21.17). The new set of constraints is given as
yiwTx≥1−ξi 21.5.1 Dual Solution: Stochastic Gradient Ascent
We consider only the hinge loss case because quadratic loss can be handled by a change of kernel, as shown in Eq.(21.30). The dual optimization objective for hinge loss [Eq. (21.29)] is given as
n 1nn
maxJ(α)= αi − αiαjyiyjK(xi,xj)
α i=1 2 i=1 j=1
subject to the constraints 0 ≤ αi ≤ C for all i = 1,…,n. Here α = (α1,α2,··· ,αn)T ∈ Rn.
Let us consider the terms in J(α) that involve the Lagrange multiplier αk: 122 n
J(αk)=αk − 2αkykK(xk,xk)−αkyk αiyiK(xi,xk) i=1
i̸=k
The gradient or the rate of change in the objective function at α is given as the
partial derivative of J(α) with respect to α, that is, with respect to each αk : ∇J(α) = ∂J(α), ∂J(α),…, ∂J(α)T
∂α1 ∂α2 ∂αn
where the kth component of the gradient is obtained by differentiating J(αk) with
respect to αk :
∂ J ( α ) ∂ J ( α k ) n
∂αk = ∂αk =1−yk αiyiK(xi,xk) (21.36) i=1
SVM Training Algorithms 535
Because we want to maximize the objective function J(α), we should move in the direction of the gradient ∇J(α). Starting from an initial α, the gradient ascent approach successively updates it as follows:
αt+1 =αt +ηt∇J(αt)
where αt is the estimate at the tth step, and ηt is the step size.
Instead of updating the entire α vector in each step, in the stochastic gradient
ascent approach, we update each component αk independently and immediately use the new value to update other components. This can result in faster convergence. The update rule for the k-th component is given as
∂J(α)n
αk =αk +ηk ∂αk =αk +ηk 1−yk αiyiK(xi,xk) (21.37)
i=1
where ηk is the step size. We also have to ensure that the constraints αk ∈ [0,C] are satisfied. Thus, in the update step above, if αk < 0 we reset it to αk = 0, and if αk > C we reset it to αk = C. The pseudo-code for stochastic gradient ascent is given in Algorithm 21.1.
ALGORITHM21.1. DualSVMAlgorithm:StochasticGradientAscent
if loss = hinge then
K ← {K(xi , xj )}i,j =1,…,n // kernel matrix, hinge loss
else if loss = quadratic then
K←{K(xi,xj)+ 1 δij}i,j=1,…,n//kernelmatrix,quadraticloss
SVM-DUAL (D,K,C,ǫ):
foreachxi∈Ddo xi←xi//maptoRd+1 11
2 3 4 5
2C fork=1,…,ndoηk ← 1
8
9 10 11
12 13 14
15 16 17
α0 ← (0,…,0)T repeat
α ← αt
for k = 1 to n do
//setstepsize
// update kth component of α n
αk ←αk +ηk 1−yk αiyiK(xi,xk) i=1
ifαk <0thenαk ←0 ifαk >Cthenαk ←C
αt+1 ←α t←t+1
until∥αt−αt−1∥≤ǫ
6 7t←0
K(xk ,xk )
536 Support Vector Machines To determine the step size ηk, ideally, we would like to choose it so that the
gradient at αk goes to zero, which happens when
ηk = 1 (21.38)
K(xk,xk)
To see why, note that when only αk is updated, the other αi do not change. Thus,
the new α has a change only in αk , and from Eq. (21.36) we get
∂J(α) =1−ykαiyiK(xi,xk)−ykαkykK(xk,xk)
∂αk i̸=k
Plugging in the value of αk from Eq. (21.37), we have
∂J(α)n
∂αk = 1−yk αiyiK(xi,xk) − αk+ηk 1−yk αiyiK(xi,xk) K(xk,xk)
i̸=k i=1 nn
= 1−yk αiyiK(xi,xk) −ηkK(xk,xk) 1−yk αiyiK(xi,xk)
i=1
n
= 1−ηkK(xk,xk) 1−yk αiyiK(xi,xk) i=1
i=1
Substituting ηk from Eq. (21.38), we have ∂J(α)1n
∂ak = 1− K(xk,xk)K(xk,xk) 1−yk αiyiK(xi,xk) =0 i=1
In Algorithm 21.1, for better convergence, we thus choose ηk according to Eq. (21.38). The method successively updates α and stops when the change falls below a given threshold ǫ. Since the above description assumes a general kernel function between any two points, we can recover the linear, nonseparable case by simply setting K(xi , xj ) = xTi xj . The computational complexity of the method is O(n2) per iteration.
Note that once we obtain the final α, we classify a new point z ∈ Rd+1 as follows:
yˆ =signh(φ(z))=signwTφ(z)=signαiyiK(xi,z) αi >0
Example 21.7 (Dual SVM: Linear Kernel). Figure 21.5 shows the n = 150 points from the Iris dataset, using sepal length and sepal width as the two attributes. The goal is to discriminate between Iris-setosa (shown as circles) and other types of Iris flowers (shown as triangles). Algorithm 21.1 was used to train the SVM classifier with a linear kernel K(xi , xj ) = xTi xj and convergence threshold ǫ = 0.0001, with hinge loss. Two different values of C were used; hyperplane h10 is obtained by using C = 10, whereas h1000 uses C = 1000; the hyperplanes are given as follows:
h10 (x) : 2.74×1 − 3.74×2 − 3.09 = 0 h1000(x) : 8.56×1 − 7.14×2 − 23.12 = 0
SVM Training Algorithms
X2
537
4.0
3.5
3.0
2.5
2
4
X1
4.5
5.0
Figure 21.5. SVM dual algorithm with linear kernel.
5.5
6.0
The hyperplane h10 has a larger margin, but it has a larger slack; it misclassifies one of the circles. On the other hand, the hyperplane h1000 has a smaller margin, but it minimizes the slack; it is a separating hyperplane. This example illustrates the fact that the higher the value of C the more the emphasis on minimizing the slack.
Example 21.8 (Dual SVM: Quadratic Kernel). Figure 21.6 shows the n = 150 points from the Iris dataset projected on the first two principal components. The task is to separate Iris-versicolor (in circles) from the other two types of Irises (in triangles). The figure plots the decision boundaries obtained when using the linear kernel K(xi , xj ) = xTi xj , and the inhomogeneous quadratic kernel K(xi , xj ) = (1 + xTi xj)2,wherexi ∈Rd+1,asperEq.(21.34).Theoptimalhyperplaneinbothcaseswas found via the gradient ascent approach in Algorithm 21.1, with C = 10, ǫ = 0.0001 and using hinge loss.
The optimal hyperplane hl (shown in gray) for the linear kernel is given as hl(x):0.16×1 +1.9×2 +0.8=0
As expected, hl is unable to separate the classes. On the other hand, the optimal hyperplane hq (shown as clipped black ellipse) for the quadratic kernel is given as
hq (x) : wTφ(x) = 1.86×12 + 1.87x1x2 + 0.14×1 + 0.85×2 − 1.22×2 − 3.25 = 0
where x = (x1,x2)T, w = 1.86,1.32,0.099,0.85,−0.87,−3.25T and φ(x) = x12,√2x1x2,√2×1,x2,√2×2,1T.
h10
h1000
6.5 7.0 7.5 8.0
538
1.0 0.5 0 −0.5 −1.0 −1.5
Support Vector Machines
u2
hq
−4 −3 −2 −1 0 1 2 3 Figure 21.6. SVM dual algorithm with quadratic kernel.
The hyperplane hq is able to separate the two classes quite well. Here we explicitly reconstructed w for illustration purposes; note that the last element of w gives the bias term b = −3.25.
21.5.2 Primal Solution: Newton Optimization
The dual approach is the one most commonly used to train SVMs, but it is also possible to train using the primal formulation.
Consider the primal optimization function for the linear, but nonseparable case [Eq. (21.17)]. With w, xi ∈ Rd +1 as discussed earlier, we have to minimize the objective function:
(21.39)
(21.40)
w 2 i=1 subject to the linear constraints:
1 n min J(w) = ∥w∥2 + C (ξi )k
yi (wTxi)≥1−ξi andξi ≥0foralli=1,…,n Rearranging the above, we obtain an expression for ξi
ξi ≥1−yi (wTxi)andξi ≥0, whichimpliesthat ξi =max0,1−yi (wTxi)
hl
u1
SVM Training Algorithms 539 Plugging Eq. (21.40) into the objective function [Eq. (21.39)], we obtain
n
J(w)= 1∥w∥2 +Cmax0,1−yi (wTxi)k
2 i=1
= 1∥w∥2 +C 1−yi(wTxi)k (21.41)
2 yi (wT xi )<1
The last step follows from Eq. (21.40) because ξi > 0 if and only if 1 − yi(wTxi) > 0, that is, yi(wTxi) < 1. Unfortunately, the hinge loss formulation, with k = 1, is not differentiable. One could use a differentiable approximation to the hinge loss, but here we describe the quadratic loss formulation.
Quadratic Loss
For quadratic loss, we have k = 2, and the primal objective [Eq. (21.41)] can be written as
J(w)= 1∥w∥2 +C1−yi(wTxi)2 2 yi(wTxi)<1
The gradient or the rate of change of the objective function at w is given as the partial derivative of J(w) with respect to w:
∇w = ∂J(w) =w−2Cyixi 1−yi(wTxi) ∂w yi(wTxi)<1
=w−2Cyixi+2CxixTi w yi (wT xi )<1 yi (wT xi )<1
=w−2Cv+2CSw where the vector v and the matrix S are given as
v=yixi S=xixTi yi (wT xi )<1 yi (wT xi )<1
Note that the matrix S is the scatter matrix, and the vector v is m times the mean of the, saym,signedpointsyixi thatsatisfytheconditionyih(xi)<1.
The Hessian matrix is defined as the matrix of second-order partial derivatives of J(w) with respect to w, which is given as
Hw=∂∇w =I+2CS ∂w
Because we want to minimize the objective function J(w), we should move in the direction opposite to the gradient. The Newton optimization update rule for w is given as
wt+1 =wt −ηtH−1∇wt (21.42) wt
540
Support Vector Machines
ALGORITHM21.2. PrimalSVMAlgorithm:NewtonOptimization, Quadratic Loss
SVM-PRIMAL (D,C,ǫ): 1 foreach xi ∈ D do
xi ←xi//maptoRd+1 21
3t←0
4 5
6
7 8 9
10 11 12
w0 ← (0,...,0)T // initialize wt ∈ Rd+1 repeat
v← yixi yi(wt xi)<1
T S← xixTi yi(wTt xi)<1
∇ ← (I + 2CS)wt − 2Cv // gradient
H ← I + 2CS // Hessian
wt+1 ←wt −ηtH−1∇//Newtonupdaterule[Eq.(21.42)] t←t+1
until∥wt−wt−1∥≤ǫ
where ηt > 0 is a scalar value denoting the step size at iteration t. Normally one needs to use a line search method to find the optimal step size ηt , but the default value of ηt = 1 usually works for quadratic loss.
The Newton optimization algorithm for training linear, nonseparable SVMs in the primalisgiveninAlgorithm21.2.Thestepsizeηt issetto1bydefault.Aftercomputing thegradientandHessianatwt (lines6–9),theNewtonupdateruleisusedtoobtainthe new weight vector wt+1 (line 10). The iterations continue until there is very little change in the weight vector. Computing S requires O(nd2) steps; computing the gradient ∇, the Hessian matrix H and updating the weight vector wt+1 takes time O(d2); and inverting the Hessian takes O(d3) operations, for a total computational complexity of O(nd2 + d3) per iteration in the worst case.
Example 21.9 (Primal SVM). Figure 21.7 plots the hyperplanes obtained using the dual and primal approaches for the 2-dimensional Iris dataset comprising the sepal length versus sepal width attributes. We used C = 1000 and ǫ = 0.0001 with the quadratic loss function. The dual solution hd (gray line) and the primal solution hp (thick black line) are essentially identical; they are as follows:
hd (x) : 7.47×1 − 6.34×2 − 19.89 = 0 hp (x) : 7.47×1 − 6.34×2 − 19.91 = 0
SVM Training Algorithms
541
4.0
3.5
3.0
2.5
2
4
X2 hd,hp
Primal Kernel SVMs
4.5 5.0
Figure 21.7. SVM primal algorithm with linear kernel.
8.0
5.5
6.0
6.5
7.0
7.5
In the preceding discussion we considered the linear, nonseparable case for primal SVM learning. We now generalize the primal approach to learn kernel-based SVMs, again for quadratic loss.
Let φ denote a mapping from the input space to the feature space; each input point xi ismappedtothefeaturepointφ(xi).LetK(xi,xj)denotethekernelfunction,andlet w denote the weight vector in feature space. The hyperplane in feature space is then given as
h(x): wTφ(x) = 0
Using Eqs. (21.28) and (21.40), the primal objective function in feature space can
be written as
1 2 n
min J(w) = ∥w∥ + C L(yi , h(xi )) (21.43)
The gradient at w is given as
∇w =w+Cn ∂L(yi,h(xi))·∂h(xi)
w 2 i=1 whereL(yi,h(xi))=max{0,1−yih(xi)}k isthelossfunction.
where
i=1 ∂h(xi) ∂w
∂h(xi) = ∂wTφ(xi) =φ(xi) ∂w ∂w
X1
542
Support Vector Machines
At the optimal solution, the gradient vanishes, that is, ∇w = 0, which yields w=−Cn ∂L(yi,h(xi))·φ(xi)
i=1 ∂h(xi)
n i=1
(21.44)
=
βi φ (xi )
where βi is the coefficient of the point φ(xi) in feature space. In other words, the optimal weight vector in feature space is expressed as a linear combination of the points φ(xi) in feature space.
Using Eq.(21.44), the distance to the hyperplane in feature space can be expressed as
n
yih(xi)=yiwTφ(xi)=yi
where K = K(xi,xj )ni,j=1 is the n × n kernel matrix, Ki is the ith column of K, and
β = β1,…,βnT is the coefficient vector.
Plugging Eqs. (21.44) and (21.45) into Eq. (21.43), with quadratic loss (k = 2), yields
the primal kernel SVM formulation purely in terms of the kernel matrix:
nnn
1 T2 min J(β)= βiβjK(xi,xj)+C max 0,1−yiKi β
j=1
βjK(xj,xi)=yiKTi β (21.45)
β 2 i=1 j=1 i=1 = 1 β T K β + C ( 1 − y i K Ti β ) 2
2 y i K Ti β < 1
The gradient of J(β) with respect to β is given as
∇β = ∂J(β) =Kβ−2CyiKi(1−yiKTi β) ∂β T
yiKi β<1 =Kβ+2C(KiKTi )β−2CyiKi
yiKTi β<1 yiKTi β<1 =(K+2CS)β−2Cv
where the vector v ∈ Rn and the matrix S ∈ Rn×n are given as v=yiKi S=KiKTi
yiKTi β<1 yiKTi β<1 Furthermore, the Hessian matrix is given as
Hβ=∂∇β =K+2CS ∂β
We can now minimize J(β) by Newton optimization using the following update rule:
β =β −ηH−1∇ t+1 t t β β
SVM Training Algorithms 543 ALGORITHM21.3. PrimalKernelSVMAlgorithm:NewtonOptimization,
Quadratic Loss
SVM-PRIMAL-KERNEL (D,K,C,ǫ): 1 foreach xi ∈ D do
xi ←xi//maptoRd+1 21
3 K ← {K(xi , xj )}i,j =1,...,n // compute kernel matrix 4t←0
5 6
7
8
9 10 11 12 13
β0 ← (0,...,0)T // initialize βt ∈ Rn repeat
v← yiKi yi(Ki βt)<1
T
S← KiKTi yi(KTi βt)<1
∇ ← (K + 2CS)βt − 2Cv // gradient
H ← K + 2CS // Hessian
βt+1 ← βt − ηt H−1∇ // Newton update rule t←t+1
untilβt−βt−1≤ǫ
Note that if Hβ is singular, that is, if it does not have an inverse, then we add a small
ridge to the diagonal to regularize it. That is, we make H invertible as follows: Hβ = Hβ + λI
where λ > 0 is some small positive ridge value.
Once β has been found, it is easy to classify any test point z as follows:
n n yˆ = sign wTφ(z) = sign βiφ(xi)Tφ(z) = sign βiK(xi,z)
i=1 i=1
The Newton optimization algorithm for kernel SVM in the primal is given in Algorithm 21.3. The step size ηt is set to 1 by default, as in the linear case. In each iteration, the method first computes the gradient and Hessian (lines 7–10). Next, the Newton update rule is used to obtain the updated coefficient vector βt+1 (line 11). The iterations continue until there is very little change in β. The computational complexity of the method is O(n3) per iteration in the worst case.
Example 21.10 (Primal SVM: Quadratic Kernel). Figure 21.8 plots the hyperplanes obtained using the dual and primal approaches on the Iris dataset projected onto the first two principal components. The task is to separate iris versicolor from the others, the same as in Example 21.8. Because a linear kernel is not suitable for this task, we employ the quadratic kernel. We further set C = 10 and ǫ = 0.0001, with
544
1.0 0.5 0 −0.5 −1.0 −1.5
Support Vector Machines
u2
hd hp
−4 −3 −2 −1 0 1 2 3 Figure 21.8. SVM quadratic kernel: dual and primal.
the quadratic loss function. The dual solution hd (black contours) and the primal solution hp (gray contours) are given as follows:
hd (x) : 1.4×12 + 1.34x1x2 − 0.05×1 + 0.66×2 − 0.96×2 − 2.66 = 0 hp (x) : 0.87×12 + 0.64x1x2 − 0.5×1 + 0.43×2 − 1.04×2 − 2.398 = 0
Although the solutions are not identical, they are close, especially on the left decision boundary.
21.6 FURTHER READING
The origins of support vector machines can be found in V. N. Vapnik (1982). In particular, it introduced the generalized portrait approach for constructing an optimal separating hyperplane. The use of the kernel trick for SVMs was introduced in Boser, Guyon, and V. N. Vapnik (1992), and the soft margin SVM approach for nonseparable data was proposed in Cortes and V. Vapnik (1995). For a good introduction to support vector machines, including implementation techniques, see Cristianini and Shawe-Taylor (2000) and Scho ̈ lkopf and Smola (2002). The primal training approach described in this chapter is from Chapelle (2007).
Boser, B. E., Guyon, I. M., and Vapnik, V. N. (1992). A training algorithm for optimal margin classifiers. Proceedings of the fifth annual workshop on Computational learning theory. ACM, pp. 144–152.
u1
Exercises 545
Chapelle, O. (2007). Training a support vector machine in the primal. Neural Computation, 19 (5): 1155–1178.
Cortes, C. and Vapnik, V. (1995). Support-vector networks. Machine learning, 20 (3): 273–297.
Cristianini, N. and Shawe-Taylor, J. (2000). An introduction to support vector machines and other kernel-based learning methods. Cambridge University Press.
Scho ̈ lkopf, B. and Smola, A. J. (2002). Learning with kernels: support vector machines, regularization, optimization and beyond. Cambridge, MA: MIT Press.
Vapnik, V. N. (1982). Estimation of dependences based on empirical data. Vol. 40. New York: Springer-Verlag.
21.7 EXERCISES
Q1. Consider the dataset in Figure 21.9, which has points from two classes c1 (triangles) and c2 (circles). Answer the questions below.
(a) Find the equations for the two hyperplanes h1 and h2.
(b) Show all the support vectors for h1 and h2.
(c) Which of the two hyperplanes is better at separating the two classes based on the margin computation?
(d) Find the equation of the best separating hyperplane for this dataset, and show the corresponding support vectors. You can do this witout having to solve the Lagrangian by considering the convex hull of each class and the possible hyperplanes at the boundary of the two classes.
9 8 7 6 5 4 3 2 1
123456789
Figure 21.9. Dataset for Q1.
h1(x)=0
h2(x)=0
546
Support Vector Machines
Table 21.2. Dataset for Q2
i
xi1
xi2
yi
αi
x1 x2 x3 x4 x5 x6 x7 x8 x9 x10
4 4 1 2.5 4.9 1.9 3.5 0.5 2 4.5
2.9 4 2.5 1 4.5 1.9 4 1.5 2.1 2.5
1
1 −1 −1 1 −1 1 −1 −1 1
0.414 0
0 0.018 0
0 0.018 0 0.414 0
Q2. Given the 10 points in Table 21.2, along with their classes and their Lagranian multipliers (αi ), answer the following questions:
(a) What is the equation of the SVM hyperplane h(x)?
(b) What is the distance of x6 from the hyperplane? Is it within the margin of the
classifier?
(c) Classify the point z = (3, 3)T using h(x) from above.
CHAPTER 22 ClassificationAssessment
We have seen different classifiers in the preceding chapters, such as decision trees, full and naive Bayes classifiers, nearest neighbors classifier, support vector machines, and so on. In general, we may think of the classifier as a model or function M that predicts the class label yˆ for a given input example x:
yˆ = M ( x )
where x = (x1,x2,…,xd)T ∈ Rd is a point in d-dimensional space and yˆ ∈ {c1,c2,…,ck} is its predicted class.
To build the classification model M we need a training set of points along with their known classes. Different classifiers are obtained depending on the assumptions used to build the model M. For instance, support vector machines use the maximum margin hyperplane to construct M. On the other hand, the Bayes classifier directly computes the posterior probability P(cj|x) for each class cj, and predicts the class of x as the one with the maximum posterior probability, yˆ = argmaxcj P(cj|x). Once the model M has been trained, we assess its performance over a separate testing set of points for which we know the true classes. Finally, the model can be deployed to predict the class for future points whose class we typically do not know.
In this chapter we look at methods to assess a classifier, and to compare multiple classifiers. We start by defining metrics of classifier accuracy. We then discuss how to determine bounds on the expected error. We finally discuss how to assess the performance of classifiers and compare them.
22.1 CLASSIFICATION PERFORMANCE MEASURES
Let D be the testing set comprising n points in a d dimensional space, let {c1,c2,…,ck} denote the set of k class labels, and let M be a classifier. For xi ∈ D, let yi denote its true class, and let yˆi = M(xi) denote its predicted class.
547
548 Classification Assessment
Error Rate
The error rate is the fraction of incorrect predictions for the classifier over the testing set, defined as
1 n
ErrorRate=n I(yi ̸=yˆi) (22.1)
i=1
where I is an indicator function that has the value 1 when its argument is true, and 0 otherwise. Error rate is an estimate of the probability of misclassification. The lower the error rate the better the classifier.
Accuracy
The accuracy of a classifier is the fraction of correct predictions over the testing set: 1 n
Accuracy = n I(yi = yˆi ) = 1 − Error Rate (22.2) i=1
Accuracy gives an estimate of the probability of a correct prediction; thus, the higher the accuracy, the better the classifier.
Example 22.1. Figure 22.1 shows the 2-dimensional Iris dataset, with the two attributes being sepal length and sepal width. It has 150 points, and has three equal-sized classes: Iris-setosa (c1; circles), Iris-versicolor (c2; squares) and Iris-virginica(c3; triangles). The dataset is partitioned into training and testing sets, in the ratio 80:20. Thus, the training set has 120 points (shown in light gray), and
X2
4.0
3.5
3.0
2.5
2
4
X1
4.5
5.0
6.5 7.0 7.5 8.0
5.5
Figure 22.1. Iris dataset: three classes.
6.0
Classification Performance Measures 549
the testing set D has n = 30 points (shown in black). One can see that whereas c1 is well separated from the other classes, c2 and c3 are not easy to separate. In fact, some points are labeled as both c2 and c3 (e.g., the point (6,2.2)T appears twice, labeled as c2 and c3).
We classify the test points using the full Bayes classifier (see Chapter 18). Each class is modeled using a single normal distribution, whose mean (in white) and density contours (corresponding to one and two standard deviations) are also plotted in Figure 22.1. The classifier misclassifies 8 out of the 30 test cases. Thus, we have
Error Rate = 8/30 = 0.267 Accuracy = 22/30 = 0.733
22.1.1 Contingency Table–based Measures
The error rate (and, thus also the accuracy) is a global measure in that it does not explicitly consider the classes that contribute to the error. More informative measures can be obtained by tabulating the class specific agreement and disagreement between the true and predicted labels over the testing set. Let D = {D1,D2,…,Dk} denote a partitioning of the testing points based on their true class labels, where
Dj ={xi ∈D|yi =cj}
Let ni = |Di | denote the size of true class ci .
Let R = {R1,R2,…,Rk} denote a partitioning of the testing points based on the
predicted labels, that is,
R j = { x i ∈ D | yˆ i = c j }
Let mj = |Rj | denote the size of the predicted class cj .
R and D induce a k × k contingency table N, also called a confusion matrix, defined
as follows:
N ( i , j ) = n i j = R i ∩ D j = x a ∈ D | yˆ a = c i a n d y a = c j
where 1 ≤ i,j ≤ k. The count nij denotes the number of points with predicted class ci whose true label is cj . Thus, nii (for 1 ≤ i ≤ k) denotes the number of cases where the classifier agrees on the true label ci . The remaining counts nij , with i ̸= j , are cases where the classifier and true labels disagree.
Accuracy/Precision
The class-specific accuracy or precision of the classifier M for class ci is given as the fraction of correct predictions over all points predicted to be in class ci
acci = preci = nii mi
where mi is the number of examples predicted as ci by classifier M. The higher the accuracy on class ci the better the classifier.
550 Classification Assessment The overall precision or accuracy of the classifier is the weighted average of the
class-specific accuracy:
kmi 1k Accuracy = Precision = n acci = n nii
i=1 i=1 This is identical to the expression in Eq. (22.2).
Coverage/Recall
The class-specific coverage or recall of M for class ci is the fraction of correct predictions over all points in class ci :
coveragei = recalli = nii ni
where ni is the number of points in class ci. The higher the coverage the better the classifier.
F-measure
Often there is a trade-off between the precision and recall of a classifier. For example, it is easy to make recalli = 1, by predicting all testing points to be in class ci . However, in this case preci will be low. On the other hand, we can make preci very high by predicting only a few points as ci, for instance, for those predictions where M has the most confidence, but in this case recalli will be low. Ideally, we would like both precision and recall to be high.
The class-specific F-measure tries to balance the precision and recall values, by computing their harmonic mean for class ci :
Fi =
preci
2 = 2·preci ·recalli = 2nii + 1 preci +recalli ni +mi
1
The higher the Fi value the better the classifier.
recalli
The overall F-measure for the classifier M is the mean of the class-specific values: 1 r
F=k Fi i=1
For a perfect classifier, the maximum value of the F-measure is 1.
Example 22.2. Consider the 2-dimensional Iris dataset shown in Figure 22.1. In Example 22.1 we saw that the error rate was 26.7%. However, the error rate measure does not give much information about the classes or instances that are more difficult to classify. From the class-specific normal distribution in the figure, it is clear that the Bayes classifier should perform well for c1, but it is likely to have problems discriminating some test cases that lie close to the decision boundary between c2 and c3. This information is better captured by the confusion matrix obtained on the testing set, as shown in Table 22.1. We can observe that all 10 points in c1 are classified correctly. However, only 7 out of the 10 for c2 and 5 out of the 10 for c3 are classified correctly.
Classification Performance Measures
Table 22.1. Contingency table for Iris dataset: testing set
n1 = 10 n2 = 10 n3 = 10
551
m1 = 10 m2 = 12 m3 = 8 n = 30
True
Predicted
Iris-setosa (c1)
Iris-versicolor (c2)
Iris-virginica(c3 )
Iris-setosa (c1) Iris-versicolor (c2) Iris-virginica (c3)
10 0 0
0 7 3
0 5 5
From the confusion matrix we can compute the class-specific precision (or accuracy) values:
prec1 = n11 = 10/10 = 1.0 m1
prec2 = n22 = 7/12 = 0.583 m2
prec3 = n33 = 5/8 = 0.625 m3
The overall accuracy tallies with that reported in Example 22.1: Accuracy= (n11 +n22 +n33) = (10+7+5) =22/30=0.733
n 30
The class-specific recall (or coverage) values are given as
recall1 = n11 = 10/10 = 1.0 n1
recall2 = n22 = 7/10 = 0.7 n2
recall3 = n33 = 5/10 = 0.5 n3
From these we can compute the class-specific F-measure values:
F1= 2·n11 (n1 +m1)
F2= 2·n22 (n2 +m2)
F = 1(1.0+0.636+0.556)= 2.192 =0.731 33
= 20/20 = 1.0
= 14/22 = 0.636
F3= 2·n33 (n3 +m3)
= 10/18 = 0.556 Thus, the overall F-measure for the classifier is
552
Classification Assessment
Table 22.2. Confusion matrix for two classes
True Class
Predicted Class
Positive (c1)
Negative (c2)
Positive (c1)
True Positive (TP)
False Positive (FP)
Negative (c2)
False Negative (FN)
True Negative (TN)
22.1.2 Binary Classification: Positive and Negative Class
When there are only k = 2 classes, we call class c1 the positive class and c2 the negative class. The entries of the resulting 2 × 2 confusion matrix, shown in Table 22.2, are given special names, as follows:
• True Positives (TP): The number of points that the classifier correctly predicts as positive:
TP=n11 ={xi |yˆi =yi =c1}
• False Positives (FP): The number of points the classifier predicts to be positive, which
in fact belong to the negative class:
F P = n 1 2 = { x i | yˆ i = c 1 a n d y i = c 2 }
• False Negatives (FN): The number of points the classifier predicts to be in the negative class, which in fact belong to the positive class:
FN=n21 ={xi |yˆi =c2 and yi =c1}
• True Negatives (TN): The number of points that the classifier correctly predicts as
negative:
TN=n22 ={xi |yˆi =yi =c2}
The error rate [Eq. (22.1)] for the binary classification case is given as the fraction of
Error Rate
mistakes (or false predictions):
Accuracy
Error Rate = FP + FN n
The accuracy [Eq. (22.2)] is the fraction of correct predictions:
Accuracy = TP + TN n
The above are global measures of classifier performance. We can obtain class-specific measures as follows.
Classification Performance Measures 553 Class-specific Precision
The precision for the positive and negative class is given as
precP = TP = TP TP+FP m1
precN = TN = TN TN+FN m2
where mi = |Ri | is the number of points predicted by M as having class ci .
Sensitivity: True Positive Rate
The true positive rate, also called sensitivity, is the fraction of correct predictions with respect to all points in the positive class, that is, it is simply the recall for the positive class
TPR = recallP = TP = TP TP+FN n1
where n1 is the size of the positive class. Specificity: True Negative Rate
The true negative rate, also called specificity, is simply the recall for the negative class: TNR = specificity = recallN = TN = TN
where n2 is the size of the negative class. False Negative Rate
The false negative rate is defined as
FNR= FN = FN =1−sensitivity
TP+FN n1 The false positive rate is defined as
FPR= FP = FP =1−specificity FP+TN n2
False Positive Rate
FP+TN n2
Example 22.3. Consider the Iris dataset projected onto its first two principal components, as shown in Figure 22.2. The task is to separate Iris-versicolor (class c1; in circles) from the other two Irises (class c2; in triangles). The points from class c1 lie in-between the points from class c2, making this is a hard problem for (linear) classification. The dataset has been randomly split into 80% training (in gray) and 20% testing points (in black). Thus, the training set has 120 points and the testing set has n = 30 points.
554
1.0 0.5 0 −0.5 −1.0 −1.5
Classification Assessment
u2
−4 −3 −2 −1 0 1 2 3 Figure 22.2. Iris principal component dataset: training and testing sets.
Applying the naive Bayes classifier (with one normal per class) on the training set yields the following estimates for the mean, covariance matrix and prior probability for each class:
Pˆ (c1) = 40/120 = 0.33
μˆ 1 = −0.641 −0.204T
1 = 0 . 2 9 0 0 0.18
Pˆ (c2) = 80/120 = 0.67 μˆ 2 = 0.27 0.14T
2 = 6 . 1 4 0 0 0.206
The mean (in white) and the contour plot of the normal distribution for each class are also shown in the figure; the contours are shown for one and two standard deviations along each axis.
For each of the 30 testing points, we classify them using the above parameter estimates (see Chapter 18). The naive Bayes classifier misclassified 10 out of the 30 test instances, resulting in an error rate and accuracy of
Error Rate = 10/30 = 0.33 Accuracy = 20/30 = 0.67
The confusion matrix for this binary classification problem is shown in Table 22.3. From this table, we can compute the various performance measures:
precP = TP = 7 =0.5 TP+FP 14
u1
Classification Performance Measures 555 Table 22.3. Iris PC dataset: contingency table for binary classification
True
Predicted
Positive (c1) Negative (c2)
Positive (c1)
TP = 7 FN = 3 n1 = 10
FP = 7 TN = 13 n2 = 20
m1 = 14 m2 = 16 n = 30
Negative (c2)
precN = TN = 13 = 0.8125 TN+FN 16
recallP =sensitivity=TPR= TP = 7 =0.7 TP+FN 10
recallN = specificity = TNR = TN = 13 = 0.65 TN+FP 20
FNR=1−sensitivity=1−0.7=0.3 FPR=1−specificity=1−0.65=0.35
We can observe that the precision for the positive class is rather low. The true positive rate is also low, and the false positive rate is relatively high. Thus, the naive Bayes classifier is not particularly effective on this testing dataset.
22.1.3 ROC Analysis
Receiver Operating Characteristic (ROC) analysis is a popular strategy for assessing the performance of classifiers when there are two classes. ROC analysis requires that a classifier output a score value for the positive class for each point in the testing set. These scores can then be used to order points in decreasing order. For instance, we can use the posterior probability P(c1|xi) as the score, for example, for the Bayes classifiers. For SVM classifiers, we can use the signed distance from the hyperplane as the score because large positive distances are high confidence predictions for c1, and large negative distances are very low confidence predictions for c1 (they are, in fact, high confidence predictions for the negative class c2).
Typically, a binary classifier chooses some positive score threshold ρ, and classifies all points with score above ρ as positive, with the remaining points classified as negative. However, such a threshold is likely to be somewhat arbitrary. Instead, ROC analysis plots the performance of the classifier over all possible values of the threshold parameter ρ. In particular, for each value of ρ, it plots the false positive rate (1-specificity) on the x-axis versus the true positive rate (sensitivity) on the y-axis. The resulting plot is called the ROC curve or ROC plot for the classifier.
Let S(xi ) denote the real-valued score for the positive class output by a classifier M for the point xi . Let the maximum and minimum score thresholds observed on testing dataset D be as follows:
ρmin =min{S(xi)} ρmax =max{S(xi)} ii
556 Classification Assessment Table22.4. Differentcasesfor2×2confusionmatrix
(a) Initial: all negative (b) Final: all positive (c) Ideal classifier
Initially, we classify all points as negative. Both TP and FP are thus initially zero (as shown in Table 22.4a), resulting in TPR and FPR rates of zero, which correspond to the point (0,0) at the lower left corner in the ROC plot. Next, for each distinct value of ρ in the range [ρmin,ρmax], we tabulate the set of positive points:
R1(ρ)={xi ∈D:S(xi)>ρ}
and we compute the corresponding true and false positive rates, to obtain a new point in the ROC plot. Finally, in the last step, we classify all points as positive. Both FN and TN are thus zero (as shown in Table 22.4b), resulting in TPR and FPR values of 1. This results in the point (1,1) at the top right-hand corner in the ROC plot. An ideal classifier corresponds to the top left point (0, 1), which corresponds to the case FPR = 0 and TPR = 1, that is, the classifier has no false positives, and identifies all true positives (as a consequence, it also correctly predicts all the points in the negative class). This case is shown in Table 22.4c. As such, a ROC curve indicates the extent to which the classifier ranks positive instances higher than the negative instances. An ideal classifier should score all positive points higher than any negative point. Thus, a classifier with a curve closer to the ideal case, that is, closer to the upper left corner, is a better classifier.
Area Under ROC Curve
The area under the ROC curve, abbreviated AUC, can be used as a measure of classifier performance. Because the total area of the plot is 1, the AUC lies in the interval [0, 1] – the higher the better. The AUC value is essentially the probability that the classifier will rank a random positive test case higher than a random negative test instance.
ROC/AUC Algorithm
Algorithm 22.1 shows the steps for plotting a ROC curve, and for computing the area under the curve. It takes as input the testing set D, and the classifier M. The first step is to predict the score S(xi ) for the positive class (c1 ) for each test point xi ∈ D. Next, we sort the (S(xi ), yi ) pairs, that is, the score and the true class pairs, in decreasing order of the scores (line 3). Initially, we set the positive score threshold ρ = ∞ (line 7). The for loop (line 8) examines each pair (S(xi),yi) in sorted order, and for each distinct value of the score, it sets ρ = S(xi ) and plots the point
(FPR,TPR) = FP, TP n2 n1
As each test point is examined, the true and false positive values are adjusted based on the true class yi for the test point xi. If y1 = c1, we increment the true positives,
True
Predicted
Pos
Neg
Pos Neg
0
FN
0
TN
True
Predicted
Pos
Neg
Pos Neg
TP
0
FP
0
True
Predicted
Pos
Neg
Pos Neg
TP
0
0
TN
Classification Performance Measures 557 ALGORITHM 22.1. ROC Curve and Area under the Curve
ROC-CURVE(D, M):
1 n1 ← {xi ∈ D |yi = c1 } // size of positive class
2 n2 ← {xi ∈ D |yi = c2 } // size of negative class
// classify, score, and sort all test points
3 L ← sort the set {(S(xi ), yi ) : xi ∈ D} by decreasing scores
4 FP←TP←0
5 FPprev ← TPprev ← 0
6 AUC←0
7ρ←∞
8 9
10
11 12 13 14
15 16
17 18
19 20 21
foreach (S(xi),yi) ∈ L do if ρ > S(x ) then
i
plot point FP , TP
n2 n1
AUC ← AUC + TRAPEZOID-AREA
FPprev , TPprev , FP , TP n2 n1 n2n1
ρ←S(xi) FPprev ← FP TPprev ← TP
ifyi =c1 then TP←TP+1 else FP←FP+1
plot point FP , TP n2 n1
AUC ← AUC + TRAPEZOID-AREA
TRAPEZOID-AREA((x1,y1),(x2,y2)):
b ← |x2 − x1| // base of trapezoid
h ← 1 (y2 + y1) // average height of trapezoid 2
return (b · h)
FPprev , TPprev , FP , TP n2 n1 n2n1
otherwise, we increment the false positives (lines 15-16). At the end of the for loop we plot the final point in the ROC curve (line 17).
The AUC value is computed as each new point is added to the ROC plot. The algorithm maintains the previous values of the false and true positives, FPprev and TPprev, for the previous score threshold ρ. Given the current FP and TP values, we compute the area under the curve defined by the four points
(x1,y1) = FPprev , TPprev n2 n1
(x2,y2) = FP, TP n2n1
(x1,0) = FPprev ,0 n2
(x2,0) = FP,0 n2
These four points define a trapezoid, whenever x2 > x1 and y2 > y1, otherwise, they define a rectangle (which may be degenerate, with zero area). The function TRAPEZOID-AREA computes the area under the trapezoid, which is given as b · h,
558
Classification Assessment
Table 22.5. Sorted scores and true classes
S(xi ) yi
0.93 0.82 0.80 0.77 0.74 0.71 0.69 0.67 0.66 0.61 c2 c1 c2 c1 c1 c1 c2 c1 c2 c2
S(xi ) yi
0.59 0.55 0.55 0.53 0.47 0.30 0.26 0.11 0.04 2.97e-03 c2 c2 c1 c1 c1 c1 c1 c2 c2 c2
S(xi ) yi
1.28e-03 2.55e-07 6.99e-08 3.11e-08 3.109e-08 c2 c2 c2 c2 c2
S(xi ) yi
1.53e-08 9.76e-09 2.08e-09 1.95e-09 7.83e-10 c2 c2 c2 c2 c2
where b = |x2 − x1| is the length of the base of the trapezoid and h = 1 (y2 + y1) is the 2
average height of the trapezoid.
Example 22.4. Consider the binary classification problem from Example 22.3 for the Iris principal components dataset. The test dataset D has n = 30 points, with n1 = 10 points in the positive class and n2 = 20 points in the negative class.
We use the naive Bayes classifier to compute the probability that each test point belongs to the positive class (c1 ; iris-versicolor). The score of the classifier for test pointxi isthereforeS(xi)=P(c1|xi).Thesortedscores(indecreasingorder)along with the true class labels are shown in Table 22.5.
The ROC curve for the test dataset is shown in Figure 22.3. Consider the positive score threshold ρ = 0.71. If we classify all points with a score above this value as positive, then we have the following counts for the true and false positives:
TP=3 FP=2
The false positive rate is therefore FP = 2/20 = 0.1, and the true positive rate is TP =
n2 n1 3/10 = 0.3. This corresponds to the point (0.1, 0.3) in the ROC curve. Other points on
the ROC curve are obtained in a similar manner as shown in Figure 22.3. The total area under the curve is 0.775.
Example 22.5 (AUC). To see why we need to account for trapezoids when comput- ing the AUC, consider the following sorted scores, along with the true class, for some testingdatasetwithn=5,n1 =3andn2 =2.
(0.9,c1),(0.8,c2),(0.8,c1),(0.8,c1),(0.1,c2)
Classification Performance Measures 559
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1 0
0 0.1 0.2
Figure 22.3. ROC plot for Iris principal components dataset. The ROC curves for the naive Bayes (black)
and random (gray) classifiers are shown.
1.0
0.8
0.6
0.4
0.333 0.2
0.3 0.4
False Positive Rate
0.5 0.6 0.7
0.8 0.9
0
0 0.2 0.4 0.6 0.8 1.0
False Positive Rate
Figure 22.4. ROC plot and AUC: trapezoid region.
0.5
Algorithm 22.1 yields the following points that are added to the ROC plot, along with the running AUC:
ρ
FP
TP
(FPR, TPR)
AUC
∞
0.9 0.8 0.1
0 0 1 2
0 1 3 3
(0, 0) (0, 0.333) (0.5, 1) (1, 1)
0
0 0.333 0.833
True Positive Rate
True Positive Rate
560 Classification Assessment
Random Classifier
It is interesting to note that a random classifier corresponds to a diagonal line in the ROC plot. To see this think of a classifier that randomly guesses the class of a point as positive half the time, and negative the other half. We then expect that half of the true positives and true negatives will be identified correctly, resulting in the point (TPR,FPR) = (0.5,0.5) for the ROC plot. If, on the other hand, the classifier guesses the class of a point as positive 90% of the time and as negative 10% of the time, then we expect 90% of the true positives and 10% of the true negatives to be labeled correctly, resulting in TPR = 0.9 and FPR = 1 − TNR = 1 − 0.1 = 0.9, that is, we get the point (0.9, 0.9) in the ROC plot. In general, any fixed probability of prediction, say r, for the positive class, yields the point (r,r) in ROC space. The diagonal line thus represents the performance of a random classifier, over all possible positive class prediction thresholds r. If follows that if the ROC curve for any classifier is below the diagonal, it indicates performance worse than random guessing. For such cases, inverting the class assignment will produce a better classifier. As a consequence of the diagonal ROC curve, the AUC value for a random classifier is 0.5. Thus, if any classifier has an AUC value less than 0.5, that also indicates performance worse than random.
Figure 22.4 shows the ROC plot, with the shaded region representing the AUC. We can observe that a trapezoid is obtained whenever there is at least one positive and one negative point with the same score. The total AUC is 0.833, obtained as the sum of the trapezoidal region on the left (0.333) and the rectangular region on the right (0.5).
Example 22.6. In addition to the ROC curve for the naive Bayes classifier, Figure 22.3 also shows the ROC plot for the random classifier (the diagonal line in gray). We can see that the ROC curve for the naive Bayes classifier is much better than random. Its AUC value is 0.775, which is much better than the 0.5 AUC for a random classifier. However, at the very beginning naive Bayes performs worse than the random classifier because the highest scored point is from the negative class. As such, the ROC curve should be considered as a discrete approximation of a smooth curve that would be obtained for a very large (infinite) testing dataset.
Class Imbalance
It is worth remarking that ROC curves are insensitive to class skew. This is because the TPR, interpreted as the probability of predicting a positive point as positive, and the FPR, interpreted as the probability of predicting a negative point as positive, do not depend on the ratio of the positive to negative class size. This is a desirable property, since the ROC curve will essentially remain the same whether the classes are balanced (have relatively the same number of points) or skewed (when one class has many more points than the other).
Classifier Evaluation 561
22.2 CLASSIFIER EVALUATION
In this section we discuss how to evaluate a classifier M using some performance measure θ. Typically, the input dataset D is randomly split into a disjoint training set and testing set. The training set is used to learn the model M, and the testing set is used to evaluate the measure θ. However, how confident can we be about the classification performance? The results may be due to an artifact of the random split, for example, by random chance the testing set may have particularly easy (or hard) to classify points, leading to good (or poor) classifier performance. As such, a fixed, pre-defined partitioning of the dataset is not a good strategy for evaluating classifiers. Also note that, in general, D is itself a d-dimensional multivariate random sample drawn from the true (unknown) joint probability density function f (x) that represents the population of interest. Ideally, we would like to know the expected value E[θ] of the performance measure over all possible testing sets drawn from f. However, because f is unknown, we have to estimate E[θ] from D. Cross-validation and resampling are two common approaches to compute the expected value and variance of a given performance measure; we discuss these methods in the following sections.
22.2.1 K-fold Cross-Validation
Cross-validation divides the dataset D into K equal-sized parts, called folds, namely D1, D2, …, DK. Each fold Di is, in turn, treated as the testing set, with the remaining foldscomprisingthetrainingsetD\Di = j̸=iDj.AftertrainingthemodelMi on D \ Di , we assess its performance on the testing set Di to obtain the i -th estimate θi . The expected value of the performance measure can then be estimated as
and its variance as
1 K μˆ θ = E [ θ ] = K θ i
i=1 1 K
σˆ θ2 = K ( θ i − μˆ θ ) 2 i=1
( 2 2 . 3 )
( 2 2 . 4 )
Algorithm 22.2 shows the pseudo-code for K-fold cross-validation. After randomly
shuffling the dataset D, we partition it into K equal folds (except for possibly the
last one). Next, each fold Di is used as the testing set on which we assess the
performanceθi oftheclassifierMi trainedonD\Di.Theestimatedmeanandvariance
of θ can then be reported. Note that the K-fold cross-validation can be repeated
multiple times; the initial random shuffling ensures that the folds are different each
time.
Usually K is chosen to be 5 or 10. The special case, when K = n, is called
leave-one-out cross-validation, where the testing set comprises a single point and the remaining data is used for training purposes.
562
Classification Assessment
ALGORITHM22.2. K-foldCross-Validation
CROSS-VALIDATION(K, D):
1 D ← randomly shuffle D
2 {D1,D2,…,DK} ← partition D in K equal parts
3 foreach i ∈ [1, K] do
4 Mi ← train classifier on D \ Di
5 θi ←assess Mi on Di
1K
6 μˆ θ = K i = 1 θ i
7 σˆ 2 = 1 K ( θ − μˆ ) 2 θ K i=1 i θ
8 r e t u r n μˆ θ , σˆ θ2
Example 22.7. Consider the 2-dimensional Iris dataset from Example 22.1 with k = 3 classes. We assess the error rate of the full Bayes classifier via 5-fold cross-validation, obtaining the following error rates when testing on each fold:
θ1 =0.267 θ2 =0.133 θ3 =0.233 θ4 =0.367 θ5 =0.167 Using Eqs. (22.3) and (22.4), the mean and variance for the error rate are as follows:
μˆ θ = 1.167 = 0.233 σˆθ2 = 0.00833 5
We can repeat the whole cross-validation approach multiple times, with a different permutation of the input points, and then we can compute the mean of the average error rate, and mean of the variance. Performing ten 5-fold cross-validation runs for the Iris dataset results in the mean of the expected error rate as 0.232, and the mean of the variance as 0.00521, with the variance in both these estimates being less than 10−3 .
22.2.2 Bootstrap Resampling
Another approach to estimate the expected performance of a classifier is to use the bootstrap resampling method. Instead of partitioning the input dataset D into disjoint folds, the bootstrap method draws K random samples of size n with replacement from D. Each sample Di is thus the same size as D, and has several repeated points. Consider the probability that a point xj ∈ D is not selected for the i th bootstrap sample Di . Due to sampling with replacement, the probability that a given point is selected is given as p = 1 , and thus the probability that it is not selected is
n
Because Di has n points, the probability that xj is not selected even after n tries is given as
1n P(xj̸∈Di)=qn= 1−n ≃e−1=0.368
q = 1 − p = 1 − 1 n
Classifier Evaluation 563 ALGORITHM22.3. BootstrapResamplingMethod
BOOTSTRAP-RESAMPLING(K, D): 1 fori∈[1,K]do
2 Di ← sample of size n with replacement from D
3 Mi ← train classifier on Di
4 θi ←assess Mi on D
1K
5 μˆ θ = K i = 1 θ i
6 σˆ 2 = 1 K ( θ i − μˆ θ ) 2 θ K i=12
7 returnμˆθ,σˆθ
On the other hand, the probability that xj ∈ D is given as
P(xj ∈Di)=1−P(xj ̸∈Di)=1−0.368=0.632
This means that each bootstrap sample contains approximately 63.2% of the points from D.
The bootstrap samples can be used to evaluate the classifier by training it on each of samples Di and then using the full input dataset D as the testing set, as shown in Algorithm 22.3. The expected value and variance of the performance measure θ can be obtained using Eqs. (22.3) and (22.4). However, it should be borne in mind that the estimates will be somewhat optimistic owing to the fairly large overlap between the training and testing datasets (63.2%). The cross-validation approach does not suffer from this limitation because it keeps the training and testing sets disjoint.
8 7 6 5 4 3 2 1
Example 22.8. We continue with the Iris dataset from Example 22.7. However, we now apply bootstrap sampling to estimate the error rate for the full Bayes classifier, using K = 50 samples. The sampling distribution of error rates is shown in Figure 22.5.
0.20 0.21 0.22
Error Rate
0.18 0.19
Figure 22.5. Sampling distribution of error rates.
0.23 0.24
0.25 0.26 0.27
Frequency
564 Classification Assessment
The expected value and variance of the error rate are μˆ θ = 0 . 2 1 3
σˆ θ2 = 4 . 8 1 5 × 1 0 − 4
Due to the overlap between the training and testing sets, the estimates are more optimistic (i.e., lower) compared to those obtained via cross-validation in Example 22.7, where we had μˆ θ = 0.233 and σˆθ2 = 0.00833.
22.2.3 Confidence Intervals
Having estimated the expected value and variance for a chosen performance measure, we would like to derive confidence bounds on how much the estimate may deviate from the true value.
To answer this question we make use of the central limit theorem, which states that the sum of a large number of independent and identically distributed (IID) random variables has approximately a normal distribution, regardless of the distribution of the individual random variables. More formally, let θ1,θ2,…,θK be a sequence of IID random variables, representing, for example, the error rate or some other performance measure over the K-folds in cross-validation or K bootstrap samples. Assume that each θi has a finite mean E[θi] = μ and finite variance var(θi) = σ2.
Let μˆ denote the sample mean:
μˆ = 1 ( θ 1 + θ 2 + · · · + θ K )
K
By linearity of expectation, we have
1 1K 1
E[μˆ]=E K(θ1+θ2+···+θK) =K E[θi]=K(Kμ)=μ
i=1
Utilizing the linearity of variance for independent random variables, and noting that
var(aX) = a2 · var(X) for a ∈ R, the variance of μˆ is given as
1 1 K 1 2 σ 2
var(μˆ)=var K(θ1 +θ2 +···+θK) = K2 var(θi)= K2 Kσ = K i=1
Thus, the standard deviation of μˆ is given as
std(μˆ)= var(μˆ)=√K
σ
We are interested in the distribution of the z-score of μˆ , which is itself a random
variable
ZK specifies the deviation of the estimated mean from the true mean in terms of its standard deviation. The central limit theorem states that as the sample size increases,
ZK = μˆ −E[μˆ] = μˆ −μ =√Kμˆ −μ σ
s t d ( μˆ ) √ K σ
Classifier Evaluation 565 the random variable ZK converges in distribution to the standard normal distribution
(which has mean 0 and variance 1). That is, as K → ∞, for any x ∈ R, we have lim P(ZK ≤x)=(x)
K→∞
where (x) is the cumulative distribution function for the standard normal density function f(x|0,1). Let zα/2 denote the z-score value that encompasses α/2 of the probability mass for a standard normal distribution, that is,
P(0≤ZK ≤zα/2)=(zα/2)−(0)=α/2
then, because the normal distribution is symmetric about the mean, we have
Note that
lim P(−zα/2 ≤ZK ≤zα/2)=2·P(0≤ZK ≤zα/2)=α K→∞
−zα/2 ≤ZK ≤zα/2 =⇒ −zα/2 ≤√Kμˆ −μ≤zα/2 σ
σσ = ⇒ − z α / 2 √ K ≤ μˆ − μ ≤ z α / 2 √ K
σσ =⇒ μˆ−zα/2√K ≤μ≤ μˆ+zα/2√K
(22.5)
Substituting the above into Eq. (22.5) we obtain bounds on the value of the true mean μ in terms of the estimated value μˆ , that is,
σσ
lim P μˆ −zα/2√K ≤μ≤μˆ +zα/2√K =α (22.6)
K→∞
Thus, for any given level of confidence α, we can compute the probability that the
σσ
true mean μ lies in the α% confidence interval μˆ − zα/2 √K , μˆ + zα/2 √K . In other
words, even though we do not know the true mean μ, we can obtain a high-confidence estimate of the interval within which it must lie (e.g., by setting α = 0.95 or α = 0.99).
Unknown Variance
The analysis above assumes that we know the true variance σ2, which is generally not the case. However, we can replace σ 2 by the sample variance
1 K
σˆ 2 = K ( θ i − μˆ ) 2 ( 2 2 . 7 )
i=1
because σˆ 2 is a consistent estimator for σ 2 , that is, as K → ∞, σˆ 2 converges with probability 1, also called converges almost surely, to σ2. The central limit theorem then states that the random variable Z∗K defined below converges in distribution to the standard normal distribution:
Z ∗K = √ K μˆ − μ ( 2 2 . 8 ) σˆ
566 Classification Assessment and thus, we have
σˆ σˆ
lim P μˆ −zα/2√K ≤μ≤μˆ +zα/2√K =α (22.9)
for μ.
K→∞
In other words, we say that μˆ − zα/2 √K , μˆ + zα/2 √K is the α% confidence interval
σˆ σˆ
Example 22.9. Consider Example 22.7, where we applied 5-fold cross-validation (K = 5) to assess the error rate of the full Bayes classifier. The estimated expected value and variance for the error rate were as follows:
μˆ θ = 0.233 σˆθ2 = 0.00833 σˆθ = √0.00833 = 0.0913
Let α = 0.95 be the confidence value. It is known that the standard normal distribution has 95% of the probability density within zα/2 = 1.96 standard deviations from the mean. Thus, in the limit of large sample size, we have
σˆθ σˆθ P μ∈ μˆθ −zα/2√K,μˆθ +zα/2√K
=0.95
P μ ∈ (0.233 − 0.08, 0.233 + 0.08) = P μ ∈ (0.153, 0.313) = 0.95
σˆ 1.96×0.0913
Because zα/2 √
θ = √ =0.08,wehave
K5
Put differently, with 95% confidence, the true expected error rate lies in the interval (0.153, 0.313).
If we want greater confidence, for example, for α = 0.99, then the corresponding σˆ 2.58×0.0913
θ = √ = 0.105. The 99% confidence z-score value is zα/2 = 2.58, and thus zα/2 √
K5 interval for μ is therefore wider (0.128,0.338).
Nevertheless, K = 5 is not a large sample size, and thus the above confidence intervals are not that reliable.
Small Sample Size
The confidence interval in Eq. (22.9) applies only when the sample size K → ∞. We would like to obtain more precise confidence intervals for small samples. Consider the random variables Vi, for i = 1,…,K, defined as
K
S =
i=1
V 2i =
K θ i − μ ˆ 2
σ
1 K
= σ 2
K σ ˆ 2 ( θ i − μˆ ) 2 = σ 2
( 2 2 . 1 0 )
V i = θ i − μˆ σ
Further, consider the sum of their squares:
i=1
i=1
The last step follows from the definition of sample variance in Eq. (22.7).
If we assume that the Vi’s are IID with the standard normal distribution, then the sum S follows a chi-squared distribution with K − 1 degrees of freedom, denoted
Classifier Evaluation 567
χ 2 (K − 1), since S is the sum of the squares of K random variables Vi . There are only K − 1 degrees of freedom because each Vi depends on μˆ and the sum of the θi ’s is thus fixed.
Consider the random variable Z∗K in Eq. (22.8). We have, ∗ √ μˆ − μ μˆ − μ
(22.11)
ZK=K σˆ =σˆ/√K
Dividing the numerator and denominator in the expression above by σ/√K, we get
√μˆ−μ
∗ μˆ−μσˆ/Kσ/√KZK
ZK= σ/√K σ/√K = σˆ/σ =S/K The last step follows from Eq. (22.10) because
Kσˆ2 σˆ S= σ2 impliesthatσ= S/K
Assuming that ZK follows a standard normal distribution, and noting that S follows a chi-squared distribution with K − 1 degrees of freedom, then the distribution of Z∗K is precisely the Student’s t distribution with K − 1 degrees of freedom. Thus, in the small sample case, instead of using the standard normal density to derive the confidence interval, we use the t distribution. In particular, we choose the value tα/2,K−1 such that the cumulative t distribution function with K − 1 degrees of freedom encompasses α/2 of the probability mass, that is,
P(0≤Z∗K ≤tα/2,K−1)=TK−1(tα/2)−TK−1(0)=α/2
where TK−1 is the cumulative distribution function for the Student’s t distribution with K − 1 degrees of freedom. Because the t distribution is symmetric about the mean, we have
σˆ σˆ
P μˆ −tα/2,K−1 √K ≤μ≤μˆ +tα/2,K−1 √K =α (22.12)
The α% confidence interval for the true mean μ is thus
σˆ σˆ
Note the dependence of the interval on both α and the sample size K.
Figure 22.6 shows the t distribution density function for different values of K. It also shows the standard normal density function. We can observe that the t distribution has more probability concentrated in its tails compared to the standard normal distribution. Further, as K increases, the t distribution very rapidly converges in distribution to the standard normal distribution, consistent with the large sample
case. Thus, for large samples, we may use the usual zα/2 threshold.
μˆ −tα/2,K−1 √K ≤μ≤μˆ +tα/2,K−1 √K
568
Classification Assessment
y
0.4 0.3 0.2 0.1
f(x|0,1) t (10)
t (4)
t (1)
x
−5 −4 −3 −2 −1 0 1 2 3 4 5
Figure 22.6. Student’s t distribution: K degrees of freedom. The thick solid line is standard normal
distribution.
Example 22.10. Consider Example 22.9. For 5-fold cross-validation, the estimated meanerrorrateisμˆθ =0.233,andtheestimatedvarianceisσˆθ =0.0913.
Due to the small sample size (K = 5), we can get a better confidence interval by using the t distribution. For K − 1 = 4 degrees of freedom, for α = 0.95, we use the quantile function for the Student’s t-distribution to obtain tα/2,K−1 = 2.776. Thus,
σˆθ 0.0913
tα/2,K−1 √K = 2.776 × √5 = 0.113
The 95% confidence interval is therefore
(0.233 − 0.113, 0.233 + 0.113) = (0.12, 0.346)
which is much wider than the overly optimistic confidence interval (0.153,0.313) obtained for the large sample case in Example 22.9.
For α = 0.99, we have tα/2,K−1 = 4.604, and thus σˆθ 0.0913
tα/2,K−1 √K = 4.604 × √5 = 0.188
and the 99% confidence interval is
(0.233 − 0.188, 0.233 + 0.188) = (0.045, 0.421)
This is also much wider than the 99% confidence interval (0.128, 0.338) obtained for the large sample case in Example 22.9.
22.2.4 Comparing Classifiers: Paired t-Test
In this section we look at a method that allows us to test for a significant difference in the classification performance of two alternative classifiers, MA and MB. We want to assess which of them has a superior classification performance on a given dataset D.
Classifier Evaluation 569
Following the evaluation methodology above, we can apply K-fold cross-validation (or bootstrap resampling) and tabulate their performance over each of the K folds, with identical folds for both classifiers. That is, we perform a paired test, with both classifiers trained and tested on the same data. Let θ1A,θ2A,…,θKA and θ1B,θ2B,…,θKB denote the performance values for MA and MB, respectively. To determine if the two classifiers have different or similar performance, define the random variable δi as the difference in their performance on the ith dataset:
δ i = θ iA − θ iB
Now consider the estimates for the expected difference and the variance of the
differences:
1 K
μˆ δ = K
i=1
δ i
1 K
σˆ δ2 = K ( δ i − μˆ δ ) 2
i=1
We can set up a hypothesis testing framework to determine if there is a statistically significant difference between the performance of MA and MB. The null hypothesis H0 is that their performance is the same, that is, the true expected difference is zero, whereas the alternative hypothesis Ha is that they are not the same, that is, the true expected difference μδ is not zero:
H0: μδ=0 Ha: μδ̸=0
Let us define the z-score random variable for the estimated expected difference as
Z ∗δ = √ K μˆ δ − μ δ σˆδ
Following a similar argument as in Eq. (22.11), Z∗δ follows a t distribution with K − 1 degrees of freedom. However, under the null hypothesis we have μδ = 0, and thus
Z ∗δ = √ K μˆ δ ∼ t K − 1 σˆδ
where the notation Z∗δ ∼ tK−1 means that Z∗δ follows the t distribution with K − 1 degrees of freedom.
Given a desired confidence level α, we conclude that P −tα/2,K−1 ≤ Z∗δ ≤ tα/2,K−1 = α
Put another way, if Z∗δ ̸∈ −tα/2,K−1,tα/2,K−1, then we may reject the null hypothesis with α% confidence. In this case, we conclude that there is a significant difference between the performance of MA and MB. On the other hand, if Z∗δ does lie in the above confidence interval, then we accept the null hypothesis that both MA and MB have essentially the same performance. The pseudo-code for the paired t-test is shown in Algorithm 22.4.
570
Classification Assessment
ALGORITHM22.4. Pairedt-TestviaCross-Validation
1 2 3 4 5 6
7 8
9 10 11 12 13
PAIRED t-TEST(α, K, D):
D ← randomly shuffle D
{D1,D2,…,DK} ← partition D in K equal parts foreach i ∈ [1, K] do
MAi , MBi ← train the two different classifiers on D \ Di θiA,θiB ← assess MAi and MBi on Di
δ i = θ i A − θ i B
1K
μˆ δ = K i = 1 δ i
σˆ 2 = 1 K ( δ − μˆ ) 2 δ K i=1 i δ
Z ∗ = √ K μˆ δ δ σˆ
δ
if Z∗δ ∈ −tα/2,K−1 , tα/2,K−1 then
Accept H0; both classifiers have similar performance else
Reject H0; classifiers have significantly different performance
Example 22.11. Consider the 2-dimensional Iris dataset from Example 22.1, with k = 3 classes. We compare the naive Bayes (MA) with the full Bayes (MB) classifier via cross-validation using K = 5 folds. Using error rate as the performance measure, we obtain the following values for the error rates and their difference over each of the K folds:
i12345 θiA 0.233 0.267 0.1 0.4 0.3 θiB 0.2 0.2 0.167 0.333 0.233
δi 0.033 0.067 −0.067 0.067 0.067
The estimated expected difference and variance of the differences are
μˆ δ = 0.167 = 0.033 σˆδ2 = 0.00333 σˆδ = √0.00333 = 0.0577 5
The z-score value is given as
Z∗δ = √Kμˆδ = √5×0.033 =1.28
σˆδ 0.0577
From Example 22.10, for α = 0.95 and K − 1 = 4 degrees of freedom, we have
tα/2,K−1 = 2.776. Because
Z∗δ =1.28∈(−2.776,2.776)=−tα/2,K−1,tα/2,K−1
we cannot reject the null hypothesis. Instead, we accept the null hypothesis that μδ = 0, that is, there is no significant difference between the naive and full Bayes classifier for this dataset.
Bias-Variance Decomposition 571
22.3 BIAS-VARIANCE DECOMPOSITION
GivenatrainingsetD={xi,yi}ni=1,comprisingnpointsxi ∈Rd,withtheircorrespond- ing classes yi , a learned classification model M predicts the class for a given test point x. The various performance measures we described above mainly focus on minimizing the prediction error by tabulating the fraction of misclassified points. However, in many applications, there may be costs associated with making wrong predictions. A loss function specifies the cost or penalty of predicting the class to be yˆ = M(x), when the true class is y. A commonly used loss function for classification is the zero-one loss, defined as
L(y,M(x)) = I(M(x) ̸= y) = 0 if M(x) = y 1 ifM(x)̸=y
Thus, zero-one loss assigns a cost of zero if the prediction is correct, and one otherwise. Another commonly used loss function is the squared loss, defined as
L(y , M(x)) = (y − M(x))2
where we assume that the classes are discrete valued, and not categorical.
Expected Loss
An ideal or optimal classifier is the one that minimizes the loss function. Because the true class is not known for a test case x, the goal of learning a classification model can be cast as minimizing the expected loss:
Ey[L(y,M(x))|x]=L(y,M(x))·P(y|x) (22.13) y
where P(y|x) is the conditional probability of class y given test point x, and Ey denotes that the expectation is taken over the different class values y.
Minimizing the expected zero–one loss corresponds to minimizing the error rate. This can be seen by expanding Eq. (22.13) with zero–one loss. Let M(x) = ci , then we have
Ey[L(y,M(x))|x]=Ey[I(y ̸=M(x))|x] = I(y ̸= ci ) · P (y|x)
y =P(y|x)
y̸=ci =1−P(ci|x)
Thus, to minimize the expected loss we should choose ci as the class that maximizes the posteriorprobability,thatis,ci =argmaxyP(y|x).Becausebydefinition[Eq.(22.1)], the error rate is simply an estimate of the expected zero–one loss, this choice also minimizes the error rate.
572 Classification Assessment
Bias and Variance
The expected loss for the squared loss function offers important insight into the classification problem because it can be decomposed into bias and variance terms. Intuitively, the bias of a classifier refers to the systematic deviation of its predicted decision boundary from the true decision boundary, whereas the variance of a classifier refers to the deviation among the learned decision boundaries over different training sets. More formally, because M depends on the training set, given a test point x, we denote its predicted value as M(x, D). Consider the expected square loss:
Ey Ly , M(x, D) x, D
=Eyy−M(x,D)2x,D
=Eyy−Ey[y|x]+Ey[y|x]−M(x,D)2 x,D
add and subtract same term
=E y−E [y|x]2 x,D+E M(x,D)−E[y|x]2 x,D
+Ey 2 y−Ey[y|x] · Ey[y|x]−M(x,D) x,D =Eyy−Ey[y|x]2 x,D+M(x,D)−Ey[y|x]2
+2Ey[y|x]−M(x,D)·Ey[y|x]−Ey[y|x]
0
=Eyy−Ey[y|x]2 x,D+M(x,D)−Ey[y|x]2
var(y|x) squared-error
yyyy
(22.14)
Above, we made use of the fact that for any random variables X and Y, and for any constant a, we have E[X + Y] = E[X] + E[Y], E[aX] = aE[X], and E[a] = a. The first term in Eq. (22.14) is simply the variance of y given x. The second term is the squared error between the predicted value M(x,D) and the expected value Ey[y|x]. Because this term depends on the training set, we can eliminate this dependence by averaging over all possible training tests of size n. The average or expected squared error for a given test point x over all training sets is then given as
EDM(x,D)−Ey[y|x]2
= ED M(x, D) −ED [M(x, D)] + ED [M(x, D)] −Ey [y |x]2
add and subtract same term =EDM(x,D)−ED[M(x,D)]2+EDED[M(x,D)]−Ey[y|x]2
+ 2ED [M(x, D)] − Ey [y |x] · ED [M(x, D)] − ED [M(x, D)]
0
= ED M(x, D) − ED [M(x, D)]2 + ED [M(x, D)] − Ey [y |x]2
(22.15)
variance bias
Bias-Variance Decomposition 573
This means that the expected squared error for a given test point can be decomposed into bias and variance terms. Combining Eqs. (22.14) and (22.15) the expected squared loss over all test points x and over all training sets D of size n yields the following decomposition into noise, variance and bias terms:
Ex,D,yy−M(x,D)2
=Ex,D,yy−Ey[y|x]2 x,D+Ex,DM(x,D)−Ey[y|x]2 =Ex,yy−Ey[y|x]2+Ex,DM(x,D)−ED[M(x,D)]2
noise average variance +ExED[M(x,D)]−Ey[y|x]2 (22.16)
average bias
Thus, the expected square loss over all test points and training sets can be decomposed into three terms: noise, average bias, and average variance. The noise term is the average variance var(y|x) over all test points x. It contributes a fixed cost to the loss independent of the model, and can thus be ignored when comparing different classifiers. The classifier specific loss can then be attributed to the variance and bias terms. In general, bias indicates whether the model M is correct or incorrect. It also reflects our assumptions about the domain in terms of the decision boundary. For example, if the decision boundary is nonlinear, and we use a linear classifier, then it is likely to have high bias, that is, it will be consistently incorrect over different training sets. On the other hand, a nonlinear (or a more complex) classifier is more likely to capture the correct decision boundary, and is thus likely to have a low bias. Nevertheless, this does not necessarily mean that a complex classifier will be a better one, since we also have to consider the variance term, which measures the inconsistency of the classifier decisions. A complex classifier induces a more complex decision boundary and thus may be prone to overfitting, that is, it may try to model all the small nuances in the training data, and thus may be susceptible to small changes in training set, which may result in high variance.
In general, the expected loss can be attributed to high bias or high variance, with typically a trade-off between these two terms. Ideally, we seek a balance between these opposing trends, that is, we prefer a classifier with an acceptable bias (reflecting domain or dataset specific assumptions) and as low a variance as possible.
Example 22.12. Figure 22.7 illustrates the trade-off between bias and variance, using the Iris principal components dataset, which has n = 150 points and k = 2 classes (c1 = +1, and c2 = −1). We construct K = 10 training datasets via bootstrap sampling, and use them to train SVM classifiers using a quadratic (homogeneous) kernel, varying the regularization constant C from 10−2 to 102.
Recall that C controls the weight placed on the slack variables, as opposed to the margin of the hyperplane (see Section 21.3). A small value of C emphasizes the margin, whereas a large value of C tries to minimize the slack terms. Figures 22.7a, 22.7b, and 22.7c show that the variance of the SVM model increases
2 1 0
−1 −2 −3
0.3
0.2
loss
bias variance
1
0 −1 −2 −3
1
0 −1 −2 u1 −3
574
u2 22
u2
Classification Assessment
u2
−4 −3 −2 −1
(c) C = 100
−4 −3 −2 −1 0 1 2 3 (a) C = 0.01
−4 −3 −2 −1 0 1 2 3 (b) C = 1
0
1 2
3
10−2
10−1 100 101 102 C
(d) Bias-Variance
u1 0
Figure 22.7. Bias-variance decomposition: SVM quadratic kernels. Decision boundaries plotted for K = 10 bootstrap samples.
as we increase C, as seen from the varying decision boundaries. Figure 22.7d plots the average variance and average bias for different values of C, as well as the expected loss. The bias-variance tradeoff is clearly visible, since as the bias reduces, the variance increases. The lowest expected loss is obtained when C = 1.
22.3.1 Ensemble Classifiers
A classifier is called unstable if small perturbations in the training set result in large changes in the prediction or decision boundary. High variance classifiers are inherently unstable, since they tend to overfit the data. On the other hand, high bias methods typically underfit the data, and usually have low variance. In either case, the aim
0.1
u1
Bias-Variance Decomposition 575
of learning is to reduce classification error by reducing the variance or bias, ideally both. Ensemble methods create a combined classifier using the output of multiple base classifiers, which are trained on different data subsets. Depending on how the training sets are selected, and on the stability of the base classifiers, ensemble classifiers can help reduce the variance and the bias, leading to a better overall performance.
Bagging
Bagging, which stands for Bootstrap Aggregation, is an ensemble classification method that employs multiple bootstrap samples (with replacement) from the input training data D to create slightly different training sets Di, i = 1,2,…,K. Different base classifiers Mi are learned, with Mi trained on Di. Given any test point x, it is first classified using each of the K base classifiers, Mi. Let the number of classifiers that predict the class of x as cj be given as
vj(x)=Mi(x)=cj i=1,…,K
The combined classifier, denoted MK, predicts the class of a test point x by majority
voting among the k classes:
MK(x) = argmaxvj (x) j = 1,…,k
cj
For binary classification, assuming that the classes are given as {+1, −1}, the combined
classifier MK can be expressed more simply as
K MK(x)=sign Mi(x)
i=1
Bagging can help reduce the variance, especially if the base classifiers are unstable, due to the averaging effect of majority voting. It does not, in general, have much effect on the bias.
Example 22.13. Figure 22.8a shows the averaging effect of bagging for the Iris principal components dataset from Example 22.12. The figure shows the SVM decision boundaries for the quadratic kernel using C = 1. The base SVM classifiers are trained on K = 10 bootstrap samples. The combined (average) classifier is shown in bold.
Figure 22.8b shows the combined classifiers obtained for different values of K, keeping C = 1. The zero–one and squared loss for selected values of K are shown below
K
3 5 8 10 15
Zero–one loss
0.047 0.04 0.02 0.027 0.027
Squared loss
0.187 0.16 0.10 0.113 0.107
1
0 −1 −2 −3
1
0 −1 −2 u1 −3
576
u2 22
u2
Classification Assessment
−4 −3 −2 −1 0 1 2 3 (a) K = 10
−4 −3 −2 −1 0 1 2 3 (b) Effect of K
Figure 22.8. Bagging: combined classifiers. (a) uses K = 10 bootstrap samples. (b) shows average decision boundary for different values of K.
The worst training performance is obtained for K = 3 (in thick gray) and the best for K = 8 (in thick black).
Boosting
Boosting is another ensemble technique that trains the base classifiers on different samples. However, the main idea is to carefully select the samples to boost the performance on hard to classify instances. Starting from an initial training sample D1, we train the base classifier M1, and obtain its training error rate. To construct the next sample D2, we select the misclassified instances with higher probability, and after training M2, we obtain its training error rate. To construct D3, those instances that are hard to classify by M1 or M2, have a higher probability of being selected. This process is repeated for K iterations. Thus, unlike bagging that uses independent random samples from the input dataset, boosting employs weighted or biased samples to construct the different training sets, with the current sample depending on the previous ones. Finally, the combined classifier is obtained via weighted voting over the output of the K base classifiers M1,M2,…,MK.
Boosting is most beneficial when the base classifiers are weak, that is, have an error rate that is slightly less than that for a random classifier. The idea is that whereas M1 may not be particularly good on all test instances, by design M2 may help classify some cases where M1 fails, and M3 may help classify instances where M1 and M2 fail, and so on. Thus, boosting has more of a bias reducing effect. Each of the weak learners is likely to have high bias (it is only slightly better than random guessing), but the final combined classifier can have much lower bias, since different weak learners learn to classify instances in different regions of the input space. Several variants of boosting can be obtained based on how the instance weights are computed for sampling, how the base classifiers are combined, and so on. We discuss Adaptive Boosting (AdaBoost), which is one of the most popular variants.
u1
Bias-Variance Decomposition 577 ALGORITHM22.5. AdaptiveBoostingAlgorithm:AdaBoost
ADABOOST(K, D): 1 w0 ←1·1∈Rn
n 2t←1
3 whilet≤Kdo
5 6 7 8 9
10 11
12
14 15
16
Dt ← weighted resampling with replacement from D using wt−1
α = ln 1−ǫt // classifier weight t ǫt
Mt ← train classifier on Dt
ǫ ←n wt−1 ·IM (x )̸=y //weightederrorrateon D t i=1i ti i
ifǫt=0thenbreak elseifǫ<0.5then
t
foreach i ∈ [1, n] do
// update point weights
w t − 1 i f M ( x ) = y wt=i ti i
i wt−1 iǫttii
1−ǫt if M (x ) ̸= y wt = wt // normalize weights
1T wt t←t+1
return {M1,M2,...,MK}
Adaptive Boosting: AdaBoost Let D be the input training set, comprising n points xi ∈ Rd . The boosting process will be repeated K times. Let t denote the iteration and letαt denotetheweightforthetthclassifierMt.Letwit denotetheweightforxi,with wt = (w1t , w2t , . . . , wnt )T being the weight vector over all the points for the t th iteration. In fact, w is a probability vector, whose elements sum to one. Initially all points have equal weights, that is,
0 11 1T 1 w = n,n,...,n =n1
where 1 ∈ Rn is the n-dimensional vector of all 1’s.
The pseudo-code for AdaBoost is shown in Algorithm 22.5. During iteration t,
the training sample Dt is obtained via weighted resampling using the distribution wt−1,
that is, we draw a sample of size n with replacement, such that the ith point is chosen
accordingtoitsprobabilitywt−1.Next,wetraintheclassifierM usingD,andcompute itt
its weighted error rate ǫt on the entire input dataset D: n
i=1
where I is an indicator function that is 1 when its argument is true, that is, when Mt misclassifies xi , and is 0 otherwise.
ǫt = wt−1 ·I Mt(xi)̸=yi i
578
Classification Assessment
The weight for the tth classifier is then set as αt =ln1−ǫt
ǫt
and the weight for each point xi ∈ D is updated based on whether the point is
misclassified or not
wt =wt−1·expα ·IM(x)̸=y
Thus, if the predicted class matches the true class, that is, if Mt (xi ) = yi , then I(Mt (xi ) ̸= yi ) = 0, and the weight for point xi remains unchanged. On the other hand, if the point is misclassified, that is, Mt (xi ) ̸= yi , then we have I(Mt (xi ) ̸= yi ) = 1, and
wt =wt−1·expα=wt−1expln1−ǫt=wt−11 −1 i i t i ǫt i ǫt
Wecanobservethatiftheerrorrateǫt issmall,thenthereisagreaterweightincrement
for xi . The intuition is that a point that is misclassified by a good classifier (with a low
error rate) should be more likely to be selected for the next training dataset. On the
other hand, if the error rate of the base classifier is close to 0.5, then there is only a
small change in the weight, since a bad classifier (with a high error rate) is expected
to misclassify many instances. Note that for a binary class problem, an error rate of
0.5 corresponds to a random classifier, that is, one that makes a random guess. Thus,
we require that a base classifier has an error rate at least slightly better than random
guessing, that is, ǫt < 0.5. If the error rate ǫt ≥ 0.5, then the boosting method discards
the classifier, and returns to line 5 to try another data sample. Alternatively, one can
simply invert the predictions for binary classification. It is worth emphasizing that for a
multi-class problem (with k > 2), the requirement that ǫt < 0.5 is a significantly stronger
requirement than for the binary (k = 2) class problem because in the multiclass case a
random classifier is expected to have an error rate of k−1 . Note also that if the error k
rate of the base classifier ǫt = 0, then we can stop the boosting iterations.
Once the point weights have been updated, we re-normalize the weights so that wt
iittii
is a probability vector (line 14):
wt 1 T
wt = 1Twt = nj=1 wjt w1t ,w2t ,...,wnt
Combined Classifier Given the set of boosted classifiers, M1,M2,...,MK, along with their weights α1,α2,...,αK, the class for a test case x is obtained via weighted majority voting. Let vj (x) denote the weighted vote for class cj over the K classifiers, given as
K vj(x)= αt ·I Mt(x)=cj
t=1
Because I(Mt (x) = cj ) is 1 only when Mt (x) = cj , the variable vj (x) simply obtains the tally for class cj among the K base classifiers, taking into account the classifier weights. The combined classifier, denoted MK, then predicts the class for x as follows:
MK(x) = argmaxvj (x) j = 1,..,k cj
1
0
Bias-Variance Decomposition 579 In the case of binary classification, with classes {+1,−1}, the combined classifier MK
can be expressed more simply as
K MK (x) = sign αt Mt (x)
t=1
Example 22.14. Figure 22.9a illustrates the boosting approach on the Iris principal components dataset, using linear SVMs as the base classifiers. The regularization constant was set to C = 1. The hyperplane learned in iteration t is denoted ht , thus, the classifier model is given as Mt (x) = sign(ht (x)). As such, no individual linear hyperplane can discriminate between the classes very well, as seen from their error rates on the training set:
u2 h3 h2 h4
Testing Error Training Error
0.20
h1 0.15 0.10 0.05
Mt h1 ǫt 0.280 αt 0.944
h2 0.305 0.826
h3 0.174 1.559
h4 0.282 0.935
However, when we combine the decisions from successive hyperplanes weighted by αt , we observe a marked drop in the error rate for the combined classifier MK (x) as K increases:
combined model M1 M2 M3 M4 training error rate 0.280 0.253 0.073 0.047
We can see, for example, that the combined classifier M3, comprising h1, h2 and h3, has already captured the essential features of the nonlinear decision boundary between the two classes, yielding an error rate of 7.3%. Further reduction in the training error is obtained by increasing the number of boosting steps.
To assess the performance of the combined classifier on independent testing data, we employ 5-fold cross-validation, and plot the average testing and training
0.35 0.30
0.25
−1
−2 u10 K
Figure 22.9. (a) Boosting SVMs with linear kernel. (b) Average testing and training error: 5-fold cross-validation.
−4 −3 −2 −1 0 1 2 3 0 50 100 150 200
(a) (b)
580 Classification Assessment
Bagging as a Special Case of AdaBoost: Bagging can be considered as a special case of AdaBoost, where wt = 1 1, and αt = 1 for all K iterations. In this case, the weighted
n
resampling defaults to regular resampling with replacement, and the predicted class for a test case also defaults to simple majority voting.
22.4 FURTHER READING
The application of ROC analysis to classifier performance was introduced in Provost and Fawcett (1997), with an excellent introduction to ROC analysis given in Fawcett (2006). For an in-depth description of the bootstrap, cross-validation, and other methods for assessing classification accuracy see Efron and Tibshirani (1993). For many datasets simple rules, like one-level decision trees, can yield good classification performance; see Holte (1993) for details. For a recent review and comparison of classifiers over multiple datasets see Demsˇar (2006). A discussion of bias, variance, and zero–one loss for classification appears in Friedman (1997), with a unified decomposition of bias and variance for both squared and zero–one loss given in Domingos (2000). The concept of bagging was proposed in Breiman (1996), and that of adaptive boosting in Freund and Schapire (1997). Random forests is a tree-based ensemble approach that can be very effective; see Breiman (2001) for details. For a comprehensive overview on the evaluation of classification algorithms see Japkowicz and Shah (2011).
Breiman, L. (1996). Bagging predictors. Machine learning, 24 (2): 123–140.
Breiman, L. (2001). Random forests. Machine learning, 45 (1): 5–32.
Demsˇar, J. (2006). Statistical comparisons of classifiers over multiple data sets. The
Journal of Machine Learning Research, 7, 1–30.
Domingos, P. (2000). A unified bias-variance decomposition for zero-one and squared
loss. Proceedings of the National Conference on Artificial Intelligence, pp. 564–569. Efron, B. and Tibshirani, R. (1993). An introduction to the bootstrap. Vol. 57. Boca
Raton, FL: Chapman & Hall/CRC.
Fawcett, T. (2006). An introduction to ROC analysis. Pattern recognition letters, 27 (8):
861–874.
Freund, Y. and Schapire, R. E. (1997). A decision-theoretic generalization of on-line
learning and an application to boosting. Journal of computer and system sciences,
55 (1): 119–139.
Friedman, J. H. (1997). On bias, variance, 0/1-loss, and the curse-of-dimensionality.
Data mining and knowledge discovery, 1 (1): 55–77.
error rates as a function of K in Figure 22.9b. We can see that as the number of base classifiers K increases, both the training and testing error rates reduce. However, while the training error essentially goes to 0, the testing error does not reduce beyond 0.02, which happens at K = 110. This example illustrates the effectiveness of boosting in reducing the bias.
Exercises 581
Holte, R. C. (1993). Very simple classification rules perform well on most commonly used datasets. Machine learning, 11 (1): 63–90.
Japkowicz, N. and Shah, M. (2011). Evaluating learning algorithms: a classification perspective. New York: Cambridge University Press.
Provost, F. and Fawcett, T. (1997). Analysis and visualization of classifier performance: Comparison under imprecise class and cost distributions. Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining. Menlo Park, CA: AAAI Press, pp. 43–48.
22.5 EXERCISES
Q1. TrueorFalse:
(a) A classification model must have 100% accuracy (overall) on the training dataset. (b) A classification model must have 100% coverage (overall) on the training
dataset.
Q2. GiventhetrainingdatabaseinTable22.6aandthetestingdatainTable22.6b,answer the following questions:
(a) Build the complete decision tree using binary splits and Gini index as the
evaluation measure (see Chapter 19).
(b) Compute the accuracy of the classifier on the test data. Also show the per class
accuracy and coverage.
Table 22.6. Data for Q2
X
Y
Z
Class
15
20
25
30
35
25
15
20
1 3 2 4 2 4 2 3
A B A A B A B B
1 2 1 1 2 1 2 2
X
Y
Z
Class
10 20 30 40 15
2 1 3 2 1
A B A B B
2 1 2 2 1
(a) Training
Q3. Show that for binary classification the majority voting for the combined classifier decision in boosting can be expressed as
K MK (x) = sign αt Mt (x)
t=1
Q4. Consider the 2-dimensional dataset shown in Figure 22.10, with the labeled points belonging to two classes: c1 (triangles) and c2 (circles). Assume that the six hyperplanes were learned from different bootstrap samples. Find the error rate for each of the six hyperplanes on the entire dataset. Then, compute the 95% confidence
(b) Testing
582
Classification Assessment
h5 h6
h3 h2h1 h4
9 8 7 6 5 4 3 2 1
123456789 Figure 22.10. For Q4.
Table 22.7. Critical values for t-test
dof
123456
tα/2
12.7065 4.3026 3.1824 2.7764 2.5706 2.4469
Q5.
interval for the expected error rate, using the t -distribution critical values for different degrees of freedom (dof) given in Table 22.7.
Consider the probabilities P (+1|xi ) for the positive class obtained for some classifier, and given the true class labels yi
yi
P (+1|xi )
Plot the ROC curve for this classifier.
x1 x2 x3 x4 x5 x6 x7 x8 x9 x10
+1 −1 +1 +1 −1 +1 −1 +1 −1 −1 0.53 0.86 0.25 0.95 0.87 0.86 0.76 0.94 0.44 0.86
584 Classification Assessment
Index
accuracy, 548
Apriori algorithm, 223 association rule, 220, 301
antecedent, 301
assessment measures, 301 Bonferroni correction, 320 confidence, 220, 302 consequent, 301
conviction, 306
Fisher exact test, 316
general, 315
improvement, 315
Jaccard coefficient, 305 leverage, 304
lift, 303
multiple hypothesis testing, 320 non-redundant, 315
odds ratio, 306
permutation test, 320
swap randomization, 321 productive, 315 randomization test, 320 redundant, 315 significance, 320
specific, 315 support, 220, 302
relative, 302
swap randomization, 321 unproductive, 315 bootstrap sampling, 325 confidence interval, 325
mining algorithm, 234
relative support, 220 association rule mining, 234 attribute
binary, 3 categorical, 3
nominal, 3
ordinal, 3 continuous, 3 discrete, 3 numeric, 3
interval-scaled, 3 ratio-scaled, 3
bagging, 575
Bayes classifier, 466
categorical attributes, 470
numeric attributes, 467 Bayes theorem, 466, 491 Bernoulli distribution
mean, 64
sample mean, 64 sample variance, 64 variance, 64
Bernoulli variable, 63
BetaCV measure, 441 bias-variance decomposition, 571 binary database, 218
vertical representation, 218 Binomial distribution, 65 bivariate analysis
585
586
categorical, 72
numeric, 42
Bonferroni correction, 320 boosting, 576
AdaBoost, 577
combined classifier, 578 bootstrap
resampling, 562 sampling, 325
C-index, 441 Calinski–Harabasz index, 450 categorical attributes
angle, 87
cosine similarity, 88 covariance matrix, 68, 83 distance, 87
Euclidean distance, 87 Hamming distance, 88 Jaccard coefficient, 88 mean, 67, 83
bivariate, 74 norm, 87
sample covariance matrix, 69 bivariate, 75
sample mean, 67 bivariate, 74
Cauchy-Schwartz inequality, 7 central limit theorem, 564 centroid, 333
Charm algorithm, 248
properties, 248
χ2 distribution, 80 chi-squared statistic, 80 χ2 statistic, 80, 85 classification, 29
accuracy, 548, 549, 552 area under ROC curve, 556 assessment measures, 547
contingency table based, 549 AUC, 556
bagging, 575
Bayes classifier, 466
bias, 572
bias-variance decomposition, 571 binary classes, 552
Index
boosting, 576 AdaBoost, 577
classifier evaluation, 561 confidence interval, 564 confusion matrix, 549 coverage, 550 cross-validation, 561 decision trees, 480 ensemble classifiers, 574 error rate, 548, 552 expected loss, 571 F-measure, 550
false negative, 552
false negative rate, 553
false positive, 552
false positive rate, 553
K nearest neighbors classifier, 476 KNN classifier, 476
loss function, 571
naive Bayes classifier, 472 overfitting, 573
paired t-test, 568
precision, 549, 553
recall, 550
sensitivity, 553
specificity, 553
true negative, 552
true negative rate, 553
true positive, 552
true positive rate, 553
unstable, 574
variance, 572
classifier evaluation, 561 bootstrap resampling, 562 confidence interval, 564 cross-validation, 561 paired t-test, 568
closed itemsets, 243 Charm algorithm, 248 equivalence class, 244
cluster stability, 454 clusterability, 457 clustering, 28
centroid, 333
curse of dimensionality, 388 DBSCAN, 375
Index
border point, 375
core point, 375
density connected, 376 density-based cluster, 376 directly density reachable, 375 ǫ-neighborhood, 375
noise point, 375 DENCLUE
density attractor, 385 dendrogram, 364 density-based
DBSCAN, 375
DENCLUE, 385
EM, see expectation maximization EM algorithm, see expectation
maximization algorithm evaluation, 425
expectation maximization, 342, 343 expectation step, 344, 348 Initialization, 344, 348 maximization step, 345, 348 multivariate data, 346
univariate data, 344 expectation maximization
algorithm, 349
external validation, 425 Gaussian mixture model, 342 graph cuts, 401
internal validation, 425 K-means, 334
specialization of EM, 353 kernel density estimation, 379 Kernel K-means, 338
Markov chain, 416
Markov clustering, 416 Markov matrix, 416
relative validation, 425 spectral clustering
computational complexity, 407 stability, 425
sum of squared errors, 333 tendency, 425
validation external, 425 internal, 425 relative, 425
587
clustering evaluation, 425 clustering stability, 425 clustering tendency, 425, 457
distance distribution, 459 Hopkins statistic, 459 spatial histogram, 457
clustering validation
BetaCV measure, 441
C-index, 441 Calinski-Harabasz index, 450 clustering tendency, 457 conditional entropy, 430 contingency table, 426 correlation measures, 436 Davies-Bouldin index, 444 distance distribution, 459 Dunn index, 443 entropy-based measures, 430 external measures, 425 F-measure, 427 Fowlkes-Mallows measure, 435 gap statistic, 452
Hopkins statistic, 459 Hubert statistic, 437, 445
discretized, 438
internal measures, 440 Jaccard coefficient, 435 matching based measures, 426 maximum matching, 427 modularity, 443
mutual information, 431
normalized, 431 normalized cut, 442 pair-wise measures, 433 purity, 426
Rand statistic, 435
relative measures, 448 silhouette coefficient, 444, 448 spatial histogram, 457 stability, 454
variation of information, 432
conditional entropy, 430 confidence interval, 325, 564
small sample, 566
unknown variance, 565 confusion matrix, 549
588
contingency table, 78
χ2 test, 85
clustering validation, 426 multi-way, 84
correlation, 45
cosine similarity, 7 covariance, 43 covariance matrix, 46, 49
bivariate, 74 determinant, 46 eigen-decomposition, 57 eigenvalues, 49
inner product, 50
outer product, 50 positive semi-definite, 49 trace, 46
cross-validation, 561 leave-one-out, 561
cumulative distribution binomial, 18
cumulative distribution function, 18 empirical CDF, 33
empirical inverse CDF, 34 inverse CDF, 34
joint CDF, 22, 23
quantile function, 34 curse of dimensionality
clustering, 388
data dimensionality, 2 extrinsic, 13 intrinsic, 13
data matrix, 1 centering, 10 column space, 12 mean, 9
rank, 13
row space, 12 symbolic, 63 total variance, 9
data mining, 25 data normalization
range normalization, 52
standard score normalization, 52 Davies–Bouldin index, 444 DBSCAN algorithm, 375
Index
decision tree algorithm, 484 decision trees, 480
axis-parallel hyperplane, 482 categorical attributes, 484 data partition, 482
decision rules, 484
entropy, 485
Gini-index, 486 information gain, 486 purity, 483
split-point, 482
split-point evaluation, 487
categorical attributes, 491 measures, 485
numeric attributes, 487
DENCLUE
center-defined cluster, 386 density attractor, 385 density reachable, 387 density-based cluster, 387
DENCLUE algorithm, 385 dendrogram, 364
density attractor, 385 density estimation, 379
nearest neighbor based, 384 density-based cluster, 387 density-based clustering
DBSCAN, 375
DENCLUE, 385 dimensionality reduction, 183 discrete random variable, 14 discretization, 89
equal-frequency intervals, 89
equal-width intervals, 89 dominant eigenvector, 105
power iteration method, 105 Dunn index, 443
Eclat algorithm, 225 computational complexity, 228 dEclat, 229
diffsets, 228
equivalence class, 226
empirical joint probability mass function, 457
ensemble classifiers, 574
Index
bagging, 575
boosting, 576 entropy, 485
split, 486
EPMF, see empirical joint probability
mass function error rate, 548
Euclidean distance, 7
expectation maximization, 342, 343,
357 expectation step, 358
maximization step, 359 expected value, 34 exploratory data analysis, 26
F-measure, 427
false negative, 552
false positive, 552
Fisher exact test, 316, 318 Fowlkes-Mallows measure, 435 FPGrowth algorithm, 231 frequent itemset, 219
frequent itemsets
mining, 221
frequent pattern mining, 27
Gamma function, 80, 166 gap statistic, 452
Gauss error function, 55 Gaussian mixture model, 342 generalized itemset, 250 GenMax algorithm, 245
maximality checks, 245 Gini index, 486
graph, 280
adjacency matrix, 96 weighted, 96
authority score, 110 average degree, 98
average path length, 98 Baraba ́ si-Albert model, 124
clustering coefficient, 131 degree distribution, 125 diameter, 131
centrality
authority score, 110
589
betwenness, 103 closeness, 103
degree, 102
eccentricity, 102 eigenvector centrality, 104 hub score, 110
pagerank, 108
prestige, 104
clustering coefficient, 100 clustering effect, 114 degree, 97
degree distribution, 94 degree sequence, 94 diameter, 98
eccentricity, 98
effective diameter, 99 efficiency, 101
Erdo ̈ s-Re ́ nyi model, 116 HITS, 110
hub score, 110
labeled, 280
pagerank, 108
preferential attachment, 124 radius, 98
random graphs, 116 scale-free property, 113 shortest path, 95 small-world property, 112 transitivity, 101 Watts-Strogatz model, 118
clustering coefficient, 119 degree distribution, 121 diameter, 119, 122
graph clustering
average weight, 409
degree matrix, 395
graph cut, 402
k-way cut, 401
Laplacian matrix, 398
Markov chain, 416
Markov clustering, 416
MCL algorithm, 418
modularity, 411
normalized adjacency matrix, 395 normalized asymmetric Laplacian,
400
590
normalized cut, 404
normalized modularity, 415 normalized symmetric Laplacian,
399
objective functions, 403, 409 ratio cut, 403
weighted adjacency matrix, 394
graph cut, 402
graph isomorphism, 281 graph kernel, 156
exponential, 157 power kernel, 157 von Neumann, 158
graph mining
canonical DFS code, 287 canonical graph, 286 canonical representative, 285 DFS code, 286
edge growth, 283
extended edge, 280
graph isomorphism, 281 gSpan algorithm, 288 rightmost path extension, 284 rightmost vertex, 285
search space, 283
subgraph isomorphism, 282
graph models, 112
Baraba ́ si-Albert model, 124 Erdo ̈ s-Re ́ nyi model, 116 Watts-Strogatz model, 118
graphs
degree matrix, 395
Laplacian matrix, 398
normalized adjacency matrix, 395 normalized asymmetric Laplacian,
400
normalized symmetric Laplacian,
399
weighted adjacency matrix, 394
GSP algorithm, 261 gSpan algorithm, 288
candidate extension, 291 canonicality checking, 295 subgraph isomorphisms, 293 support computation, 291
Index
hierarchical clustering, 364 agglomerative, 364 complete link, 367 dendrogram, 364, 365 distance measures, 367 divisive, 364
group average, 368 Lance-Williams formula, 370 mean distance, 368 minimum variance, 368 single link, 367
update distance matrix, 370 Ward’s method, 368
Hopkins statistic, 459 Hubert statistic, 437, 445 hyper-rectangle, 163 hyperball, 164
volume, 165 hypercube, 164
volume, 165 hyperspace, 163
density of multivariate normal, 172 diagonals, 171
angle, 171 hypersphere, 164
asymptotic volume, 167
closed, 164
inscribed within hypercube, 168 surface area, 167
volume of thin shell, 169 hypersphere volume, 175
Jacobian, 176–178 Jacobian matrix, 176–178
IID, see independent and identically distributed
inclusion-exclusion principle, 251 independent and identically
distributed, 24 information gain, 486
inter-quartile range, 38 itemset, 217
itemset mining, 217, 221
Apriori algorithm, 223 level-wise approach, 223
candidate generation, 221
Index
Charm algorithm, 248 computational complexity, 222 Eclat algorithm, 225
tidset intersection, 225 FPGrowth algorithm, 231
frequent pattern tree, 231 frequent pattern tree, 231 GenMax algorithm, 245 level-wise approach, 223 negative border, 240 partition algorithm, 238 prefix search tree, 221, 223 support computation, 221 tidset intersection, 225
itemsets
assessment measures, 309 closed, 313
maximal, 312
minimal generator, 313 minimum support threshold, 219 productive, 314
support, 309
relative, 309 closed, 243, 248 closure operator, 243
properties, 243
generalized, 250
maximal, 242, 245
minimal generators, 244 non-derivable, 250, 254
relative support, 219
rule-based assessment measures, 310 support, 219
Jaccard coefficient, 435 Jacobian matrix, 176–178
K nearest neighbors classifier, 476 K-means
algorithm, 334
kernel method, 338
k-way cut, 401
kernel density estimation, 379
discrete kernel, 380, 382 Gaussian kernel, 380, 383 multivariate, 382
591
univariate, 379
kernel discriminant analysis, 504 Kernel K-means, 338
kernel matrix, 135
centered, 151
normalized, 153 kernel methods
data-specific kernel map, 142 diffusion kernel, 156
exponential, 157 power kernel, 157 von Neumann, 158
empirical kernel map, 140 Gaussian kernel, 147 graph kernel, 156
Hilbert space, 140
kernel matrix, 135 kernel operations
centering, 151 distance, 149 mean, 149
norm, 148 normalization, 153 total variance, 150
kernel trick, 137 Mercer kernel map, 143 polynomial kernel
homogeneous, 144
inhomogeneous, 144
positive semi-definite kernel, 138 pre-Hilbert space, 140 reproducing kernel Hilbert space,
140
reproducing kernel map, 139 reproducing property, 140 spectrum kernel, 155
string kernel, 155
vector kernel, 144
kernel PCA, see kernel principal component analysis
kernel principal component analysis, 202
kernel trick, 338
KL divergence, see Kullback-Leibler
divergence KNN classifier, 476
592
Kullback-Leibler divergence, 457
linear discriminant analysis, 497 between-class scatter matrix, 500 Fisher objective, 499
optimal linear discriminant, 500 within-class scatter matrix, 500
loss function, 571 squared loss, 571 zero-one loss, 571
Mahalanobis distance, 56 Markov chain, 416 Markov clustering, 416 maximal itemsets, 242
GenMax algorithm, 245
maximum likelihood estimation, 343,
353
covariance matrix, 355
mean, 354
mixture parameters, 356 maximum matching, 427 mean, 34
median, 35
minimal generator, 244 mode, 36
modularity, 412, 443 multinomial distribution, 71
covariance, 72
mean, 72
sample covariance, 72 sample mean, 72
multiple hypothesis testing, 320 multivariate analysis
categorical, 82
numeric, 48
multivariate Bernoulli variable, 66, 82
covariance matrix, 68, 83 empirical PMF, 69
joint PMF, 73
mean, 67, 83
probability mass function, 66, 73 sample covariance matrix, 69 sample mean, 67
multivariate variable Bernoulli, 66
Index
mutual information, 431 normalized, 431
naive Bayes classifier, 472 categorical attributes, 475 numeric attributes, 472
nearest neighbor density estimation, 384
non-derivable itemsets, 250, 254 inclusion-exclusion principle, 251 support bounds, 252
normal distribution
Gauss error function, 55
normalized cut, 442
orthogonal complement, 186 orthogonal projection matrix, 186
error vector, 186 orthogonal subspaces, 186
pagerank, 108
paired t-test, 568
pattern assessment, 309
PCA, see principal component analysis permutation test, 320
swap randomization, 321 population, 24
power iteration method, 105 PrefixSpan algorithm, 265 principal component, 187
kernel PCA, 202
principal component analysis, 187
choosing the dimensionality, 197 connection with SVD, 211
mean squared error, 193, 197 minimum squared error, 189 total projected variance, 192, 196
probability distribution bivariate normal, 21 normal, 17
probability density function, 16 joint PDF, 20, 23
probability distribution Bernoulli, 15, 63 Binomial, 15 Gaussian, 17
Index
multivariate normal, 56
normal, 54
probability mass function, 15
empirical joint PMF, 43 empirical PMF, 34
joint PMF, 20, 23
purity, 426
quantile function, 34 quartile, 38
Rand statistic, 435 random graphs, 116
average degree, 116 clustering coefficient, 117 degree distribution, 116 diameter, 118
random sample, 24 multivariate, 24 statistic, 25 univariate, 24
random variable, 14 Bernoulli, 63
bivariate, 19
continuous, 14 correlation, 45 covariance, 43 covariance matrix, 46, 49 discrete, 14
empirical joint PMF, 43 expectation, 34
expected value, 34 generalized variance, 46, 49 independent and identically
distributed, 24 inter-quartile range, 38 mean, 34
bivariate, 43
multivariate, 48
median, 35
mode, 36
moments about the mean, 39 multivariate, 23
standard deviation, 39 standardized covariance, 45 total variance, 43, 46, 49
593
value range, 38 variance, 38 vector, 23
receiver operating characteristic curve, 555
ROC curve, see receiver operating characteristic curve
rule assessment, 301
sample covariance matrix bivariate, 75
sample mean, 25 sample space, 14 sample variance
geometric interpretation, 40 sequence, 259
closed, 260
maximal, 260 sequence mining
alphabet, 259
GSP algorithm, 261 prefix, 259
PrefixSpan algorithm, 265 relative support, 260 search space, 261 sequence, 259
Spade algorithm, 263 subsequence, 259
consecutive, 259 substring, 259 substring mining, 267 suffix, 259
suffix tree, 267
support, 260
silhouette coefficient, 444, 448 singular value decomposition, 208
connection with PCA, 211 Frobenius norm, 210
left singular vector, 209 reduced SVD, 209
right singular vector, 209 singular value, 209
spectral decomposition, 210
Spade algorithm sequential joins, 263
spectral clustering
594
average weight, 409 computational complexity, 407 degree matrix, 395
k-way cut, 401
Laplacian matrix, 398
modularity, 411
normalized adjacency matrix, 395 normalized asymmetric Laplacian,
400
normalized cut, 404
normalized modularity, 415 normalized symmetric Laplacian,
399
objective functions, 403, 409 ratio cut, 403
weighted adjacency matrix, 394
spectral clustering algorithm, 406 standard deviation, 39
standard score, 39
statistic, 25
robustness, 35
sample correlation, 45
sample covariance, 44
sample covariance matrix, 46, 50 sample inter-quartile range, 38 sample mean, 25, 35
bivariate, 43
multivariate, 48
sample median, 36
sample mode, 36
sample range, 38
sample standard deviation, 39 sample total variance, 43 sample variance, 39
standard score, 39 trimmed mean, 35 unbiased estimator, 35 z-score, 39
statistical independence, 22 Stirling numbers
second kind, 333 string, see sequence string kernel
spectrum kernel, 155 subgraph, 281
Index
connected, 281
tids, 218
support, 283
subgraph isomorphism, 282 substring mining, 267
suffix tree, 267
Ukkonen’s algorithm, 270 suffix tree, 267
Ukkonen’s algorithm, 270 support vector machines, 513
bias, 513
canonical hyperplane, 517 classifier, 521
directed distance, 514
dual algorithm, 534
dual objective, 520
hinge loss, 524, 531
hyperplane, 513 Karush-Kuhn-Tucker conditions,
520
kernel SVM, 529
linearly separable, 514
margin, 517
maximum margin hyperplane, 519 newton optimization algorithm, 538 non-separable case, 523
nonlinear case, 529
primal algorithm, 538
primal kernel SVM algorithm, 541 primal objective, 519
quadratic loss, 528, 531 regularization constant, 524 separable case, 519
separating hyperplane, 514
slack variables, 524
soft margin, 524
stochastic gradient ascent algorithm,
534
support vectors, 517 training algorithms, 533 weight vector, 513
SVD, see singular value decomposition SVM, see support vector machines swap randomization, 321
tidset, 218
transaction identifiers, 218
Index
total variance, 9, 43 transaction, 218 transaction database, 218 true negative, 552
true positive, 552
Ukkonen’s algorithm computational cost, 271 implicit extensions, 272 implicit suffixes, 271 skip/count trick, 272 space requirement, 270 suffix links, 273
time complexity, 276 univariate analysis
categorical, 63 numeric, 33
variance, 38
variation of information, 432 vector
dot product, 6
Euclidean norm, 6
length, 6
linear combination, 4 Lp-norm, 7
normalization, 7
orthogonal decomposition, 10
595
orthogonal projection, 11 orthogonality, 8 perpendicular distance, 11 standard basis, 4
unit vector, 6 vector kernel, 144 Gaussian, 147
polynomial, 144
vector random variable, 23 vector space
basis, 13
column space, 12 dimension, 13
linear combination, 12 linear dependence, 13 linear independence, 13 orthogonal basis, 13 orthonormal basis, 13 row space, 12
span, 12
spanning set, 12 standard basis, 13
Watts-Strogatz model clustering coefficient, 122
z-score, 39