Introduction to data mining
CMT207 Information modelling & database systems
Statistical modelling
statisticians were the first to use the term data mining statisticians view data mining as the construction of a
statistical model
an underlying distribution from which the visible data is drawn
e.g. suppose the data is a set of numbers with normal distribution
the mean and standard
deviation would become
the model of the data
14
Machine learning
some say that data mining = machine learning
some data mining uses algorithms from machine learning
supervised machine learning uses data as a training set to train an algorithm, e.g. Bayes nets, support vector machines, etc.
good approach when we have little idea of what we are looking for in the data
e.g. it is not very clear what it is about movies that makes certain viewers (dis)like it
… but when using a sample of their responses, machine learning algorithms have proved quite successful in predicting movie ratings
25
Computational approaches to modelling
more recently, computer scientists have looked at data mining as an algorithmic problem
the model of the data is simply the answer to a complex query about it
many different approaches, but most can be described as either:
summarising the data succinctly and approximately, or
extracting the most prominent features of the data
and ignoring the rest
36
Lecture
in this lecture, we begin with the essence of data mining
we cover Bonferroni’s principle, which is a warning about overusing the ability to mine data
Data mining
data mining is the discovery of “models” for data
What is a “model”?
a “model” can be interpreted in several different ways
statistical modelling
machine learning
computational approaches to modelling
summarisation
feature extraction
3/11/2016
1
3/11/2016
Summarisation
one of the most interesting forms of summarisation is the PageRank idea
in this form of Web mining, the entire complex structure of the Web is summarised by a single number for each page
remarkable property this ranking has is that it reflects very well the “importance” of the page
it relates to the degree to which typical searchers would want to see a page returned as an answer to their search query
7
Bonferroni’s principle
a theorem in statistics known as the Bonferroni correction
gives a statistically sound way to avoid treating random occurrences as if they were real
calculate the expected number of occurrences of the events you are looking for under the assumption that data is random
if this number is significantly larger than the number of real instances you hope to find, then you must expect that most of the occurrences are bogus
10
Feature extraction
look for the most extreme examples of a phenomenon and represents the data by these examples
two important types of feature extraction:
1.
2.
frequent itemsets – looking for small sets of items that frequently appear together, e.g. hamburger and ketchup
these sets are the characterisation of the data
similar items – often, data looks like a collection of sets, e.g.
customers can be viewed as the set of items they bought
the objective is to find pairs of sets that have a relatively large fraction of their elements in common
e.g. Amazon can recommend products that many “similar”
customers have bought
8
Example
trying to identify people who cheat on their spouses we know that the percentage of those who cheat on
their spouses is 5%
you decide that people who go out with co-workers
more than three times a month are cheating on their spouses
your method discovers that 20% of people qualify as cheaters
20% >> 5%, therefore, in the very best case only 1⁄4 will actually be cheaters (true positives), and the rest will be bogus (false positives)
11
Bonferroni’s principle
a warning against overzealous use of data mining
suppose and you look for occurrences of certain events
in a dataset
even if the data is completely random, you can still expect to find such occurrences
the number of these occurrences will grow as the dataset grows in size
… but these occurrences may be “bogus”, i.e. be random with no particular cause
9
Summary
12
2
5/2/2019
Frequent itemsets
CMT207 Information modelling & database systems
1
Market basket
market basket (shopping cart) brought to the register for checkout
contains items, i.e. different products that the store sells
a major chain might sell 100,000 different items and collect data about millions of market baskets
4
Lecture
in the previous lecture we introduced three approaches to data mining
one of these was a computational approach, which includes:
summarisation
feature extraction
we learnt that mining of frequent itemsets is an
important type of feature extraction
in this lecture we will learn how to find frequent itemsets
2
Itemset
itemset = a set of items
frequent itemset = an itemset that appears in many
market baskets
a retailer can learn what is commonly bought together an opportunity to do some clever marketing
5
Market–basket data model
Example: diapers & beer
one chain store analysed their basket data and discovered that people who buy diapers are unusually likely to buy beer
one would hardly expect these two items to be related
explanation:
if you buy diapers, you probably have a baby at home
if you have a baby, then you are unlikely to be drinking at a pub
instead, you are more likely to bring beer home
marketing ploy: put diapers on sale and raise the price of beer
6
1
5/2/2019
Other applications
items = words
baskets= documents,e.g.Webpages,blogs,tweets
a basket (document) contains those items (words) that are present in the document
we would hope to find among the frequent pairs some pairs of words that represent a joint concept (relationship)
e.g. a pair like {Brad, Angelina} appears with high frequency
Brad + Angelina = Brangelina
7
Example
documents returned by googling cat and dog:
1. {cat, and, dog, bites}
2. {yahoo, news, claims, a, cat, mated, with, a, dog, and, produced, viable, offspring}
3. {cat, killer, likely, is, a, big, dog}
4. {professional, free, advice, on, dog, training, puppy, training}
5. {cat, and, kitten, training, and, behaviour}
6. {dog, &, cat, provides, dog, training, in, Eugene, Oregon}
7. {dog, and, cat, is, a, slang, term, used, by, police, officers, for, a, male–female, relationship}
8. {shop, for, your, show, dog, grooming, and, pet, supplies}
10
The market–basket data model
used to describe a common form of many–to–many relationship between two types of objects:
on the one hand, we have items
on the other we have baskets (aka transactions)
each basket consists of a set of items (an itemset)
we assume that :
the number of items in a basket is small the number of baskets is very large
8
Support
empty set ∅ is a subset of any set therefore, its support is 8
we shall not concern ourselves with the empty set, since it tells us nothing
next:
1. singletons: {cat}, {and}, {dog}, {bites}, {yahoo}, {news}, {claims}, {a}, {mated}, {with}, {produced}, {viable}, {offspring}, …
2. doubletons: {cat, and}, {and, dog}, {cat, dog}, {dog, bites}, {cat, bites}, {yahoo, news}, …
3. triples, etc.
11
Frequent itemsets
intuitively, a set of items that appears in many baskets is said to be frequent
How many is many?! 10, 100, 1K, 1M? formally:
s is a number called the support threshold
I is a set of items
support for I = the number of baskets that contain it as a subset
we say that I is frequent if its support ≥ s
9
Itemset
Example
{cat}
frequent itemsets
{dog}
{and}
1. {cat, and, dog, bites}
2. {yahoo, news, claims, a, cat, mated, with,
a, dog, and, produced, viable, offspring} s = 3
3. {cat, killer, likely, is, a, big, dog}
4. {professional, free, advice, on, dog, training, puppy, training}
5. {cat, and, kitten, training, and, behaviour}
6. {dog, &, cat, provides, dog, training, in, Eugene, Oregon}
7. {dog, and, cat, is, a, slang, term, used, by, police, officers, for, a, male–female, relationship}
8. {shop, for, your, show, dog, grooming, and, pet, supplies}
{a}
{training}
{for}
{is}
Support
6
7
5
3
3
2
2
12
2
5/2/2019
Example continued
now, let us look at the doubletons
a doubleton cannot be frequent unless both items in
the set are frequent by themselves
therefore, we only need to look at combinations of frequent singletons
13
Example continued
candidates for frequent triples:
{cat, dog, and} in baskets (1), (2) and (7) {cat, dog, a} in baskets (2), (3) and (7)
support ≥ 3, so both are frequent triples
there can be no frequent quadruples or larger sets
this approach will be summarised later as a–priori algorithm
16
{cat}
{dog}
{and}
{a}
{training}
{cat}
{dog}
5
{and}
4
4
{a}
Example
1. {cat, and, dog, bites}
2. {yahoo, news, claims, a, cat, mated, with,
a, dog, and, produced, viable, offspring}
3. {cat, killer, likely, is, a, big, dog}
4. {professional, free, advice, on, dog, training, puppy, training}
5. {cat, and, kitten, training, and, behaviour}
6. {dog, &, cat, provides, dog, training, in, Eugene, Oregon}
7. {dog, and, cat, is, a, slang, term, used, by, police, officers, for, a, male–female, relationship}
8. {shop, for, your, show, dog, grooming, and, pet, supplies}
14
3
3
2
{training}
2
2
1
0
Association rules
Example continued
five frequent doubletons (support ≥ 3):
{cat, dog}, {cat, and}, {cat, a}, {dog, and}, {dog, a}
next, let us see if there are frequent triples
in order to be a frequent triple, each pair of elements
in the triple must be a frequent doubleton
by combining frequent doubletons, we get two candidates for frequent triples:
{cat, dog, and} {cat, dog, a}
15
Association rules
information about frequent itemsets is often presented as a collection of if–then rules, called association rules
the form of an association rule is I → i, where I is an itemset and i is an item
interpretation: if I appears in some basket, then i is “likely” to appear in that basket as well
we formalise the notion of “likely” by defining the confidence of the rule as the following ratio:
support(I ∪ { i }) / support(I)
e.g. confidence({cat, dog} → and) =
support({cat, dog, and}) / support({cat, dog}) = 3 / 5
18
3
5/2/2019
Confidence
confidence of the association rule I → i: confidence(I → i) = support(I ∪ { i }) / support(I)
confidence can be useful, provided the support for the left side (I) is fairly large, e.g.
{hotdog} → mustard
nothing “unusual” about people buying hot dog and
mustard together
as long as we know that many people buy hot dogs and many who people buy hot dogs will buy mustard as well, we can use it to inform sale strategy
19
Finding rules with high confidence
having already calculated the support of both I and I \ {i}, their ratio is the confidence of the rule I \ {i} → i:
confidence(I \ {i} → i) = support(I) / support(I \ {i})
we can now focus on those with high confidence
assumption: there are not too many frequent itemsets and thus not too many high–confidence association rules
reason is that such association rules should be acted upon
e.g. if we give a store manager a million association rules,
they cannot even read them, let alone act on them
the support threshold should be adjusted so that we do not get too many frequent itemsets
22
Association rules with high confidence
we want the confidence of the rule to be reasonably high, e.g. 50%, or else the rule has little practical effect
confidence(I → i) ≥ 0.5
support(I ∪ {i}) / support(I) ≥ 0.5
support(I ∪ {i}) ≥ 0.5 x support(I)
if I has high support, then I ∪ {i} will also have fairly high support
finding association rules with high confidence is not much harder than finding frequent itemsets
assume that it is possible to find those frequent itemsets whose support ≥ threshold s
20
A–priori algorithm
Finding rules with high confidence
suppose we have found all itemsets with support ≥ threshold s
within them, we can find all association rules that have both high support and high confidence
let I be a frequent itemset of n items
there are n possible association rules involving I:
I \ {i} → i for each item i in I
I is frequent and I \ {i} is at least as frequent
therefore, I \ {i} is also a frequent itemset
21
Representation of market–basket data
we assume that market–basket data is stored in a file basket by basket, e.g.
{23,456,1001}{3,18,92,145}{…
items basket the character { begins a basket
thecharacter}ends abasket
the items in a basket are represented by integers … and are separated by commas
24
4
5/2/2019
Finding frequent pairs
let us concentrate on finding frequent pairs
if we have enough main memory to count all pairs, then it is simple to read the file of baskets in a single pass
for each basket, we use a double loop to generate all pairs
each time we generate a pair, we add 1 to its count
at the end, we examine all pairs to see which have
counts ≥ support threshold s
these are the frequent pairs
25
A note on the power law
1. 2.
many types of data follow power law distribution, which is characterised by:
a very small number of events that have a very high probability of appearing, and
a very large number of events that have a very low probability of appearing
e.g. few words (e.g. the) with enormously high probability vs. many words (e.g. database) with low probability of appearing in a randomly chosen sentence
represented by a continuously decreasing curve
28
Finding frequent pairs
however, a single pass approach will fail if there are too many pairs to count them all in main memory
the a–priori algorithm is designed to reduce the number of pairs that must be counted
… but at the expense of performing two passes over data, rather than a single pass
26
Between the passes of a–priori
after the first pass, we examine the item counts to determine which ones are frequent singletons
because of the power law, we do not usually expect many singletons to be frequent
anyway, by setting the support threshold s sufficiently high, we can ensure not to get too many frequent sets
a typical support threshold s would be 1% of the baskets
for the second pass of a–priori, we create a new numbering from 1 to m for frequent items
29
The first pass of a–priori
in the first pass, we create an array of counts
the i–th element counts the occurrences of the item
numbered i
initially, the counts for all the items are set to 0
as we read baskets, we add 1 to the corresponding element of the array
27
The second pass of a–priori
a pair cannot be frequent unless both its members are frequent
so, we count all pairs that consist of two frequent items
the space required is m2 bytes, e.g. m x m matrix
the second pass steps:
1. for each basket, look in the frequent–items table to see which of its items are frequent
2. in a double loop, generate all pairs of frequent items in that basket
3. for each such pair, add 1 to its count in the data structure used to store counts
30
5
5/2/2019
31
6
between the sets compared
thisnotionofsimilarityiscalledJaccardsimilarity
wewillthenexaminesomeoftheusesoffinding similar sets, e.g.
findingtextuallysimilardocuments collaborativefiltering
2
Document similarity
animportantclassofproblemsthatJaccardsimilarity addresses well is that of finding textually similar documents in a large corpus such as the Web
documentsarerepresentedas”bags”ofwordsandwe
compare documents by measuring the overlap of their bag representations
applications: plagiarism
mirrorpages
newsaggregation
5
3/11/2016
Similarity and distance
CMT207 Information modelling & database systems
1
Jaccard similarity
JaccardsimilarityofsetsAandBisdefinedastheratio of the size of the intersection of A and B to the size of their union, i.e.
sim(A,B)=|A∩B|/|A∪B|
e.g.sim(A,B)=3/(2+3+3) =3/8
= 0.375
note that sim(A, B) ∈ [0, 1] sim(A,B)=0ifA∩B=∅ sim(A,B)=1ifA=B
4
Lecture
afundamentaldataminingproblemistofind”similar” items in the data
initially,wewillfocusonaparticularnotionof similarity based on the relative size of intersection
Similarity
Plagiarism
findingplagiariseddocumentstestsourabilitytofind textual similarity
theplagiarisermay…
copysomepartsofadocument
alterafewwords
altertheorderinwhichsentencesappear
yettheresultingdocumentmaystillcontain>50%of the original material
comparingdocumentscharacterbycharacterwillnot detect sophisticated plagiarism, but Jaccard similarity can
6
1
Mirror pages
itiscommonforanimportantorpopularWebsiteto be duplicated at a number of hosts to improve its availability
thesemirrorsitesarequitesimilar,butarerarely identical
e.g.theymightcontaininformationassociatedwith their particular host
itisimportanttobeabletodetectmirrorpages, because search engines should avoid showing nearly identical pages within the first page of results
News aggregation
thesamenewsgetsreportedbydifferentpublishers
eacharticlesissomewhatdifferent
newsaggregators,suchasGoogleNews,trytofindall versions in order to group them together
thisrequiresfindingWebpagesthataretextually similar, although not identical
Online purchases
Amazon has millions of customers and sells millions of products
its database records which products have been bought by which customers
we can say that two customers are similar if their sets of purchased products have a high Jaccard similarity
likewise, two products that have sets of purchasers with high Jaccard similarity could be considered similar
7 10
Movie ratings
NetFlixrecordswhichmovieseachofitscustomers watched together with their ratings
moviesaresimilariftheywerewatchedorratedhighly by many of the same customers
customers are similar if they watched or rated highly many of the same movies
examples…
8 11
Collaborative filtering
themethodofmakingautomaticpredictions(filtering) about the interests of a user by collecting taste information from many users (collaborating)
theunderlyingassumptionofcollaborativefilteringis that those who agreed in the past tend to agree again in the future, e.g.
onlinepurchases movieratings
…
9 12
3/11/2016
2
3/11/2016
13
Distance vs. similarity
typically,givenasimilaritymeasure,onecan”revert” it to serve as the distance measure and vice versa
conversionsmaydiffer,e.g.ifdisadistancemeasure, then one can use:
or
ifsimisthesimilaritymeasurethatrangesbetween0 and 1, then the corresponding distance measure can be defined as:
16
Distance
Distance axioms
1. 2. 3. 4.
formally,distanceisameasurethatsatisfiesthe following conditions:
d(x, y) ≥ 0
d(x, y) = 0 iff x = y
d(x, y) = d(y, x)
d(x, z) ≤ d(x, y) + d(y, z)
theseconditionsexpress intuitive notions about the concept of distance
non–negativity coincidence symmetry triangle inequality
17
Distance vs. similarity
similarityismeasureofhowclosetoeachothertwo instances are
theclosertheinstancesaretoeachother, the larger is the similarity value
distanceisalsoameasureofhowclosetoeachother two instances are
theclosertheinstancesaretoeachother, the smaller is the distance value
15
Euclidian distances
themostfamiliardistancemeasureistheonewe normally think of as “distance”
ann–dimensionalEuclideanspaceisonewhere points are vectors of n real numbers
e.g.apointina3Dspaceisrepresentedas(x,y,z)
18
3
Pythagorean theorem
Exercise: d(A, B) = ???
B
x1 – x2
Cosine similarity
thecosinesimilaritybetweentwovectorsina Euclidean space is a measure that calculates the cosine of the angle between them
thismetricisameasurementoforientationandnot magnitude
1. 2. 3.
square the distance in each dimension sum the squares
take the square root
Value
−1
1
0
in between
Meaning
exactly opposite exactly the same orthogonal intermediate similarity
y y
A
1 y2
y1 – y2
0 x2 x1 x
usingPythagoreantheorem: d(A,B)2 = (x1 – x2)2 + (y1 – y2)2 d(A,B) = √(x1 – x2)2 + (y1 – y2)2
Euclidian distance
Euclidiandistanceinann–dimensionalspacebetween X = (x1, x2, … , xn) and Y = (y1, y2, … , yn):
20
19
21
24
Cosine similarity
Cosine similarity
wecancalculatethecosinesimilarityasfollows:
cosinesimilarityrangesfrom–1to1
3/11/2016
4
3/11/2016
Cosine distance
cosinedistanceisatermoftenusedforthemeasure defined by the following formula:
d(A,B) = 1 – sim(A,B)
itisimportanttonotethatthisisnotaproperdistance measure!
itdoesnotsatisfythetriangleinequalityproperty itviolatesthecoincidenceproperty
25
Edit distance
informally,editdistanceisdefinedastheminimalnumber (or cost) of changes needed to transform one string into the other
thesechangesmayincludethefollowingeditoperations:
1. insertion of a single character
2. deletion of a single character
3. replacement (substitution) of two corresponding characters in the two strings being compared
4. transposition (reversal or swap) of two adjacent characters in one of the strings
28
Edit distance
Edit operations
insertion …ac… … abc …
deletion …abc… … ac …
replacement …abc… … adc …
transposition …abc… … acb …
29
Edit distance
editdistancehasbeenwidelyappliedinnatural language processing for approximate string matching, where:
1. 2.
the distance between identical strings is equal to zero
the distance increases as the string s get more dissimilar with respect to:
the symbols they contain, and
the order in which they appear
27
Applications
successfullyutilisedinNLPapplicationstodealwith: alternatespellings
misspellings
theuseofwhitespacesasmeansofformatting UPPER–andlower–caseletters
otherorthographicvariations
e.g.80%ofthespellingmistakescanbeidentifiedand corrected automatically by considering a single omission, insertion, substitution or reversal
30
5
script or edit path)
thecostofanalignmentiscalculatedbysummingthe costs of individual edit operations it is comprised of
whenalleditoperationshavethesamecosts,thenthe cost of an alignment is equivalent to the total number of operations in the alignment
3/11/2016
Applications
apartfromNLP,themostpopularapplicationareaof edit distance is molecular biology
editdistanceisusedtocompareDNAsequencesin order to infer information about:
commonancestry
functionalequivalence possiblemutations
etc.
31
Variants
dependingonthetypesofeditoperationsallowed,a number of specialised variants of edit distance have been identified
Hammingdistanceallowsonlyreplacement
longestcommonsubsequenceproblemallowsonly insertion and deletion, both at the same cost
Levenshteindistanceallowsinsertion,deletionand replacement of individual characters, where individual edit operations may have different costs
DameraudistanceextendsLevenshteindistanceby permitting the transposition of two adjacent characters
34
Notation and terminology
letx=x1 …xm andy=y1 …yn betwostringsoflengths m and n respectively, where 0 < m ≤ n
asequenceofeditoperationstransformingxintoyis referred to as an alignment (also edit sequence, edit
Levenshtein distance
well suited for a number of practical applications
most of the existing algorithms have been developed for the
simple Levenshtein distance
many of them can easily be adapted for:
32
general Levenshtein distance, where different costs are used for different operations
Damerau distance, where transposition is an allowed edit operation
transposition is important in some applications such as text searching, where transpositions are typical typing errors
note that the transposition could be simulated by using an insertion followed by a deletion, but the total cost would be different in that case!
35
Edit distance
formally, the value of edit distance between x and y, ed(x, y), corresponds to the minimal alignment cost over all possible alignments for x and y
when all edit operations incur the same cost, edit distance is referred to as simple edit distance
simple edit distance is a distance measure, i.e. ed(x, y) ≥ 0, ed(x, y) = 0 iff x = y, ed(x, y) = ed(y, x), ed(x, z) ≤ ed(x, y) + ed(y, z)
general edit distance permits different costs for different operations or even symbols
the choice of the operation costs influences the "meaning" of the corresponding alignments, and thus they depend on a specific application
33
Dynamic programming
aclassofalgorithmsbasedontheideaof:
breakingaproblemdownintosub–problemssothat
optimal solutions can be obtained for sub–problems
combiningsub–solutionstoproduceanoptimal solution to the overall problem
thesameideaisappliedincrementallytosub–problems
bysavingandre–usingtheresultsobtainedforthe sub–problems, unnecessary re–computation is avoided for recurring sub–problems, thus facilitating the computation of the overall solution
36
6
Wagner–Fischer algorithm
adynamicprogrammingapproachtothecomputation of Levenshtein distance, which relies on the following reasoning:
at each step of an alignment, i.e. after aligning two leading substrings of the two strings, there are only three possibilities:
1. delete the next symbol from the first string (delete)
2. delete the next symbol from the second string (insert)
3. match the next symbol in the first string to the first symbol in the second string (exact match or replace otherwise)
Wagner–Fischer algorithm
Wagner–Fischer algorithm
let x[1..m], y[1..n] be two arrays of char let ed[0..m, 0..n] be a 2D array of int
// distance to an empty string
for i in [0..m] ed[i, 0] = i; for j in [0..n] ed[0, j] = j;
for j in [1..n] for i in [1..m]
for the cost C(i, j) of aligning the leading substrings
x1 ... xi and y1 ... yj, the cost of their alignment is calculated as follows:
1≤i≤m,j=0: C(i,0)=C(i–1,0)+IC(xi)
i=0,1≤j≤n: C(0,j)=C(0,j–1)+DC(yj)
C(i–1,j)+IC(xi)
1≤i≤m,1≤j≤n: C(i,j)=min C(i,j–1)+DC(yj)
C(i–1,j–1)+RC(xi,yj)
whereC(0,0)=0andIC,DCandRCarethecostsofinsert,
extractinganoptimalalignment from the cost matrix:
delete and replace operations
Wagner–Fischer algorithm
insert from y into x
replace or match
delete from x alignment:
– we a t h er s we e t – er
if the cost values are represented by a cost matrix, then the matrix needs to be filled so that the values needed for the
two strings are regarded similar if their edit distance is lower than a certain threshold: ed(x, y) ≤ t → x ∼ y
anabsolutethresholdtdoesnottakeintoaccountthe lengths m and n (m ≤ n) of the strings x and y
thesamethresholdshouldnotbeusedforverylong strings and very short ones, e.g.
doppelgä–nger vs. hot doppelgaenger dog
arelativethresholdr=t×mproportionaltothelength of the shorter string is suggested instead
otherwise,shortstringswouldbeerroneouslyregarded similar to non–similar long strings
calculation of C(i, j) are available:
j n
?? ?
C(i – 1, j – 1) C(i–1,j)
C(i,j–1)
upper–left upper i left m
itsufficestofillthecostmatrixrow–wiseleft–to–right, column–wise top–to–bottom, or diagonally upper–left to lower–right
edit distance between x and y is then obtained as C(m, n) 39
42
37
ed[i–1, j] + 1, // delete ed[i, j–1] + 1, // insert ed[i–1, j–1] + 1 // replace
);
return ed[m,n];
Example
acostmatrixcalculatedforthestringsweatherand sweeter, providing the value 3 for their edit distance
40
38
41
if x[i] = y[j]
then ed[i, j] = ed[i–1, j–1]; else ed[i, j] = minimum of (
String similarity
// match, so no operation required
3/11/2016
7
3/11/2016
43
8
classes play an important role in how we analyse and describe the world
in the context of understanding data, clusters are potential classes
3/11/2016
Clustering
CMT207 Information modelling & database systems
1
Clustering
solution: divide & conquer
cluster analysis divides data into groups (clusters) that are meaningful, useful or both
4
Lecture
in the previous lecture we learnt how to compare objects using distance and similarity measures
in this lecture, we will learn about clustering, which is used to automatically group similar objects together
Clustering for understanding
classes are conceptually meaningful groups of objects that share common characteristics
2
5
Information overload
a difficulty a person may have when trying to deal with more information
than they are able to process to make sensible decisions
results in delaying making decisions or making the wrong decisions
an increasing problem in the context of big data
3
Clustering for utility
some clustering techniques characterise each cluster in terms of a cluster prototype
cluster prototype is a data object that is representative of other objects in the cluster
cluster prototypes can be used as the basis for a number of data analysis or data processing techniques
6
1
objects in other clusters
the greater the similarity within a cluster and the greater the difference between clusters, the better the clustering
unsupervised learning – no annotations/training required!
3/11/2016
What is clustering?
How many clusters?
2,4or6?
notion of a cluster is
ambiguous
difficult to decide what constitutes a cluster
the best definition of a cluster depends on the nature of data and desired outputs
10
Clustering
clustering – a data mining technique that automatically groups objects with similar properties into clusters
objects within a cluster should be similar (or related) to one another and different from (or unrelated to) the
Hierarchical vs. partitional clustering
two basic types of clustering:
1. hierarchical
2. non–hierarchical (partitional)
8
11
Clustering
the goal is to:
1. minimise intra–cluster distances
2. maximise inter–cluster distances
9
Hierarchical clustering
clusters are permitted to have subclusters
a set of nested clusters that are organised in a tree
each node (cluster) in the tree (except the leaves) is the union of its children (subclusters)
the root of the tree contains all objects
12
2
3/11/2016
Partitional clustering
partitional clustering is simply a division of the set of objects into non–overlapping subsets (clusters)
note that hierarchical clustering can be viewed as a sequence of partitional clusterings
a partitional clustering can be obtained by cutting the tree at a particular level
Types of clusters
clustering aims to find useful groups of objects
usefulness is defined by the goals of data analysis
there are several different notions of a cluster that prove useful in practice:
1. well–separated
2. prototype–based
3. graph–based
4. density–based
5. shared–property (conceptual)
6. described by an objective function
16
Complete vs. partial clustering
a complete clustering assigns every object to a cluster
a partial clustering does not
the motivation for partial clustering is that some objects may not belong to well defined groups
many times data objects may represent noise, outliers of uninteresting background
14
Well–separated clusters
each object is closer (or more similar) to every object in the cluster than to any other object not in the cluster
a threshold can be used to specify that all objects in a cluster must be sufficiently close to one another
idealistic definition of a cluster
satisfied only when the data contains natural clusters
that are quite far from one another
well–separated clusters do not need to be spherical, but can have any shape
17
Clusters
Prototype–based clusters
each object is closer (more similar) to the prototype that defines the cluster than the prototype of any other cluster
for many types of data, the prototype can be regarded as the most central point
in such instances, we commonly refer to prototype– based clusters as centre–based clusters
such clusters tend to be spherical
18
3
3/11/2016
Graph–based clusters
if the data is represented as a graph, where the nodes are objects and the links represent connections between objects, then a cluster can be defined as a connected component
a group of objects that are connected to one another, but are not connected to objects outside of the group
an important example of graph–based clusters are contiguity–based clusters, where two objects are connected if they are within a specified distance from each other
19
Clusters described by an objective function
finds clusters that minimise or maximise an objective function
enumerate all possible ways of dividing the points into clusters and evaluate them using the objective function (NP hard!)
can have global or local objectives
hierarchical clustering algorithms typically have
local objectives
partitional clustering algorithms typically have global objectives
22
Density–based clusters
a cluster is a dense region of objects that is surrounded by a region of low density
a density–based definition of a cluster is often used when the clusters are irregular or intertwined and
Clustering strategies
when noise and outliers are present
a contiguity–based definition of a cluster would not work well since the noise would tend to form connections between clusters
20
Shared–property (conceptual) clusters
a group of objects that share some property
encompasses all previous definitions of clusters
... but also covers new types of clusters
points in a cluster share some general property that derives from the entire set of objects
too sophisticated
takes us into the area of pattern recognition
21
Hierarchical clustering
hierarchical clustering = a method of cluster analysis which seeks to build a hierarchy of clusters
two strategies:
1.
2.
agglomerative (bottom-up approach)
each instance starts separately in its own cluster pairs of clusters are merged recursively
divisive (top-down approach)
all instances start together in a single cluster clusters are split performed recursively
the result is usually presented as a dendrogram ... may respond to a meaningful taxonomy
24
4
1. Compute the proximity matrix. 2. repeat
3. 4.
5.
Merge the closest two clusters.
Update the proximity matrix to reflect the proximity
between the new cluster and the original clusters. until Only one cluster remains.
3/11/2016
Hierarchical clustering
nested cluster diagram dendrogram
25
Cluster proximity
MIN defines cluster proximity as the proximity between the closest two points in different clusters
this yields contiguity-based clusters
MAX defines cluster proximity as the proximity
between the farthest two points in different clusters
alternative names:
MIN = single link
MAX = complete link
MIN
MAX
28
Agglomerative hierarchical clustering
starting with individual points as clusters
successively merge the two closest clusters until only
one cluster remains basic algorithm:
Cluster proximity
the group average technique defines cluster proximity to be the average pair–wise proximities of all pairs of points from different clusters
26
29
Cluster proximity
the key operation of the algorithm is the computation of the proximity between clusters
the definition of cluster proximity differentiates the various agglomerative hierarchical techniques
cluster proximity is typically defined with a particular type of cluster in mind
e.g. many agglomerative hierarchical clustering techniques, such as MIN, MAX & group average are defined with a graph–based clusters in mind
proximity?
27
Cluster proximity
if, instead of a graph–based view, we take a prototype–based view...
each cluster is represented by a centroid
proximity is commonly defined as the proximity between cluster centroids
30
5
2
1
1.00
0.90
2
0.90
1.00
3
0.10
0.70
4
0.65
0.60
5
0.20
Single link (MIN) hierarchical clustering
3/11/2016
Single–link (MIN) hierarchical clustering
Single link (MIN) hierarchical clustering
c and f are the two closest points 5 clusters: {a}, {b}, {c, f}, {d}, {e}
34
Single link (MIN) hierarchical clustering
cluster proximity is defined as the proximity between the closest two points in different clusters
similarity matrix:
1
0.50
3
0.10
0.70
1.00
0.40
0.30
4
0.65
0.60
0.40
1.00
0.80
{b} and {e} are the two closest clusters 4 clusters: {a}, {b, e}, {c, f}, {d}
5
0.20
0.50
0.30
0.80
1.00
32
35
Single link (MIN) hierarchical clustering
start with all points as singleton clusters: 6 clusters: {a}, {b}, {c}, {d}, {e}, {f}
at each step merge two clusters with two closest points
33
Single link (MIN) hierarchical clustering
{c, f} and {b, e} are two closest clusters because of the proximity between b and c
3 clusters: {a}, {b, e, c, f}, {d}
36
6
Single link (MIN) hierarchical clustering
merge the only two remaining clusters 1cluster:{a,b,c,d,e,f}
2
3
0.90
0.10
0.65
0.20
1.00
0.70
0.60
0.50
0.70
1.00
0.40
0.30
0.60
0.40
1.00
0.80
0.50
3/11/2016
Single link (MIN) hierarchical clustering
disclosertocthanaistob 2 clusters: {a}, {b, c, d, e, f}
37
Complete–link (MAX) hierarchical clustering
Complete link (MAX) hierarchical clustering
cluster proximity is defined as the proximity between the farthest two points in different clusters
similarity matrix:
1
2
3
4
5
1
1.00
0.30
4
0.80
5
1.00
sim({3}, {1, 2})
sim({3}, {4, 5})
sim({1, 2}, {4, 5})
= sim(3, 1) = 0.10 = sim(3, 5) = 0.30 = sim(1, 5) = 0.20
41
0.90
0.10
0.65
0.20
38
Strength/limitation of MIN
strength: can handle limitation: sensitive to non–elliptical shapes noise and outliers
39
Complete link (MAX) hierarchical clustering
42
7
Complete link (MAX) hierarchical clustering
3/11/2016
Complete link (MAX) hierarchical clustering
43
Complete link (MAX) hierarchical clustering
46
Complete link (MAX) hierarchical clustering
44
47
Complete link (MAX) hierarchical clustering
45
Strength/limitation of MAX
strength: less susceptible limitation: tends to
to noise and outliers
break large clusters
48
8
similarity matrix:
Hierarchical clustering: group average
3/11/2016
Hierarchical clustering: group average
Hierarchical clustering: group average
52
Hierarchical clustering: group average
cluster proximity is defined as the average of pair–wise proximity between points in the two clusters
1
2
3
4
5
1
1.00
0.90
0.10
0.65
0.20
2
0.90
1.00
0.70
0.60
0.50
3
0.10
0.70
1.00
0.40
0.30
4
0.65
0.60
0.40
1.00
0.80
5
0.20
0.50
0.30
0.80
1.00
50
53
Hierarchical clustering: group average
51
Hierarchical clustering: group average
54
9
3/11/2016
Hierarchical clustering: group average
55
Hierarchical clustering: Summary
Hierarchical clustering: group average
Comparison
MIN MAX
group average
56
Strength/limitation of group average
group average is a compromise between MIN and MAX strength: less susceptible to noise and outliers
limitation: biased towards spherical clusters
57
Time and space requirements
n = number of points space: O(n2)
it uses a proximity matrix n x n time: O(n3)
there are n steps
at each step the proximity matrix of size n x n needs
to be searched and updated
can be reduced to O(n2log(n)) in some approaches
60
10
3/11/2016
Limitations
once a decision has been made to combine two clusters, it cannot be undone
no objective function is directly optimised
different schemes have problems with one or more of
the following:
sensitivity to noise and outliers
difficulty handling different sized clusters and convex shapes
breaking large clusters
61
K–means algorithm
1. Select k points as the initial centroids. 2. repeat
3. 4.
5.
Form k clusters by assigning all points to the closest centroid.
Recompute the centroid for each cluster. until The centroids don't change.
initial centroids are often chosen randomly
stopping condition is often changed to "until relatively few points change clusters"
64
K–means clustering
K–means clustering
works well when a given set is composed of a number of distinct classes
the number of clusters k must be specified problem: How to choose k objectively?
K-means will converge for most common similarity measures in the first few iterations
less computationally intensive than hierarchical methods: O(n ⋅ k ⋅ i ⋅ a), where n, k, i and a are the numbers of points, clusters, iterations and attributes
modest space requirements because only points and centroids are stored: O((n + k) ⋅ a) 65
Non-hierarchical clustering
non-hierarchical clustering attempts to directly partition the dataset into a set of disjoint clusters
k-means clustering aims to partition dataset into k clusters in which each instance belongs to the cluster with the nearest mean
basic principles:
each cluster is associated with its centre (centroid)
each point is assigned to the cluster with the closest centroid
63
Two different k–means clusterings original points
optimal clustering
sub–optimal clustering
66
11
in the "right" way and sometimes they will not possible "solutions":
multiple runs may help
sample and use hierarchical clustering to
determine initial centroids
select most widely separated
... 68
goal of attaining:
1. cohesion – high intra-cluster similarity (instances within a cluster are similar)
2. separation – low inter-cluster similarity (instances from different clusters are dissimilar)
3/11/2016
Importance of choosing initial centroids
123
456
67
Evaluation
Problems with choosing initial centroids
if there are k "real" clusters then the chance of selecting one centroid from each cluster is small
... very small when k is large!
sometimes the initial centroids will readjust themselves
Evaluation
evaluating clustering algorithms is complex because it is difficult to find an objective measure of quality of clusters
typical objective functions in clustering formalise the
71
Limitations of k–means
has problems when clusters are of differing:
sizes
densities
non–spherical shapes
69
Graph–based view of cohesion and separation
mathematically, cohesion and separation for a graph–based cluster can be expressed as follows:
cohesion separation
72
12
3/11/2016
Prototype–based view of cohesion and separation
mathematically, cohesion and separation for a prototype–based cluster can be expressed as follows:
centroids
cohesion separation
73
Cluster validity
previous definitions of cohesion and separation give us some simple and well–defined measures of cluster validity
they can be combined into an overall validity measure by using a weighted sum:
the weights will vary depending on the cluster validity measure
in some cases, the weights are simply 1 or the cluster sizes
74
75
13
naïve Bayes
support vector machines
inter–annotator agreement
evaluation: precision, recall, F-measure
cross–validation
platforms: Weka, Mechanical Turk, Crowdflower
tumour cells as benign or malignant
news stories as finance, weather, entertainment, sports, etc.
3/11/2016
Classification
CMT207 Information modelling & database systems
1
Classification
classification is used to map an object to one (or more) classes from a predefined set of classes
class = a set of objects having some property or attribute in common and differentiated from others by kind, type or quality
class – also referred to as a type, category, kind, sort...
4
Lecture
rule-based systems: 'if-then' (deduction) supervised machine learning (induction)
k–nearest neighbour
inducing decision trees
Examples
classifying ...
e–mail as spam or not spam
credit card transactions as legitimate or fraudulent
2
5
Dealing with complex information
'divide' & conquer
'compartmentalise'
grouping similar or related things together
associating information
classifying objects into different categories
3
Classification task
the input data for a classification task is a collection of records
each record, also known as an instance or example, is characterised by a tuple (x, y), where
x is the set of attributes
y is a special attribute designated as the class label
classification is the task of mapping an input attribute
set x into its class label y, i.e. (x, ?) → (x, y)
attributes can be discrete or continuous
the class label must be a discrete attribute
1
Classification task
formally, classification is the task of learning a target function f that maps each attribute set x to one of the predefined class labels y, i.e. f(x) = y
the target function is also known informally as a
classification model
input:
attribute set
(x1, x2, ... , xn)
output:
class label
y
Rule–based classification
deductive reasoning: applying a series of rules to reach a conclusion
modus ponens: IfAthenB.
A
Therefore B.
e.g. junk e-mail filtering rule:
If (Subject contains 'viagra') then (e-mail is junk). Subject contains 'viagra'
Therefore e-mail is junk.
11
3/11/2016
Example
attributes class
7
Rule–based classification
classification model
8
Classification
when dealing with large volumes of information, we would like to take advantage of computers' processing power to classify objects efficiently
Can this process be automated? Yes, various classification methods are used in data mining!
classifier – an algorithm that implements classification
two basic types of classification methods:
1. rule–based (top–down approach, deductive reasoning): specific conclusion is derived from general principles
2. machine learning (bottom–up approach, inductive reasoning): general conclusion is derived from specific instances
9
Example: E-mail filter rules
12
2
from a predefined set of labels
3/11/2016
Examples of classification problems
text categorisation, e.g. spam filtering
optical character recognition
natural language processing, e.g. part-of-speech tagging
medical diagnosis
Quiz: give your own examples
13
Example
a rule covers an instance x if the attributes of x satisfy the condition of the rule
R1: if (Give Birth = no) & (Can Fly = yes), then bird
R2: if (Give Birth = no) & (Live in Water = yes), then fish
R3: if (Give Birth = yes) & (Blood Type = warm), then mammal
R4: if (Give Birth = no) & (Can Fly = no), then reptile
R5: if (Live in Water = sometimes), then amphibian
rule R1 covers a hawk → bird
rule R3 covers a grizzly bear → mammal 16
Rule–based systems
model expert knowledge
the concept of an expert system is this: the knowledge
of an expert is encoded into the rule set
rule–based systems are fairly simplistic, consisting of
little more than a set of if–then statements
a relatively simple model that can be adapted to any number of problems
14
Example
R1: if (Give Birth = no) & (Can Fly = yes), then bird
R2: if (Give Birth = no) & (Live in Water = yes), then fish
R3: if (Give Birth = yes) & (Blood Type = warm), then mammal
R4: if (Give Birth = no) & (Can Fly = no), then reptile
R5: if (Live in Water = sometimes), then amphibian
lemur triggers rule R3, so it is classified as a mammal turtle triggers both R4 and R5: reptile or amphibian dogfish shark triggers none of the rules
17
Example
R1: if (Give Birth = no) & (Can Fly = yes), then bird
R2: if (Give Birth = no) & (Live in Water = yes), then fish
R3: if (Give Birth = yes) & (Blood Type = warm), then mammal
R4: if (Give Birth = no) & (Can Fly = no), then reptile
R5: if (Live in Water = sometimes), then amphibian 15
Rule–based classification systems
each rule has a left–hand side & a ride–hand side
the left–hand side contains information about certain facts, which must be true in order for the rule to potentially fire (execute)
If >1 rules can fire, which one to chose?!?
different heuristics can be applied for conflict resolution:
first applicable
random
most specific
least recently used best rule
18
3
3/11/2016
Conflict resolution strategy
first applicable: rules are ordered and the first applicable rule is chosen
simple, allows control, predictable
potential problem: an infinite loop
random: a single random rule from the conflict set is
chosen
not predictable
most specific: the rule with the most conditions satisfied is chosen
19
Rule–based system
How to convert expert knowledge into rules?
Do such rules exists after all?
rule-based systems are only feasible for problems for which the knowledge in the problem area can be written in the form of if–then rules
… and for which this problem area is not large!
problems: often very complex, rules can interact, inefficient, cannot learn (i.e. improve with experience), do not work well with subjective problems…
22
Conflict resolution strategy
least recently used: each rule is accompanied by a time stamp, and the least recently used rule is chosen
maximises the number of rules that are fired at least once
Supervised machine learning
best rule: each rule is given a weight, which specifies how much it should be considered over the alternatives
the rule with the most preferable outcome is chosen based on this weight
20
Rule–based systems
the concept of an expert system: the knowledge of an expert is encoded into the rule set
… but expert knowledge is often tacit knowledge, i.e. locked in people’s heads
typical bottleneck in expert systems: knowledge elicitation
knowledge elicitation = the process of explicating domain-specific knowledge underlying human performance
explicit = fully & clearly expressed or demonstrated, leaving nothing merely implied
21
Machine learning
apple apple
machine learning – studies how to automatically learn to make accurate predictions based on past
observations
supervised machine learning:
apple apple
apple
ahumanexperthasprovidedatrainingset: apple a set of sample objects with known classes
a human expert has determined into what classes an object may be categorised (classification scheme)
(labels or annotations)
not an not an not an apple apple apple
apple
24
4
Training data – example
features labels
Batman
Robin
male
male
yes
yes
yes
yes
no
no
yes
no
no
no
Good
3/11/2016
Supervised machine learning
1. training phase: training set (labelled data) is used to automatically build a classification model
2. testing phase: test set (also labelled data) is used to evaluate the classification model
3. the classification model is used to predict class(es) for new (unseen) examples
25
k–nearest neighbour (k–NN)
Basic idea:
if it walks like a duck, quacks like a duck, then it is probably a duck
gender
mask
cape
tie
ears
smokes
class
Good
Alfred
Penguin
Catwoman
Joker
male
male
female
male
no
no
yes
no
no
no
no
no
yes
yes
no
no
no
no
yes
no
no
yes
no
no
Good
training records
Bad
Bad
Bad
unknown record
26
choose k nearest neighbours
Popular supervised approaches
naïve Bayes – a probabilistic classifier based on Bayes theorem, which analyses the relationship between features & the labelled class to estimate a conditional probability that links feature values & the class
k-nearest neighbour (k-NN) – computes distances from a test instance to available training instances & assigns the majority class label among the k nearest neighbours
C4.5 decision tree – C4.5 builds a decision tree from training data by choosing features that most effectively split the data between observed classes
support vector machine (SVM) – constructs a hyperplane that optimally separates the training data into two classes
27
k-nearest neighbour
requires three things:
1. the set of training records
2. distance metric to compute distance between records
3. the value of k, i.e. the number of nearest neighbours to retrieve
to classify an unknown record:
1. compute distance to other training records
2. identify k nearest neighbours
3. use class labels of nearest neighbours to determine the class label of unknown record (e.g. by taking majority vote)
30
5
3/11/2016
k-nearest neighbour
k–NN classifiers are lazy learners
they do not build models explicitly
… unlike eager learners such as decision tree induction and rule–based systems
classifying unknown records is relatively expensive
31
Example: decision tree
the root node uses the attribute body temperature to separate warm–blooded from cold–blooded vertebrates
since all cold–blooded vertebrates are non–mammals, a leaf node labelled non– mammals is created
if the vertebrate is warm– blooded, a subsequent attribute, gives birth, is used to distinguish mammals from other warm–blooded creatures
34
Decision tree learning
Example: classification
to classify a test record start from the root node
apply the test condition and follow the appropriate branch based on the outcome of the
test
this leads either to another internal node, for which a new test condition is applied, or to a leaf node
the class label associated with the leaf node is assigned to the record
35
Decision tree
decision tree is a flowchart–like structure in which:
each internal node represents a test on an attribute each branch represents the outcome of the test
each leaf node represents a class label (decision)
the paths from root to the leaves represent classification rules
33
How to build a decision tree?
many decision trees for a given set of attributes
some decision trees are more accurate than others
finding the optimal tree is computationally infeasible because of the exponential size of the search space
efficient algorithms have been developed to induce a reasonably accurate, albeit suboptimal, decision tree
they usually employ a greedy strategy that grows a decision tree by making a series of locally optimal decisions about which attribute to use to partition the data
36
6
3/11/2016
Hunt’s algorithm
one such greedy algorithm is Hunt’s algorithm
the basis of many existing decision tree induction
algorithms, including ID3, C4.5 and CART
a decision tree is grown recursively by partitioning the training records into successively smaller subsets
notation:
let Dt be the set of training records that are
associated with node t
let C = {c1, c2, . . . , cn} be the class labels
37
40
Hunt’s algorithm
if ( all records in Dt belong to the same class ci ) then {tisaleafnodelabelledasci }
else {
1. an attribute test condition is selected to partition the records into smaller subsets
2. a child node is created for each outcome of the test condition
3. the records in Dt are distributed across the children based on the outcomes
4. the algorithm is then recursively applied to each child node
}
38
Hunt’s algorithm – further modification
the algorithm will work if:
1. every combination of attribute values is present
in the training data, and
2. each combination has a unique class label
too stringent for use in most practical situations!
algorithm needs to be modified so that it can handle
these cases:
1. some of the child nodes created are empty
2. all records have identical attribute values except for the class label
41
Example
deciding whether a loan applicant will default on loan payments
classes: defaulted = yes vs. defaulted = no
a training set of records of previous borrowers
each record contains personal information of a borrower along with a class label indicating whether the borrower has defaulted on loan payments
39
Hunt’s algorithm – further modification
case 1: no records associated with a child node
this can happen if none of the training records have the combination of attribute values associated with such node
if this is the case, then the node is declared a leaf node with the same class label as the majority class of training records associated with its parent node
42
7
3/11/2016
Hunt’s algorithm – further modification
case 2: all records associated with Dt have identical attribute values except for the class label
it is not possible to split these records any further
if this is the case, then the node is declared a leaf node with the same class label as the majority class of training records associated with this node
43
2. How should the splitting procedure stop?
a stopping condition is needed to terminate the tree–growing process
a possible strategy is to continue expanding a node until:
1. all records belong to the same class, or
2. all records have identical attribute values
both conditions are sufficient to stop any decision tree induction algorithm
… but other criteria can be imposed to allow the tree– growing procedure to terminate early
the advantages of early termination will be discussed later 46
Inducing decision trees
a learning algorithm for inducing decision trees must address the following two questions:
1. How should the training records be split? 2. How should the splitting procedure stop?
44
Attribute test conditions
decision tree induction algorithms must provide a method for expressing an attribute test condition and its corresponding outcomes for different attribute types:
binary
nominal (categorical) ordinal
continuous
47
1. How should the training records be split?
each recursive step of the tree–growing process must select an attribute test condition to divide the records into smaller subsets
to implement this step, the algorithm must provide:
1. a method for specifying the test condition for
different attribute types, and
2. an objective measure for evaluating the goodness of each test condition
45
Binary attributes
e.g. body temperature:
warm–blooded vs. cold–blooded
the test condition for a binary attribute generates two potential outcomes
48
8
3/11/2016
Nominal (categorical) attributes
e.g. marital status: single, married or divorced
test condition can be expressed in two ways:
1. multi–way split: the number of outcomes depends on the number of distinct values for the attribute
2. binary split: considers all 2k−1 − 1 ways of creang a binary partition of k attribute values
49
Continuous attributes
for the binary split, the decision tree algorithm must consider all possible split positions v and select the one that produces the best partition
for the multi–way split, the algorithm must consider all possible ranges of continuous values
52
Ordinal attributes
e.g. shirt size: small < medium < large < extra large
ordinal attributes can also produce binary or multi–way
splits
... but ordinal attribute values can be grouped only if the grouping does not violate the order property
<
<
50
Selecting the best split
many measures can be used to determine the best split these measures are defined in terms of the class
distribution of the records before and after splitting
they are often based on the degree of impurity of the
child nodes, e.g.
where p(i|t) is the fraction of records belonging to class Ci at a given node t
53
Continuous attributes
e.g. annual income
the test condition can be expressed as:
1. a comparison test A < v (or A ≥ v) with binary outcomes, or
2. a range query with outcomes of the form vi ≤A
due to presence of noise, lack of representative
instances, etc.
a good model must not only fit the training data well,
but also accurately classify records it has never seen
i.e. a good model must have low training error and low generalisation error
58
Selecting the best split
decision tree induction algorithms often choose a test condition that maximises the gain Δ
since I(parent) is the same for all test conditions, maximising the gain is equivalent to minimising the
Tree pruning
if a decision tree is fully grown, it may lose some generalisation capability
large decision trees are prone to overfitting
tree–pruning step can be performed to reduce its size
56
59
Decision tree induction
input: the training records E and the attribute set F
57
Naïve–Bayesian classification
10
the posterior probability (left–hand side of the equation) given a test instance
each test instance is classified using the hypothesis with the largest posterior probability
Example
classes: mammal, non–mammal
attributes: give birth, can fly, live in water, have legs training data:
3/11/2016
Naïve–Bayesian classification
naïve Bayes classifier – a probabilistic learning method based on Bayes’ theorem
Bayes’ theorem – uses evidence e to estimate the probability of a hypothesis h
prior probability of h
probability of observing e given h
probability of observing e
posterior probability of h
when multiple hypotheses are considered, Bayes’ theorem provides the probability of each hypothesis being true given the evidence
Naïve–Bayesian classification
given a record with attributes A1, A2, … , An, the goal is to predict class C
class probability:
as before, suggest a class with the highest probability
for the given attribute
Q: How to calculate ?
if we assume independence among attributes, then:
individual probabilities are easy to calculate
64
Naïve–Bayesian classification
training step: training data used to estimate the parameters of a probability distribution (right–hand side of the equation)
testing phase: these estimations are used to compute
65
Naïve–Bayesian classification
given a record with attribute A, the goal is to predict class C
class probability:
suggest a class with the highest probability for the
given attribute
multiple attributes A1, A2, … , An?
63
Example
A – given attributes, M – mammal, N – non-mammal
→ mammal
66
11
>
3/11/2016
Naïve–Bayesian classification
robust to irrelevant attributes
robust to isolated noise points
handles missing values by ignoring them during probability calculations
independence assumption may not hold for some attributes
67
Example
70
Support vector machine (SVM)
Separation
Which hyperplane better separates the data? B1 or B2?
How do we define better?
hyperplane that maximises the margin
B1 is better than B2
71
Support vector machine (SVM)
the goal of an SVM is to construct a hyperplane (decision boundary) that optimally separates the training data into two classes
69
Margin
intuitively, we are more certain of the class of points that lie far from the separating hyperplane
thus, it is desirable for all training points be as far from the hyperplane as possible
… but on the correct side of that hyperplane
72
12
3/11/2016
Problem
?
separation margin73
Inter–annotator agreement
Solution
we want to maximise the margin:
equivalent to minimising:
… but subject to the following constraints:
Annotations
supervised machine learning (SML) depends on manually annotated training data
learning from mistakes does not work in SML!
SML approach will learn to make the same mistakes as
human annotators!
therefore, the crucial issue is:
Are the annotations correct?
if the annotations differ between annotators , then who do we trust?
77
this is a constrained optimisation problem
there are numerical approaches to solve it e.g. quadratic programming
74
Issues
some problems are not linearly separable
some problems may have a non–linear decision boundary
… but these are more expensive to find
75
Validity vs. reliability
validity: Is an annotation correct?
problem: there is no “ground truth”
i.e. we do not know what is correct!
we cannot check directly if it is correct
reliability
Are human annotators consistent with one another? i.e. do they consistently make the same annotations?
assumption: high reliability implies validity
78
13
0
Disagreement
task
difficult
complex
annotator
tiredness
laziness
subjective malice … gaming
…
Contingency tables
3/11/2016
Inter-annotator agreement
How can we measure reliability?
multiple annotators + measure inter-annotator agreement
e.g. Is this an apple? observed agreement
4 out of 6 = 0.6666…
≈ 67%
Is 67% good enough? What is good enough?
A B Agree?
yes yes no yes yes no yes yes no no no no
79
Inter–annotator agreement
agreement measure must be corrected for chance agreement
Ao = observed (percentage) agreement Ae = expected agreement by chance
chance–corrected agreement:
perfect agreement:
chance agreement:
perfect disagreement:
82
contingency table of proportions:
80
83
Chance agreement
agreement purely by chance
annotator decisions are like coin tosses e.g. Is this an apple?
P(randomly choosing yes) = 0.5
P(randomly choosing no) = 0.5
P(both choosing yes) = 0.5 x 0.5 = 0.25 P(both choosing no) = 0.5 x 0.5 = 0.25 P(both agree) = 0.25 + 0.25 = 0.5 = 50% 67% is better than 50%
81
Calculating agreement
expected chance agreement:
observed agreement:
Cohen’s kappa coefficient:
a statistical measure of inter–annotator agreement
84
14
What agreement is good enough?
Landis and Koch (1977)
Krippendorff (1980)
Green (1997)
Group exercise
How can Cohen’s kappa coefficient be used in the context of crowdsourcing annotations for supervised machine learning?
homework: Calculate Cohen’s kappa coefficient for this example →
85
Evaluation
true positive: an instance is correctly predicted to belong to the given class
true negative: an instance is correctly predicted not to belong to the given class
false positive: an instance is incorrectly predicted to belong to the given class
false negative: an instance is incorrectly predicted not to belong to the given class
Evaluation
precision: proportion of predicted positives that are actually positive
recall: proportion of actual positives that were predicted positive
88
Evaluation
86
F–measure: combined measure that assesses precision/recall trade off using a harmonic mean
Which algorithm?
supervised machine learning is about classifying an object based on past experience (training data)
we learnt about k–NN, inducing decision trees, naïve Bayesian classification, SVM
… and there are others, e.g. neural networks
two types of errors: FP & FN
3/11/2016
90
15
3/11/2016
No free lunch!
Which algorithm?
no free lunch theorem
David H Wolpert, 1996
any two algorithms are equivalent when their performance is averaged across all possible problems
a general–purpose universal learning strategy is impossible
one strategy can outperform another only if it is specialised to the structure of the specific problem under consideration
91
Weka
data mining software in Java
http://www.cs.waikato.ac.nz/ml/weka/
a collection of machine learning algorithms for data mining tasks
the algorithms can either be applied directly to a dataset or called from your own Java code
open source software issued under the GNU General Public License
94
Cross–validation
a model validation technique for assessing how the results will generalize to an independent data set
estimates how accurately a predictive model will perform in practice
Supervised machine learning
advantages:
data driven, so often more accurate than
human–crafted rules
humans are often incapable of expressing what they know (e.g. English grammar rules), but can easily classify examples
disadvantages:
need a lot of labelled data lack of explanatory power
in the era of Web 2.0, crowdsourcing can address this problem
95
92
Summary
96
16
3/11/2016
97
98
17