程序代写代做 finance algorithm go data mining graph Bayesian C chain decision tree data structure Java ER database Introduction to data mining

Introduction to data mining
CMT207 Information modelling & database systems
Statistical modelling
􏰁 statisticians were the first to use the term data mining 􏰁 statisticians view data mining as the construction of a
statistical model
􏰁 an underlying distribution from which the visible data is drawn
􏰁 e.g. suppose the data is a set of numbers with normal distribution
􏰁 the mean and standard
deviation would become
the model of the data
14
Machine learning
􏰁 some say that data mining = machine learning
􏰁 some data mining uses algorithms from machine learning
􏰁 supervised machine learning uses data as a training set to train an algorithm, e.g. Bayes nets, support vector machines, etc.
􏰁 good approach when we have little idea of what we are looking for in the data
􏰁 e.g. it is not very clear what it is about movies that makes certain viewers (dis)like it
􏰁 … but when using a sample of their responses, machine learning algorithms have proved quite successful in predicting movie ratings
25
Computational approaches to modelling
􏰁 more recently, computer scientists have looked at data mining as an algorithmic problem
􏰁 the model of the data is simply the answer to a complex query about it
􏰁 many different approaches, but most can be described as either:
􏰁 summarising the data succinctly and approximately, or
􏰁 extracting the most prominent features of the data
and ignoring the rest
36
Lecture
􏰁 in this lecture, we begin with the essence of data mining
􏰁 we cover Bonferroni’s principle, which is a warning about overusing the ability to mine data
Data mining
􏰁 data mining is the discovery of “models” for data
􏰁 What is a “model”?
􏰁 a “model” can be interpreted in several different ways
􏰁 statistical modelling
􏰁 machine learning
􏰁 computational approaches to modelling
􏰁 summarisation
􏰁 feature extraction
3/11/2016
1

3/11/2016
Summarisation
􏰁 one of the most interesting forms of summarisation is the PageRank idea
􏰁 in this form of Web mining, the entire complex structure of the Web is summarised by a single number for each page
􏰁 remarkable property this ranking has is that it reflects very well the “importance” of the page
􏰁 it relates to the degree to which typical searchers would want to see a page returned as an answer to their search query
7
Bonferroni’s principle
􏰁 a theorem in statistics known as the Bonferroni correction
􏰁 gives a statistically sound way to avoid treating random occurrences as if they were real
􏰁 calculate the expected number of occurrences of the events you are looking for under the assumption that data is random
􏰁 if this number is significantly larger than the number of real instances you hope to find, then you must expect that most of the occurrences are bogus
10
Feature extraction
􏰁 look for the most extreme examples of a phenomenon and represents the data by these examples
􏰁 two important types of feature extraction:
1.
2.
frequent itemsets – looking for small sets of items that frequently appear together, e.g. hamburger and ketchup
􏰁 these sets are the characterisation of the data
similar items – often, data looks like a collection of sets, e.g.
customers can be viewed as the set of items they bought
􏰁 the objective is to find pairs of sets that have a relatively large fraction of their elements in common
􏰁 e.g. Amazon can recommend products that many “similar”
customers have bought
8
Example
􏰁 trying to identify people who cheat on their spouses 􏰁 we know that the percentage of those who cheat on
their spouses is 5%
􏰁 you decide that people who go out with co-workers
more than three times a month are cheating on their spouses
􏰁 your method discovers that 20% of people qualify as cheaters
􏰁 20% >> 5%, therefore, in the very best case only 1⁄4 will actually be cheaters (true positives), and the rest will be bogus (false positives)
11
Bonferroni’s principle
􏰁 a warning against overzealous use of data mining
􏰁 suppose and you look for occurrences of certain events
in a dataset
􏰁 even if the data is completely random, you can still expect to find such occurrences
􏰁 the number of these occurrences will grow as the dataset grows in size
􏰁 … but these occurrences may be “bogus”, i.e. be random with no particular cause
9
Summary
12
2

5/2/2019
Frequent itemsets
CMT207 Information modelling & database systems
1
Market basket
 market basket (shopping cart) brought to the register for checkout
 contains items, i.e. different products that the store sells
 a major chain might sell 100,000 different items and collect data about millions of market baskets
4
Lecture
 in the previous lecture we introduced three approaches to data mining
 one of these was a computational approach, which includes:
 summarisation
 feature extraction
 we learnt that mining of frequent itemsets is an
important type of feature extraction
 in this lecture we will learn how to find frequent itemsets
2
Itemset
 itemset = a set of items
 frequent itemset = an itemset that appears in many
market baskets
 a retailer can learn what is commonly bought together  an opportunity to do some clever marketing
5
Market–basket data model
Example: diapers & beer
 one chain store analysed their basket data and discovered that people who buy diapers are unusually likely to buy beer
 one would hardly expect these two items to be related
 explanation:
 if you buy diapers, you probably have a baby at home
 if you have a baby, then you are unlikely to be drinking at a pub
 instead, you are more likely to bring beer home
 marketing ploy: put diapers on sale and raise the price of beer
6
1

5/2/2019
Other applications
 items = words
baskets= documents,e.g.Webpages,blogs,tweets
 a basket (document) contains those items (words) that are present in the document
 we would hope to find among the frequent pairs some pairs of words that represent a joint concept (relationship)
 e.g. a pair like {Brad, Angelina} appears with high frequency
 Brad + Angelina = Brangelina
7
Example
 documents returned by googling cat and dog:
1. {cat, and, dog, bites}
2. {yahoo, news, claims, a, cat, mated, with, a, dog, and, produced, viable, offspring}
3. {cat, killer, likely, is, a, big, dog}
4. {professional, free, advice, on, dog, training, puppy, training}
5. {cat, and, kitten, training, and, behaviour}
6. {dog, &, cat, provides, dog, training, in, Eugene, Oregon}
7. {dog, and, cat, is, a, slang, term, used, by, police, officers, for, a, male–female, relationship}
8. {shop, for, your, show, dog, grooming, and, pet, supplies}
10
The market–basket data model
 used to describe a common form of many–to–many relationship between two types of objects:
 on the one hand, we have items
 on the other we have baskets (aka transactions)
 each basket consists of a set of items (an itemset)
 we assume that :
 the number of items in a basket is small  the number of baskets is very large
8
Support
 empty set ∅ is a subset of any set  therefore, its support is 8
 we shall not concern ourselves with the empty set, since it tells us nothing
 next:
1. singletons: {cat}, {and}, {dog}, {bites}, {yahoo}, {news}, {claims}, {a}, {mated}, {with}, {produced}, {viable}, {offspring}, …
2. doubletons: {cat, and}, {and, dog}, {cat, dog}, {dog, bites}, {cat, bites}, {yahoo, news}, …
3. triples, etc.
11
Frequent itemsets
 intuitively, a set of items that appears in many baskets is said to be frequent
 How many is many?! 10, 100, 1K, 1M?  formally:
 s is a number called the support threshold
 I is a set of items
 support for I = the number of baskets that contain it as a subset
 we say that I is frequent if its support ≥ s
9
Itemset
Example
{cat}
frequent itemsets
{dog}
{and}
1. {cat, and, dog, bites}
2. {yahoo, news, claims, a, cat, mated, with,
a, dog, and, produced, viable, offspring} s = 3
3. {cat, killer, likely, is, a, big, dog}
4. {professional, free, advice, on, dog, training, puppy, training}
5. {cat, and, kitten, training, and, behaviour}
6. {dog, &, cat, provides, dog, training, in, Eugene, Oregon}
7. {dog, and, cat, is, a, slang, term, used, by, police, officers, for, a, male–female, relationship}
8. {shop, for, your, show, dog, grooming, and, pet, supplies}
{a}
{training}
{for}
{is}
Support
6
7
5
3
3
2
2
12
2

5/2/2019
Example continued
 now, let us look at the doubletons
 a doubleton cannot be frequent unless both items in
the set are frequent by themselves
 therefore, we only need to look at combinations of frequent singletons
13
Example continued
 candidates for frequent triples:
 {cat, dog, and} in baskets (1), (2) and (7)  {cat, dog, a} in baskets (2), (3) and (7)
 support ≥ 3, so both are frequent triples
 there can be no frequent quadruples or larger sets
 this approach will be summarised later as a–priori algorithm
16
{cat}
{dog}
{and}
{a}
{training}
{cat}
{dog}
5
{and}
4
4
{a}
Example
1. {cat, and, dog, bites}
2. {yahoo, news, claims, a, cat, mated, with,
a, dog, and, produced, viable, offspring}
3. {cat, killer, likely, is, a, big, dog}
4. {professional, free, advice, on, dog, training, puppy, training}
5. {cat, and, kitten, training, and, behaviour}
6. {dog, &, cat, provides, dog, training, in, Eugene, Oregon}
7. {dog, and, cat, is, a, slang, term, used, by, police, officers, for, a, male–female, relationship}
8. {shop, for, your, show, dog, grooming, and, pet, supplies}
14
3
3
2
{training}
2
2
1
0
Association rules
Example continued
 five frequent doubletons (support ≥ 3):
{cat, dog}, {cat, and}, {cat, a}, {dog, and}, {dog, a}
 next, let us see if there are frequent triples
 in order to be a frequent triple, each pair of elements
in the triple must be a frequent doubleton
 by combining frequent doubletons, we get two candidates for frequent triples:
 {cat, dog, and}  {cat, dog, a}
15
Association rules
 information about frequent itemsets is often presented as a collection of if–then rules, called association rules
 the form of an association rule is I → i, where I is an itemset and i is an item
 interpretation: if I appears in some basket, then i is “likely” to appear in that basket as well
 we formalise the notion of “likely” by defining the confidence of the rule as the following ratio:
support(I ∪ { i }) / support(I)
 e.g. confidence({cat, dog} → and) =
support({cat, dog, and}) / support({cat, dog}) = 3 / 5
18
3

5/2/2019
Confidence
 confidence of the association rule I → i: confidence(I → i) = support(I ∪ { i }) / support(I)
 confidence can be useful, provided the support for the left side (I) is fairly large, e.g.
 {hotdog} → mustard
 nothing “unusual” about people buying hot dog and
mustard together
 as long as we know that many people buy hot dogs and many who people buy hot dogs will buy mustard as well, we can use it to inform sale strategy
19
Finding rules with high confidence
 having already calculated the support of both I and I \ {i}, their ratio is the confidence of the rule I \ {i} → i:
confidence(I \ {i} → i) = support(I) / support(I \ {i})
 we can now focus on those with high confidence
 assumption: there are not too many frequent itemsets and thus not too many high–confidence association rules
 reason is that such association rules should be acted upon
 e.g. if we give a store manager a million association rules,
they cannot even read them, let alone act on them
 the support threshold should be adjusted so that we do not get too many frequent itemsets
22
Association rules with high confidence
 we want the confidence of the rule to be reasonably high, e.g. 50%, or else the rule has little practical effect
 confidence(I → i) ≥ 0.5
 support(I ∪ {i}) / support(I) ≥ 0.5
 support(I ∪ {i}) ≥ 0.5 x support(I)
 if I has high support, then I ∪ {i} will also have fairly high support
 finding association rules with high confidence is not much harder than finding frequent itemsets
 assume that it is possible to find those frequent itemsets whose support ≥ threshold s
20
A–priori algorithm
Finding rules with high confidence
 suppose we have found all itemsets with support ≥ threshold s
 within them, we can find all association rules that have both high support and high confidence
 let I be a frequent itemset of n items
 there are n possible association rules involving I:
 I \ {i} → i for each item i in I
 I is frequent and I \ {i} is at least as frequent
 therefore, I \ {i} is also a frequent itemset
21
Representation of market–basket data
 we assume that market–basket data is stored in a file basket by basket, e.g.
{23,456,1001}{3,18,92,145}{…
items basket  the character { begins a basket
thecharacter}ends abasket
 the items in a basket are represented by integers  … and are separated by commas
24
4

5/2/2019
Finding frequent pairs
 let us concentrate on finding frequent pairs
 if we have enough main memory to count all pairs, then it is simple to read the file of baskets in a single pass
 for each basket, we use a double loop to generate all pairs
 each time we generate a pair, we add 1 to its count
 at the end, we examine all pairs to see which have
counts ≥ support threshold s
 these are the frequent pairs
25
A note on the power law

1. 2. 

many types of data follow power law distribution, which is characterised by:
a very small number of events that have a very high probability of appearing, and
a very large number of events that have a very low probability of appearing
e.g. few words (e.g. the) with enormously high probability vs. many words (e.g. database) with low probability of appearing in a randomly chosen sentence
represented by a continuously decreasing curve
28
Finding frequent pairs
 however, a single pass approach will fail if there are too many pairs to count them all in main memory
 the a–priori algorithm is designed to reduce the number of pairs that must be counted
 … but at the expense of performing two passes over data, rather than a single pass
26
Between the passes of a–priori
 after the first pass, we examine the item counts to determine which ones are frequent singletons
 because of the power law, we do not usually expect many singletons to be frequent
 anyway, by setting the support threshold s sufficiently high, we can ensure not to get too many frequent sets
 a typical support threshold s would be 1% of the baskets
 for the second pass of a–priori, we create a new numbering from 1 to m for frequent items
29
The first pass of a–priori
 in the first pass, we create an array of counts
 the i–th element counts the occurrences of the item
numbered i
 initially, the counts for all the items are set to 0
 as we read baskets, we add 1 to the corresponding element of the array
27
The second pass of a–priori
 a pair cannot be frequent unless both its members are frequent
 so, we count all pairs that consist of two frequent items
 the space required is m2 bytes, e.g. m x m matrix
 the second pass steps:
1. for each basket, look in the frequent–items table to see which of its items are frequent
2. in a double loop, generate all pairs of frequent items in that basket
3. for each such pair, add 1 to its count in the data structure used to store counts
30
5

5/2/2019
31
6

between the sets compared
􏰁 thisnotionofsimilarityiscalledJaccardsimilarity
􏰁 wewillthenexaminesomeoftheusesoffinding similar sets, e.g.
􏰁 findingtextuallysimilardocuments 􏰁 collaborativefiltering
2
Document similarity
􏰁 animportantclassofproblemsthatJaccardsimilarity addresses well is that of finding textually similar documents in a large corpus such as the Web
􏰁 documentsarerepresentedas”bags”ofwordsandwe
compare documents by measuring the overlap of their bag representations
􏰁 applications: 􏰁 plagiarism
􏰁 mirrorpages
􏰁 newsaggregation
5
3/11/2016
Similarity and distance
CMT207 Information modelling & database systems
1
Jaccard similarity
􏰁 JaccardsimilarityofsetsAandBisdefinedastheratio of the size of the intersection of A and B to the size of their union, i.e.
sim(A,B)=|A∩B|/|A∪B|
􏰁 e.g.sim(A,B)=3/(2+3+3) =3/8
= 0.375
􏰁 note that sim(A, B) ∈ [0, 1] 􏰁 sim(A,B)=0ifA∩B=∅ 􏰁 sim(A,B)=1ifA=B
4
Lecture
􏰁 afundamentaldataminingproblemistofind”similar” items in the data
􏰁 initially,wewillfocusonaparticularnotionof similarity based on the relative size of intersection
Similarity
Plagiarism
􏰁 findingplagiariseddocumentstestsourabilitytofind textual similarity
􏰁 theplagiarisermay…
􏰁 copysomepartsofadocument
􏰁 alterafewwords
􏰁 altertheorderinwhichsentencesappear
􏰁 yettheresultingdocumentmaystillcontain>50%of the original material
􏰁 comparingdocumentscharacterbycharacterwillnot detect sophisticated plagiarism, but Jaccard similarity can
6
1

Mirror pages
􏰁 itiscommonforanimportantorpopularWebsiteto be duplicated at a number of hosts to improve its availability
􏰁 thesemirrorsitesarequitesimilar,butarerarely identical
􏰁 e.g.theymightcontaininformationassociatedwith their particular host
􏰁 itisimportanttobeabletodetectmirrorpages, because search engines should avoid showing nearly identical pages within the first page of results
News aggregation
􏰁 thesamenewsgetsreportedbydifferentpublishers
􏰁 eacharticlesissomewhatdifferent
􏰁 newsaggregators,suchasGoogleNews,trytofindall versions in order to group them together
􏰁 thisrequiresfindingWebpagesthataretextually similar, although not identical
Online purchases
􏰁 Amazon has millions of customers and sells millions of products
􏰁 its database records which products have been bought by which customers
􏰁 we can say that two customers are similar if their sets of purchased products have a high Jaccard similarity
􏰁 likewise, two products that have sets of purchasers with high Jaccard similarity could be considered similar
7 10
Movie ratings
􏰁 NetFlixrecordswhichmovieseachofitscustomers watched together with their ratings
􏰁 moviesaresimilariftheywerewatchedorratedhighly by many of the same customers
􏰁 customers are similar if they watched or rated highly many of the same movies
􏰁 examples…
8 11
Collaborative filtering
􏰁 themethodofmakingautomaticpredictions(filtering) about the interests of a user by collecting taste information from many users (collaborating)
􏰁 theunderlyingassumptionofcollaborativefilteringis that those who agreed in the past tend to agree again in the future, e.g.
􏰁 onlinepurchases 􏰁 movieratings
􏰁 …
9 12
3/11/2016
2

3/11/2016
13
Distance vs. similarity
􏰁 typically,givenasimilaritymeasure,onecan”revert” it to serve as the distance measure and vice versa
􏰁 conversionsmaydiffer,e.g.ifdisadistancemeasure, then one can use:
or
􏰁 ifsimisthesimilaritymeasurethatrangesbetween0 and 1, then the corresponding distance measure can be defined as:
16
Distance
Distance axioms
􏰁
1. 2. 3. 4. 􏰁
formally,distanceisameasurethatsatisfiesthe following conditions:
d(x, y) ≥ 0
d(x, y) = 0 iff x = y
d(x, y) = d(y, x)
d(x, z) ≤ d(x, y) + d(y, z)
theseconditionsexpress intuitive notions about the concept of distance
non–negativity coincidence symmetry triangle inequality
17
Distance vs. similarity
􏰁 similarityismeasureofhowclosetoeachothertwo instances are
􏰁 theclosertheinstancesaretoeachother, the larger is the similarity value
􏰁 distanceisalsoameasureofhowclosetoeachother two instances are
􏰁 theclosertheinstancesaretoeachother, the smaller is the distance value
15
Euclidian distances
􏰁 themostfamiliardistancemeasureistheonewe normally think of as “distance”
􏰁 ann–dimensionalEuclideanspaceisonewhere points are vectors of n real numbers
􏰁 e.g.apointina3Dspaceisrepresentedas(x,y,z)
18
3

Pythagorean theorem
Exercise: d(A, B) = ???
B
x1 – x2
Cosine similarity
􏰁 thecosinesimilaritybetweentwovectorsina Euclidean space is a measure that calculates the cosine of the angle between them
􏰁 thismetricisameasurementoforientationandnot magnitude
1. 2. 3.
square the distance in each dimension sum the squares
take the square root
Value
−1
1
0
in between
Meaning
exactly opposite exactly the same orthogonal intermediate similarity
y y
A
1 y2
y1 – y2
0 x2 x1 x
􏰁 usingPythagoreantheorem: d(A,B)2 = (x1 – x2)2 + (y1 – y2)2 d(A,B) = √(x1 – x2)2 + (y1 – y2)2
Euclidian distance
􏰁 Euclidiandistanceinann–dimensionalspacebetween X = (x1, x2, … , xn) and Y = (y1, y2, … , yn):
20
19
21
24
Cosine similarity
Cosine similarity
􏰁 wecancalculatethecosinesimilarityasfollows:
􏰁 cosinesimilarityrangesfrom–1to1
3/11/2016
4

3/11/2016
Cosine distance
􏰁 cosinedistanceisatermoftenusedforthemeasure defined by the following formula:
d(A,B) = 1 – sim(A,B)
􏰁 itisimportanttonotethatthisisnotaproperdistance measure!
􏰁 itdoesnotsatisfythetriangleinequalityproperty 􏰁 itviolatesthecoincidenceproperty
25
Edit distance
􏰁 informally,editdistanceisdefinedastheminimalnumber (or cost) of changes needed to transform one string into the other
􏰁 thesechangesmayincludethefollowingeditoperations:
1. insertion of a single character
2. deletion of a single character
3. replacement (substitution) of two corresponding characters in the two strings being compared
4. transposition (reversal or swap) of two adjacent characters in one of the strings
28
Edit distance
Edit operations
􏰁 insertion …ac… … abc …
􏰁 deletion …abc… … ac …
􏰁 replacement …abc… … adc …
􏰁 transposition …abc… … acb …
29
Edit distance
􏰁 editdistancehasbeenwidelyappliedinnatural language processing for approximate string matching, where:
1. 2.
the distance between identical strings is equal to zero
the distance increases as the string s get more dissimilar with respect to:
􏰁 the symbols they contain, and
􏰁 the order in which they appear
27
Applications
􏰁 successfullyutilisedinNLPapplicationstodealwith: 􏰁 alternatespellings
􏰁 misspellings
􏰁 theuseofwhitespacesasmeansofformatting 􏰁 UPPER–andlower–caseletters
􏰁 otherorthographicvariations
􏰁 e.g.80%ofthespellingmistakescanbeidentifiedand corrected automatically by considering a single omission, insertion, substitution or reversal
30
5

script or edit path)
􏰁 thecostofanalignmentiscalculatedbysummingthe costs of individual edit operations it is comprised of
􏰁 whenalleditoperationshavethesamecosts,thenthe cost of an alignment is equivalent to the total number of operations in the alignment
3/11/2016
Applications
􏰁 apartfromNLP,themostpopularapplicationareaof edit distance is molecular biology
􏰁 editdistanceisusedtocompareDNAsequencesin order to infer information about:
􏰁 commonancestry
􏰁 functionalequivalence 􏰁 possiblemutations
􏰁 etc.
31
Variants
􏰁 dependingonthetypesofeditoperationsallowed,a number of specialised variants of edit distance have been identified
􏰁 Hammingdistanceallowsonlyreplacement
􏰁 longestcommonsubsequenceproblemallowsonly insertion and deletion, both at the same cost
􏰁 Levenshteindistanceallowsinsertion,deletionand replacement of individual characters, where individual edit operations may have different costs
􏰁 DameraudistanceextendsLevenshteindistanceby permitting the transposition of two adjacent characters
34
Notation and terminology
􏰁 letx=x1 …xm andy=y1 …yn betwostringsoflengths m and n respectively, where 0 < m ≤ n 􏰁 asequenceofeditoperationstransformingxintoyis referred to as an alignment (also edit sequence, edit Levenshtein distance 􏰁 well suited for a number of practical applications 􏰁 most of the existing algorithms have been developed for the simple Levenshtein distance 􏰁 many of them can easily be adapted for: 32 􏰁 general Levenshtein distance, where different costs are used for different operations 􏰁 Damerau distance, where transposition is an allowed edit operation 􏰁 transposition is important in some applications such as text searching, where transpositions are typical typing errors 􏰁 note that the transposition could be simulated by using an insertion followed by a deletion, but the total cost would be different in that case! 35 Edit distance 􏰁 formally, the value of edit distance between x and y, ed(x, y), corresponds to the minimal alignment cost over all possible alignments for x and y 􏰁 when all edit operations incur the same cost, edit distance is referred to as simple edit distance 􏰁 simple edit distance is a distance measure, i.e. ed(x, y) ≥ 0, ed(x, y) = 0 iff x = y, ed(x, y) = ed(y, x), ed(x, z) ≤ ed(x, y) + ed(y, z) 􏰁 general edit distance permits different costs for different operations or even symbols 􏰁 the choice of the operation costs influences the "meaning" of the corresponding alignments, and thus they depend on a specific application 33 Dynamic programming 􏰁 aclassofalgorithmsbasedontheideaof: 􏰁 breakingaproblemdownintosub–problemssothat optimal solutions can be obtained for sub–problems 􏰁 combiningsub–solutionstoproduceanoptimal solution to the overall problem 􏰁 thesameideaisappliedincrementallytosub–problems 􏰁 bysavingandre–usingtheresultsobtainedforthe sub–problems, unnecessary re–computation is avoided for recurring sub–problems, thus facilitating the computation of the overall solution 36 6 Wagner–Fischer algorithm 􏰁 adynamicprogrammingapproachtothecomputation of Levenshtein distance, which relies on the following reasoning: 􏰁 at each step of an alignment, i.e. after aligning two leading substrings of the two strings, there are only three possibilities: 1. delete the next symbol from the first string (delete) 2. delete the next symbol from the second string (insert) 3. match the next symbol in the first string to the first symbol in the second string (exact match or replace otherwise) Wagner–Fischer algorithm Wagner–Fischer algorithm let x[1..m], y[1..n] be two arrays of char let ed[0..m, 0..n] be a 2D array of int // distance to an empty string for i in [0..m] ed[i, 0] = i; for j in [0..n] ed[0, j] = j; for j in [1..n] for i in [1..m] 􏰁 for the cost C(i, j) of aligning the leading substrings x1 ... xi and y1 ... yj, the cost of their alignment is calculated as follows: 􏰁 1≤i≤m,j=0: C(i,0)=C(i–1,0)+IC(xi) 􏰁 i=0,1≤j≤n: C(0,j)=C(0,j–1)+DC(yj) C(i–1,j)+IC(xi) 􏰁 1≤i≤m,1≤j≤n: C(i,j)=min C(i,j–1)+DC(yj) C(i–1,j–1)+RC(xi,yj) 􏰁 whereC(0,0)=0andIC,DCandRCarethecostsofinsert, 􏰁 􏰁 extractinganoptimalalignment from the cost matrix: delete and replace operations Wagner–Fischer algorithm 􏰂 insert from y into x 􏰃 replace or match 􏰄 delete from x alignment: – we a t h er s we e t – er 􏰁 if the cost values are represented by a cost matrix, then the matrix needs to be filled so that the values needed for the 􏰁 􏰁 􏰁 􏰁 􏰁 two strings are regarded similar if their edit distance is lower than a certain threshold: ed(x, y) ≤ t → x ∼ y anabsolutethresholdtdoesnottakeintoaccountthe lengths m and n (m ≤ n) of the strings x and y thesamethresholdshouldnotbeusedforverylong strings and very short ones, e.g. doppelgä–nger vs. hot doppelgaenger dog arelativethresholdr=t×mproportionaltothelength of the shorter string is suggested instead otherwise,shortstringswouldbeerroneouslyregarded similar to non–similar long strings calculation of C(i, j) are available: j n ?? ? 􏰁 C(i – 1, j – 1) 􏰁 C(i–1,j) 􏰁 C(i,j–1) upper–left upper i left m 􏰁 itsufficestofillthecostmatrixrow–wiseleft–to–right, column–wise top–to–bottom, or diagonally upper–left to lower–right 􏰁 edit distance between x and y is then obtained as C(m, n) 39 42 37 ed[i–1, j] + 1, // delete ed[i, j–1] + 1, // insert ed[i–1, j–1] + 1 // replace ); return ed[m,n]; Example 􏰁 acostmatrixcalculatedforthestringsweatherand sweeter, providing the value 3 for their edit distance 40 38 41 if x[i] = y[j] then ed[i, j] = ed[i–1, j–1]; else ed[i, j] = minimum of ( String similarity // match, so no operation required 3/11/2016 7 3/11/2016 43 8 􏰁 classes play an important role in how we analyse and describe the world 􏰁 in the context of understanding data, clusters are potential classes 3/11/2016 Clustering CMT207 Information modelling & database systems 1 Clustering 􏰁 solution: divide & conquer 􏰁 cluster analysis divides data into groups (clusters) that are meaningful, useful or both 4 Lecture 􏰁 in the previous lecture we learnt how to compare objects using distance and similarity measures 􏰁 in this lecture, we will learn about clustering, which is used to automatically group similar objects together Clustering for understanding 􏰁 classes are conceptually meaningful groups of objects that share common characteristics 2 5 Information overload 􏰁 a difficulty a person may have when trying to deal with more information than they are able to process to make sensible decisions 􏰁 results in delaying making decisions or making the wrong decisions 􏰁 an increasing problem in the context of big data 3 Clustering for utility 􏰁 some clustering techniques characterise each cluster in terms of a cluster prototype 􏰁 cluster prototype is a data object that is representative of other objects in the cluster 􏰁 cluster prototypes can be used as the basis for a number of data analysis or data processing techniques 6 1 objects in other clusters 􏰁 the greater the similarity within a cluster and the greater the difference between clusters, the better the clustering 􏰁 unsupervised learning – no annotations/training required! 3/11/2016 What is clustering? How many clusters? 􏰁 2,4or6? 􏰁 notion of a cluster is ambiguous 􏰁 difficult to decide what constitutes a cluster 􏰁 the best definition of a cluster depends on the nature of data and desired outputs 10 Clustering 􏰁 clustering – a data mining technique that automatically groups objects with similar properties into clusters 􏰁 objects within a cluster should be similar (or related) to one another and different from (or unrelated to) the Hierarchical vs. partitional clustering 􏰁 two basic types of clustering: 1. hierarchical 2. non–hierarchical (partitional) 8 11 Clustering 􏰁 the goal is to: 1. minimise intra–cluster distances 2. maximise inter–cluster distances 9 Hierarchical clustering 􏰁 clusters are permitted to have subclusters 􏰁 a set of nested clusters that are organised in a tree 􏰁 each node (cluster) in the tree (except the leaves) is the union of its children (subclusters) 􏰁 the root of the tree contains all objects 12 2 3/11/2016 Partitional clustering 􏰁 partitional clustering is simply a division of the set of objects into non–overlapping subsets (clusters) 􏰁 note that hierarchical clustering can be viewed as a sequence of partitional clusterings 􏰁 a partitional clustering can be obtained by cutting the tree at a particular level Types of clusters 􏰁 clustering aims to find useful groups of objects 􏰁 usefulness is defined by the goals of data analysis 􏰁 there are several different notions of a cluster that prove useful in practice: 1. well–separated 2. prototype–based 3. graph–based 4. density–based 5. shared–property (conceptual) 6. described by an objective function 16 Complete vs. partial clustering 􏰁 a complete clustering assigns every object to a cluster 􏰁 a partial clustering does not 􏰁 the motivation for partial clustering is that some objects may not belong to well defined groups 􏰁 many times data objects may represent noise, outliers of uninteresting background 14 Well–separated clusters 􏰁 each object is closer (or more similar) to every object in the cluster than to any other object not in the cluster 􏰁 a threshold can be used to specify that all objects in a cluster must be sufficiently close to one another 􏰁 idealistic definition of a cluster 􏰁 satisfied only when the data contains natural clusters that are quite far from one another 􏰁 well–separated clusters do not need to be spherical, but can have any shape 17 Clusters Prototype–based clusters 􏰁 each object is closer (more similar) to the prototype that defines the cluster than the prototype of any other cluster 􏰁 for many types of data, the prototype can be regarded as the most central point 􏰁 in such instances, we commonly refer to prototype– based clusters as centre–based clusters 􏰁 such clusters tend to be spherical 18 3 3/11/2016 Graph–based clusters 􏰁 if the data is represented as a graph, where the nodes are objects and the links represent connections between objects, then a cluster can be defined as a connected component 􏰁 a group of objects that are connected to one another, but are not connected to objects outside of the group 􏰁 an important example of graph–based clusters are contiguity–based clusters, where two objects are connected if they are within a specified distance from each other 19 Clusters described by an objective function 􏰁 finds clusters that minimise or maximise an objective function 􏰁 enumerate all possible ways of dividing the points into clusters and evaluate them using the objective function (NP hard!) 􏰁 can have global or local objectives 􏰁 hierarchical clustering algorithms typically have local objectives 􏰁 partitional clustering algorithms typically have global objectives 22 Density–based clusters 􏰁 a cluster is a dense region of objects that is surrounded by a region of low density 􏰁 a density–based definition of a cluster is often used when the clusters are irregular or intertwined and Clustering strategies when noise and outliers are present 􏰁 a contiguity–based definition of a cluster would not work well since the noise would tend to form connections between clusters 20 Shared–property (conceptual) clusters 􏰁 a group of objects that share some property 􏰁 encompasses all previous definitions of clusters 􏰁 ... but also covers new types of clusters 􏰁 points in a cluster share some general property that derives from the entire set of objects 􏰁 too sophisticated 􏰁 takes us into the area of pattern recognition 21 Hierarchical clustering 􏰁 hierarchical clustering = a method of cluster analysis which seeks to build a hierarchy of clusters 􏰁 two strategies: 1. 2. agglomerative (bottom-up approach) 􏰁 each instance starts separately in its own cluster 􏰁 pairs of clusters are merged recursively divisive (top-down approach) 􏰁 all instances start together in a single cluster 􏰁 clusters are split performed recursively 􏰁 the result is usually presented as a dendrogram 􏰁 ... may respond to a meaningful taxonomy 24 4 1. Compute the proximity matrix. 2. repeat 3. 4. 5. Merge the closest two clusters. Update the proximity matrix to reflect the proximity between the new cluster and the original clusters. until Only one cluster remains. 3/11/2016 Hierarchical clustering nested cluster diagram dendrogram 25 Cluster proximity 􏰁 MIN defines cluster proximity as the proximity between the closest two points in different clusters 􏰁 this yields contiguity-based clusters 􏰁 MAX defines cluster proximity as the proximity between the farthest two points in different clusters 􏰁 alternative names: 􏰁 MIN = single link 􏰁 MAX = complete link MIN MAX 28 Agglomerative hierarchical clustering 􏰁 starting with individual points as clusters 􏰁 successively merge the two closest clusters until only one cluster remains 􏰁 basic algorithm: Cluster proximity 􏰁 the group average technique defines cluster proximity to be the average pair–wise proximities of all pairs of points from different clusters 26 29 Cluster proximity 􏰁 the key operation of the algorithm is the computation of the proximity between clusters 􏰁 the definition of cluster proximity differentiates the various agglomerative hierarchical techniques 􏰁 cluster proximity is typically defined with a particular type of cluster in mind 􏰁 e.g. many agglomerative hierarchical clustering techniques, such as MIN, MAX & group average are defined with a graph–based clusters in mind proximity? 27 Cluster proximity 􏰁 if, instead of a graph–based view, we take a prototype–based view... 􏰁 each cluster is represented by a centroid 􏰁 proximity is commonly defined as the proximity between cluster centroids 30 5 2 1 1.00 0.90 2 0.90 1.00 3 0.10 0.70 4 0.65 0.60 5 0.20 Single link (MIN) hierarchical clustering 3/11/2016 Single–link (MIN) hierarchical clustering Single link (MIN) hierarchical clustering 􏰁 c and f are the two closest points 􏰁 5 clusters: {a}, {b}, {c, f}, {d}, {e} 34 Single link (MIN) hierarchical clustering 􏰁 cluster proximity is defined as the proximity between the closest two points in different clusters 􏰁 similarity matrix: 1 0.50 3 0.10 0.70 1.00 0.40 0.30 4 0.65 0.60 0.40 1.00 0.80 􏰁 {b} and {e} are the two closest clusters 􏰁 4 clusters: {a}, {b, e}, {c, f}, {d} 5 0.20 0.50 0.30 0.80 1.00 32 35 Single link (MIN) hierarchical clustering 􏰁 start with all points as singleton clusters: 􏰁 6 clusters: {a}, {b}, {c}, {d}, {e}, {f} 􏰁 at each step merge two clusters with two closest points 33 Single link (MIN) hierarchical clustering 􏰁 {c, f} and {b, e} are two closest clusters because of the proximity between b and c 􏰁 3 clusters: {a}, {b, e, c, f}, {d} 36 6 Single link (MIN) hierarchical clustering 􏰁 merge the only two remaining clusters 􏰁 1cluster:{a,b,c,d,e,f} 2 3 0.90 0.10 0.65 0.20 1.00 0.70 0.60 0.50 0.70 1.00 0.40 0.30 0.60 0.40 1.00 0.80 0.50 3/11/2016 Single link (MIN) hierarchical clustering 􏰁 disclosertocthanaistob 􏰁 2 clusters: {a}, {b, c, d, e, f} 37 Complete–link (MAX) hierarchical clustering Complete link (MAX) hierarchical clustering 􏰁 cluster proximity is defined as the proximity between the farthest two points in different clusters 􏰁 similarity matrix: 1 2 3 4 5 1 1.00 0.30 4 0.80 5 1.00 􏰁 sim({3}, {1, 2}) 􏰁 sim({3}, {4, 5}) 􏰁 sim({1, 2}, {4, 5}) = sim(3, 1) = 0.10 = sim(3, 5) = 0.30 = sim(1, 5) = 0.20 41 0.90 0.10 0.65 0.20 38 Strength/limitation of MIN 􏰁 strength: can handle 􏰁 limitation: sensitive to non–elliptical shapes noise and outliers 39 Complete link (MAX) hierarchical clustering 42 7 Complete link (MAX) hierarchical clustering 3/11/2016 Complete link (MAX) hierarchical clustering 43 Complete link (MAX) hierarchical clustering 46 Complete link (MAX) hierarchical clustering 44 47 Complete link (MAX) hierarchical clustering 45 Strength/limitation of MAX 􏰁 strength: less susceptible 􏰁 limitation: tends to to noise and outliers break large clusters 48 8 􏰁 similarity matrix: Hierarchical clustering: group average 3/11/2016 Hierarchical clustering: group average Hierarchical clustering: group average 52 Hierarchical clustering: group average 􏰁 cluster proximity is defined as the average of pair–wise proximity between points in the two clusters 1 2 3 4 5 1 1.00 0.90 0.10 0.65 0.20 2 0.90 1.00 0.70 0.60 0.50 3 0.10 0.70 1.00 0.40 0.30 4 0.65 0.60 0.40 1.00 0.80 5 0.20 0.50 0.30 0.80 1.00 50 53 Hierarchical clustering: group average 51 Hierarchical clustering: group average 54 9 3/11/2016 Hierarchical clustering: group average 55 Hierarchical clustering: Summary Hierarchical clustering: group average Comparison MIN MAX group average 56 Strength/limitation of group average 􏰁 group average is a compromise between MIN and MAX 􏰁 strength: less susceptible to noise and outliers 􏰁 limitation: biased towards spherical clusters 57 Time and space requirements 􏰁 n = number of points 􏰁 space: O(n2) 􏰁 it uses a proximity matrix n x n 􏰁 time: O(n3) 􏰁 there are n steps 􏰁 at each step the proximity matrix of size n x n needs to be searched and updated 􏰁 can be reduced to O(n2log(n)) in some approaches 60 10 3/11/2016 Limitations 􏰁 once a decision has been made to combine two clusters, it cannot be undone 􏰁 no objective function is directly optimised 􏰁 different schemes have problems with one or more of the following: 􏰁 sensitivity to noise and outliers 􏰁 difficulty handling different sized clusters and convex shapes 􏰁 breaking large clusters 61 K–means algorithm 1. Select k points as the initial centroids. 2. repeat 3. 4. 5. Form k clusters by assigning all points to the closest centroid. Recompute the centroid for each cluster. until The centroids don't change. 􏰁 initial centroids are often chosen randomly 􏰁 stopping condition is often changed to "until relatively few points change clusters" 64 K–means clustering K–means clustering 􏰁 works well when a given set is composed of a number of distinct classes 􏰁 the number of clusters k must be specified 􏰁 problem: How to choose k objectively? 􏰁 K-means will converge for most common similarity measures in the first few iterations 􏰁 less computationally intensive than hierarchical methods: O(n ⋅ k ⋅ i ⋅ a), where n, k, i and a are the numbers of points, clusters, iterations and attributes 􏰁 modest space requirements because only points and centroids are stored: O((n + k) ⋅ a) 65 Non-hierarchical clustering 􏰁 non-hierarchical clustering attempts to directly partition the dataset into a set of disjoint clusters 􏰁 k-means clustering aims to partition dataset into k clusters in which each instance belongs to the cluster with the nearest mean 􏰁 basic principles: 􏰁 each cluster is associated with its centre (centroid) 􏰁 each point is assigned to the cluster with the closest centroid 63 Two different k–means clusterings original points optimal clustering sub–optimal clustering 66 11 in the "right" way and sometimes they will not 􏰁 possible "solutions": 􏰁 multiple runs may help 􏰁 sample and use hierarchical clustering to determine initial centroids 􏰁 select most widely separated 􏰁 ... 68 goal of attaining: 1. cohesion – high intra-cluster similarity (instances within a cluster are similar) 2. separation – low inter-cluster similarity (instances from different clusters are dissimilar) 3/11/2016 Importance of choosing initial centroids 123 456 67 Evaluation Problems with choosing initial centroids 􏰁 if there are k "real" clusters then the chance of selecting one centroid from each cluster is small 􏰁 ... very small when k is large! 􏰁 sometimes the initial centroids will readjust themselves Evaluation 􏰁 evaluating clustering algorithms is complex because it is difficult to find an objective measure of quality of clusters 􏰁 typical objective functions in clustering formalise the 71 Limitations of k–means 􏰁 has problems when clusters are of differing: 􏰁 sizes 􏰁 densities 􏰁 non–spherical shapes 69 Graph–based view of cohesion and separation 􏰁 mathematically, cohesion and separation for a graph–based cluster can be expressed as follows: cohesion separation 72 12 3/11/2016 Prototype–based view of cohesion and separation 􏰁 mathematically, cohesion and separation for a prototype–based cluster can be expressed as follows: centroids cohesion separation 73 Cluster validity 􏰁 previous definitions of cohesion and separation give us some simple and well–defined measures of cluster validity 􏰁 they can be combined into an overall validity measure by using a weighted sum: 􏰁 the weights will vary depending on the cluster validity measure 􏰁 in some cases, the weights are simply 1 or the cluster sizes 74 75 13 􏰁 naïve Bayes 􏰁 support vector machines 􏰁 inter–annotator agreement 􏰁 evaluation: precision, recall, F-measure 􏰁 cross–validation 􏰁 platforms: Weka, Mechanical Turk, Crowdflower 􏰁 tumour cells as benign or malignant 􏰁 news stories as finance, weather, entertainment, sports, etc. 3/11/2016 Classification CMT207 Information modelling & database systems 1 Classification 􏰁 classification is used to map an object to one (or more) classes from a predefined set of classes 􏰁 class = a set of objects having some property or attribute in common and differentiated from others by kind, type or quality 􏰁 class – also referred to as a type, category, kind, sort... 4 Lecture 􏰁 rule-based systems: 'if-then' (deduction) 􏰁 supervised machine learning (induction) 􏰁 k–nearest neighbour 􏰁 inducing decision trees Examples 􏰁 classifying ... 􏰁 e–mail as spam or not spam 􏰁 credit card transactions as legitimate or fraudulent 2 5 Dealing with complex information 􏰁 'divide' & conquer 􏰁 'compartmentalise' 􏰁 grouping similar or related things together 􏰁 associating information 􏰁 classifying objects into different categories 3 Classification task 􏰁 the input data for a classification task is a collection of records 􏰁 each record, also known as an instance or example, is characterised by a tuple (x, y), where 􏰁 x is the set of attributes 􏰁 y is a special attribute designated as the class label 􏰁 classification is the task of mapping an input attribute set x into its class label y, i.e. (x, ?) → (x, y) 􏰁 attributes can be discrete or continuous 􏰁 the class label must be a discrete attribute 1 Classification task 􏰁 formally, classification is the task of learning a target function f that maps each attribute set x to one of the predefined class labels y, i.e. f(x) = y 􏰁 the target function is also known informally as a classification model input: attribute set (x1, x2, ... , xn) output: class label y Rule–based classification 􏰁 deductive reasoning: applying a series of rules to reach a conclusion 􏰁 modus ponens: 􏰁 IfAthenB. 􏰁A 􏰁 Therefore B. 􏰁 e.g. junk e-mail filtering rule: 􏰁 If (Subject contains 'viagra') then (e-mail is junk). 􏰁 Subject contains 'viagra' 􏰁 Therefore e-mail is junk. 11 3/11/2016 Example attributes class 7 Rule–based classification classification model 8 Classification 􏰁 when dealing with large volumes of information, we would like to take advantage of computers' processing power to classify objects efficiently 􏰁 Can this process be automated? Yes, various classification methods are used in data mining! 􏰁 classifier – an algorithm that implements classification 􏰁 two basic types of classification methods: 1. rule–based (top–down approach, deductive reasoning): specific conclusion is derived from general principles 2. machine learning (bottom–up approach, inductive reasoning): general conclusion is derived from specific instances 9 Example: E-mail filter rules 12 2 from a predefined set of labels 3/11/2016 Examples of classification problems 􏰁 text categorisation, e.g. spam filtering 􏰁 optical character recognition 􏰁 natural language processing, e.g. part-of-speech tagging 􏰁 medical diagnosis 􏰁 Quiz: give your own examples 13 Example 􏰁 a rule covers an instance x if the attributes of x satisfy the condition of the rule R1: if (Give Birth = no) & (Can Fly = yes), then bird R2: if (Give Birth = no) & (Live in Water = yes), then fish R3: if (Give Birth = yes) & (Blood Type = warm), then mammal R4: if (Give Birth = no) & (Can Fly = no), then reptile R5: if (Live in Water = sometimes), then amphibian 􏰁 rule R1 covers a hawk → bird 􏰁 rule R3 covers a grizzly bear → mammal 16 Rule–based systems 􏰁 model expert knowledge 􏰁 the concept of an expert system is this: the knowledge of an expert is encoded into the rule set 􏰁 rule–based systems are fairly simplistic, consisting of little more than a set of if–then statements 􏰁 a relatively simple model that can be adapted to any number of problems 14 Example R1: if (Give Birth = no) & (Can Fly = yes), then bird R2: if (Give Birth = no) & (Live in Water = yes), then fish R3: if (Give Birth = yes) & (Blood Type = warm), then mammal R4: if (Give Birth = no) & (Can Fly = no), then reptile R5: if (Live in Water = sometimes), then amphibian 􏰁 lemur triggers rule R3, so it is classified as a mammal 􏰁 turtle triggers both R4 and R5: reptile or amphibian 􏰁 dogfish shark triggers none of the rules 17 Example R1: if (Give Birth = no) & (Can Fly = yes), then bird R2: if (Give Birth = no) & (Live in Water = yes), then fish R3: if (Give Birth = yes) & (Blood Type = warm), then mammal R4: if (Give Birth = no) & (Can Fly = no), then reptile R5: if (Live in Water = sometimes), then amphibian 15 Rule–based classification systems 􏰁 each rule has a left–hand side & a ride–hand side 􏰁 the left–hand side contains information about certain facts, which must be true in order for the rule to potentially fire (execute) 􏰁 If >1 rules can fire, which one to chose?!?
􏰁 different heuristics can be applied for conflict resolution:
􏰁 first applicable
􏰁 random
􏰁 most specific
􏰁 least recently used 􏰁 best rule
18
3

3/11/2016
Conflict resolution strategy
􏰁 first applicable: rules are ordered and the first applicable rule is chosen
􏰁 simple, allows control, predictable
􏰁 potential problem: an infinite loop
􏰁 random: a single random rule from the conflict set is
chosen
􏰁 not predictable
􏰁 most specific: the rule with the most conditions satisfied is chosen
19
Rule–based system
􏰁 How to convert expert knowledge into rules?
􏰁 Do such rules exists after all?
􏰁 rule-based systems are only feasible for problems for which the knowledge in the problem area can be written in the form of if–then rules
􏰁 … and for which this problem area is not large!
􏰁 problems: often very complex, rules can interact, inefficient, cannot learn (i.e. improve with experience), do not work well with subjective problems…
22
Conflict resolution strategy
􏰁 least recently used: each rule is accompanied by a time stamp, and the least recently used rule is chosen
􏰁 maximises the number of rules that are fired at least once
Supervised machine learning
􏰁 best rule: each rule is given a weight, which specifies how much it should be considered over the alternatives
􏰁 the rule with the most preferable outcome is chosen based on this weight
20
Rule–based systems
􏰁 the concept of an expert system: the knowledge of an expert is encoded into the rule set
􏰁 … but expert knowledge is often tacit knowledge, i.e. locked in people’s heads
􏰁 typical bottleneck in expert systems: knowledge elicitation
􏰁 knowledge elicitation = the process of explicating domain-specific knowledge underlying human performance
􏰁 explicit = fully & clearly expressed or demonstrated, leaving nothing merely implied
21
Machine learning
apple apple
􏰁 machine learning – studies how to automatically learn to make accurate predictions based on past
observations
􏰁 supervised machine learning:
apple apple
apple
􏰁ahumanexperthasprovidedatrainingset: apple a set of sample objects with known classes
􏰁 a human expert has determined into what classes an object may be categorised (classification scheme)
(labels or annotations)
not an not an not an apple apple apple
apple
24
4

Training data – example
features labels
Batman
Robin
male
male
yes
yes
yes
yes
no
no
yes
no
no
no
Good
3/11/2016
Supervised machine learning
1. training phase: training set (labelled data) is used to automatically build a classification model
2. testing phase: test set (also labelled data) is used to evaluate the classification model
3. the classification model is used to predict class(es) for new (unseen) examples
25
k–nearest neighbour (k–NN)
Basic idea:
if it walks like a duck, quacks like a duck, then it is probably a duck
gender
mask
cape
tie
ears
smokes
class
Good
Alfred
Penguin
Catwoman
Joker
male
male
female
male
no
no
yes
no
no
no
no
no
yes
yes
no
no
no
no
yes
no
no
yes
no
no
Good
training records
Bad
Bad
Bad
unknown record
26
choose k nearest neighbours
Popular supervised approaches
􏰁 naïve Bayes – a probabilistic classifier based on Bayes theorem, which analyses the relationship between features & the labelled class to estimate a conditional probability that links feature values & the class
􏰁 k-nearest neighbour (k-NN) – computes distances from a test instance to available training instances & assigns the majority class label among the k nearest neighbours
􏰁 C4.5 decision tree – C4.5 builds a decision tree from training data by choosing features that most effectively split the data between observed classes
􏰁 support vector machine (SVM) – constructs a hyperplane that optimally separates the training data into two classes
27
k-nearest neighbour
􏰁 requires three things:
1. the set of training records
2. distance metric to compute distance between records
3. the value of k, i.e. the number of nearest neighbours to retrieve
􏰁 to classify an unknown record:
1. compute distance to other training records
2. identify k nearest neighbours
3. use class labels of nearest neighbours to determine the class label of unknown record (e.g. by taking majority vote)
30
5

3/11/2016
k-nearest neighbour
􏰁 k–NN classifiers are lazy learners
􏰁 they do not build models explicitly
􏰁 … unlike eager learners such as decision tree induction and rule–based systems
􏰁 classifying unknown records is relatively expensive
31
Example: decision tree
􏰁 the root node uses the attribute body temperature to separate warm–blooded from cold–blooded vertebrates
􏰁 since all cold–blooded vertebrates are non–mammals, a leaf node labelled non– mammals is created
􏰁 if the vertebrate is warm– blooded, a subsequent attribute, gives birth, is used to distinguish mammals from other warm–blooded creatures
34
Decision tree learning
Example: classification
􏰁 to classify a test record start from the root node
􏰁 apply the test condition and follow the appropriate branch based on the outcome of the
test
􏰁 this leads either to another internal node, for which a new test condition is applied, or to a leaf node
􏰁 the class label associated with the leaf node is assigned to the record
35
Decision tree
􏰁 decision tree is a flowchart–like structure in which:
􏰁 each internal node represents a test on an attribute 􏰁 each branch represents the outcome of the test
􏰁 each leaf node represents a class label (decision)
􏰁 the paths from root to the leaves represent classification rules
33
How to build a decision tree?
􏰁 many decision trees for a given set of attributes
􏰁 some decision trees are more accurate than others
􏰁 finding the optimal tree is computationally infeasible because of the exponential size of the search space
􏰁 efficient algorithms have been developed to induce a reasonably accurate, albeit suboptimal, decision tree
􏰁 they usually employ a greedy strategy that grows a decision tree by making a series of locally optimal decisions about which attribute to use to partition the data
36
6

3/11/2016
Hunt’s algorithm
􏰁 one such greedy algorithm is Hunt’s algorithm
􏰁 the basis of many existing decision tree induction
algorithms, including ID3, C4.5 and CART
􏰁 a decision tree is grown recursively by partitioning the training records into successively smaller subsets
􏰁 notation:
􏰁 let Dt be the set of training records that are
associated with node t
􏰁 let C = {c1, c2, . . . , cn} be the class labels
37
40
Hunt’s algorithm
if ( all records in Dt belong to the same class ci ) then {tisaleafnodelabelledasci }
else {
1. an attribute test condition is selected to partition the records into smaller subsets
2. a child node is created for each outcome of the test condition
3. the records in Dt are distributed across the children based on the outcomes
4. the algorithm is then recursively applied to each child node
}
38
Hunt’s algorithm – further modification
􏰁 the algorithm will work if:
1. every combination of attribute values is present
in the training data, and
2. each combination has a unique class label
􏰁 too stringent for use in most practical situations!
􏰁 algorithm needs to be modified so that it can handle
these cases:
1. some of the child nodes created are empty
2. all records have identical attribute values except for the class label
41
Example
􏰁 deciding whether a loan applicant will default on loan payments
􏰁 classes: defaulted = yes vs. defaulted = no
􏰁 a training set of records of previous borrowers
􏰁 each record contains personal information of a borrower along with a class label indicating whether the borrower has defaulted on loan payments
39
Hunt’s algorithm – further modification
􏰁 case 1: no records associated with a child node
􏰁 this can happen if none of the training records have the combination of attribute values associated with such node
􏰁 if this is the case, then the node is declared a leaf node with the same class label as the majority class of training records associated with its parent node
42
7

3/11/2016
Hunt’s algorithm – further modification
􏰁 case 2: all records associated with Dt have identical attribute values except for the class label
􏰁 it is not possible to split these records any further
􏰁 if this is the case, then the node is declared a leaf node with the same class label as the majority class of training records associated with this node
43
2. How should the splitting procedure stop?
􏰁 a stopping condition is needed to terminate the tree–growing process
􏰁 a possible strategy is to continue expanding a node until:
1. all records belong to the same class, or
2. all records have identical attribute values
􏰁 both conditions are sufficient to stop any decision tree induction algorithm
􏰁 … but other criteria can be imposed to allow the tree– growing procedure to terminate early
􏰁 the advantages of early termination will be discussed later 46
Inducing decision trees
􏰁 a learning algorithm for inducing decision trees must address the following two questions:
1. How should the training records be split? 2. How should the splitting procedure stop?
44
Attribute test conditions
􏰁 decision tree induction algorithms must provide a method for expressing an attribute test condition and its corresponding outcomes for different attribute types:
􏰁 binary
􏰁 nominal (categorical) 􏰁 ordinal
􏰁 continuous
47
1. How should the training records be split?
􏰁 each recursive step of the tree–growing process must select an attribute test condition to divide the records into smaller subsets
􏰁 to implement this step, the algorithm must provide:
1. a method for specifying the test condition for
different attribute types, and
2. an objective measure for evaluating the goodness of each test condition
45
Binary attributes
􏰁 e.g. body temperature:
warm–blooded vs. cold–blooded
􏰁 the test condition for a binary attribute generates two potential outcomes
48
8

3/11/2016
Nominal (categorical) attributes
􏰁 e.g. marital status: single, married or divorced
􏰁 test condition can be expressed in two ways:
1. multi–way split: the number of outcomes depends on the number of distinct values for the attribute
2. binary split: considers all 2k−1 − 1 ways of crea􏰅ng a binary partition of k attribute values
49
Continuous attributes
􏰁 for the binary split, the decision tree algorithm must consider all possible split positions v and select the one that produces the best partition
􏰁 for the multi–way split, the algorithm must consider all possible ranges of continuous values
52
Ordinal attributes
􏰁 e.g. shirt size: small < medium < large < extra large 􏰁 ordinal attributes can also produce binary or multi–way splits 􏰁 ... but ordinal attribute values can be grouped only if the grouping does not violate the order property < < 50 Selecting the best split 􏰁 many measures can be used to determine the best split 􏰁 these measures are defined in terms of the class distribution of the records before and after splitting 􏰁 they are often based on the degree of impurity of the child nodes, e.g. where p(i|t) is the fraction of records belonging to class Ci at a given node t 53 Continuous attributes 􏰁 e.g. annual income 􏰁 the test condition can be expressed as: 1. a comparison test A < v (or A ≥ v) with binary outcomes, or 2. a range query with outcomes of the form vi ≤Atestingseterror(h’)
􏰁 due to presence of noise, lack of representative
instances, etc.
􏰁 a good model must not only fit the training data well,
but also accurately classify records it has never seen
􏰁 i.e. a good model must have low training error and low generalisation error
58
Selecting the best split
􏰁 decision tree induction algorithms often choose a test condition that maximises the gain Δ
􏰁 since I(parent) is the same for all test conditions, maximising the gain is equivalent to minimising the
Tree pruning
􏰁 if a decision tree is fully grown, it may lose some generalisation capability
􏰁 large decision trees are prone to overfitting
􏰁 tree–pruning step can be performed to reduce its size
56
59
Decision tree induction
􏰁 input: the training records E and the attribute set F
57
Naïve–Bayesian classification
10

the posterior probability (left–hand side of the equation) given a test instance
􏰁 each test instance is classified using the hypothesis with the largest posterior probability
Example
􏰁 classes: mammal, non–mammal
􏰁 attributes: give birth, can fly, live in water, have legs 􏰁 training data:
3/11/2016
Naïve–Bayesian classification
􏰁 naïve Bayes classifier – a probabilistic learning method based on Bayes’ theorem
􏰁 Bayes’ theorem – uses evidence e to estimate the probability of a hypothesis h
prior probability of h
probability of observing e given h
probability of observing e
posterior probability of h
􏰁 when multiple hypotheses are considered, Bayes’ theorem provides the probability of each hypothesis being true given the evidence
Naïve–Bayesian classification
􏰁 given a record with attributes A1, A2, … , An, the goal is to predict class C
􏰁 class probability:
􏰁 as before, suggest a class with the highest probability
for the given attribute
􏰁 Q: How to calculate ?
􏰁 if we assume independence among attributes, then:
􏰁 individual probabilities are easy to calculate
64
Naïve–Bayesian classification
􏰁 training step: training data used to estimate the parameters of a probability distribution (right–hand side of the equation)
􏰁 testing phase: these estimations are used to compute
65
Naïve–Bayesian classification
􏰁 given a record with attribute A, the goal is to predict class C
􏰁 class probability:
􏰁 suggest a class with the highest probability for the
given attribute
􏰁 multiple attributes A1, A2, … , An?
63
Example
􏰁 A – given attributes, M – mammal, N – non-mammal
→ mammal
66
11
>

3/11/2016
Naïve–Bayesian classification
􏰁 robust to irrelevant attributes
􏰁 robust to isolated noise points
􏰁 handles missing values by ignoring them during probability calculations
􏰁 independence assumption may not hold for some attributes
67
Example
70
Support vector machine (SVM)
Separation
􏰁 Which hyperplane better separates the data? B1 or B2?
􏰁 How do we define better?
􏰁 hyperplane that maximises the margin
􏰁 B1 is better than B2
71
Support vector machine (SVM)
􏰁 the goal of an SVM is to construct a hyperplane (decision boundary) that optimally separates the training data into two classes
69
Margin
􏰁 intuitively, we are more certain of the class of points that lie far from the separating hyperplane
􏰁 thus, it is desirable for all training points be as far from the hyperplane as possible
􏰁 … but on the correct side of that hyperplane
72
12

3/11/2016
Problem
?
separation margin73
Inter–annotator agreement
Solution
􏰁 we want to maximise the margin:
􏰁 equivalent to minimising:
􏰁 … but subject to the following constraints:
Annotations
􏰁 supervised machine learning (SML) depends on manually annotated training data
􏰁 learning from mistakes does not work in SML!
􏰁 SML approach will learn to make the same mistakes as
human annotators!
􏰁 therefore, the crucial issue is:
􏰁 Are the annotations correct?
􏰁 if the annotations differ between annotators , then who do we trust?
77
􏰁 this is a constrained optimisation problem
􏰁 there are numerical approaches to solve it 􏰁 e.g. quadratic programming
74
Issues
􏰁 some problems are not linearly separable
􏰁 some problems may have a non–linear decision boundary
􏰁 … but these are more expensive to find
75
Validity vs. reliability
􏰁 validity: Is an annotation correct?
􏰁 problem: there is no “ground truth”
􏰁 i.e. we do not know what is correct!
􏰁 we cannot check directly if it is correct
􏰁 reliability
􏰁 Are human annotators consistent with one another? 􏰁 i.e. do they consistently make the same annotations?
􏰁 assumption: high reliability implies validity
78
13
0

Disagreement
􏰁 task
􏰁 difficult
􏰁 complex
􏰁 annotator
􏰁 tiredness
􏰁 laziness
􏰁 subjective 􏰁 malice 􏰁 … 􏰁 gaming
􏰁 …
Contingency tables
3/11/2016
Inter-annotator agreement
􏰁 How can we measure reliability?
􏰁 multiple annotators + measure inter-annotator agreement
􏰁 e.g. Is this an apple? 􏰁 observed agreement
􏰁 4 out of 6 = 0.6666…
􏰁 ≈ 67%
􏰁 Is 67% good enough? 􏰁 What is good enough?
A B Agree?
yes yes 􏰆 no yes 􏰇 yes no 􏰇 yes yes 􏰆 no no 􏰆 no no 􏰆
79
Inter–annotator agreement
􏰁 agreement measure must be corrected for chance agreement
􏰁 Ao = observed (percentage) agreement 􏰁 Ae = expected agreement by chance
􏰁 chance–corrected agreement:
􏰁 perfect agreement:
􏰁 chance agreement:
􏰁 perfect disagreement:
82
􏰁 contingency table of proportions:
80
83
Chance agreement
􏰁 agreement purely by chance
􏰁 annotator decisions are like coin tosses 􏰁 e.g. Is this an apple?
􏰁 P(randomly choosing yes) = 0.5
􏰁 P(randomly choosing no) = 0.5
􏰁 P(both choosing yes) = 0.5 x 0.5 = 0.25 􏰁 P(both choosing no) = 0.5 x 0.5 = 0.25 􏰁 P(both agree) = 0.25 + 0.25 = 0.5 = 50% 􏰁 67% is better than 50%
81
Calculating agreement
􏰁 expected chance agreement:
􏰁 observed agreement:
􏰁 Cohen’s kappa coefficient:
􏰁 a statistical measure of inter–annotator agreement
84
14

What agreement is good enough?
􏰁 Landis and Koch (1977)
􏰁 Krippendorff (1980)
􏰁 Green (1997)
Group exercise
􏰁 How can Cohen’s kappa coefficient be used in the context of crowdsourcing annotations for supervised machine learning?
􏰁 homework: Calculate Cohen’s kappa coefficient for this example →
85
Evaluation
􏰁 true positive: an instance is correctly predicted to belong to the given class
􏰁 true negative: an instance is correctly predicted not to belong to the given class
􏰁 false positive: an instance is incorrectly predicted to belong to the given class
􏰁 false negative: an instance is incorrectly predicted not to belong to the given class
Evaluation
􏰁 precision: proportion of predicted positives that are actually positive
􏰁 recall: proportion of actual positives that were predicted positive
88
Evaluation
86
􏰁 F–measure: combined measure that assesses precision/recall trade off using a harmonic mean
Which algorithm?
􏰁 supervised machine learning is about classifying an object based on past experience (training data)
􏰁 we learnt about k–NN, inducing decision trees, naïve Bayesian classification, SVM
􏰁 … and there are others, e.g. neural networks
two types of errors: FP & FN
3/11/2016
90
15

3/11/2016
No free lunch!
􏰁 Which algorithm?
􏰁 no free lunch theorem
􏰁 David H Wolpert, 1996
any two algorithms are equivalent when their performance is averaged across all possible problems
􏰁 a general–purpose universal learning strategy is impossible
􏰁 one strategy can outperform another only if it is specialised to the structure of the specific problem under consideration
91
Weka
􏰁 data mining software in Java
􏰁 http://www.cs.waikato.ac.nz/ml/weka/
􏰁 a collection of machine learning algorithms for data mining tasks
􏰁 the algorithms can either be applied directly to a dataset or called from your own Java code
􏰁 open source software issued under the GNU General Public License
94
Cross–validation
􏰁 a model validation technique for assessing how the results will generalize to an independent data set
􏰁 estimates how accurately a predictive model will perform in practice
Supervised machine learning
􏰁 advantages:
􏰁 data driven, so often more accurate than
human–crafted rules
􏰁 humans are often incapable of expressing what they know (e.g. English grammar rules), but can easily classify examples
􏰁 disadvantages:
􏰁 need a lot of labelled data 􏰁 lack of explanatory power
in the era of Web 2.0, crowdsourcing can address this problem
95
92
Summary
96
16

3/11/2016
97
98
17