CS代考 www.cardiff.ac.uk/medic/irg-clinicalepidemiology

www.cardiff.ac.uk/medic/irg-clinicalepidemiology

Frequent itemsets

Copyright By PowCoder代写 加微信 powcoder

Information modelling
& database systems

in the previous lecture we introduced three approaches to data mining
one of these was a computational approach,
which includes:
summarisation
feature extraction
we learnt that mining of frequent itemsets is an important type of feature extraction
in this lecture we will learn how to find frequent itemsets

Market–basket data model

Market basket
market basket (shopping cart) brought to the register for checkout
contains items, i.e. different products that the store sells
a major chain might sell 100,000 different items and collect data about millions of market baskets

itemset = a set of items
frequent itemset = an itemset that appears in many market baskets
a retailer can learn what is commonly bought together
an opportunity to do some clever marketing

Example: diapers & beer
one chain store analysed their basket data and discovered that people who buy diapers are unusually likely to buy beer
one would hardly expect these two items to be related
explanation:
if you buy diapers, you probably
have a baby at home
if you have a baby, then you are
unlikely to be drinking at a pub
instead, you are more likely to bring beer home
marketing ploy: put diapers on sale and raise the price of beer

Other applications
items = words
baskets = documents, e.g. Web pages, blogs, tweets
a basket (document) contains those items (words) that are present in the document
we would hope to find among the frequent pairs some pairs of words that represent a joint concept (relationship)
e.g. a pair like {Brad, Angelina} appears with high frequency
Brad + Angelina = Brangelina

The market–basket data model
used to describe a common form of many–to–many relationship between two types of objects:
on the one hand, we have items
on the other we have baskets (aka transactions)
each basket consists of a set of items (an itemset)
we assume that :
the number of items in a basket is small
the number of baskets is very large

Frequent itemsets
intuitively, a set of items that appears in many baskets is said to be frequent
How many is many?! 10, 100, 1K, 1M?
s is a number called the support threshold
I is a set of items
support for I = the number of baskets that
contain it as a subset
we say that I is frequent if its support  s

documents returned by googling cat and dog:
{cat, and, dog, bites}
{yahoo, news, claims, a, cat, mated, with, a, dog, and, produced, viable, offspring}
{cat, killer, likely, is, a, big, dog}
{professional, free, advice, on, dog, training, puppy, training}
{cat, and, kitten, training, and, behaviour}
{dog, &, cat, provides, dog, training, in, Eugene, Oregon}
{dog, and, cat, is, a, slang, term, used, by, police, officers, for, a, male–female, relationship}
{shop, for, your, show, dog, grooming, and, pet, supplies}

empty set  is a subset of any set
therefore, its support is 8
we shall not concern ourselves with the empty set, since it tells us nothing
singletons: {cat}, {and}, {dog}, {bites}, {yahoo}, {news}, {claims}, {a}, {mated}, {with}, {produced},
{viable}, {offspring}, …
doubletons: {cat, and}, {and, dog}, {cat, dog},
{dog, bites}, {cat, bites}, {yahoo, news}, …
triples, etc.

{cat, and, dog, bites}
{yahoo, news, claims, a, cat, mated, with,
a, dog, and, produced, viable, offspring}
{cat, killer, likely, is, a, big, dog}
{professional, free, advice, on, dog, training, puppy, training}
{cat, and, kitten, training, and, behaviour}
{dog, &, cat, provides, dog, training, in, Eugene, Oregon}
{dog, and, cat, is, a, slang, term, used, by, police, officers, for, a, male–female, relationship}
{shop, for, your, show, dog, grooming, and, pet, supplies}
Itemset Support
{training} 3

Example continued
now, let us look at the doubletons
a doubleton cannot be frequent unless both items in the set are frequent by themselves
therefore, we only need to look at combinations of frequent singletons

{cat, and, dog, bites}
{yahoo, news, claims, a, cat, mated, with,
a, dog, and, produced, viable, offspring}
{cat, killer, likely, is, a, big, dog}
{professional, free, advice, on, dog, training, puppy, training}
{cat, and, kitten, training, and, behaviour}
{dog, &, cat, provides, dog, training, in, Eugene, Oregon}
{dog, and, cat, is, a, slang, term, used, by, police, officers, for, a, male–female, relationship}
{shop, for, your, show, dog, grooming, and, pet, supplies}

{cat} {dog} {and} {a} {training}
{cat} 5 4 3 2
{dog} 4 3 2
{and} 2 1
{training}

Example continued
five frequent doubletons (support  3):
{cat, dog}, {cat, and}, {cat, a}, {dog, and}, {dog, a}
next, let us see if there are frequent triples
in order to be a frequent triple, each pair of elements in the triple must be a frequent doubleton
by combining frequent doubletons, we get two candidates for frequent triples:
{cat, dog, and}
{cat, dog, a}

Example continued
candidates for frequent triples:
{cat, dog, and} in baskets (1), (2) and (7)
{cat, dog, a} in baskets (2), (3) and (7)
support  3, so both are frequent triples
there can be no frequent quadruples or larger sets
this approach will be summarised later as a–priori algorithm

Association rules

Association rules
information about frequent itemsets is often presented as a collection of if–then rules, called association rules
the form of an association rule is I  i, where I is an itemset and i is an item
interpretation: if I appears in some basket, then i is “likely” to appear in that basket as well
we formalise the notion of “likely” by defining the confidence of the rule as the following ratio:
support(I  { i }) / support(I)
e.g. confidence({cat, dog}  and) =
support({cat, dog, and}) / support({cat, dog}) = 3 / 5

Confidence
confidence of the association rule I  i:
confidence(I  i) = support(I  { i }) / support(I)
confidence can be useful, provided the support for the left side (I) is fairly large, e.g.
{hotdog}  mustard
nothing “unusual” about people buying hot dog and mustard together
as long as we know that many people buy hot dogs and many who people buy hot dogs will buy mustard as well, we can use it to inform sale strategy

Association rules with high confidence
we want the confidence of the rule to be reasonably high, e.g. 50%, or else the rule has little practical effect
confidence(I  i)  0.5
support(I  {i}) / support(I)  0.5
support(I  {i})  0.5 x support(I)
if I has high support, then I  {i} will also have
fairly high support
finding association rules with high confidence is not much harder than finding frequent itemsets
assume that it is possible to find those frequent itemsets whose support  threshold s

Finding rules with high confidence
suppose we have found all itemsets with support  threshold s
within them, we can find all association rules that have both high support and high confidence
let I be a frequent itemset of n items
there are n possible association rules involving I:
I \ {i}  i for each item i in I
I is frequent and I \ {i} is at least as frequent
therefore, I \ {i} is also a frequent itemset

Finding rules with high confidence
having already calculated the support of both I and I \ {i}, their ratio is the confidence of the rule I \ {i}  i:
confidence(I \ {i}  i) = support(I) / support(I \ {i})
we can now focus on those with high confidence
assumption: there are not too many frequent itemsets and thus not too many high–confidence association rules
reason is that such association rules should be acted upon
e.g. if we give a store manager a million association rules, they cannot even read them, let alone act on them
the support threshold should be adjusted so that we do not get too many frequent itemsets

A–priori algorithm

Representation of market–basket data
we assume that market–basket data is stored in a file basket by basket, e.g.
{23,456,1001}{3,18,92,145}{…

the character { begins a basket
the character } ends a basket
the items in a basket are represented by integers
… and are separated by commas

Finding frequent pairs
let us concentrate on finding frequent pairs
if we have enough main memory to count all pairs, then it is simple to read the file of baskets in a single pass
for each basket, we use a double loop to generate all pairs
each time we generate a pair, we add 1 to its count
at the end, we examine all pairs to see which have counts  support threshold s
these are the frequent pairs

Finding frequent pairs
however, a single pass approach will fail if there are too many pairs to count them all in main memory
the a–priori algorithm is designed to reduce the number of pairs that must be counted
… but at the expense of performing two passes over data, rather than a single pass

The first pass of a–priori
in the first pass, we create an array of counts
the i–th element counts the occurrences of the item numbered i
initially, the counts for all the items are set to 0
as we read baskets, we add 1 to the corresponding element of the array

A note on the power law
many types of data follow power law distribution, which is characterised by:
a very small number of events that have a very high probability of appearing, and
a very large number of events that have a very low probability of appearing
e.g. few words (e.g. the) with enormously high probability vs. many words (e.g. database) with low probability of appearing in a randomly chosen sentence
represented by a continuously decreasing curve

Between the passes of a–priori
after the first pass, we examine the item counts to determine which ones are frequent singletons
because of the power law, we do not usually expect many singletons to be frequent
anyway, by setting the support threshold s sufficiently high, we can ensure not to get too many frequent sets
a typical support threshold s would be 1% of the baskets
for the second pass of a–priori, we create a new numbering from 1 to m for frequent items

The second pass of a–priori
a pair cannot be frequent unless both its members are frequent
so, we count all pairs that consist of two frequent items
the space required is m2 bytes, e.g. m x m matrix
the second pass steps:
for each basket, look in the frequent–items table to see which of its items are frequent
in a double loop, generate all pairs of frequent items in that basket
for each such pair, add 1 to its count in the data structure used to store counts

/docProps/thumbnail.jpeg

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com