CS考试辅导 www.cardiff.ac.uk/medic/irg-clinicalepidemiology

www.cardiff.ac.uk/medic/irg-clinicalepidemiology

Introduction to data mining

Copyright By PowCoder代写 加微信 powcoder

Information modelling
& database systems

in this lecture, we begin with the essence of data mining
we cover Bonferroni’s principle, which is a warning about overusing the ability to mine data

Data mining
data mining is the discovery of “models” for data
What is a “model”?
a “model” can be interpreted in several different ways
statistical modelling
machine learning
computational approaches to modelling
summarisation
feature extraction

Statistical modelling
statisticians were the first to use the term data mining
statisticians view data mining as the construction of a statistical model
an underlying distribution from which the visible data is drawn
e.g. suppose the data is a set
of numbers with normal
distribution
the mean and standard
deviation would become
the model of the data

Machine learning
some say that data mining = machine learning
some data mining uses algorithms from machine learning
supervised machine learning uses data as a training set
to train an algorithm, e.g. Bayes nets, support vector machines, etc.
good approach when we have little idea of what we are looking for in the data
e.g. it is not very clear what it is about movies that makes certain viewers (dis)like it
… but when using a sample of their responses, machine learning algorithms have proved quite successful in predicting movie ratings

Computational approaches to modelling
more recently, computer scientists have looked at data mining as an algorithmic problem
the model of the data is simply the answer to a complex query about it
many different approaches, but most can be described
as either:
summarising the data succinctly and approximately, or
extracting the most prominent features of the data and ignoring the rest

Summarisation
one of the most interesting forms of summarisation is the PageRank idea
in this form of Web mining, the entire complex structure of the Web is summarised by a single number for each page
remarkable property this ranking has is that it reflects very well the “importance” of the page
it relates to the degree to which typical searchers would want to see a page returned as an answer to their search query

Feature extraction
look for the most extreme examples of a phenomenon and represents the data by these examples
two important types of feature extraction:
frequent itemsets – looking for small sets of items that frequently appear together, e.g. hamburger and ketchup
these sets are the characterisation of the data
similar items – often, data looks like a collection of sets, e.g. customers can be viewed as the set of items they bought
the objective is to find pairs of sets that have a relatively large fraction of their elements in common
e.g. Amazon can recommend products that many “similar” customers have bought

Bonferroni’s principle
a warning against overzealous use of data mining
suppose and you look for occurrences of certain events in a dataset
even if the data is completely random, you can still expect to find such occurrences
the number of these occurrences will grow as the dataset grows in size
… but these occurrences may be “bogus”, i.e. be random with no particular cause

Bonferroni’s principle
a theorem in statistics known as the Bonferroni correction
gives a statistically sound way to avoid treating random occurrences as if they were real
calculate the expected number of occurrences of the events you are looking for under the assumption that data is random
if this number is significantly larger than the number of real instances you hope to find, then you must expect that most of the occurrences are bogus

trying to identify people who cheat on their spouses
we know that the percentage of those who cheat on their spouses is 5%
you decide that people who go out with co-workers more than three times a month are cheating on their spouses
your method discovers that 20% of people qualify as cheaters
20% >> 5%, therefore, in the very best case only ¼ will actually be cheaters (true positives), and the rest will be bogus (false positives)

/docProps/thumbnail.jpeg

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com