www.cardiff.ac.uk/medic/irg-clinicalepidemiology
Classification
Copyright By PowCoder代写 加微信 powcoder
Information modelling
& database systems
rule-based systems: ‘if-then’ (deduction)
supervised machine learning (induction)
k–nearest neighbour
inducing decision trees
naïve Bayes
support vector machines
inter–annotator agreement
evaluation: precision, recall, F-measure
cross–validation
platforms: Weka, Mechanical Turk, Crowdflower
Dealing with complex information
‘divide’ & conquer
‘compartmentalise’
grouping similar or related things together
associating information
classifying objects into different categories
Classification
classification is used to map an
object to one (or more) classes
from a predefined set of classes
class = a set of objects having some property or attribute in common and differentiated from others by kind, type or quality
class – also referred to as a type, category, kind, sort…
classifying …
e–mail as spam or not spam
credit card transactions as legitimate or fraudulent
tumour cells as benign or malignant
news stories as finance, weather, entertainment, sports, etc.
Classification task
the input data for a classification task is a collection of records
each record, also known as an instance or example, is characterised by a tuple (x, y), where
x is the set of attributes
y is a special attribute designated as the class label
classification is the task of mapping an input attribute set x into its class label y, i.e. (x, ?) (x, y)
attributes can be discrete or continuous
the class label must be a discrete attribute
attributes
from a predefined set of labels
Classification task
formally, classification is the task of learning a target function f that maps each attribute set x to one of the predefined class labels y, i.e. f(x) = y
the target function is also known informally as a classification model
classification
attribute set
(x1, x2, … , xn)
class label
Classification
when dealing with large volumes of
information, we would like to take
advantage of computers’ processing
power to classify objects efficiently
Can this process be automated? Yes, various classification methods are used in data mining!
classifier – an algorithm that implements classification
two basic types of classification methods:
rule–based (top–down approach, deductive reasoning): specific conclusion is derived from general principles
machine learning (bottom–up approach, inductive reasoning): general conclusion is derived from specific instances
Rule–based classification
Rule–based classification
deductive reasoning: applying a series of rules to reach a conclusion
modus ponens:
If A then B.
Therefore B.
e.g. junk e-mail filtering rule:
If (Subject contains ‘viagra’) then (e-mail is junk).
Subject contains ‘viagra’
Therefore e-mail is junk.
Example: E-mail filter rules
Examples of classification problems
text categorisation, e.g. spam filtering
optical character recognition
natural language processing, e.g. part-of-speech tagging
medical diagnosis
Quiz: give your own examples
Rule–based systems
model expert knowledge
the concept of an expert system is this: the knowledge of an expert is encoded into the rule set
rule–based systems are fairly simplistic, consisting of little more than a set of if–then statements
a relatively simple model that can be adapted to any number of problems
R1: if (Give Birth = no) & (Can Fly = yes), then bird
R2: if (Give Birth = no) & (Live in Water = yes), then fish
R3: if (Give Birth = yes) & (Blood Type = warm), then mammal
R4: if (Give Birth = no) & (Can Fly = no), then reptile
R5: if (Live in Water = sometimes), then amphibian
a rule covers an instance x if the attributes of x satisfy the condition of the rule
rule R1 covers a hawk bird
rule R3 covers a grizzly bear mammal
R1: if (Give Birth = no) & (Can Fly = yes), then bird
R2: if (Give Birth = no) & (Live in Water = yes), then fish
R3: if (Give Birth = yes) & (Blood Type = warm), then mammal
R4: if (Give Birth = no) & (Can Fly = no), then reptile
R5: if (Live in Water = sometimes), then amphibian
lemur triggers rule R3, so it is classified as a mammal
turtle triggers both R4 and R5: reptile or amphibian
dogfish shark triggers none of the rules
R1: if (Give Birth = no) & (Can Fly = yes), then bird
R2: if (Give Birth = no) & (Live in Water = yes), then fish
R3: if (Give Birth = yes) & (Blood Type = warm), then mammal
R4: if (Give Birth = no) & (Can Fly = no), then reptile
R5: if (Live in Water = sometimes), then amphibian
Rule–based classification systems
each rule has a left–hand side & a ride–hand side
the left–hand side contains information about certain facts, which must be true in order for the rule to potentially fire (execute)
If >1 rules can fire, which one to chose?!?
different heuristics can be applied for conflict resolution:
first applicable
most specific
least recently used
Conflict resolution strategy
first applicable: rules are ordered and the first applicable rule is chosen
simple, allows control, predictable
potential problem: an infinite loop
random: a single random rule from the conflict set is chosen
not predictable
most specific: the rule with the most conditions satisfied is chosen
Conflict resolution strategy
least recently used: each rule is accompanied by a time stamp, and the least recently used rule is chosen
maximises the number of rules that are fired at least once
best rule: each rule is given a weight, which specifies how much it should be considered over the alternatives
the rule with the most preferable outcome is chosen based on this weight
Rule–based systems
the concept of an expert system: the knowledge of an expert is encoded into the rule set
… but expert knowledge is often tacit knowledge,
i.e. locked in people’s heads
typical bottleneck in expert systems:
knowledge elicitation
knowledge elicitation = the process of
explicating domain-specific knowledge
underlying human performance
explicit = fully & clearly expressed or
demonstrated, leaving nothing merely
Rule–based system
How to convert expert knowledge into rules?
Do such rules exists after all?
rule-based systems are only feasible for problems
for which the knowledge in the problem area can
be written in the form of if–then rules
… and for which this problem area is not large!
problems: often very complex, rules can interact, inefficient, cannot learn (i.e. improve with experience), do not work well with subjective problems…
Supervised machine learning
Machine learning
machine learning – studies how to automatically
learn to make accurate predictions based on past observations
supervised machine learning:
a human expert has determined into what
classes an object may be categorised
(classification scheme)
a human expert has provided a training set:
a set of sample objects with known classes
(labels or annotations)
not an apple
not an apple
not an apple
Supervised machine learning
training phase: training set (labelled data) is used to automatically build a classification model
testing phase: test set (also labelled data) is used to evaluate the classification model
the classification model is used to predict class(es) for new (unseen) examples
Training data – example
gender mask cape tie ears smokes class
Batman male yes yes no yes no Good
Robin male yes yes no no no Good
Alfred male no no yes no no Good
Penguin male no no yes no yes Bad
Catwoman female yes no no yes no Bad
Joker male no no no no no Bad
Popular supervised approaches
naïve Bayes – a probabilistic classifier based on Bayes theorem, which analyses the relationship between features & the labelled class to estimate a conditional probability that links feature values & the class
k-nearest neighbour (k-NN) – computes distances from a test instance to available training instances & assigns the majority class label among the k nearest neighbours
C4.5 decision tree – C4.5 builds a decision tree from training data by choosing features that most effectively split the data between observed classes
support vector machine (SVM) – constructs a hyperplane that optimally separates the training data into two classes
k–nearest neighbour (k–NN)
Basic idea:
if it walks like a duck, quacks like
a duck, then it is probably a duck
choose k nearest
neighbours
k-nearest neighbour
requires three things:
the set of training records
distance metric to compute distance between records
the value of k, i.e. the number of nearest neighbours to retrieve
to classify an unknown record:
compute distance to other training records
identify k nearest neighbours
use class labels of nearest neighbours to determine
the class label of unknown record (e.g. by taking majority vote)
k-nearest neighbour
k–NN classifiers are lazy learners
they do not build models explicitly
… unlike eager learners such as decision tree induction and rule–based systems
classifying unknown records is relatively expensive
Decision tree learning
Decision tree
decision tree is a flowchart–like structure in which:
each internal node represents a test on an attribute
each branch represents the outcome of the test
each leaf node represents a class label (decision)
the paths from root to the leaves represent classification rules
Example: decision tree
the root node uses the attribute body temperature to separate warm–blooded from cold–blooded vertebrates
since all cold–blooded vertebrates are non–mammals, a leaf node labelled non–mammals is created
if the vertebrate is warm–blooded, a subsequent attribute, gives birth, is used to distinguish mammals from other warm–blooded creatures
Example: classification
to classify a test record start from the root node
apply the test condition and follow the appropriate branch based on the outcome of the test
this leads either to another internal node, for which a new test condition is applied, or to a leaf node
the class label associated with the leaf node is assigned to the record
How to build a decision tree?
many decision trees for a given set of attributes
some decision trees are more accurate than others
finding the optimal tree is computationally infeasible because of the exponential size of the search space
efficient algorithms have been developed to induce a reasonably accurate, albeit suboptimal, decision tree
they usually employ a greedy strategy that grows a decision tree by making a series of locally optimal decisions about which attribute to use to partition the data
Hunt’s algorithm
one such greedy algorithm is Hunt’s algorithm
the basis of many existing decision tree induction algorithms, including ID3, C4.5 and CART
a decision tree is grown recursively by partitioning the training records into successively smaller subsets
let Dt be the set of training records that are associated with node t
let C = {c1, c2, . . . , cn} be the class labels
Hunt’s algorithm
if ( all records in Dt belong to the same class ci )
then { t is a leaf node labelled as ci }
an attribute test condition is selected to partition the records into smaller subsets
a child node is created for each outcome of the test condition
the records in Dt are distributed across the children based on the outcomes
the algorithm is then recursively applied to each child node
deciding whether a loan applicant will default on loan payments
classes: defaulted = yes vs. defaulted = no
a training set of records of previous borrowers
each record contains personal information of a borrower along with a class label indicating whether the borrower has defaulted on loan payments
The initial tree for the classification problem contains a single node with class label Defaulted = No, which means that most of the borrowers successfully repaid their loans. The tree, however, needs to be refined since the root node contains records from both classes. The records are
subsequently divided into smaller subsets based on the outcomes of the Home Owner test condition. Hunt’s algorithm is then applied recursively to each child of the root node. From the training set, notice that all borrowers who are home owners successfully repaid their loans. The left child of the root is therefore a leaf node labelled Defaulted = No. For the right child, we need to continue applying the recursive step of Hunt’s algorithm until all the records belong to the same class.
Hunt’s algorithm – further modification
the algorithm will work if:
every combination of attribute values is present
in the training data, and
each combination has a unique class label
too stringent for use in most practical situations!
algorithm needs to be modified so that it can handle these cases:
some of the child nodes created are empty
all records have identical attribute values except
for the class label
Hunt’s algorithm – further modification
case 1: no records associated with a child node
this can happen if none of the training records have the combination of attribute values associated with such node
if this is the case, then the node is declared a leaf node with the same class label as the majority class of training records associated with its parent node
Hunt’s algorithm – further modification
case 2: all records associated with Dt have identical attribute values except for the class label
it is not possible to split these records any further
if this is the case, then the node is declared a leaf node with the same class label as the majority class of training records associated with this node
Inducing decision trees
a learning algorithm for inducing decision trees
must address the following two questions:
How should the training records be split?
How should the splitting procedure stop?
1. How should the training records be split?
each recursive step of the tree–growing process must select an attribute test condition to divide the records into smaller subsets
to implement this step, the algorithm must provide:
a method for specifying the test condition for different attribute types, and
an objective measure for evaluating the goodness
of each test condition
2. How should the splitting procedure stop?
a stopping condition is needed to terminate the
tree–growing process
a possible strategy is to continue expanding a node until:
all records belong to the same class, or
all records have identical attribute values
both conditions are sufficient to stop any decision tree induction algorithm
… but other criteria can be imposed to allow the tree–growing procedure to terminate early
the advantages of early termination will be discussed later
Attribute test conditions
decision tree induction algorithms must provide a method for expressing an attribute test condition and its corresponding outcomes for different attribute types:
nominal (categorical)
continuous
Binary attributes
e.g. body temperature:
warm–blooded vs. cold–blooded
the test condition for a binary attribute generates
two potential outcomes
Nominal (categorical) attributes
e.g. marital status: single, married or divorced
test condition can be expressed in two ways:
multi–way split: the number of outcomes depends on the number of distinct values for the attribute
binary split: considers all 2k−1 − 1 ways of creating a binary partition of k attribute values
Ordinal attributes
e.g. shirt size: small < medium < large < extra large
ordinal attributes can also produce binary or multi–way splits
... but ordinal attribute values can be grouped only if the grouping does not violate the order property
Continuous attributes
e.g. annual income
the test condition can be expressed as:
a comparison test A < v (or A ≥ v) with binary outcomes, or
a range query with outcomes of the form
vi ≤ A < vi+1 for i = 1, ... , k
Continuous attributes
for the binary split, the decision tree algorithm must consider all possible split positions v and select the one that produces the best partition
for the multi–way split, the algorithm must consider all possible ranges of continuous values
Selecting the best split
many measures can be used to determine the best split
these measures are defined in terms of the class distribution of the records before and after splitting
they are often based on the degree of impurity of the child nodes, e.g.
where p(i|t) is the fraction of records belonging to class Ci at a given node t
Selecting the best split
to determine how well a test condition performs,
we need to compare
the degree of impurity of the parent node
(before splitting)
the degree of impurity of the child nodes
(after splitting)
the larger their difference, the better the test condition
Selecting the best split
the gain, Δ, is a criterion that can be used to determine the goodness of a split:
I is the impurity measure
N is the total number of records at the parent node
k is the number of attribute values
N(vj) is the number of records at the child node vj
weighted average of
children's impurity
Selecting the best split
decision tree induction algorithms often choose a test condition that maximises the gain Δ
since I(parent) is the same for all test conditions, maximising the gain is equivalent to minimising the weighted average impurity measures of the child nodes
finally, when entropy is used as the impurity measure in gain, it is known as the information gain, Δinfo
Decision tree induction
input: the training records E and the attribute set F
Overfitting
happens when the learning algorithm continues to develop hypotheses that reduce training set error
at the cost of increasing test set error
a hypothesis ℎ overfits the training data if there is an alternative hypothesis, ℎ′, such that:
training set error(ℎ) < training set error(ℎ')
testing set error(ℎ) > testing set error(ℎ’)
due to presence of noise, lack of representative instances, etc.
a good model must not only fit the training data well, but also accurately classify records it has never seen
i.e. a good model must have low training error and
low generalisation error
Tree pruning
if a decision tree is fully grown, it may lose some generalisation capability
large decision trees are prone to overfitting
tree–pruning step can be performed to reduce its size
pruning helps by trimming the branches of the initial tree in a way that improves its generalisation capability
pruning can be done by replacing a subtree with:
a new leaf node with a class label based on the majority class of the corresponding records, or
the most frequently used branch of the subtree
Naïve–Bayesian classification
Naïve–Bayesian classification
naïve Bayes classifier – a probabilistic learning method based on Bayes’ theorem
Bayes’ theorem – uses evidence e to estimate the probability of a hypothesis h
when multiple hypotheses are considered, Bayes’ theorem provides the probability of each hypothesis being true given the evidence
posterior probability of h
prior probability of h
probability of observing e
probability of observing e given h
Naïve–Bayesian classification
training step: training data used to estimate the parameters of a probability distribution
(right–hand side of the equation)
testing phase: these estimations are used to compute the posterior probability (left–hand side of the equation) given a test instance
each test instance is classified using the hypothesis with the largest posterior probability
Naïve–Bayesian classification
given a record
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com