COMP9318: Data Warehousing and Data Mining
— L7: Classification and Prediction —
Data Mining: Concepts and Techniques 1
n Problem definition and preliminaries
Data Mining: Concepts and Techniques 2
ML Map
Data Mining: Concepts and Techniques 3
Classification vs. Prediction
n Classification:
n predicts categorical class labels (discrete or nominal)
n classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data
n Prediction (aka. Regression):
n models continuous-valued functions, i.e., predicts
unknown or missing values n Typical Applications
n credit approval
n target marketing n medical diagnosis
n treatment effectiveness analysis
Data Mining: Concepts and Techniques 4
Classification and Regression
• Given a new object o, map it to a feature
vector
• Predict the output (class label)
• Binary classification:
x = (x1,x2,…,xd)>
Sometimes,{0, 1}
• Multi-class classification:
• Learn a classification function: • Regression:
Examples of Classification Problem
n Text categorization:
Doc: Months of campaigning and weeks of round-the-clock efforts in Iowa all came down to a final push Sunday, …
Politics
Topic:
Sport
n Input object: a document = a sequence of words
n Input features : = word frequencies n freq(democrats)=2, freq(basketball)= 0, … n = [1, 2, 0, …]T
n Class label: n ‘Politics’:
n ‘Sport’:
Examples of Classification Problem
n Image Classification:
n Input object: an image = a matrix of RGB values
n Input features = Color histogram
n pixel_count(red) = 1004, pixel_count(blue) = 23000 n = [1004, 23000, …]T
n Class label
n ‘bird image’:
n ‘non-bird image’:
Which images are birds, which are not?
How to find f()?
1. Input:
• In supervised learning, we are given a set of
training examples:
• Identical independent distribution (i.i.d)
assumption
• A critical assumption for machine learning theory
• e.g.,
X
logP(x | ✓) = logP(xi | ✓) i
How to find f()?
2. Representation of f()
• Typically only consider a particular function family F
• consider parameterized functions f(x; 𝜃) ∈ F
Examples:
n F: linear functions (for regression) è f = wTx
n F: linear functions (for classification) è f = 𝜎(wTx) n What about more general function families?
How to find f()?
Empirical vs. Structured Risk Minimization (ERM vs. SRM)
3.
Criterion for the best f()
• Non-Bayesian approaches:
• Lossfunction: G({l(yˆ,y)}=G({l(f(x;✓),y)} iiii
ERM SRM
• Regularization: Examples:
⌦(✓)
n Regression with L2 loXss and L1 regularization
J ( ✓ ) = n ( yˆ y ) 2 + k ✓ k
i=1
n Classification with cross entropy loss and L2 regularization
J(✓)=Xni=1✓ Xkj=1yi,j log(yˆi,j)◆+ k✓k2
ii1
Machine Learning Terminologies
n Supervisedlearninghasinputlabelleddata n #instances x #attributes matrix/table
n #attributes = #features + 1
n 1 (usu. the last attribute) is for the class attribute n Labelleddatasplitinto2or3disjointsubsets
n Training data
n Training error = 1.0 –
#correctly_classified
#training_instances
è Build a model
n Validation/development data n Testing data
n Testing/generalization error = 1.0 –
#correctly_classified #testing_instances
n Wemainlydiscussbinaryclassificationhere n i.e., #labels = 2
è Select/refine the model
Data Mining: Concepts and Techniques 11
è Evaluate the model
n Overview of the whole process (not including cross-validation)
Data Mining: Concepts and Techniques 12
Classification—A Two-Step Process
n Modelconstruction:describingasetofpredeterminedclasses
n Each tuple/sample is assumed to belong to a predefined class,
as determined by the class label attribute
n The set of tuples used for model construction is training set
n The model is represented as classification rules, decision trees, or mathematical formulae
n Modelusage:forclassifyingfutureorunknownobjects n Estimate accuracy of the model
The known label of test sample is compared with the classified result from the model
Accuracy rate is the percentage of test set samples that are correctly classified by the model
Test set is independent of training set, otherwise over-fitting will occur
n If the accuracy is acceptable, use the model to classify data tuples whose class labels are not known
n
n
n
Data Mining: Concepts and Techniques 13
In practice, much more than this simple view
Classification Process (1): Preprocessing & Feature Engineering
EID Name Title EMP_Date Track
101 Mike Assistant Prof 2013-01-01 3-year contract
102 Mary Assistant Prof 2009-04-15 Continuing
Training
103 Bill Scientia Professor 2014-02-03 Continuing
Data
110 Jim Associate Prof 2009-03-14 Continuing
class label attrib
121 Dave Associate Prof 2009-12-02 1-year contract
234 Anne Professor 2013-03-21 Future Fellow
188 Sarah Student Officer 2008-01-17 Continuing
Raw Data
Gender RANK YEARS TENURED
M Assistant Prof 3 no
F Assistant Prof 7 yes
M Professor 2 yes
M Associate Prof 7 yes
M Associate Prof 6 no
F Professor 3 no
Training Data
14
Classification Process (2): Training
Training Data
class label attrib
Classification Algorithms; θ
training
Classifier (Model)
Predicted
no
yes
yes
yes
yes
yes
Gender RANK YEARS TENURED
M Assistant Prof 3 no
F Assistant Prof 7 yes
M Professor 2 yes
M Associate Prof 7 yes
M Associate Prof 6 no
F Professor 3 no
IF rank = ‘professor’ OR years ≥ 6
THEN tenured = ‘yes’
training error = 33.3%
15
Classification Process (3): Evaluate the Model on Testing Data
Classifier (Model)
Testing Data
IF rank = ‘professor’ OR years ≥ 6
THEN tenured = ‘yes’
(M, Jeff, Professor, 4, yes) (F, Merlisa, Asso Prof, 5, yes)
testing error = 50%
Data Mining: Concepts and Techniques 16
Classification Process (4): Use the Model in Production
Classifier (Model)
Unseen Data
IF rank = ‘professor’ OR years ≥ 6
THEN tenured = ‘yes’
(Pedro, Assi Prof, 8, ?)
Data Mining: Concepts and Techniques 17
How to judge a model?
n Basedontrainingerrorortestingerror? n Testing error
n Otherwise, this is a kind of data scooping è overfitting n Whatiftherearemultiplemodelstochoosefrom?
n Further split a “test/development set” from the training set
n Canwetrusttheerrorvalueson
the development set? 🚫
n Need “large” dev set
è less data for training
n k-fold cross-validation n k=n: leave-one-out
10-fold CV
All the labelled data
Test set
18
Exercise: Problem definition and Feature Engineering
n How to formulate the following into ML problems? What kind of resources do you need? What are the features you think may be most relevant?
1. Predict the sale trend of a particular product in the next month
2. Design an algorithm to produce better top-10 ranking results for queries
Data Mining: Concepts and Techniques 19
Supervised vs. Unsupervised Learning
n Supervised learning (classification)
n Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations
n New data is classified based on the training set n Unsupervised learning (clustering)
n The class labels of training data is unknown
n Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data
Data Mining: Concepts and Techniques 20
n Decision Tree classifier n ID3
n Other variants
Data Mining: Concepts and Techniques 21
Training Dataset
age
income
student
credit_rating
buys_computer
<=30
high
no
fair
no
<=30
high
no
excellent
no
31...40
high
no
fair
yes
>40
medium
no
fair
yes
>40
low
yes
fair
yes
>40
low
yes
excellent
no
31…40
low
yes
excellent
yes
<=30
medium
no
fair
no
<=30
low
yes
fair
yes
>40
medium
yes
fair
yes
<=30
medium
yes
excellent
yes
31...40
medium
no
excellent
yes
31...40
high
yes
fair
yes
>40
medium
no
excellent
no
This follows an example from Quinlan’s ID3
Data Mining: Concepts and Techniques 22
age
Output: A Decision Tree for “buys_computer”
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low
yes excellent yes
age
income
student
credit_rating
buys_computer
>40
medium
no
fair
yes
>40
low
yes
fair
yes
>40
low
yes
excellent
no
>40
medium
yes
fair
yes
>40
medium
no
excellent
no
<=30
student?
no yes
age?
30..40
yes
>40
credit rating?
age
31…40
31…40
31…40
31…40
income
high
low
medium
high
<=30
student
no
yes
no
yes
income
high
credit_rating
fair
excellent
excellent
buys_computer
yes
yes yes
excellent
no
fair
yes
no
credit_rating
fair
buys_computer
fair
student
yes
no
yes
Data Mining: Concepts and Techniques
23
<=30
31...40
high
high
no
no
excellent
no
no
yes
<=30
medium
no
fair
fair
no
age
income
student
credit_rating
buys_computer
<=30
>40
<=30
low
medium
medium
yes
yes
yes
fair
fair
excellent
yes
yes
<=30
<=30
<=30
high
high
no
no
no
fair
excellent
no
no
31...40
medium
no
excellent
yes
yes
31...40
>40
high
medium
yes
no
fair
excellent
yes
no
no
<=30
medium
low
yes
fair
fair
yes
<=30
medium
yes
excellent
yes
Extracting Classification Rules from Trees
n RepresenttheknowledgeintheformofIF-THENrules n Rules are easier for humans to understand
n Oneruleiscreatedforeachpathfromtheroottoaleaf
n Each attribute-value pair along a path forms a conjunction n The leaf node holds the class prediction
n Example
IF age = “<=30” AND student = “no” THEN buys_computer = “no” IF age = “<=30” AND student = “yes” THEN buys_computer = “yes”
.........
age?
30..40
yes
no
student?
yes
>40
credit rating?
excellent fair
no
<=30
yes
no
Data Mining: Concepts and Techniques 24
yes
Exercise: Write down the pseudo-code of the induction algorithm
Algorithm for Decision Tree Induction
n Basicalgorithm(agreedyalgorithm)
n Input: Attributes are categorical (if continuous-valued, they are
discretized in advance)
n Overview: Tree is constructed in a top-down recursive divide-and-
conquer manner
n At start, all the training examples are at the root
Samples are partitioned recursively based on selected test- attributes
n Test-attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain)
n Threeconditionsforstoppingpartitioning(i.e.,boundaryconditions)
n There are no samples left, OR
n All samples for a given node belong to the same class, OR
n There are no remaining attributes for further partitioning (majority voting is employed for classifying the leaf)
n
Data Mining: Concepts and Techniques 25
Decision Tree Induction Algorithm
Data Mining: Concepts and Techniques 26
Attribute Selection Measure:
Information Gain (ID3/C4.5)
n Select the attribute with the highest information gain
n S contains si tuples of class Ci for i = {1, ..., m}
n information measures info required to classify any
arbitrary tuple
msi si I(s1,s2,...,sm) = -å s log2 s
i=1
n entropy of attribute A with values {a1,a2,...,av}
s1j +...+smj j =1
v
E(A)=å s I(s1j,...,smj )
n information gained by branching on attribute A Gain(A)= I(s1,s2,...,sm)-E(A)
Data Mining: Concepts and Techniques 27
Attribute Selection by Information
GainComputation
g ClassP:buys_computer=“yes” g Class N: buys_computer = “no” g I(p,n)=I(9,5)=0.940
g Compute the entropy for age:
E(age)= 5 I(2,3)+ 4 I(4,0) 14 14
<=30
<=30
31...40
>40
31…40
<=30
<=30
>40
<=30
age
<=30
30...40
>40
high
high
high
medium
low
medium
low
medium
medium
no
no
no
yes
no
yes
yes
yes
pi
2
4
3
fair
excellent
fair
ni
3
0
2
I(pi, ni)
0.971
0
0.971
+ 5 I (3,2) = 0.694 5 14
14 I (2,3) means “age <=30” has 5
out of 14 samples, with 2 yes’es
and 3 no’s. Hence
Gain(age) = I ( p, n) - E(age) = 0.246 Similarly,
Gain(income) = 0.029 Gain(student) = 0.151 Gain(credit _ rating) = 0.048
age
income
student
credit_rating
buys_computer
no
no
no
fair
yes
yes
>40
yes
excellent
fair
fair
yes
>40
low
low
yes
fair
excellent
no
income
pi
ni
I(pi, ni)
yes
no
high
2
2
1
yes
fair
excellent
yes
medium
4
2
0.918
31…40
31…40
>40
medium
high
medium
no
yes
no
excellent
fair
excellent
yes
low 310.811
yes
Q: what’s the extreme/worst case? ncepts and Techniques 28
yes
Data Mining: Co
no
Other Attribute Selection Measures and Splitting Choices
n Gini index (CART, IBM IntelligentMiner)
n All attributes are assumed continuous-valued
n Assume there exist several possible split values for each attribute
n May need other tools, such as clustering, to get the possible split values
n Can be modified for categorical attributes n Induces binary split => binary decision trees
Data Mining: Concepts and Techniques 29
Gini Index (IBM IntelligentMiner)
n If a data set T contains examples from n classes, gini index, gini(T) is defined as
gini(T)=1-ån p2 j=1 j
where pj is the relative frequency of class j in T.
n If a data set T is split into two subsets T1 and T2 with sizes
N1 and N2 respectively, the gini index of the split data contains examples from n classes, the gini index gini(T) is defined as
n The attribute provides the smallest ginisplit(T) is chosen to
split the node (need to enumerate all possible splitting
points for each attribute).
Data Mining: Concepts and Techniques 30
gini (T)=N1gini( )+N2gini( ) split N T1 N T2
Case I: Numerical Attributes
Split value
Ginisplit(S) = 3/10*Gini(1,2) + 7/10*Gini(6,1) = 0.30
Ginisplit(S) = 5/10*Gini(2,3) + 5/10*Gini(5,0) = 0.24
Cut=22.5
Y
N
<
1
2
>=
6
1
Age
Car
Class
20 20
…
Y
20 20
…
N
20 20
…
N
25 25
…
N
25 25
…
Y
30 30
…
Y
30
…
Y
30
…
Y
40
…
Y
40
…
Y
Cut=27.5
Y
N
<
2
3
>=
5
0
22.5 27.5
…… … Gini(S)=1-åp2
j
12
Gini (S)=nGini(S)+nGini(S)
split n1n2
Data Mining: Concepts and Techniques 31
Exercise: compute gini indexes for other splits
Case II: Categorical Attributes
count matrix
attrib list for Car
Class=Y
Class=N
M
5
0
T
0
2
S
2
1
Age
Car
Class
20
M
Y
30
M
Y
25
T
N
30
S
Y
40
S
Y
20
T
N
30
M
Y
25
M
Y
40
M
Y
20
S
N
Need to consider all possible splits !
Y
N
{M, T}
5
2
{S}
2
1
Y
N
{M, S}
7
1
{T}
0
2
{T, S}
{M}
Y
2
5
N
3
0
Ginisplit(S) = 7/10*Gini(5,2) + 3/10*Gini(2,1) = 0.42
Ginisplit(S) = 8/10*Gini(7,1) + 2/10*Gini(0,2) = 0.18
Ginisplit(S) = 5/10*Gini(2,3) + 5/10*Gini(5,0) = 0.24
Data Mining: Concepts and Techniques
32
age
income
ID3
no
credit_rating
buys_computer
<=30
<=30
31...40
high
high
high
no
no
fair
excellent
no
no
yes
age
income
student
credit_rating
buys_computer
>40
medium
no
fair
yes
>40
low
yes
fair
yes
>40
low
yes
excellent
no
>40
medium
yes
fair
yes
>40
medium
no
excellent
no
<=30
student?
no yes
age?
30..40
yes
>40
credit rating?
age
31…40
31…40
31…40
31…40
income
high
low
medium
high
student
no
yes
no
yes
credit_rating
fair
excellent
excellent
buys_computer
yes
yes yes
excellent
no
fair
yes
fair
student
yes
no
yes
Data Mining: Concepts and Techniques
33
>40
medium
no
fair
fair
yes
>40
low
yes
fair
yes
>40
31…40
<=30
low
low
medium
yes
yes
excellent
no
yes
no
excellent
fair
no
age
income
student
credit_rating
buys_computer
<=30
>40
low
medium
yes
yes
fair
yes
yes
<=30
<=30
<=30
<=30
<=30
high
high
medium
low
medium
no
no
yes
yes
fair
excellent
fair
fair
excellent
no
no
yes
<=30
31...40
medium
medium
medium
yes
no
fair
excellent
excellent
excellent
yes
yes
31...40
>40
high
yes
no
fair
yes
no
no
no
yes
CART/SPRINT
Illustrative of the shape only
age?
<=30
student?
>30
credit rating?
excellent fair
no yes
34
no yes
no
age
yes
Data Mining: Concepts and Techniques
Avoid Overfitting in Classification
n Overfitting: An induced tree may overfit the training data n Too many branches, some may reflect anomalies due
to noise or outliers
n Poor accuracy for unseen samples
n Two approaches to avoid overfitting
n Prepruning: Halt tree construction early—do not split a node if this would result in the goodness measure falling below a threshold
n Difficult to choose an appropriate threshold
n Postpruning: Remove branches from a “fully grown”
tree—get a sequence of progressively pruned trees
n Use a set of data different from the training data to decide which is the “best pruned tree”
Data Mining: Concepts and Techniques 35
Overfitting Example /1
Human ? Dog?
Body temperature
warm
yes no
no
cold
Hibernates
Four-legged
yes
no
yes
no
• Lack of representative samples
• Existence of noise
Name Body Gives Birth Four-legged Hibernates Is Mammal? Temperature
Salamander Cold-blooded No Yes Yes No
Guppy Cold-blooded Yes No No No
Eagle Warm-blooded No No No No
Poorwill Warm-blooded No No Yes No
Platypus Warm-blooded No Yes Yes Yes
no
Training error = 0%, but does not generalize!
Overfitting Example /2
Human ? Dog?
warm
yes no
cold
no
Body temperature
yes
no
Training error = 20%, but generalizes!
Gives Birth
• Lack of representative samples
• Existence of noise
Name Body Gives Birth Four-legged Hibernates Is Mammal? Temperature
Salamander Cold-blooded No Yes Yes No
Guppy Cold-blooded Yes No No No
Eagle Warm-blooded No No No No
Poorwill Warm-blooded No No Yes No
Platypus Warm-blooded No Yes Yes Yes
Overfitting
n Overfitting:modeltoocomplexètrainingerrorkeepdecreasing, but testing error increases
n Underfitting:modeltoosimpleèbothtrainingandtestinghaslarge errors.
38
Overfitting
Data Mining: Concepts and Techniques 39
Overfitting examples in Regression & Classification
Data Mining: Concepts and Techniques 40
DT Pruning Methods
n Use a separate validation set
n Estimation of generalization/test errors n Use all the data for training
n but apply a statistical test (e.g., chi-square) to estimate whether expanding or pruning a node may improve the entire distribution
n Use minimum description length (MDL) principle n halting growth of the tree when the encoding
is minimized
Data Mining: Concepts and Techniques 41
Better estimate of generalization error in C4.5:
Use the upper 75% confidence bound from the training error of a node, assuming a binomial distribution
n What’sthegeneralizationerrors(i.e.,errorsontestingdata)onT? n Use pessimistic estimates
n e’(t)=e(t)+0.5
n E’(t) = e(T) + 0.5N, where N is the number of leaf nodes in T
n What’sthegeneralizationerrorsonroot(T)only? n E’ (root(T)) = e(T) + 0.5
n Post-pruningfrombottom-up
n If generalization error reduces after pruning, replace sub-tree by a
leaf node
n Use majority voting to decide the class label
Pessimistic Post-pruning
n Observedonthetrainingdata
n e(t): #errors on a leaf node t of the tree T n e(T)=∑tinT e(t)
Data Mining: Concepts and Techniques 42
Example
“ˆ h
n Training error before splitting on A = 10/30
Class = Yes
20
Class = No
10
Error = 10/30
Eh “ˆ 0
n Pessimistic error = (10+0.5)/30 n Training error after splitting on A
h Eh0
= 9/30
A?
A2
n Pessimistic error = (9 + 4*0.5)/30 = 11/30
Prune the subtree at A
A1
A3 A4
Class = Yes
8
Class = No
4
Class = Yes
3
Class = No
4
Class = Yes
4
Class = No
1
Class = Yes
5
Class = No
1
Data Mining: Concepts and Techniques
43
Classification in Large Databases
n Classification—a classical problem extensively studied by statisticians and machine learning researchers
n Scalability: Classifying data sets with millions of examples and hundreds of attributes with reasonable speed
n Why decision tree induction in data mining?
n relatively faster learning speed (than other classification
methods)
n convertible to simple and easy to understand classification rules
n can use SQL queries for accessing databases
n comparable classification accuracy with other methods
Data Mining: Concepts and Techniques 44
Enhancements to basic decision tree induction
n Allow for continuous-valued attributes
n Dynamically define new discrete-valued attributes that partition the continuous attribute value into a discrete set of intervals, one-hot encoding, or specialized DT learning algorithms
n Handle missing attribute values
n Assign the most common value of the attribute n Assign probability to each of the possible values
n Attribute construction
n Create new attributes based on existing ones that are
sparsely represented
n This reduces fragmentation, repetition, and replication
Data Mining: Concepts and Techniques 45
n Bayesian Classifiers
Data Mining: Concepts and Techniques 46
Bayesian Classification: Why?
n Probabilistic learning: Calculate explicit probabilities for hypothesis, among the most practical approaches to certain types of learning problems
n Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct. Prior knowledge can be combined with observed data.
n Probabilistic prediction: Predict multiple hypotheses, weighted by their probabilities
n Standard: Even when Bayesian methods are computationally intractable, they can provide a standard of optimal decision making against which other methods can be measured
Data Mining: Concepts and Techniques 47
Bayesian Theorem: Basics
n Let X be a data sample whose class label is unknown
n Let h be a hypothesis that X belongs to class C
n For classification problems, determine P(h|X): the probability that the hypothesis holds given the observed data sample X
n P(h): prior probability of hypothesis h (i.e. the initial probability before we observe any data, reflects the background knowledge)
n P(X): probability that sample data is observed
n P(X|h) : probability of observing the sample X, given that
the hypothesis holds
Data Mining: Concepts and Techniques 48
Bayesian Theorem
n Given training data X, posteriori probability of a hypothesis h, P(h|X) follows the Bayes theorem
n Informally, this can be written as
posterior = likelihood x prior / evidence
n MAP (maximum posteriori) hypothesis
n Practical difficulty: require initial knowledge of many
probabilities, significant computational cost
Data Mining: Concepts and Techniques 49
Training dataset
Hypotheses: C1:buys_computer= ‘yes’ C2:buys_computer= ‘no’
Data sample
X =(age<=30, Income=medium, Student=yes, Credit_rating= Fair)
age
income
student
credit_rating
buys_computer
<=30
high
no
fair
no
<=30
high
no
excellent
no
30...40
high
no
fair
yes
>40
medium
no
fair
yes
>40
low
yes
fair
yes
>40
low
yes
excellent
no
31…40
low
yes
excellent
yes
<=30
medium
no
fair
no
<=30
low
yes
fair
yes
>40
medium
yes
fair
yes
<=30
medium
yes
excellent
yes
31...40
medium
no
excellent
yes
31...40
high
yes
fair
yes
>40
medium
no
excellent
no
Data Mining: Concepts and Techniques 50
Sparsity Problem
n Maximum likelihood Estimate of P(h)
n Let p be the probability that the class is C1
n Consider a training example x and its label y
n L(x) = py(1-p)1-y
n L(X)=∏x inX L(x)
n l(X) = log(L(X)) = ∑x in X ylog(p) + (1-y)log(1-p) n To maximize l(X), let dl(X)/dp = 0 è p = (∑y)/n n P(C1) = 9/14, P(C2) = 5/14
n ML estimate of P(X|h) = ?
n Requires O(2d) training examples, where d is the
#features.
Data likelihood
Data Mining: Concepts and Techniques
51
Curse of dimensionality
Log Data likelihood
Naïve Bayes Classifier
n Use a model
n Assumption: attributes are conditionally independent:
n P(X|C)=ÕP(x |C)
i k=1 k i
n The product of occurrence of say 2 elements x1 and x2,
given the current class is C, is the product of the probabilities of each element taken separately, given the same class P([x1,x2],C) = P(x1,C) * P(x2,C)
n No dependence relation between attributes given the class
n Greatly reduces the computation cost, only count the class distribution è Only need to estimate P(xk | Ci)
Data Mining: Concepts and Techniques 52
Naïve Bayesian Classifier: Example
n Compute P(X|Ci) for each class
X=(age<=30 , income =medium, student=yes, credit_rating=fair)
P(age=“<30” | buys_computer=“yes”) = 2/9=0.222 P(age=“<30” | buys_computer=“no”) = 3/5 =0.6 P(income=“medium” | buys_computer=“yes”)= 4/9 =0.444 P(income=“medium” | buys_computer=“no”) = 2/5 = 0.4 P(student=“yes” | buys_computer=“yes)= 6/9 =0.667 P(student=“yes” | buys_computer=“no”)= 1/5=0.2 P(credit_rating=“fair” | buys_computer=“yes”)=6/9=0.667 P(credit_rating=“fair” | buys_computer=“no”)=2/5=0.4
P(X|Ci) : P(X|buys_computer=“yes”)= 0.222 x 0.444 x 0.667 x 0.0.667 =0.044 P(X|buys_computer=“no”)= 0.6 x 0.4 x 0.2 x 0.4 =0.019
P(X|Ci)*P(Ci ) : P(X|buys_computer=“yes”) * P(buys_computer=“yes”)=0.028 P(X|buys_computer=“no”) * P(buys_computer=“no”)=0.007
X belongs to class “buys_computer=yes”
likelihood
Data Mining: Concepts and Techniques 53
The Need for Smoothing
n Pr[Xi = vj | Ck] could still be 0, if not observed in the training data
n makes Pr[Ck | X] = 0, regardless of other likelihood values of Pr[Ct = vt| Ck]
n Add-1 Smoothing
n reserve a small amount of probability for
unseen probabilities
n (conditional) probabilities of observed events have to be adjusted to make the total probability equals 1.0
Data Mining: Concepts and Techniques 54
Add-1 Smoothing
n Pr[Xi = vj | Ck] = Count(Xi = vj, Ck) / Count(Ck)
n Pr[Xi = vj | Ck] = [Count(Xi = vj, Ck) +1] / [Count(Ck) +B]
n What’s the right value for B?
n make∑vj Pr[Xi =vj |Ck]=1 n Explain the above constraint
Pr[Xi | Ck]: Given Ck, the (conditional) probability of Xi taking a specific value
Xi must take one of the values, hence summing over all possible value for Xi, the probabilities should sum up to 1.0
n B = dom(Xi), i.e., # of values Xi can take
Data Mining: Concepts and Techniques 55
n
n
Many other kinds of smoothing exist (esp. in Natural Language Modelling)
Smoothing Example
no instance of age <= 30 in the “No” class
Class: C1:buys_computer= ‘yes’ C2:buys_computer= ‘no’
Data sample
X =(age<=30, Income=medium, Student=yes, Credit_rating= Fair)
>40
age
30…40
30…40
30…40
>40
>40
31…40
30…40
<=30
>40
<=30
31...40
31...40
>40
B=3
income
high
high
high
medium
low
low
low
medium
low
medium
medium
medium
high
medium
B=3
student
no
no
no
yes
yes
yes
no
yes
yes
yes
no
yes
no
B=2
credit_rating
fair
excellent
fair
fair
fair
excellent
excellent
fair
fair
fair
excellent
excellent
fair
excellent
B=2
buys_computer
no
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
Data Mining: Concepts and Techniques 56
Consider the “No” Class
n Compute P(X|Ci) for the “No” class
X=(age<=30 , income =medium, student=yes, credit_rating=fair)
P(age=“<30” | buys_computer=“no”) = (0 +1) / (5 +3) = 0.125 P(income=“medium” | buys_computer=“no”) = (2 +1) / (5 +3) = 0.375 P(student=“yes” | buys_computer=“no”)= (1 +1) / (5 +2) = 0.286 P(credit_rating=“fair” | buys_computer=“no”)= (2 +1) / (5 +2) = 0.429
P(X|Ci) : P(X|buys_computer=“no”)= 0.125 x 0.375 x 0.286 x 0.429 = 0.00575 P(X|Ci)*P(Ci ) : P(X|buys_computer=“no”) * P(buys_computer=“no”) = 0.00205
Probabilities
Pr[<=30 | No] Pr[30..40 | No] Pr[>=40 | No]
Without Smoothing
0 / 5 3 / 5 2 / 5
With Smoothing
1 / 8 4 / 8 3 / 8
Data Mining: Concepts and Techniques
57
sum up to 1
How to Handle Numeric Values
n Need to model the distribution of Pr[Xi | Ck] n Method 1:
n Assume it to be Gaussian è Gaussian Naïve Bayes 1 ✓ (vj μik)2◆
nPr[Xi =vj |Ck]= p2⇡ i2 exp 2 i2 n Method 2:
n Use binning to discretize the feature values
Data Mining: Concepts and Techniques 58
Text Classification
n NB has been widely used in text classification (aka., text categorization)
n Outline:
n Applications
n Language Model
n Two Classification Methods
Based on “Chap 13: Text classification & Naive Bayes” in Introduction to Information Retrieval
• http://nlp.stanford.edu/IR-book/
Data Mining: Concepts and Techniques 59
Document Classification
Test Data:
Classes:
Training Data:
“planning language proof intelligence”
(AI)
(Programming)
programming garbage … semantics collection language memory proof… optimization
region…
(HCI)
Semantics
Garb.Coll.
Multimedia
GUI
ML
learning intelligence algorithm reinforcement network…
planning temporal reasoning plan language…
…
Planning
(Note: in real life there is often a hierarchy, not present in the above problem statement; and also, you get papers on ML approaches to Garb. Coll.)
13.1
More Text Classification Examples: Many search engine functionalities use classification
Assign labels to each document or web-page:
n LabelsaremostoftentopicssuchasYahoo-categories
e.g., “finance,” “sports,” “news>world>asia>business”
n Labelsmaybegenres
e.g., “editorials” “movie-reviews” “news“
n Labelsmaybeopiniononaperson/product e.g., “like”, “hate”, “neutral”
n Labelsmaybedomain-specific
e.g., “interesting-to-me” : “not-interesting-to-me”
e.g., “contains adult language” : “doesn’t”
e.g., language identification: English, French, Chinese, …
e.g., search vertical: about Linux versus not e.g., “link spam” : “not link spam”
Challenge in Applying NB to Text
n Need to model P(text | class) n e.g., P(“a b c d” | class)
n Extremely sparse è Need a model to help n 1st method:
n Based on a statistical language model
n Turns out to be the bag of words model if using unigrams
n è multinomial NB n 2nd method:
n View a text as a set of tokens è a Boolean vector in {0, 1}|V|, where V is the vocabulary.
n è Bernoulli NB
sklearn.naive_bayes.MultinomialNB
sklearn.naive_bayes.BernoulliNB
Data Mining: Concepts and Techniques 62
Unigram and higher-order Language models in Information Retrieval
n P(“abcd”)=
n P(a) * P(b | a) * P(c | a b) * P(d | a b c)
n Unigram model: 0-th order Markov Model n P(d | a b c) = P(d)
n Bigram Language Models: 1st order Markov Model
n P(d | a b c) = P(d | c)
n P(a b c d) = P(a) * P(b | a) * P(c | b) * P(d | c)
n The same with class-conditional probabilities, i.e., P(“a b c d” | C)
Easy. Effective!
Two Models /1
n Model 1: Multinomial = Class conditional unigram n One feature Xi for each word pos in document
n feature’s values are all words in dictionary n Value of Xi is the word in position i
n Naïve Bayes assumption:
n Given the document’s topic, word in one position in the
document tells us nothing about words in other positions n Second assumption:
n Word appearance does not depend on position P(Xi =w|c)=P(Xj =w|c)
for all positions i,j, word w, and class c
n Just have one multinomial feature predicting all words
Using Multinomial Naive Bayes Classifiers to Classify Text: Basic method
P(Cj |w1w2w3w4)/P(Cj)P(w1w2w3w4 |Cj) 4
unigram model
bag of word; multinomial
i=1
n Essentially, the classification is independent of the
positions of the words
n Use same parameters for each position
n Result is bag of words model, i.e. text ⌘ {xi : ✓i}|V | i=1
Y
/P(cj)
/ P(cj) |V | P(xi | Cj)✓i
an example text of 4 tokens
i=1
P(wi |Cj)
e.g., “to be or not to be”
Naïve Bayes: Learning
n From training corpus, extract V = Vocabulary n Calculate required P(cj) and P(xk | cj) terms
n ForeachcjinCdo
n docsj ¬ subset of documents for which the target
class is cj
P(cj)¬ |docsj |
n
| total # documents |
Textj ¬ single document containing all docsj
n
n for each word xk in Vocabulary
n nk ¬ number of occurrences of xk in Textj
P(xk |cj)¬ nk +a
n+a |Vocabulary|
n
Naïve Bayes: Classifying
n positions ¬ all word positions in current document which contain tokens found in Vocabulary
n Return cNB, where
cNB =argmaxP(cj) ÕP(xi |cj) cjÎC iÎpositions
Naive Bayes: Time Complexity
n Training Time: O(|D|Ld + |C||V|))
where Ld is the average length of a document in D.
n Assumes V and all Di , ni, and nij pre-computed in O(|D|Ld)
time during one pass through all of the data.
n Generally just O(|D|Ld) since usually |C||V| < |D|Ld
Why?
n Test Time: O(|C| Lt)
where Lt is the average length of a test document.
n Very efficient overall, linearly proportional to the time needed to just read in all the data.
Bernoulli Model
n V={a,b,c,d,e}={x1,x2,x3,x4,x5}
n Feature functions fi(text) = if text contains xi
n The feature functions extract a vector of {0, 1}|V| from any text
n Apply NB directly
abcdeC
11110+
01011-
“a b c d”
“d e b”
Two Models /2
n Model 2: Multivariate Bernoulli
n One feature Xw for each word in dictionary
nXw =trueindocumentdifwappearsind
n Naive Bayes assumption:
n Given the document’s topic, appearance of one word in the document tells us nothing about chances that another word appears
n This is the model used in the binary independence model in classic probabilistic relevance feedback in hand-classified data (Maron in IR was a very early user of NB)
Parameter estimation
n Multivariate Bernoulli model:
ˆ
P(Xw =t|cj)=
n Multinomial model: ˆ
fraction of documents of topic cj in which word w appears
fraction of times in which word w appears
P(Xi =w|cj)=
across all documents of topic cj
n Can create a mega-document for topic j by concatenating all documents in this topic
n Use frequency of w in mega-document
Classification
n Multinomial vs Multivariate Bernoulli?
n Multinomial model is almost always more effective in text applications!
n See results figures later
n See IIR sections 13.2 and 13.3 for worked examples with each model
NB Model Comparison: WebKB
Underflow Prevention: log space
n Multiplying lots of probabilities, which are between 0 and 1 by definition, can result in floating-point underflow.
n Since log(xy) = log(x) + log(y), it is better to perform all computations by summing logs of probabilities rather than multiplying probabilities.
n Class with highest final un-normalized log probability score is still the most probable.
cNB =argmaxlogP(cj)+ ålogP(xi |cj) c jÎC iÎ positions
n Note that model is now just max of sum of weights...
Violation of NB Assumptions
n Conditional independence n “Positional independence” n Examples?
Observations
Raining Sunny
P(+,+,r) = 3/8 P(-,-,r) = 1/8 P(+,+,s) = 1/8 P(-,-,s) = 3/8
n Two identical sensors at the same location
n Equivalent training dataset
n Note: P(s1|C) = P(s2|C) no matter what
++r
++r
--r
++s
--s
--s
--s
S1 S2 Class
++r
P(si =+|r)= P(si = - | r ) =
P(si =+|s)= P(si = - | s ) =
P(r)= P(s) =
???
Prediction
n P(r|++)/ n P(s|++)/
P(si =+|r)=3/4 P(si =-|r) =1/4
P(si =+|s)=1/4 P(si =-|s) =3/4
P(r) = 1/2 P(s) = 1/2
S1 S2 Class
+ + ???
???
Problem: Posterior probability estimation not accurate
Reason: Correlation between features Fix: Use logistic regression classifier
n P(r | ++)/ 9/32 n P(s|++)/1/32
P(si =+|r)=3/4 P(si =-|r) =1/4
P(si =+|s)=1/4 P(si =-|s) =3/4
n P(r|++)=9/10 n P(s | ++) =1/10
P(r) = 1/2 P(s) = 1/2
++r
++r
--r
++s
--s
--s
--s
S1 S2 Class
++r
Naïve Bayes Posterior Probabilities
n Classification results of naïve Bayes (the class with maximum posterior probability) are usually fairly accurate.
n However, due to the inadequacy of the conditional independence assumption, the actual posterior- probability numerical estimates are not.
n Output probabilities are commonly very close to 0 or 1.
n Correct estimation Þ accurate prediction, but correct probability estimation is NOT necessary for accurate prediction (just need right ordering of probabilities)
Naïve Bayesian Classifier: Comments
n Advantages:
n Easy to implement
n Good results obtained in most of the cases
n Disadvantages
n Assumption: class conditional independence , therefore loss of
accuracy
n Practically, dependencies exist among variables
n E.g., hospitals: patients: Profile: age, family history etc
Symptoms: fever, cough etc., Disease: lung cancer, diabetes etc
n Dependencies among these cannot be modeled by Naïve Bayesian Classifier
n Bettermethods?
n Bayesian Belief Networks
n Logistic regression / maxent
Data Mining: Concepts and Techniques 80
n (Linear Regression and) Logistic Regression Classifier
n See LR slides
Data Mining: Concepts and Techniques 81
n Other Classification Methods
Data Mining: Concepts and Techniques 82
Instance-Based Methods
n Instance-based learning:
n Store training examples and delay the processing (“lazy evaluation”) until a new instance must be classified
n Typical approaches
n k-nearest neighbor approach
n Instances represented as points in a Euclidean space.
Data Mining: Concepts and Techniques 83
The k-Nearest Neighbor Algorithm
n All instances correspond to points in the n-D space. n The nearest neighbor are defined in terms of
Euclidean distance.
n The target function could be discrete- or real- valued.
n For discrete-valued, the k-NN returns the most common value among the k training examples nearest to xq.
n Vonoroi diagram: the decision boundary induced by 1- NN for a typical set of training examples.
+
_
_
_
_
_
.+ _ *xq
+ +
.
. .
.
.
Data Mining: Concepts and Techniques 84
Bayesian Perspective
n An informal way to view kNN classifier as a Bayesian model
n Set up:
n kNN Ball around a test instance x: Bk(x) n Class i è has Ni training instances
n Class I è has ki instances within Bk(x) n Volume of Bk(x): V
P(Ci | x) = P(x | Ci)P(Ci) P(Ci) = Ni P(x) N
P(x)·V ⇡ k P(x|Ci)·V ⇡ ki N Ni
+
_
_
_
_
_
.+ _ *xq
+ +
observed probability
Data Mining: Concepts and Techniques
85
estimated probability (uniform density within Bk(x))
Effect of k
• k acts as a smoother
Data Mining: Concepts and Techniques 86
Discussion on the k-NN Algorithm
n The k-NN algorithm for continuous-valued target functions n Calculate the mean values of the k nearest neighbors
n Distance-weighted nearest neighbor algorithm
n Weight the contribution of each of the k neighbors according to
their distance to the query point xq
giving greater weight to closer neighbors n Similarly, for real-valued target functions
n Robust to noisy data by averaging k-nearest neighbors n Curse of dimensionality:
n kNN search becomes very expensive in high dimensional space n High-dimensionalindexingmethods,e.g.,LSH
n distance between neighbors could be dominated by irrelevant attributes.
n To overcome it, axes stretch or elimination of the least relevant
n
wo
qi
2
attributes.
Data Mining: Concepts and Techniques 87
1 d(x ,x)
Remarks on Lazy vs. Eager Learning
n Instance-based learning: lazy evaluation
n Decision-tree and Bayesian classification: eager evaluation n Keydifferences
n Lazy method may consider query instance xq when deciding how to generalize beyond the training data D
n Eager method cannot since they have already chosen global approximation when seeing the query
n Efficiency:Lazy-lesstimetrainingbutmoretimepredicting n Accuracy
n Lazy method effectively uses a richer hypothesis space since it uses many local linear functions to form its implicit global approximation to the target function
n Eager: must commit to a single hypothesis that covers the entire instance space
Data Mining: Concepts and Techniques 88