17-classifiers.pptx
Ling 131A
Introduc0on to NLP with Python
Classifiers
Marc Verhagen, Fall 2018
Classifica0on
• The task of choosing the correct class label for
a given input
– Is this spam?
– Is this a posi0ve or a nega0ve review?
– To what synset does this word belong?
– Is this a named en0ty?
– What POS is this word?
– What kind of named en0ty is this?
Classifica0on
• Classifier can be rule-based
– Date à Month DayOfMonth Year
• But most classifiers in use are machine
learning classifiers
• The ones that we look at are supervised
classifiers
– That is, they use example data
NLTK Classifier Example
ML Classifiers
Training versus tes0ng
Cross-valida0on
• Evaluate a model by par00oning the original
sample into a training set to train the model,
and a test set to evaluate it.
• N-fold cross-valida0on
– par00on into n equal size subsamples
– a single subsample is retained as the test data, the
remaining n-1 subsamples are used for training
– Repeat n 0mes with each subsample as the test
data
N-fold cross-valida0on
Source: wikipedia.org
N-fold cross-valida0on
• MaZers less how the data are divided
• Test and training data need to be taken from
the same data set
– That is, test data and training data have the same
characteris0cs and the overall data are
homogeneous
• Avoid human bias
– Splits have to be random
• Avoid dependencies between folds
Classifiers
• Decision tree
• Bayesian classifier
• Maximum entropy classifier
• Neural networks and deep learning
– Maybe towards the end of the course
Decision Trees
• A flowchart-like structure
in the shape of a tree
• Internal nodes are tests
– Each test is a test for the value of a feature
– Numbers of outcomes determines the number of
branches from that node
• Leaf nodes have final classifica0on
• Tradi0onally built manually
Building a decision tree
• Create decision stumps
– A mini tree with just one test and branches depending
on how many outcomes the test has
– We could do one stump for each feature
• lastleZer=vowel
• length>5
• count(e)=2
– If a decision stump implements a test for a binary
features, then it splits your training corpus into two
par00ons (that is, the tree stump has two daughters)
Note that the algorithm itself does not
associate any seman3cs with these names
Building a decision tree
• Select the decision stump that has the highest
capacity of predic0ng the final label
The lastleZer=vowel
feature performs beZer
than the other feature
so we pick it and insert it
into the decision tree.
Building a decision tree
• So we chose lastleZer=vowel
• Now let’s see if we can
do something with that
lower lec distribu0on
• We now work with a
domain of only five
observa0ons
Building a decision tree
• Again select the decision stump that has the
highest capacity of predic0ng the final label
Hard to say which
feature performs beZer,
but let’s say that the
length>5 feature is
beZer so we pick it and
insert it into the decision
tree.
Building a decision tree
Par0al tree acer adding two
stumps
You keep going 0ll you can add
no more rules or 0ll adding rules
makes no sense.
Danger of overfieng, that is, the
tree gets to be too tailored to
the training data. You could
avoid this by not adding rules
when your domain gets too
small or by pruning the tree.
Building a decision tree
• Itera0ve process star0ng at the very top
• At any point you have a par0al tree
– Select a ? node
– Decide whether
there is a good
stump that can
be added
– Or replace with
label
How to choose the stump?
• Use accuracy
• Use entropy and informa0on gain
Accuracy in decision trees
• Calculate accuracy of each par00ons and take
the weighted average
Accuracy = 0.80 Accuracy = 0.40
Entropy and informa0on gain
• More popular way for choosing a stump
• How much more organized do the values get to
be when you par00on them with a given feature?
• Entropy is a measure where a low value reflects
highly organized input and a high value reflects
chaos.
– Or beZer: low probability versus high probability
• You gain informa0on if entropy goes down:
– InfoGain(S1, S2) = Entropy(S1) – Entropy(S2)
Entropy
Zero is lowest possible value
and reflects that your output
is completely homogeneous
Highest value depends on how
many possible labels there are
Example calcula0on
• Original set of observa0ons had 5 white dots
(female names) and 5 black dots (male names)
• Entropy:
= – (0.5 x log(0.5) + 0.5 x log(0.5))
= – (0.5 x -1.0 + 0.5 x -1.0)
= – (-0.5 + -0.5)
= – (-1)
= 1
More calcula0ons
= – (0.2 x log(0.2) + 0.8 x log(0.8))
= – (0.2 x -2.322 + 0.8 x -0.322)
= – (-0.464 + -0.258)
= – (-0.722)
= 0.722
= – (0.4 x log(0.4) + 0.6 x log(0.6))
= – (0.4 x -1.322 + 0.6 x -0.737)
= – (-0.529 + -0.442)
= – (-0.971)
= 0.971
Entropy in decision trees
• Calculate entropy of original set
• Calculate average entropy of the par00ons
H = 1.00
H = 0.72
InfoGain = 1.00 – 0.72 = 0.28
H = 0.97
InfoGain = 1.00 – 0.97 = 0.03
Decision tree pros and cons
• Advantages
– Easy to interpret
– Good match for hierarchical classifica0ons
• Disadvantages
– Training sets for lower nodes may be too small
– Features checked in par0cular order
– Not enough space at top for “good” features
– Features that are weak predictors will be ignored
Naïve Bayes Classifier
• Every feature counts
• Algorithm
– Calculate prior probability of a label
• check frequency in data set
– For each feature
• how do features change the probability of labels
• changes the es0mated likelihood of a label given a
feature
– Pick highest likelihood
Naïve Bayes Classifier
Finding the topic
of a document
Naïve Bayes Classifier
Each feature reduces the likelihood that a label is true, but for some
labels the likelihood will be reduced more
Bayes Rule
Weather data example
outlook temperature humidity windy play
sunny hot high false no
sunny hot high true no
overcast hot high false yes
rainy mild high false yes
rainy cool normal false yes
rainy cool normal true no
overcast cool normal true yes
sunny mild high false no
sunny cool normal false yes
rainy mild normal false yes
sunny mild normal true yes
overcast mild high true yes
overcast hot normal false yes
rainy mild high true no
from:
WiZen & Frank
Data Mining
Weather data example
outlook temperature humidity windy play
sunny cool high true ?
yes no
outlook=sunny 2/9 3/5
temperature=cool 3/9 1/5
humidity=high 3/9 4/5
windy=true 3/9 3/5
play 9/14 5/14
What is the chance of play=yes/no with the following condi0ons
Tabulate individual probabili0es
P(outlook=sunny|play=no)
It is sunny 3/5 3mes when
we do not play
P(play=no)
We do not play 5/14 3mes
Weather data example
yes no
outlook=sunny 2/9 3/5
temperature=cool 3/9 1/5
humidity=high 3/9 4/5
windy=true 3/9 3/5
play 9/14 5/14
Calculate P(yes|F) and P(no|F)
P(yes|F) = (2/9 * 3/9 * 3/9 * 3/9 * 9/14) / P(F) = 0.0053 / P(F)
P(no|F) = (3/5 * 1/5 * 4/5 * 3/5 * 5/14) /P(F) = 0.0206 / P(F)
What is so naïve about this?
• It is naïve because features are considered
independent from each other
– If features are dependent, then the results will get
skewed
• S0ll, it performs reasonably well for many
problems and it is ocen used as a baseline.
What about scalar values?
• Say we have the temperature as a feature
• Take poten0ally infinite range of values and
put them in bins
– Do not use a bin for each value
– Overfieng
• Or use the scalar values and calculate
averages and standard devia0ons for the
values for some label
What about zero values?
• Example
– Fn à outlook=overcast
C à play=yes
– P(Fn|C) = Count(Fn+C) / Count(C) = 0 / 5
• Adjust the formula
– P(Fn|C) = Count(Fn+C) + 1 / Count(C) = 1 / 5
– similar to Laplace smoothing