PowerPoint Presentation
LECTURE 7
Text Classifcatin, Evaluatin and Errir Analysis
Arkaitz Zubiaga, 29th January, 2018
2
What is text classifcatin?
Examples if text classifcatin.
Supervised Text Classifcatin.
Evaluatin.
Errir Analysis.
LECTURE 7: CONTENTS
3
Having as input:
A text document d
A set of categories C={c
1
, …, c
m
}
The text classifcaton task outputs:
Predicted class c* that document d belongs to.
WHAT IS TEXT CLASSIFICATION?
4
Spam detecton: classifying emails/web pages as spam (ir nit).
EXAMPLES OF TEXT CLASSIFICATION
SPAM
NOT SPAM
5
Classifcaton by topic: what is the text abiut?
EXAMPLES OF TEXT CLASSIFICATION
6
Sentment analysis: is a text pisitve, negatve ir neutral?
I really liked the fiid at the restaurant.
We were 8 friends whi went there fir the frst tme.
The service was terrible.
EXAMPLES OF TEXT CLASSIFICATION
7
Language identfcaton: what language a text is writen in?
Wieviel Uhr ist es? {German, English, Spanish, French}
EXAMPLES OF TEXT CLASSIFICATION
8
Classifcaton of politcal orientaton:
dies a text suppirt labiur ir cinservatve?
EXAMPLES OF TEXT CLASSIFICATION
Labour Conservative
9
A range if diferent priblems, with a common goal:
Assigning a category/class to each document.
We know the set of categories befirehand.
WHAT IS TEXT CLASSIFICATION?
10
Rule-based classifers, e.g. if email cintains ‘viagra’ → spam
Signifcant manual efort invilved.
Supervised classifcaton:
Given: a hand-labeled set if document-class pairs
(d
1
,c
1
), (d
2
, c
2
), …, (d
m
,c
m
) → classifed inti C={c
1
, …, c
j
}
The classifer learns a model that can classify new
documents into C.
TEXT CLASSIFICATION: APPROACHES
SUPERVISED TEXT
CLASSIFICATION
12
Assumpton: We have a manually labelled dataset, e.g.:
d
1
: ‘That’s really giid, I live it’ → positve
d
2
: ‘It was biring, din’t recimmend it’ → negatve
…
d
n
: ‘I wiuldn’t gi again, awful’ → negatve
If not, we need ti fnd one or label one ourselves.
SUPERVISED CLASSIFICATION
13
Split the dataset inti train/dev/test sets.
What features are giing ti use ti represent the dicuments?
What classifer are giing ti use?
Chiise setngs, parameters, etc. fir the classifer.
SUPERVISED CLASSIF.: DECISIONS TO MAKE
14
We can split the dataset into 3 parts:
Training set → the largest set as we want priper training.
Develipment set.
Test set.
Tweak classifer based on the development set, then test it in the test set.
Tweaking and testng on the test set may lead to overftng (diing the
right things specifcally fir that test set, nit necessarily generalisable)
SPLITTING THE DATASET
Training set
Develipment
Set
Test Set
15
Cross-validaton: train and test in diferent “filds”
e.g. 10-fild criss-validatin, split the data inti 10 parts.
each tme 1 fild is used fir testng, the ither 9 fir training.
after all 10 runs, cimpute the average perfirmance.
SPLITTING THE DATASET: LARGER DATASETS
16
Cross-validaton: example.
SPLITTING THE DATASET: LARGER DATASETS
17
Usually start with some basic features:
Bag of words.
Or preferably word embeddings.
Keep adding new features:
Need to be creatve.
Think if features ciuld characterise the problem at hand.
CHOOSING THE FEATURES
18
Possible features:
Sentment analysis → ciunts if pisitve/negatve wirds.
Language identfcaton → pribabilites if characters (hiw
many k’s, b’s, v’s…), features frim wird sufxes (e.g. many
-ing wirds → English)
Spam detecton → ciunt wirds in blacklist, dimain if URLs
in email (liiking fir maliciius URLs)
THINKING OF FEATURES
19
Hiw ti assess which features are good?
Empirical evaluaton:
Incremental testng:
keep adding features, see if adding imprives perfirmance
Leave-one-out testng:
test all features, and cimbinatins if all features except ine.
when leaving feature i iut perfirms beter than all features, remive feature I
Error analysis: (later in this lecture)
liik at classifer’s errirs, what features can we use ti imprive?
CHOOSING THE FEATURES
20
Many diferent classifers exist, well-known classifers include:
Naive Bayes.
Logistc Regression (Maximum Entripy classifer)
Support Vector Machines (SVM).
Classifers can be binary (k = 2) ir multclass (k > 2).
CHOOSING A CLASSIFIER
21
Have very litle data?
Naive Bayes.
Semi-supervised classifcaton (e.g. biitstrapping)
Incirpirates classifer’s predictins inti the training data
Have good amount of data?
SVM.
Logistc regression.
CHOOSING A CLASSIFIER
22
How many categories (k)?
[k=2] Binary → binary classifer.
[k>2] Multclass:
One-vs-all classifers.
Build k classifers, each able ti distnguish class i frim the rest. Then cimbine
iutput if all classifers (e.g. based in their cinfdence scires)
Multnomial/multclass classifers.
CHOOSING A CLASSIFIER
23
BINARY VS MULTICLASS CLASSIFIERS
24
Multnomial:
is generally faster, a single classifer.
classes are mutually exclusive, ni iverlap.
One-vs-all:
Multlabel classifcaton, dicument can fall in 1+ categories:
e.g. classify by language:
I said “binjiur min ami” ti my friend → English & French
How? Out if k classifers, thise with confdence > threshold
WHEN TO USE MULTINOMIAL OR ONE-VS-ALL?
EVALUATION OF TEXT
CLASSIFICATION
26
Evaluatin is diferent for binary and multclass classifcatin.
Binary: we generally have a positve and a negatve class
(spam vs nin-spam, medical test pisitve vs negatve, exam
pass vs fail).
Classifcatin errors can only go the other class.
Multclass: multple categiries, may have diferent level if
impirtance.
Classifcatin errors can go to any other class.
EVALUATION OF TEXT CLASSIFICATION
27
2-by-2 cintngency table:
EVALUATION OF BINARY CLASSIFICATION
Actually positve Actually negatve
Classifed as positve True Pisitve (TP) False Pisitve (FP)
Classifed as
negatve
False Negatve (FN) True Negatve (TN)
28
2-by-2 cintngency table:
Precision: rati if items classifed as pisitve that are cirrect.
Recall: rati if cirrect items that are classifed as pisitve.
EVALUATION OF BINARY CLASSIFICATION
Actually positve Actually negatve
Classifed as positve True Pisitve (TP) False Pisitve (FP)
Classifed as
negatve
False Negatve (FN) True Negatve (TN)
29
2-by-2 cintngency table:
Precision: rati if items classifed as pisitve that are cirrect.
Recall: rati if cirrect items that are classifed as pisitve.
EVALUATION OF BINARY CLASSIFICATION
Actually positve Actually negatve
Classifed as positve True Pisitve (TP) False Pisitve (FP)
Classifed as
negatve
False Negatve (FN) True Negatve (TN)
30
2-by-2 cintngency table:
Precision: rati if items classifed as pisitve that are cirrect.
Recall: rati if cirrect items that are classifed as pisitve.
EVALUATION OF BINARY CLASSIFICATION
Actually positve Actually negatve
Classifed as positve True Pisitve (TP) False Pisitve (FP)
Classifed as
negatve
False Negatve (FN) True Negatve (TN)
31
We want ti iptmise fir bith precisiin and recall:
(harminic mean if precisiin and recall)
Equatin as filliws, hiwever generally ß = 1:
EVALUATION OF BINARY CLASSIFICATION
32
Bigger confusion matrix:
EVALUATION OF MULTICLASS CLASSIFICATION
Actually
UK
Actually
World
Actually
Tech
Actually
Science
Actually
Politcs
Actually
Business
Classifed as UK 95 1 13 0 1 0
Classifed as World 0 1 0 0 0 0
Classifed as Tech 10 90 0 1 0 0
Classifed as Science 0 0 0 34 3 7
Classifed as Politcs 0 1 2 13 26 5
Classifed as Business 0 0 2 14 5 10
33
Overall Accuracy: rati if cirrect classifcatins.
EVALUATION OF MULTICLASS CLASSIFICATION
34
Overall Accuracy: rati if cirrect classifcatins.
Generally a bad evaluaton approach, e.g.:
We classify 1,000 texts → 990 have pisitve sentment.
We (naively) classify everything as pisitve.
990 classifed cirrectly: 990 / 1000 = 0.99 accuracy!
EVALUATION OF MULTICLASS CLASSIFICATION
35
Per-class precision and recall:
Precisiin:
Recall:
With the harmonic mean, we can then get per-class F1 score.
EVALUATION OF MULTICLASS CLASSIFICATION
# of items correctly classified as class i
# of items classified as class i
# of items correctly classified as class i
# of actual i items
36
We have per-class precision, recall and F1 scores.
Hiw di we cimbine them all ti get a single score?
EVALUATION OF MULTICLASS CLASSIFICATION
37
Obtaining iverall perfirmances:
Macroaveraging:
Cimpute performance for each class, then average them.
All classes contribute the same ti the fnal scire (e.g. class with 990
and class with 10 instances).
Microaveraging:
Cimpute overall performance withiut cimputng per-class
perfirmances.
Large classes contribute more ti the fnal scire.
EVALUATION OF MULTICLASS CLASSIFICATION
38
Macriaveraging:
The macriaveraged F1 scire is then the harminic mean if thise.
EVALUATION OF MULTICLASS CLASSIFICATION
39
Micriaveraging:
The micriaveraged F1 scire is then the harminic mean if thise.
EVALUATION OF MULTICLASS CLASSIFICATION
40
Micriaveraged F1 scire: 0.665
Macriaveraged F1 scire: 0.440
MICRO- VS MACRO-AVERAGING EXAMPLE
41
We can chiise ti prioritse certain categories:
Give higher weight to important categories:
0.3*Prec(c
1
) + 0.3*Prec(c
2
) + 0.4*Prec(c
3
)
Select sime categiries fir inclusiin in macri/micriaverage:
e.g. Semeval task (Exercise 2 of the module), we inly macroaverage over positve
and negatve sentment classes. Perfirmance iver the neutral class is nit included.
FURTHER WEIGHTING/SELECTION
ERROR ANALYSIS FOR TEXT
CLASSIFICATION
43
ERROR ANALYSIS
Error analysis: can help us find out where our classifier can do
better.
No magic formula for performing error analysis.
Look where we are doing wrong, what labels particularly.
Do our errors have some common characteristics? Can we
infer a new feature from that?
Could our classifier be favouring one of the classes (e.g. the
majority class)?
44
ERROR ANALYSIS
Error analysis: where are we doing wrong? What labels?
Look at frequent deviations in the confusion matrix.
Actually
UK
Actually
World
Actually
Tech
Actually
Science
Actually
Politcs
Actually
Business
Classifed as UK 95 1 13 0 1 0
Classifed as World 0 1 0 0 0 0
Classifed as Tech 10 90 0 1 0 0
Classifed as Science 0 0 0 34 3 7
Classifed as Politcs 0 1 2 13 26 5
Classifed as Business 0 0 2 14 5 10
45
ERROR ANALYSIS
Error analysis: do our errors have some common characteristics?
Print some of our errors.
46
ERROR ANALYSIS
Error analysis: do our errors have some common characteristics?
Print some of our errors.
47
ERROR ANALYSIS
Error analysis: do our errors have some common characteristics?
Print some of our errors.
New feature: suffix, last 2-3 characters
48
ERROR ANALYSIS
Error analysis: could our classifier be favouring one of the classes?
Owing to class imbalance, classifiers tend to predict popular
classes more often, e.g.:
class A (700), class B (100), class C (100), class D (100)
classifiers will tend to predict A, over-represented
as in our previous example:
49
ERROR ANALYSIS
Error analysis: could our classifier be favouring one of the classes?
How to deal with imbalance (i.e. A-700, B-100, C-100, D-100)?
1) Undersample popular class → A-100, B-100, C-100, D-100
randomly remove 600 instances of A
50
ERROR ANALYSIS
Error analysis: could our classifier be favouring one of the classes?
How to deal with imbalance (i.e. A-700, B-100, C-100, D-100)?
2) Oversample other classes → A-700, B-700, C-700, D-700
repeat instances of B, C, D to match the number of A’s
51
ERROR ANALYSIS
Error analysis: could our classifier be favouring one of the classes?
How to deal with imbalance (i.e. A-700, B-100, C-100, D-100)?
3) Create synthetic data → A-700, B-700, C-700, D-700
generate new B, C, D items → needs some understanding of the
contents of the classes to be able to produce sensible data items
52
ERROR ANALYSIS
Error analysis: could our classifier be favouring one of the classes?
How to deal with imbalance (i.e. A-700, B-100, C-100, D-100)?
4) Cost sensitive learning
e.g. higher probability to predict uncommon classes
P(A)=1/700, P(B)=1/100, P(C)=1/100, P(D)=1/100
scikit → class_weight=”auto”
53
ERROR ANALYSIS
Important for the error analysis:
Subset we analyse for errors (dev set) has to be different to
the one where we ultimately apply the classifier (test set).
If we tweak the classifier looking at the test set, we’ll end up
overfitting, developing a classifier that works very well for that
particular test set.
NB: for exercise 2, you’re given 3 test sets, we’ll test it in 2 more
held-out test sets. Not looking at test sets while developing will
improve generalisability.
54
ASSOCIATED READING
Jurafsky, Daniel, and James H. Martin. 2009. Speech and Language
Processing: An Introduction to Natural Language Processing, Speech
Recognition, and Computational Linguistics. 3rd edition. Chapters 6
and 7.
Bird Steven, Ewan Klein, and Edward Loper. Natural Language
Processing with Python. O’Reilly Media, Inc., 2009. Chapter 6.
Slide 1
Slide 2
Slide 3
Slide 4
Slide 5
Slide 6
Slide 7
Slide 8
Slide 9
Slide 10
Slide 11
Slide 12
Slide 13
Slide 14
Slide 15
Slide 16
Slide 17
Slide 18
Slide 19
Slide 20
Slide 21
Slide 22
Slide 23
Slide 24
Slide 25
Slide 26
Slide 27
Slide 28
Slide 29
Slide 30
Slide 31
Slide 32
Slide 33
Slide 34
Slide 35
Slide 36
Slide 37
Slide 38
Slide 39
Slide 40
Slide 41
Slide 42
Slide 43
Slide 44
Slide 45
Slide 46
Slide 47
Slide 48
Slide 49
Slide 50
Slide 51
Slide 52
Slide 53
Slide 54