lecture07.pptx
LECTURE 7
Text Classifcatio, Evaluatio aod Errir Aoalysis
Arkaitz Zubiaga, 29
th
Jaouary, 2018
2
What is text classifcatio?
Examples if text classifcatio.
Supervised Text Classifcatio.
Evaluatio.
Errir Aoalysis.
LECTURE 7: CONTENTS
3
Haviog as ioput:
A text document d
A set of categories C={c
1
, …, c
m
}
The text classifcaaon task outputs:
Predicted class c* that document d belongs to.
WHAT IS TEXT CLASSIFICATION?
4
Spam detecaon: classifyiog emails/web pages as spam (ir oit).
EXAMPLES OF TEXT CLASSIFICATION
SPAM
NOT SPAM
5
Classifcaaon bb topic: what is the text abiut?
EXAMPLES OF TEXT CLASSIFICATION
6
Senament analbsis: is a text pisitve, oegatve ir oeutral?
I really liked the fiid at the restauraot.
We were 8 frieods whi weot there fir the frst tme.
The service was terrible.
EXAMPLES OF TEXT CLASSIFICATION
7
Language idenafcaaon: what laoguage a text is writeo io?
Wieviel Uhr ist es? {Germao, Eoglish, Spaoish, Freoch}
EXAMPLES OF TEXT CLASSIFICATION
8
Classifcaaon of poliacal orientaaon :
dies a text suppirt labiur ir cioservatve?
EXAMPLES OF TEXT CLASSIFICATION
Labour Conservative
9
A raoge if difereot priblems, with a common goal:
Assigoiog a categorb/class to each document .
We know the set of categories befirehaod.
WHAT IS TEXT CLASSIFICATION?
10
Rule-based classifers, e.g. if email ciotaios ‘viagra’ → spam
Sigoifcaot manual efort iovilved.
Supervised classifcaaon:
Given: a haod-labeled set if document-class pairs
(d
1
,c
1
), (d
2
, c
2
), …, (d
m
,c
m
) → classifed ioti C={c
1
, …, c
j
}
The classifer learns a model that cao classifb new
documents into C.
TEXT CLASSIFICATION: APPROACHES
SUPERVISED TEXT
CLASSIFICATION
12
Assumpaonn We have a maoually labelled dataset, e.g.:
d
1
: ‘That’s really giid, I live it’ → posiave
d
2
: ‘It was biriog, dio’t recimmeod it’ → negaave
…
d
o
: ‘I wiuldo’t gi agaio, awful’ → negaave
If not, we oeed ti fnd one or label one ourselves.
SUPERVISED CLASSIFICATION
13
Split the dataset ioti traio/dev/test sets.
What features are giiog ti use ti represeot the dicumeots?
What classifer are giiog ti use?
Chiise setngs, parameters, etc. fir the classifer.
SUPERVISED CLASSIF.: DECISIONS TO MAKE
14
We cao split the dataset into 3 parts:
Traioiog set → the largest set as we waot priper traioiog.
Develipmeot set.
Test set.
Tweak classifer based on the development set, theo test it io the test set.
Tweaking and tesang on the test set mab lead to overftng (diiog the
right thiogs specifcally fir that test set, oit oecessarily geoeralisable)
SPLITTING THE DATASET
Traioiog set
Develipmeot
Set
Test Set
15
Cross-validaaonn traio aod test io difereot “filds”
e.g. 10-fild criss-validatio, split the data ioti 10 parts.
each tme 1 fild is used fir testog, the ither 9 fir traioiog.
afer all 10 ruos, cimpute the average perfirmaoce.
SPLITTING THE DATASET: LARGER DATASETS
16
Cross-validaaonn example.
SPLITTING THE DATASET: LARGER DATASETS
17
Usually start with some basic features:
Bag of words.
Or preferably word embeddings.
Keep adding new features:
Need to be creaave.
Think if features ciuld characterise the problem at hand .
CHOOSING THE FEATURES
18
Possible featuresn
Senament analbsis → ciuots if pisitve/oegatve wirds.
Language idenafcaaon → pribabilites if characters (hiw
maoy k’s, b’s, v’s…), features frim wird sufxes (e.g. maoy
-iog wirds → Eoglish)
Spam detecaon → ciuot wirds io blacklist, dimaio if URLs
io email (liikiog fir maliciius URLs)
THINKING OF FEATURES
19
Hiw ti assess which features are good?
Empirical evaluaaon:
Incremental tesang:
keep addiog features, see if addiog imprives perfirmaoce
Leave-one-out tesang:
test all features, aod cimbioatios if all features except ioe.
wheo leaviog feature i iut perfirms beter thao all features, remive feature I
Error analbsisn (later io this lecture)
liik at classifer’s errirs, what features cao we use ti imprive?
CHOOSING THE FEATURES
20
Maoy difereot classifers exist, well-known classifers ioclude:
Naive Babes.
Logisac Regression (Maximum Eotripy classifer)
Support Vector Machines (SVM).
Classifers cao be binarb (k = 2) ir mulaclass (k > 2).
CHOOSING A CLASSIFIER
21
Have verb litle data?
Naive Babes.
Semi-supervised classifcaaon (e.g. biitstrappiog)
Iocirpirates classifer’s predictios ioti the traioiog data
Have good amount of data?
SVM.
Logisac regression.
CHOOSING A CLASSIFIER
22
How manb categories (k)?
[k=2] Binarb → bioary classifer.
[k>2] Mulaclass:
One-vs-all classifers.
Build k classifers, each able ti distoguish class i frim the rest. Theo cimbioe
iutput if all classifers (e.g. based io their ciofdeoce scires)
Mulanomial/mulaclass classifers.
CHOOSING A CLASSIFIER
23
BINARY VS MULTICLASS CLASSIFIERS
24
Mulanomialn
is generallb faster, a siogle classifer.
classes are mutuallb exclusive, oi iverlap.
One-vs-alln
Mulalabel classifcaaon, dicumeot cao fall io 1+ categories:
e.g. classify by laoguage:
I said “biojiur mio ami” ti my frieod → Eoglish & Freoch
How? Out if k classifers, thise with confdence > threshold
WHEN TO USE MULTINOMIAL OR ONE-VS-ALL?
EVALUATION OF TEXT
CLASSIFICATION
26
Evaluatio is diferent for binarb and mulaclass classifcatio.
Binarbn we geoerally have a posiave and a negaave class
(spam vs oio-spam, medical test pisitve vs oegatve, exam
pass vs fail).
Classifcatio errors can onlb go the other class .
Mulaclassn multple categiries, may have difereot level if
impirtaoce.
Classifcatio errors can go to anb other class .
EVALUATION OF TEXT CLASSIFICATION
27
2-by-2 ciotogeocy table:
EVALUATION OF BINARY CLASSIFICATION
Actuallb posiave Actuallb negaave
Classifed as posiave True Pisitve (TP) False Pisitve (FP)
Classifed as
negaave
False Negatve (FN) True Negatve (TN)
28
2-by-2 ciotogeocy table:
Precisionn rati if items classifed as pisitve that are cirrect.
Recalln rati if cirrect items that are classifed as pisitve.
EVALUATION OF BINARY CLASSIFICATION
Actuallb posiave Actuallb negaave
Classifed as posiave True Pisitve (TP) False Pisitve (FP)
Classifed as
negaave
False Negatve (FN) True Negatve (TN)
29
2-by-2 ciotogeocy table:
Precisionn rati if items classifed as pisitve that are cirrect.
Recalln rati if cirrect items that are classifed as pisitve.
EVALUATION OF BINARY CLASSIFICATION
Actuallb posiave Actuallb negaave
Classifed as posiave True Pisitve (TP) False Pisitve (FP)
Classifed as
negaave
False Negatve (FN) True Negatve (TN)
30
2-by-2 ciotogeocy table:
Precisionn rati if items classifed as pisitve that are cirrect.
Recalln rati if cirrect items that are classifed as pisitve.
EVALUATION OF BINARY CLASSIFICATION
Actuallb posiave Actuallb negaave
Classifed as posiave True Pisitve (TP) False Pisitve (FP)
Classifed as
negaave
False Negatve (FN) True Negatve (TN)
31
We waot ti iptmise fir bith precisiio aod recall:
(harmioic meao if precisiio aod recall)
Equatio as filliws, hiwever geoerally ß = 1:
EVALUATION OF BINARY CLASSIFICATION
32
Bigger confusion matrix:
EVALUATION OF MULTICLASS CLASSIFICATION
Actuallb
UK
Actuallb
World
Actuallb
Tech
Actuallb
Science
Actuallb
Poliacs
Actuallb
Business
Classifed as UK 95 1 13 0 1 0
Classifed as World 0 1 0 0 0 0
Classifed as Tech 10 90 0 1 0 0
Classifed as Science 0 0 0 34 3 7
Classifed as Poliacs 0 1 2 13 26 5
Classifed as Business 0 0 2 14 5 10
33
Overall Accuracbn rati if cirrect classifcatios.
EVALUATION OF MULTICLASS CLASSIFICATION
34
Overall Accuracbn rati if cirrect classifcatios.
Generallb a bad evaluaaon approach , e.g.:
We classify 1,000 texts → 990 have pisitve seotmeot.
We (oaively) classify everythiog as pisitve.
990 classifed cirrectly: 990 / 1000 = 0.99 accuracy!
EVALUATION OF MULTICLASS CLASSIFICATION
35
Per-class precision and recall :
Precisiio:
Recall:
With the harmonic mean, we cao theo get per-class F1 score.
EVALUATION OF MULTICLASS CLASSIFICATION
# of items correctly classified as class i
# of items classified as class i
# of items correctly classified as class i
# of actual i items
36
We have per-class precision, recall and F1 scores .
Hiw di we cimbioe them all ti get a single score?
EVALUATION OF MULTICLASS CLASSIFICATION
37
Obtaioiog iverall perfirmaoces:
Macroaveragingn
Cimpute performance for each class, then average them.
All classes contribute the same ti the foal scire (e.g. class with 990
aod class with 10 iostaoces).
Microaveragingn
Cimpute overall performance withiut cimputog per-class
perfirmaoces.
Large classes contribute more ti the foal scire.
EVALUATION OF MULTICLASS CLASSIFICATION
38
Macriaveragiog:
The macriaveraged F1 scire is theo the harmioic meao if thise.
EVALUATION OF MULTICLASS CLASSIFICATION
39
Micriaveragiog:
The micriaveraged F1 scire is theo the harmioic meao if thise.
EVALUATION OF MULTICLASS CLASSIFICATION
40
Micriaveraged F1 scire: 0.665
Macriaveraged F1 scire: 0.440
MICRO- VS MACRO-AVERAGING EXAMPLE
41
We cao chiise ti prioriase certain categories:
Give higher weight to important categories :
0.3*Prec(c
1
) + 0.3*Prec(c
2
) + 0.4*Prec(c
3
)
Select sime categiries fir ioclusiio io macri/micriaverage:
e.g. Semeval task (Exercise 2 of the module), we ioly macroaverage over posiave
and negaave senament classes . Perfirmaoce iver the oeutral class is oit iocluded.
FURTHER WEIGHTING/SELECTION
ERROR ANALYSIS FOR TEXT
CLASSIFICATION
43
ERROR ANALYSIS
Error analysis: can help us find out where our classifier can do
better.
No magic formula for performing error analysis.
Look where we are doing wrong, what labels particularly.
Do our errors have some common characteristics? Can we
infer a new feature from that?
Could our classifier be favouring one of the classes (e.g. the
majority class)?
44
ERROR ANALYSIS
Error analysis: where are we doing wrong? What labels?
Look at frequent deviations in the confusion matrix.
Actuallb
UK
Actuallb
World
Actuallb
Tech
Actuallb
Science
Actuallb
Poliacs
Actuallb
Business
Classifed as UK 95 1 13 0 1 0
Classifed as World 0 1 0 0 0 0
Classifed as Tech 10 90 0 1 0 0
Classifed as Science 0 0 0 34 3 7
Classifed as Poliacs 0 1 2 13 26 5
Classifed as Business 0 0 2 14 5 10
45
ERROR ANALYSIS
Error analysis: do our errors have some common characteristics?
Print some of our errors.
46
ERROR ANALYSIS
Error analysis: do our errors have some common characteristics?
Print some of our errors.
47
ERROR ANALYSIS
Error analysis: do our errors have some common characteristics?
Print some of our errors.
New feature: suffix, last 2-3 characters
48
ERROR ANALYSIS
Error analysis: could our classifier be favouring one of the classes?
Owing to class imbalance, classifiers tend to predict popular
classes more often, e.g.:
class A (700), class B (100), class C (100), class D (100)
classifiers will tend to predict A, over-represented
as in our previous example:
49
ERROR ANALYSIS
Error analysis: could our classifier be favouring one of the classes?
How to deal with imbalance (i.e. A-700, B-100, C-100, D-100)?
1) Undersample popular class → A-100, B-100, C-100, D-100
randomly remove 600 instances of A
50
ERROR ANALYSIS
Error analysis: could our classifier be favouring one of the classes?
How to deal with imbalance (i.e. A-700, B-100, C-100, D-100)?
2) Oversample other classes → A-700, B-700, C-700, D-700
repeat instances of B, C, D to match the number of A’s
51
ERROR ANALYSIS
Error analysis: could our classifier be favouring one of the classes?
How to deal with imbalance (i.e. A-700, B-100, C-100, D-100)?
3) Create synthetic data → A-700, B-700, C-700, D-700
generate new B, C, D items → needs some understanding of the
contents of the classes to be able to produce sensible data items
52
ERROR ANALYSIS
Error analysis: could our classifier be favouring one of the classes?
How to deal with imbalance (i.e. A-700, B-100, C-100, D-100)?
4) Cost sensitive learning
e.g. higher probability to predict uncommon classes
P(A)=1/700, P(B)=1/100, P(C)=1/100, P(D)=1/100
scikit → class_weight=”auto”
53
ERROR ANALYSIS
Important for the error analysis:
Subset we analyse for errors (dev set) has to be different to
the one where we ultimately apply the classifier (test set).
If we tweak the classifier looking at the test set, we’ll end up
overfitting, developing a classifier that works very well for that
particular test set.
NB: for exercise 2, you’re given 3 test sets, we’ll test it in 2 more
held-out test sets. Not looking at test sets while developing will
improve generalisability.
54
ASSOCIATED READING
Jurafsky, Daniel, and James H. Martin. 2009. Speech and Language
Processing: An Introduction to Natural Language Processing, Speech
Recognition, and Computational Linguistics. 3rd edition. Chapters 6
and 7.
Bird Steven, Ewan Klein, and Edward Loper. Natural Language
Processing with Python. O’Reilly Media, Inc., 2009. Chapter 6.