程序代写代做代考 python lecture07.pptx

lecture07.pptx

LECTURE 7

Text Classifcatio, Evaluatio aod Errir Aoalysis

Arkaitz Zubiaga, 29
th

Jaouary, 2018

2

 What is text classifcatio?

 Examples if text classifcatio.

 Supervised Text Classifcatio.

 Evaluatio.

 Errir Aoalysis.

LECTURE 7: CONTENTS

3

 Haviog as ioput:

 A text document d

 A set of categories C={c
1
, …, c

m
}

 The text classifcaaon task outputs:

 Predicted class c* that document d belongs to.

WHAT IS TEXT CLASSIFICATION?

4

 Spam detecaon: classifyiog emails/web pages as spam (ir oit).

EXAMPLES OF TEXT CLASSIFICATION

SPAM

NOT SPAM

5

 Classifcaaon bb topic: what is the text abiut?

EXAMPLES OF TEXT CLASSIFICATION

6

 Senament analbsis: is a text pisitve, oegatve ir oeutral?

 I really liked the fiid at the restauraot.

 We were 8 frieods whi weot there fir the frst tme.

 The service was terrible.

EXAMPLES OF TEXT CLASSIFICATION

7

 Language idenafcaaon: what laoguage a text is writeo io?

Wieviel Uhr ist es? {Germao, Eoglish, Spaoish, Freoch}

EXAMPLES OF TEXT CLASSIFICATION

8

 Classifcaaon of poliacal orientaaon :
dies a text suppirt labiur ir cioservatve?

EXAMPLES OF TEXT CLASSIFICATION

Labour Conservative

9

 A raoge if difereot priblems, with a common goal:

 Assigoiog a categorb/class to each document .

 We know the set of categories befirehaod.

WHAT IS TEXT CLASSIFICATION?

10

 Rule-based classifers, e.g. if email ciotaios ‘viagra’ → spam

 Sigoifcaot manual efort iovilved.

 Supervised classifcaaon:

 Given: a haod-labeled set if document-class pairs
(d

1
,c

1
), (d

2
, c

2
), …, (d

m
,c

m
) → classifed ioti C={c

1
, …, c

j
}

 The classifer learns a model that cao classifb new
documents into C.

TEXT CLASSIFICATION: APPROACHES

SUPERVISED TEXT
CLASSIFICATION

12

 Assumpaonn We have a maoually labelled dataset, e.g.:

 d
1
: ‘That’s really giid, I live it’ → posiave

 d
2
: ‘It was biriog, dio’t recimmeod it’ → negaave

 …

 d
o
: ‘I wiuldo’t gi agaio, awful’ → negaave

 If not, we oeed ti fnd one or label one ourselves.

SUPERVISED CLASSIFICATION

13

 Split the dataset ioti traio/dev/test sets.

 What features are giiog ti use ti represeot the dicumeots?

 What classifer are giiog ti use?

 Chiise setngs, parameters, etc. fir the classifer.

SUPERVISED CLASSIF.: DECISIONS TO MAKE

14

 We cao split the dataset into 3 parts:

 Traioiog set → the largest set as we waot priper traioiog.

 Develipmeot set.

 Test set.

 Tweak classifer based on the development set, theo test it io the test set.

 Tweaking and tesang on the test set mab lead to overftng (diiog the
right thiogs specifcally fir that test set, oit oecessarily geoeralisable)

SPLITTING THE DATASET

Traioiog set
Develipmeot

Set
Test Set

15

 Cross-validaaonn traio aod test io difereot “filds”

 e.g. 10-fild criss-validatio, split the data ioti 10 parts.

 each tme 1 fild is used fir testog, the ither 9 fir traioiog.

 afer all 10 ruos, cimpute the average perfirmaoce.

SPLITTING THE DATASET: LARGER DATASETS

16

 Cross-validaaonn example.

SPLITTING THE DATASET: LARGER DATASETS

17

 Usually start with some basic features:

 Bag of words.

 Or preferably word embeddings.

 Keep adding new features:

 Need to be creaave.
Think if features ciuld characterise the problem at hand .

CHOOSING THE FEATURES

18

 Possible featuresn

 Senament analbsis → ciuots if pisitve/oegatve wirds.

 Language idenafcaaon → pribabilites if characters (hiw
maoy k’s, b’s, v’s…), features frim wird sufxes (e.g. maoy
-iog wirds → Eoglish)

 Spam detecaon → ciuot wirds io blacklist, dimaio if URLs
io email (liikiog fir maliciius URLs)

THINKING OF FEATURES

19

 Hiw ti assess which features are good?

 Empirical evaluaaon:

 Incremental tesang:
keep addiog features, see if addiog imprives perfirmaoce

 Leave-one-out tesang:
test all features, aod cimbioatios if all features except ioe.
wheo leaviog feature i iut perfirms beter thao all features, remive feature I

 Error analbsisn (later io this lecture)
liik at classifer’s errirs, what features cao we use ti imprive?

CHOOSING THE FEATURES

20

 Maoy difereot classifers exist, well-known classifers ioclude:

 Naive Babes.

 Logisac Regression (Maximum Eotripy classifer)

 Support Vector Machines (SVM).

 Classifers cao be binarb (k = 2) ir mulaclass (k > 2).

CHOOSING A CLASSIFIER

21

 Have verb litle data?

 Naive Babes.

 Semi-supervised classifcaaon (e.g. biitstrappiog)
Iocirpirates classifer’s predictios ioti the traioiog data

 Have good amount of data?

 SVM.

 Logisac regression.

CHOOSING A CLASSIFIER

22

 How manb categories (k)?

 [k=2] Binarb → bioary classifer.

 [k>2] Mulaclass:

 One-vs-all classifers.
Build k classifers, each able ti distoguish class i frim the rest. Theo cimbioe
iutput if all classifers (e.g. based io their ciofdeoce scires)

 Mulanomial/mulaclass classifers.

CHOOSING A CLASSIFIER

23

BINARY VS MULTICLASS CLASSIFIERS

24

 Mulanomialn

 is generallb faster, a siogle classifer.

 classes are mutuallb exclusive, oi iverlap.

 One-vs-alln

 Mulalabel classifcaaon, dicumeot cao fall io 1+ categories:
e.g. classify by laoguage:

I said “biojiur mio ami” ti my frieod → Eoglish & Freoch

How? Out if k classifers, thise with confdence > threshold

WHEN TO USE MULTINOMIAL OR ONE-VS-ALL?

EVALUATION OF TEXT
CLASSIFICATION

26

 Evaluatio is diferent for binarb and mulaclass classifcatio.

 Binarbn we geoerally have a posiave and a negaave class
(spam vs oio-spam, medical test pisitve vs oegatve, exam
pass vs fail).
Classifcatio errors can onlb go the other class .

 Mulaclassn multple categiries, may have difereot level if
impirtaoce.
Classifcatio errors can go to anb other class .

EVALUATION OF TEXT CLASSIFICATION

27

 2-by-2 ciotogeocy table:

EVALUATION OF BINARY CLASSIFICATION

Actuallb posiave Actuallb negaave

Classifed as posiave True Pisitve (TP) False Pisitve (FP)

Classifed as
negaave

False Negatve (FN) True Negatve (TN)

28

 2-by-2 ciotogeocy table:

 Precisionn rati if items classifed as pisitve that are cirrect.

 Recalln rati if cirrect items that are classifed as pisitve.

EVALUATION OF BINARY CLASSIFICATION

Actuallb posiave Actuallb negaave

Classifed as posiave True Pisitve (TP) False Pisitve (FP)

Classifed as
negaave

False Negatve (FN) True Negatve (TN)

29

 2-by-2 ciotogeocy table:

 Precisionn rati if items classifed as pisitve that are cirrect.

 Recalln rati if cirrect items that are classifed as pisitve.

EVALUATION OF BINARY CLASSIFICATION

Actuallb posiave Actuallb negaave

Classifed as posiave True Pisitve (TP) False Pisitve (FP)

Classifed as
negaave

False Negatve (FN) True Negatve (TN)

30

 2-by-2 ciotogeocy table:

 Precisionn rati if items classifed as pisitve that are cirrect.

 Recalln rati if cirrect items that are classifed as pisitve.

EVALUATION OF BINARY CLASSIFICATION

Actuallb posiave Actuallb negaave

Classifed as posiave True Pisitve (TP) False Pisitve (FP)

Classifed as
negaave

False Negatve (FN) True Negatve (TN)

31

 We waot ti iptmise fir bith precisiio aod recall:

(harmioic meao if precisiio aod recall)

 Equatio as filliws, hiwever geoerally ß = 1:

EVALUATION OF BINARY CLASSIFICATION

32

 Bigger confusion matrix:

EVALUATION OF MULTICLASS CLASSIFICATION

Actuallb
UK

Actuallb
World

Actuallb
Tech

Actuallb
Science

Actuallb
Poliacs

Actuallb
Business

Classifed as UK 95 1 13 0 1 0

Classifed as World 0 1 0 0 0 0

Classifed as Tech 10 90 0 1 0 0

Classifed as Science 0 0 0 34 3 7

Classifed as Poliacs 0 1 2 13 26 5

Classifed as Business 0 0 2 14 5 10

33

 Overall Accuracbn rati if cirrect classifcatios.

EVALUATION OF MULTICLASS CLASSIFICATION

34

 Overall Accuracbn rati if cirrect classifcatios.

 Generallb a bad evaluaaon approach , e.g.:

 We classify 1,000 texts → 990 have pisitve seotmeot.

 We (oaively) classify everythiog as pisitve.

 990 classifed cirrectly: 990 / 1000 = 0.99 accuracy!

EVALUATION OF MULTICLASS CLASSIFICATION

35

 Per-class precision and recall :

Precisiio:

Recall:

 With the harmonic mean, we cao theo get per-class F1 score.

EVALUATION OF MULTICLASS CLASSIFICATION

# of items correctly classified as class i

# of items classified as class i

# of items correctly classified as class i

# of actual i items

36

 We have per-class precision, recall and F1 scores .

 Hiw di we cimbioe them all ti get a single score?

EVALUATION OF MULTICLASS CLASSIFICATION

37

 Obtaioiog iverall perfirmaoces:

 Macroaveragingn
Cimpute performance for each class, then average them.
All classes contribute the same ti the foal scire (e.g. class with 990
aod class with 10 iostaoces).

 Microaveragingn
Cimpute overall performance withiut cimputog per-class
perfirmaoces.
Large classes contribute more ti the foal scire.

EVALUATION OF MULTICLASS CLASSIFICATION

38

 Macriaveragiog:

 The macriaveraged F1 scire is theo the harmioic meao if thise.

EVALUATION OF MULTICLASS CLASSIFICATION

39

 Micriaveragiog:

 The micriaveraged F1 scire is theo the harmioic meao if thise.

EVALUATION OF MULTICLASS CLASSIFICATION

40

 Micriaveraged F1 scire: 0.665

 Macriaveraged F1 scire: 0.440

MICRO- VS MACRO-AVERAGING EXAMPLE

41

 We cao chiise ti prioriase certain categories:

 Give higher weight to important categories :
0.3*Prec(c

1
) + 0.3*Prec(c

2
) + 0.4*Prec(c

3
)

 Select sime categiries fir ioclusiio io macri/micriaverage:
 e.g. Semeval task (Exercise 2 of the module), we ioly macroaverage over posiave

and negaave senament classes . Perfirmaoce iver the oeutral class is oit iocluded.

FURTHER WEIGHTING/SELECTION

ERROR ANALYSIS FOR TEXT
CLASSIFICATION

43

ERROR ANALYSIS

 Error analysis: can help us find out where our classifier can do
better.

 No magic formula for performing error analysis.
 Look where we are doing wrong, what labels particularly.
 Do our errors have some common characteristics? Can we

infer a new feature from that?
 Could our classifier be favouring one of the classes (e.g. the

majority class)?

44

ERROR ANALYSIS

 Error analysis: where are we doing wrong? What labels?

Look at frequent deviations in the confusion matrix.

Actuallb
UK

Actuallb
World

Actuallb
Tech

Actuallb
Science

Actuallb
Poliacs

Actuallb
Business

Classifed as UK 95 1 13 0 1 0

Classifed as World 0 1 0 0 0 0

Classifed as Tech 10 90 0 1 0 0

Classifed as Science 0 0 0 34 3 7

Classifed as Poliacs 0 1 2 13 26 5

Classifed as Business 0 0 2 14 5 10

45

ERROR ANALYSIS

 Error analysis: do our errors have some common characteristics?

Print some of our errors.

46

ERROR ANALYSIS

 Error analysis: do our errors have some common characteristics?

Print some of our errors.

47

ERROR ANALYSIS

 Error analysis: do our errors have some common characteristics?

Print some of our errors.

New feature: suffix, last 2-3 characters

48

ERROR ANALYSIS

 Error analysis: could our classifier be favouring one of the classes?
 Owing to class imbalance, classifiers tend to predict popular

classes more often, e.g.:
class A (700), class B (100), class C (100), class D (100)
classifiers will tend to predict A, over-represented
as in our previous example:

49

ERROR ANALYSIS

 Error analysis: could our classifier be favouring one of the classes?
 How to deal with imbalance (i.e. A-700, B-100, C-100, D-100)?

 1) Undersample popular class → A-100, B-100, C-100, D-100

randomly remove 600 instances of A

50

ERROR ANALYSIS

 Error analysis: could our classifier be favouring one of the classes?
 How to deal with imbalance (i.e. A-700, B-100, C-100, D-100)?

 2) Oversample other classes → A-700, B-700, C-700, D-700

repeat instances of B, C, D to match the number of A’s

51

ERROR ANALYSIS

 Error analysis: could our classifier be favouring one of the classes?
 How to deal with imbalance (i.e. A-700, B-100, C-100, D-100)?

 3) Create synthetic data → A-700, B-700, C-700, D-700

generate new B, C, D items → needs some understanding of the
contents of the classes to be able to produce sensible data items

52

ERROR ANALYSIS

 Error analysis: could our classifier be favouring one of the classes?
 How to deal with imbalance (i.e. A-700, B-100, C-100, D-100)?

 4) Cost sensitive learning

e.g. higher probability to predict uncommon classes
P(A)=1/700, P(B)=1/100, P(C)=1/100, P(D)=1/100

scikit → class_weight=”auto”

53

ERROR ANALYSIS

 Important for the error analysis:
 Subset we analyse for errors (dev set) has to be different to

the one where we ultimately apply the classifier (test set).
 If we tweak the classifier looking at the test set, we’ll end up

overfitting, developing a classifier that works very well for that
particular test set.

 NB: for exercise 2, you’re given 3 test sets, we’ll test it in 2 more
held-out test sets. Not looking at test sets while developing will
improve generalisability.

54

ASSOCIATED READING

 Jurafsky, Daniel, and James H. Martin. 2009. Speech and Language
Processing: An Introduction to Natural Language Processing, Speech
Recognition, and Computational Linguistics. 3rd edition. Chapters 6
and 7.

 Bird Steven, Ewan Klein, and Edward Loper. Natural Language
Processing with Python. O’Reilly Media, Inc., 2009. Chapter 6.