程序代写代做代考 python PowerPoint Presentation

PowerPoint Presentation

LECTURE 7

Text Classifcatin, Evaluatin and Errir Analysis

Arkaitz Zubiaga, 29th January, 2018

2

 What is text classifcatin?

 Examples if text classifcatin.

 Supervised Text Classifcatin.

 Evaluatin.

 Errir Analysis.

LECTURE 7: CONTENTS

3

 Having as input:

 A text document d

 A set of categories C={c
1
, …, c

m
}

 The text classifcaton task outputs:

 Predicted class c* that document d belongs to.

WHAT IS TEXT CLASSIFICATION?

4

 Spam detecton: classifying emails/web pages as spam (ir nit).

EXAMPLES OF TEXT CLASSIFICATION

SPAM

NOT SPAM

5

 Classifcaton by topic: what is the text abiut?

EXAMPLES OF TEXT CLASSIFICATION

6

 Sentment analysis: is a text pisitve, negatve ir neutral?

 I really liked the fiid at the restaurant.

 We were 8 friends whi went there fir the frst tme.

 The service was terrible.

EXAMPLES OF TEXT CLASSIFICATION

7

 Language identfcaton: what language a text is writen in?

Wieviel Uhr ist es? {German, English, Spanish, French}

EXAMPLES OF TEXT CLASSIFICATION

8

 Classifcaton of politcal orientaton:
dies a text suppirt labiur ir cinservatve?

EXAMPLES OF TEXT CLASSIFICATION

Labour Conservative

9

 A range if diferent priblems, with a common goal:

 Assigning a category/class to each document.

 We know the set of categories befirehand.

WHAT IS TEXT CLASSIFICATION?

10

 Rule-based classifers, e.g. if email cintains ‘viagra’ → spam

 Signifcant manual efort invilved.

 Supervised classifcaton:

 Given: a hand-labeled set if document-class pairs
(d

1
,c

1
), (d

2
, c

2
), …, (d

m
,c

m
) → classifed inti C={c

1
, …, c

j
}

 The classifer learns a model that can classify new
documents into C.

TEXT CLASSIFICATION: APPROACHES

SUPERVISED TEXT
CLASSIFICATION

12

 Assumpton: We have a manually labelled dataset, e.g.:

 d
1
: ‘That’s really giid, I live it’ → positve

 d
2
: ‘It was biring, din’t recimmend it’ → negatve

 …

 d
n
: ‘I wiuldn’t gi again, awful’ → negatve

 If not, we need ti fnd one or label one ourselves.

SUPERVISED CLASSIFICATION

13

 Split the dataset inti train/dev/test sets.

 What features are giing ti use ti represent the dicuments?

 What classifer are giing ti use?

 Chiise setngs, parameters, etc. fir the classifer.

SUPERVISED CLASSIF.: DECISIONS TO MAKE

14

 We can split the dataset into 3 parts:

 Training set → the largest set as we want priper training.

 Develipment set.

 Test set.

 Tweak classifer based on the development set, then test it in the test set.

 Tweaking and testng on the test set may lead to overftng (diing the
right things specifcally fir that test set, nit necessarily generalisable)

SPLITTING THE DATASET

Training set
Develipment

Set
Test Set

15

 Cross-validaton: train and test in diferent “filds”

 e.g. 10-fild criss-validatin, split the data inti 10 parts.

 each tme 1 fild is used fir testng, the ither 9 fir training.

 after all 10 runs, cimpute the average perfirmance.

SPLITTING THE DATASET: LARGER DATASETS

16

 Cross-validaton: example.

SPLITTING THE DATASET: LARGER DATASETS

17

 Usually start with some basic features:

 Bag of words.

 Or preferably word embeddings.

 Keep adding new features:

 Need to be creatve.
Think if features ciuld characterise the problem at hand.

CHOOSING THE FEATURES

18

 Possible features:

 Sentment analysis → ciunts if pisitve/negatve wirds.

 Language identfcaton → pribabilites if characters (hiw
many k’s, b’s, v’s…), features frim wird sufxes (e.g. many
-ing wirds → English)

 Spam detecton → ciunt wirds in blacklist, dimain if URLs
in email (liiking fir maliciius URLs)

THINKING OF FEATURES

19

 Hiw ti assess which features are good?

 Empirical evaluaton:

 Incremental testng:
keep adding features, see if adding imprives perfirmance

 Leave-one-out testng:
test all features, and cimbinatins if all features except ine.
when leaving feature i iut perfirms beter than all features, remive feature I

 Error analysis: (later in this lecture)
liik at classifer’s errirs, what features can we use ti imprive?

CHOOSING THE FEATURES

20

 Many diferent classifers exist, well-known classifers include:

 Naive Bayes.

 Logistc Regression (Maximum Entripy classifer)

 Support Vector Machines (SVM).

 Classifers can be binary (k = 2) ir multclass (k > 2).

CHOOSING A CLASSIFIER

21

 Have very litle data?

 Naive Bayes.

 Semi-supervised classifcaton (e.g. biitstrapping)
Incirpirates classifer’s predictins inti the training data

 Have good amount of data?

 SVM.

 Logistc regression.

CHOOSING A CLASSIFIER

22

 How many categories (k)?

 [k=2] Binary → binary classifer.

 [k>2] Multclass:

 One-vs-all classifers.
Build k classifers, each able ti distnguish class i frim the rest. Then cimbine
iutput if all classifers (e.g. based in their cinfdence scires)

 Multnomial/multclass classifers.

CHOOSING A CLASSIFIER

23

BINARY VS MULTICLASS CLASSIFIERS

24

 Multnomial:

 is generally faster, a single classifer.

 classes are mutually exclusive, ni iverlap.

 One-vs-all:

 Multlabel classifcaton, dicument can fall in 1+ categories:
e.g. classify by language:

I said “binjiur min ami” ti my friend → English & French

How? Out if k classifers, thise with confdence > threshold

WHEN TO USE MULTINOMIAL OR ONE-VS-ALL?

EVALUATION OF TEXT
CLASSIFICATION

26

 Evaluatin is diferent for binary and multclass classifcatin.

 Binary: we generally have a positve and a negatve class
(spam vs nin-spam, medical test pisitve vs negatve, exam
pass vs fail).
Classifcatin errors can only go the other class.

 Multclass: multple categiries, may have diferent level if
impirtance.
Classifcatin errors can go to any other class.

EVALUATION OF TEXT CLASSIFICATION

27

 2-by-2 cintngency table:

EVALUATION OF BINARY CLASSIFICATION

Actually positve Actually negatve

Classifed as positve True Pisitve (TP) False Pisitve (FP)

Classifed as
negatve

False Negatve (FN) True Negatve (TN)

28

 2-by-2 cintngency table:

 Precision: rati if items classifed as pisitve that are cirrect.

 Recall: rati if cirrect items that are classifed as pisitve.

EVALUATION OF BINARY CLASSIFICATION

Actually positve Actually negatve

Classifed as positve True Pisitve (TP) False Pisitve (FP)

Classifed as
negatve

False Negatve (FN) True Negatve (TN)

29

 2-by-2 cintngency table:

 Precision: rati if items classifed as pisitve that are cirrect.

 Recall: rati if cirrect items that are classifed as pisitve.

EVALUATION OF BINARY CLASSIFICATION

Actually positve Actually negatve

Classifed as positve True Pisitve (TP) False Pisitve (FP)

Classifed as
negatve

False Negatve (FN) True Negatve (TN)

30

 2-by-2 cintngency table:

 Precision: rati if items classifed as pisitve that are cirrect.

 Recall: rati if cirrect items that are classifed as pisitve.

EVALUATION OF BINARY CLASSIFICATION

Actually positve Actually negatve

Classifed as positve True Pisitve (TP) False Pisitve (FP)

Classifed as
negatve

False Negatve (FN) True Negatve (TN)

31

 We want ti iptmise fir bith precisiin and recall:

(harminic mean if precisiin and recall)

 Equatin as filliws, hiwever generally ß = 1:

EVALUATION OF BINARY CLASSIFICATION

32

 Bigger confusion matrix:

EVALUATION OF MULTICLASS CLASSIFICATION

Actually
UK

Actually
World

Actually
Tech

Actually
Science

Actually
Politcs

Actually
Business

Classifed as UK 95 1 13 0 1 0

Classifed as World 0 1 0 0 0 0

Classifed as Tech 10 90 0 1 0 0

Classifed as Science 0 0 0 34 3 7

Classifed as Politcs 0 1 2 13 26 5

Classifed as Business 0 0 2 14 5 10

33

 Overall Accuracy: rati if cirrect classifcatins.

EVALUATION OF MULTICLASS CLASSIFICATION

34

 Overall Accuracy: rati if cirrect classifcatins.

 Generally a bad evaluaton approach, e.g.:

 We classify 1,000 texts → 990 have pisitve sentment.

 We (naively) classify everything as pisitve.

 990 classifed cirrectly: 990 / 1000 = 0.99 accuracy!

EVALUATION OF MULTICLASS CLASSIFICATION

35

 Per-class precision and recall:

Precisiin:

Recall:

 With the harmonic mean, we can then get per-class F1 score.

EVALUATION OF MULTICLASS CLASSIFICATION

# of items correctly classified as class i

# of items classified as class i

# of items correctly classified as class i

# of actual i items

36

 We have per-class precision, recall and F1 scores.

 Hiw di we cimbine them all ti get a single score?

EVALUATION OF MULTICLASS CLASSIFICATION

37

 Obtaining iverall perfirmances:

 Macroaveraging:
Cimpute performance for each class, then average them.
All classes contribute the same ti the fnal scire (e.g. class with 990
and class with 10 instances).

 Microaveraging:
Cimpute overall performance withiut cimputng per-class
perfirmances.
Large classes contribute more ti the fnal scire.

EVALUATION OF MULTICLASS CLASSIFICATION

38

 Macriaveraging:

 The macriaveraged F1 scire is then the harminic mean if thise.

EVALUATION OF MULTICLASS CLASSIFICATION

39

 Micriaveraging:

 The micriaveraged F1 scire is then the harminic mean if thise.

EVALUATION OF MULTICLASS CLASSIFICATION

40

 Micriaveraged F1 scire: 0.665

 Macriaveraged F1 scire: 0.440

MICRO- VS MACRO-AVERAGING EXAMPLE

41

 We can chiise ti prioritse certain categories:

 Give higher weight to important categories:
0.3*Prec(c

1
) + 0.3*Prec(c

2
) + 0.4*Prec(c

3
)

 Select sime categiries fir inclusiin in macri/micriaverage:
 e.g. Semeval task (Exercise 2 of the module), we inly macroaverage over positve

and negatve sentment classes. Perfirmance iver the neutral class is nit included.

FURTHER WEIGHTING/SELECTION

ERROR ANALYSIS FOR TEXT
CLASSIFICATION

43

ERROR ANALYSIS

 Error analysis: can help us find out where our classifier can do
better.

 No magic formula for performing error analysis.
 Look where we are doing wrong, what labels particularly.
 Do our errors have some common characteristics? Can we

infer a new feature from that?
 Could our classifier be favouring one of the classes (e.g. the

majority class)?

44

ERROR ANALYSIS

 Error analysis: where are we doing wrong? What labels?

Look at frequent deviations in the confusion matrix.

Actually
UK

Actually
World

Actually
Tech

Actually
Science

Actually
Politcs

Actually
Business

Classifed as UK 95 1 13 0 1 0

Classifed as World 0 1 0 0 0 0

Classifed as Tech 10 90 0 1 0 0

Classifed as Science 0 0 0 34 3 7

Classifed as Politcs 0 1 2 13 26 5

Classifed as Business 0 0 2 14 5 10

45

ERROR ANALYSIS

 Error analysis: do our errors have some common characteristics?

Print some of our errors.

46

ERROR ANALYSIS

 Error analysis: do our errors have some common characteristics?

Print some of our errors.

47

ERROR ANALYSIS

 Error analysis: do our errors have some common characteristics?

Print some of our errors.

New feature: suffix, last 2-3 characters

48

ERROR ANALYSIS

 Error analysis: could our classifier be favouring one of the classes?
 Owing to class imbalance, classifiers tend to predict popular

classes more often, e.g.:
class A (700), class B (100), class C (100), class D (100)
classifiers will tend to predict A, over-represented
as in our previous example:

49

ERROR ANALYSIS

 Error analysis: could our classifier be favouring one of the classes?
 How to deal with imbalance (i.e. A-700, B-100, C-100, D-100)?

 1) Undersample popular class → A-100, B-100, C-100, D-100

randomly remove 600 instances of A

50

ERROR ANALYSIS

 Error analysis: could our classifier be favouring one of the classes?
 How to deal with imbalance (i.e. A-700, B-100, C-100, D-100)?

 2) Oversample other classes → A-700, B-700, C-700, D-700

repeat instances of B, C, D to match the number of A’s

51

ERROR ANALYSIS

 Error analysis: could our classifier be favouring one of the classes?
 How to deal with imbalance (i.e. A-700, B-100, C-100, D-100)?

 3) Create synthetic data → A-700, B-700, C-700, D-700

generate new B, C, D items → needs some understanding of the
contents of the classes to be able to produce sensible data items

52

ERROR ANALYSIS

 Error analysis: could our classifier be favouring one of the classes?
 How to deal with imbalance (i.e. A-700, B-100, C-100, D-100)?

 4) Cost sensitive learning

e.g. higher probability to predict uncommon classes
P(A)=1/700, P(B)=1/100, P(C)=1/100, P(D)=1/100

scikit → class_weight=”auto”

53

ERROR ANALYSIS

 Important for the error analysis:
 Subset we analyse for errors (dev set) has to be different to

the one where we ultimately apply the classifier (test set).
 If we tweak the classifier looking at the test set, we’ll end up

overfitting, developing a classifier that works very well for that
particular test set.

 NB: for exercise 2, you’re given 3 test sets, we’ll test it in 2 more
held-out test sets. Not looking at test sets while developing will
improve generalisability.

54

ASSOCIATED READING

 Jurafsky, Daniel, and James H. Martin. 2009. Speech and Language
Processing: An Introduction to Natural Language Processing, Speech
Recognition, and Computational Linguistics. 3rd edition. Chapters 6
and 7.

 Bird Steven, Ewan Klein, and Edward Loper. Natural Language
Processing with Python. O’Reilly Media, Inc., 2009. Chapter 6.

Slide 1
Slide 2
Slide 3
Slide 4
Slide 5
Slide 6
Slide 7
Slide 8
Slide 9
Slide 10
Slide 11
Slide 12
Slide 13
Slide 14
Slide 15
Slide 16
Slide 17
Slide 18
Slide 19
Slide 20
Slide 21
Slide 22
Slide 23
Slide 24
Slide 25
Slide 26
Slide 27
Slide 28
Slide 29
Slide 30
Slide 31
Slide 32
Slide 33
Slide 34
Slide 35
Slide 36
Slide 37
Slide 38
Slide 39
Slide 40
Slide 41
Slide 42
Slide 43
Slide 44
Slide 45
Slide 46
Slide 47
Slide 48
Slide 49
Slide 50
Slide 51
Slide 52
Slide 53
Slide 54