CS计算机代考程序代写 information retrieval deep learning decision tree algorithm l4-text-classification-v2

l4-text-classification-v2

COPYRIGHT 2021, THE UNIVERSITY OF MELBOURNE
1

COMP90042
Natural Language Processing

Lecture 4
Semester 1 2021 Week 2

Jey Han Lau

Text Classification

COMP90042 L4

2

Outline

• Fundamentals of classification

• Text classification tasks

• Algorithms for classification

• Evaluation

COMP90042 L4

3

Classification

• Input
‣ A document d

• Often represented as a vector of features

‣ A fixed output set of classes C = {c1,c2,…ck}
• Categorical, not continuous (regression) or ordinal (ranking)

• Output
‣ A predicted class c ∈ C

COMP90042 L4

4

Text Classification Tasks

• Some common examples
‣ Topic classification
‣ Sentiment analysis
‣ Native-language identification
‣ Natural language inference
‣ Automatic fact-checking
‣ Paraphrase

• Input may not be a long document
‣ E.g. sentence or tweet-level sentiment analysis

COMP90042 L4

5

Topic Classification

Is the text about acquisitions or earnings?

LIEBERT CORP APPROVES MERGER
Liebert Corp said its shareholders approved the merger of a wholly-
owned subsidiary of Emerson Electric Co. Under the terms of the
merger, each Liebert shareholder will receive .3322 shares of
Emerson stock for each Liebert share.

ANSWER: ACQUISITIONS

COMP90042 L4

6

Topic Classification

• Motivation: library science, information retrieval
• Classes: Topic categories, e.g. “jobs”, “international

news”
• Features

‣ Unigram bag of words (BOW), with stop-words removed
‣ Longer n-grams (bigrams, trigrams) for phrases

• Examples of corpora
‣ Reuters news corpus (RCV1; NLTK)
‣ Pubmed abstracts
‣ Tweets with hashtags

COMP90042 L4

7

Sentiment Analysis

What is the sentiment of this tweet?

anyone having problems with Windows 10? may be coincidental but
since i downloaded, my WiFi keeps dropping out. Itunes had a
malfunction

ANSWER: NEGATIVE

COMP90042 L4

8

Sentiment Analysis

• Motivation: opinion mining, business analytics

• Classes: Positive/Negative/(Neutral)

• Features
‣ N-grams
‣ Polarity lexicons

• Examples of corpora
‣ Movie review dataset (in NLTK)
‣ SEMEVAL Twitter polarity datasets

COMP90042 L4

9

What is the native language of the writer of this
text?

Based on the feedback given, how students revised their writing will
be analyzed as well. However, since whether teachers tell their
student to revise or not can depend on teachers, it is unsure the
following analysis can be taken.

Native-Language Identification

PollEv.com/jeyhanlau569

http://PollEv.com/jeyhanlau569
http://PollEv.com/jeyhanlau569

COMP90042 L4

10

COMP90042 L4

11

What is the native language of the writer of this
text?

Based on the feedback given, how students revised their writing will
be analyzed as well. However, since whether teachers tell their
student to revise or not can depend on teachers, it is unsure the
following analysis can be taken.

ANSWER: JAPANESE

Native-Language Identification

PollEv.com/jeyhanlau569

http://PollEv.com/jeyhanlau569
http://PollEv.com/jeyhanlau569

COMP90042 L4

12

• Motivation: forensic linguistics, educational
applications

• Classes: first language of author (e.g. Indonesian)

• Features
‣ Word N-grams
‣ Syntactic patterns (POS, parse trees)
‣ Phonological features

• Examples of corpora
‣ TOEFL/IELTS essay corpora

Native-Language Identification

COMP90042 L4

13

Natural Language Inference

What is the relationship between the first and
second sentence (entailment vs. contradiction)?

ANSWER: CONTRADICTION

1: A man inspects the uniform of a figure in some East Asian country.

2: The man is sleeping

COMP90042 L4

14

Natural Language Inference

• AKA textual entailment

• Motivation: language understanding

• Classes: entailment, contradiction, neutral

• Features
‣ Word overlap
‣ Length difference between the sentences
‣ N-grams

• Examples of corpora
‣ SNLI, MNLI

COMP90042 L4

15

Building a Text Classifier

1. Identify a task of interest
2. Collect an appropriate corpus
3. Carry out annotation
4. Select features
5. Choose a machine learning algorithm
6. Train mdel and tune hyperparameters using held-out

development data
7. Repeat earlier steps as needed
8. Train final model
9. Evaluate model on held-out test data

COMP90042 L4

16

Algorithms for Classification

COMP90042 L4

17

Choosing a Classification Algorithm

• Bias vs. Variance

‣ Bias: assumptions we made in our model

‣ Variance: sensitivity to training set

• Underlying assumptions, e.g., independence

• Complexity

• Speed

COMP90042 L4

18

Naïve Bayes

• Finds the class with the highest

likelihood under Bayes law

‣ i.e. probability of the class times probability of features

given the class

• Naïvely assumes features are independent

P(C |F) ∝ P(F |C)P(C)

𝑝(𝑐𝑛 𝑓1…𝑓𝑚) =
𝑚


𝑖=1

𝑝(𝑓𝑖 𝑐𝑛)𝑝(𝑐𝑛)

COMP90042 L4

19

Naïve Bayes
• Pros:

‣ Fast to train and classify

‣ robust, low-variance → good for low data situations

‣ optimal classifier if independence assumption is correct

‣ extremely simple to implement.

• Cons:

‣ Independence assumption rarely holds

‣ low accuracy compared to similar methods in most
situations

‣ smoothing required for unseen class/feature combinations

COMP90042 L4

20

Logistic Regression

• A classifier, despite its name
• A linear model, but uses softmax


“squashing” to get valid probability

• Training maximizes probability of training data subject
to regularization which encourages low or sparse
weights

𝑝(𝑐𝑛 𝑓1…𝑓𝑚) =  
1
𝑍

∙ exp(
𝑚


𝑖=0

𝑤𝑖𝑓𝑖)

COMP90042 L4

21

Logistic Regression

• Pros:

‣ Unlike Naïve Bayes not confounded by diverse,
correlated features → better performance

• Cons:

‣ Slow to train;

‣ Feature scaling needed

‣ Requires a lot of data to work well in practice

‣ Choosing regularisation strategy is important since
overfitting is a big problem

COMP90042 L4

22

Support Vector Machines
• Finds hyperplane which separates the 


training data with maximum margin

• Pros:

• Fast and accurate linear classifier

• Can do non-linearity with kernel trick

• Works well with huge feature sets

• Cons:

• Multiclass classification awkward

• Feature scaling needed

• Deals poorly with class imbalances

• Interpretability

COMP90042 L4

23

• Non-linear kernel trick works well for text
• Feature scaling is not an issue for NLP
• NLP datasets are usually large, which favours SVM
• NLP problems often involve large feature sets

PollEv.com/jeyhanlau569

Prior to deep learning, SVM is very
popular for NLP, why?

http://PollEv.com/jeyhanlau569
http://PollEv.com/jeyhanlau569

COMP90042 L4

24

COMP90042 L4

25

K-Nearest Neighbour

• Classify based on majority class of k-nearest training
examples in feature space

• Definition of nearest can vary
‣ Euclidean distance
‣ Cosine distance

COMP90042 L4

26

K-Nearest Neighbour
• Pros:

• Simple but surprisingly effective
• No training required
• Inherently multiclass
• Optimal classifier with infinite data

• Cons:
• Have to select k
• Issues with imbalanced classes
• Often slow (for finding the neighbours)
• Features must be selected carefully

COMP90042 L4

27

Decision tree

• Construct a tree where nodes correspond to tests on
individual features

• Leaves are final class decisions
• Based on greedy maximization of mutual information

COMP90042 L4

28

Decision tree

• Pros:
• Fast to build and test
• Feature scaling irrelevant
• Good for small feature sets
• Handles non-linearly-separable problems

• Cons:
• In practice, not that interpretable
• Highly redundant sub-trees
• Not competitive for large feature sets

COMP90042 L4

29

Random Forests

• An ensemble classifier
• Consists of decision trees trained on different subsets 


of the training and feature space
• Final class decision is majority vote of sub-classifiers

COMP90042 L4

30

Random Forests

• Pros:
• Usually more accurate and more robust than

decision trees
• Great classifier for medium feature sets
• Training easily parallelised

• Cons:
• Interpretability
• Slow with large feature sets

COMP90042 L4

31

Neural Networks
• An interconnected set of nodes typically arranged in layers
• Input layer (features), output layer (class probabilities), and

one or more hidden layers
• Each node performs a linear weighting of its inputs from

previous layer, passes result through activation function to
nodes in next layer

COMP90042 L4

32

Neural Networks
• Pros:

• Extremely powerful, dominant method in NLP and vision
• Little feature engineering

• Cons:
• Not an off-the-shelf classifier
• Many hyper-parameters, difficult to optimise
• Slow to train
• Prone to overfitting

COMP90042 L4

33

Hyper-parameter Tuning

• Dataset for tuning
‣ Development set
‣ Not the training set or the test set
‣ k-fold cross-validation

• Specific hyper-parameters are classifier specific
‣ E.g. tree depth for decision trees

• But many hyper-parameters relate to regularisation
‣ Regularisation hyper-parameters penalise model

complexity
‣ Used to prevent overfitting

• For multiple hyper-parameters, use grid search

COMP90042 L4

34

Evaluation

COMP90042 L4

35

Evaluation: Accuracy

Accuracy = correct classifications/total classifications
= (79 + 10)/(79 + 13 + 8 + 10)
= 0.81
0.81 looks good, but most common class baseline
accuracy is

= (79 + 13)/(79 + 13 + 8 + 10) = 0.84

Classified As

Class A B

A 79 13

B 8 10

COMP90042 L4

36

Evaluation: Precision & Recall
Classified As

Class A B

A 79 13

B 8 10

B as “positive class”

Precision = correct classifications of B (tp) 

/ total classifications as B (tp + fp)


= 10/(10 + 13) = 0.43

Recall = correct classifications of B (tp)

/ total instances of B (tp + fn)


= 10/(10 + 8) = 0.56

False Positives (fp)

True Positives (tp)

False Negatives (fn)

COMP90042 L4

37

Evaluation: F(1)-score

• Harmonic mean of precision and recall

• Like precision and recall, defined relative to a
specific positive class

• But can be used as a general multiclass metric
‣ Macroaverage: Average F-score across classes
‣ Microaverage: Calculate F-score using sum of counts

(= accuracy for multiclass problems)

F1 =
2 × precision × recall

precision + recall

COMP90042 L4

38

A Final Word

• Lots of algorithms available to try out on your task
of interest (see scikit-learn)

• But if good results on a new task are your goal,
then well-annotated, plentiful datasets and
appropriate features often more important than the
specific algorithm used

COMP90042 L4

39

Further Reading

‣ E18 Ch 4.1, 4.3-4.4.1
‣ E18 Chs 2 & 3: reviews linear and non-linear

classification algorithms