Text Classification
COMP90042
Natural Language Processing
Lecture 4
Semester 1 2021 Week 2 Jey Han Lau
COPYRIGHT 2021, THE UNIVERSITY OF MELBOURNE
1
COMP90042
L4
• • • •
Fundamentals of classification Text classification tasks Algorithms for classification Evaluation
Outline
2
COMP90042
L4
Classification ‣ A document d
•
‣
•
•
•
Input
Often represented as a vector of features
A fixed output set of classes C = {c1,c2,…ck}
Categorical, not continuous (regression) or ordinal (ranking)
Output
‣ A predicted class c ∈ C
3
COMP90042
L4
•
Some common examples ‣ Topic classification
‣ Sentiment analysis
‣ Native-language identification ‣ Natural language inference
‣ Automatic fact-checking
‣ Paraphrase
•
Input may not be a long document
‣ E.g. sentence or tweet-level sentiment analysis
Text Classification Tasks
4
COMP90042
L4
Topic Classification
Is the text about acquisitions or earnings?
LIEBERT CORP APPROVES MERGER
Liebert Corp said its shareholders approved the merger of a wholly- owned subsidiary of Emerson Electric Co. Under the terms of the merger, each Liebert shareholder will receive .3322 shares of Emerson stock for each Liebert share.
ANSWER: ACQUISITIONS
5
COMP90042
L4
Topic Classification
• Motivation:libraryscience,informationretrieval
• Classes:Topiccategories,e.g.“jobs”,“international news”
• Features
‣ Unigram bag of words (BOW), with stop-words removed ‣ Longer n-grams (bigrams, trigrams) for phrases
• Examplesofcorpora
‣ Reuters news corpus (RCV1; NLTK) ‣ Pubmed abstracts
‣ Tweets with hashtags
6
COMP90042
L4
Sentiment Analysis What is the sentiment of this tweet?
anyone having problems with Windows 10? may be coincidental but since i downloaded, my WiFi keeps dropping out. Itunes had a malfunction
ANSWER: NEGATIVE
7
COMP90042
L4
•
•
•
•
Motivation: opinion mining, business analytics
Sentiment Analysis
Classes: Positive/Negative/(Neutral)
Features
‣ N-grams
‣ Polarity lexicons
Examples of corpora
‣ Movie review dataset (in NLTK)
‣ SEMEVAL Twitter polarity datasets
8
COMP90042
L4
Native-Language Identification
What is the native language of the writer of this text?
Based on the feedback given, how students revised their writing will be analyzed as well. However, since whether teachers tell their student to revise or not can depend on teachers, it is unsure the following analysis can be taken.
PollEv.com/jeyhanlau569
9
COMP90042
L4
10
COMP90042
L4
Native-Language Identification
What is the native language of the writer of this text?
Based on the feedback given, how students revised their writing will be analyzed as well. However, since whether teachers tell their student to revise or not can depend on teachers, it is unsure the following analysis can be taken.
ANSWER: JAPANESE
PollEv.com/jeyhanlau569
11
COMP90042
L4
•
• •
Motivation: forensic linguistics, educational applications
•
Examples of corpora
‣ TOEFL/IELTS essay corpora
Native-Language Identification
Classes: first language of author (e.g. Indonesian)
Features
‣ Word N-grams
‣ Syntactic patterns (POS, parse trees) ‣ Phonological features
12
COMP90042
L4
Natural Language Inference
What is the relationship between the first and second sentence (entailment vs. contradiction)?
1: A man inspects the uniform of a figure in some East Asian country. 2: The man is sleeping
ANSWER: CONTRADICTION
13
COMP90042
L4
•
•
•
•
AKA textual entailment
Motivation: language understanding Classes: entailment, contradiction, neutral
•
Examples of corpora ‣ SNLI, MNLI
Natural Language Inference
Features
‣ Word overlap
‣ Length difference between the sentences ‣ N-grams
14
COMP90042
L4
Building a Text Classifier
1. Identify a task of interest
2. Collect an appropriate corpus
3. Carry out annotation
4. Select features
5. Choose a machine learning algorithm
6. Train mdel and tune hyperparameters using held-out development data
7. Repeat earlier steps as needed
8. Train final model
9. Evaluate model on held-out test data
15
COMP90042
L4
Algorithms for Classification
16
COMP90042
L4
•
Bias vs. Variance
‣ Bias: assumptions we made in our model ‣ Variance: sensitivity to training set
• • •
Underlying assumptions, e.g., independence Complexity
Speed
Choosing a Classification Algorithm
17
COMP90042
L4
•
Naïve Bayes
•
‣
given the class
Naïvely assumes features are independent
𝑝(𝑐𝑛 𝑓1…𝑓𝑚) = ∏𝑚 𝑝(𝑓𝑖 𝑐𝑛)𝑝(𝑐𝑛) 𝑖=1
Finds the class with the highest
likelihood under Bayes law
P(C|F) ∝ P(F|C)P(C)
‣ i.e. probability of the class times probability of features
18
COMP90042
L4
Naïve Bayes
• Pros:
‣ Fast to train and classify
‣ robust, low-variance → good for low data situations
‣ optimal classifier if independence assumption is correct ‣ extremely simple to implement.
• Cons:
‣ Independence assumption rarely holds
‣ low accuracy compared to similar methods in most situations
‣ smoothing required for unseen class/feature combinations
19
COMP90042
L4
Logistic Regression
• A classifier, despite its name
• A linear model, but uses softmax
“squashing” to get valid probability
𝑝(𝑐𝑛 𝑓1…𝑓𝑚) = 1 ∙ exp( ∑𝑚 𝑤𝑖𝑓𝑖) 𝑍 𝑖=0
• Training maximizes probability of training data subject to regularization which encourages low or sparse weights
20
COMP90042
L4
Logistic Regression
‣ Unlike Naïve Bayes not confounded by diverse,
correlated features → better performance • Cons:
‣ Slow to train;
‣ Feature scaling needed
‣ Requires a lot of data to work well in practice
‣ Choosing regularisation strategy is important since overfitting is a big problem
• Pros:
21
COMP90042
L4
Support Vector Machines
• Finds hyperplane which separates the
training data with maximum margin
• Pros:
• Fast and accurate linear classifier
• Can do non-linearity with kernel trick
• Works well with huge feature sets
• Cons:
• Multiclass classification awkward
• Feature scaling needed
• Deals poorly with class imbalances
• Interpretability
22
COMP90042
L4
• • • •
Prior to deep learning, SVM is very popular for NLP, why?
Non-linear kernel trick works well for text
Feature scaling is not an issue for NLP
NLP datasets are usually large, which favours SVM NLP problems often involve large feature sets
PollEv.com/jeyhanlau569
23
COMP90042
L4
24
COMP90042
L4
K-Nearest Neighbour
• Classify based on majority class of k-nearest training
examples in feature space
• Definition of nearest can vary ‣ Euclidean distance
‣ Cosine distance
25
COMP90042
L4
K-Nearest Neighbour
• Pros:
• Simple but surprisingly effective
• No training required
• Inherently multiclass
• Optimal classifier with infinite data
• Cons:
• Have to select k
• Issues with imbalanced classes
• Often slow (for finding the neighbours)
• Features must be selected carefully
26
COMP90042
L4
Decision tree
• Construct a tree where nodes correspond to tests on individual features
• Leaves are final class decisions
• Based on greedy maximization of mutual information
27
COMP90042
L4
Decision tree
• Pros:
• Fast to build and test
• Feature scaling irrelevant
• Good for small feature sets
• Handles non-linearly-separable problems
• Cons:
• In practice, not that interpretable
• Highly redundant sub-trees
• Not competitive for large feature sets
28
COMP90042
L4
Random Forests
• An ensemble classifier
• Consists of decision trees trained on different subsets
of the training and feature space
• Final class decision is majority vote of sub-classifiers
29
COMP90042
L4
Random Forests
• Pros:
• Usually more accurate and more robust than
decision trees
• Great classifier for medium feature sets
• Training easily parallelised
• Cons:
• Interpretability
• Slow with large feature sets
30
COMP90042
L4
Neural Networks
• An interconnected set of nodes typically arranged in layers
• Input layer (features), output layer (class probabilities), and one or more hidden layers
• Each node performs a linear weighting of its inputs from previous layer, passes result through activation function to nodes in next layer
31
COMP90042
L4
Neural Networks
• Pros:
• Extremely powerful, dominant method in NLP and vision
• Little feature engineering
• Cons:
• Not an off-the-shelf classifier
• Many hyper-parameters, difficult to optimise
• Slow to train
• Prone to overfitting
32
COMP90042
L4
•
Dataset for tuning
‣ Development set
‣ Not the training set or the test set ‣ k-fold cross-validation
•
•
Specific hyper-parameters are classifier specific ‣ E.g. tree depth for decision trees
•
But many hyper-parameters relate to regularisation
‣ Regularisation hyper-parameters penalise model complexity
‣ Used to prevent overfitting
For multiple hyper-parameters, use grid search
Hyper-parameter Tuning
33
COMP90042
L4
Evaluation
34
COMP90042
L4
Evaluation: Accuracy
Classified As
Class
A
B
A
79
13
B
8
10
Accuracy = correct classifications/total classifications = (79 + 10)/(79 + 13 + 8 + 10)
= 0.81
0.81 looks good, but most common class baseline accuracy is
= (79 + 13)/(79 + 13 + 8 + 10) = 0.84
35
COMP90042
L4
Evaluation: Precision & Recall
Classified As
Class
A
B
A
79
13
B
8
10
False Positives (fp) True Positives (tp)
B as “positive class”
Precision
Recall
= correct classifications of B (tp)
/ total classifications as B (tp + fp)
= 10/(10 + 13) = 0.43
= correct classifications of B (tp)
/ total instances of B (tp + fn)
= 10/(10 + 8) = 0.56
False Negatives (fn)
36
COMP90042
L4
•
Evaluation: F(1)-score Harmonic mean of precision and recall
2 × precision × recall precision + recall
•
•
F1 =
Like precision and recall, defined relative to a
specific positive class
But can be used as a general multiclass metric
‣ Macroaverage: Average F-score across classes
‣ Microaverage: Calculate F-score using sum of counts (= accuracy for multiclass problems)
37
COMP90042
L4
•
•
Lots of algorithms available to try out on your task of interest (see scikit-learn)
A Final Word
But if good results on a new task are your goal, then well-annotated, plentiful datasets and appropriate features often more important than the specific algorithm used
38
COMP90042
L4
Further Reading ‣ E18 Ch 4.1, 4.3-4.4.1
‣ E18 Chs 2 & 3: reviews linear and non-linear classification algorithms
39