Workshop 3
COMP90051 Natural Language Processing Semester 1, 2020
COMP90051 Natural Language Processing (S1 2020) Workshop 3 Jun Wang
• Online lectures and tutorials • Recording
• Questions
COMP90051 Natural Language Processing (S1 2020)
Workshop 3
Jun Wang
Materials
• Download files
• Workshop-03.pdf
• 03-classification.ipynb • 04-ngram.ipynb
• From Canvas – Modules – Workshops – Worksheets/ Notebooks
COMP90051 Natural Language Processing (S1 2020)
Workshop 3
Jun Wang
Learning Outcomes
• Text classification
• Definition, Applications, Challenge … • Algorithms
• N-gram language model
• Different N
• Smooth vs Non-Smooth
• Back-off and Interpolation
COMP90051 Natural Language Processing (S1 2020) Workshop 3 Jun Wang
Text Classification
COMP90051 Natural Language Processing (S1 2020) Workshop 3 Jun Wang
Text classification
1. What is text classification? Give some examples.
COMP90051 Natural Language Processing (S1 2020) Workshop 3 Jun Wang
Text classification
1. What is text classification? Give some examples.
• Text classification is the task of classifying text documents into
different labels. • Input
• Output
COMP90051 Natural Language Processing (S1 2020) Workshop 3 Jun Wang
Text classification
1. What is text classification? Give some examples.
• Text classification is the task of classifying text documents into
different labels. • Input
• a document d
• a fixed set of labels C • Output
• A predicted class c ∈ C
COMP90051 Natural Language Processing (S1 2020) Workshop 3 Jun Wang
Sentiment Analysis
• Document d: I like this movie. • Labels: Positive, Negative
Doc d
Classifier
Positive
COMP90051 Natural Language Processing (S1 2020) Workshop 3 Jun Wang
Text classification
1. What is text classification? Give some examples. • Examples
COMP90051 Natural Language Processing (S1 2020)
Workshop 3
Jun Wang
Text classification
1. What is text classification? Give some examples.
• Examples
• Topic classification
• Sentiment analysis
• Authorship attribution
• Native-language identification • Automatic fact-checking
COMP90051 Natural Language Processing (S1 2020) Workshop 3 Jun Wang
Text classification
(a) Why is text classification generally a difficult problem? What are some hurdles that need to be overcome?
COMP90051 Natural Language Processing (S1 2020) Workshop 3 Jun Wang
Text classification
(a) Why is text classification generally a difficult problem? What are some hurdles that need to be overcome?
• Problem
• Document representation
• BOW
COMP90051 Natural Language Processing (S1 2020) Workshop 3 Jun Wang
Bag-Of-Words
• Document A: I like natural language processing
• Document B: I am playing a game
• Document C: The aims for this subject is to develop an understanding of natural language processing
I
Like
Natura l
Langua ge
Processin g
Am
Playing
A
Game
The
aims
For
…
Of
Doc A
1
1
1
1
1
0
0
0
0
0
0
0
…
0
Doc B
1
0
0
0
0
1
1
1
1
0
0
0
…
0
Doc C
1
0
1
1
1
0
0
0
0
1
1
1
…
1
• What is the length of vectors?
COMP90051 Natural Language Processing (S1 2020) Workshop 3 Jun Wang
Text classification
(a) Why is text classification generally a difficult problem? What are some hurdles that need to be overcome?
• Document representation • BOW
• Feature selection
• Sparse data problem
COMP90051 Natural Language Processing (S1 2020) Workshop 3 Jun Wang
Classifier
• (b) Consider some (supervised) text classification problem, and discuss whether the following (supervised) machine learning models would be suitable:
COMP90051 Natural Language Processing (S1 2020) Workshop 3 Jun Wang
Classifier
• (b) Consider some (supervised) text classification problem, and discuss whether the following (supervised) machine learning models would be suitable:
i. k-Nearest Neighbour using Euclidean distance ii. k-Nearest Neighbour using Cosine similarity iii. Decision Trees using Information Gain
iv. Naive Bayes
v. Logistic Regression
vi. Support Vector Machines
COMP90051 Natural Language Processing (S1 2020)
Workshop 3
Jun Wang
Classifier
• It depends on
• Number of Features
• Number of classes
• Number of instances
• Underlying assumption • Complexity
• Speed
•…
COMP90051 Natural Language Processing (S1 2020) Workshop 3 Jun Wang
KNN
• Classify based on majority class of k-nearest training examples in feature space
• High-dimensionality problems
COMP90051 Natural Language Processing (S1 2020) Workshop 3 Jun Wang
Euclidean distance vs Cosine similarity
• Doc A: 11111000000000000000 • Doc B: 10000111100000000000 • Doc C: 10111000011111111111
• Euclidean distance: • d(q,p)
• Cosine similarity • c(a,b)
COMP90051 Natural Language Processing (S1 2020) Workshop 3 Jun Wang
Decision Tree
• Construct a tree where nodes correspond to tests on individual features
• Feature selection • Information Gain
• It tends to prefer rare features
COMP90051 Natural Language Processing (S1 2020) Workshop 3 Jun Wang
Naive Baye
• Finds the class with the highest likelihood under Bayes law
• Assumption of NB
COMP90051 Natural Language Processing (S1 2020) Workshop 3 Jun Wang
Naive Baye
• Finds the class with the highest likelihood under Bayes law
• Assumption of NB
• The conditional independence of features and classes
• Sensitive to a large feature set
COMP90051 Natural Language Processing (S1 2020) Workshop 3 Jun Wang
Logistic Regression
• A linear classifier
• Relaxes the conditional independence
• Handle large numbers of mostly useless features
COMP90051 Natural Language Processing (S1 2020) Workshop 3 Jun Wang
Support Vector Machines
• Finds hyperplane which separates the training data with maximum margin
• Multiple classes
COMP90051 Natural Language Processing (S1 2020) Workshop 3 Jun Wang
Language model
COMP90051 Natural Language Processing (S1 2020) Workshop 3 Jun Wang
Language model
• What is language model
• Models that assign probabilities to sequences of words
• 2. For the following “corpus” of two documents:
1. how much wood would a wood chuck chuck if a wood chuck would chuck wood
2. a wood chuck would chuck the wood he could chuck if a wood chuck would chuck wood
• (a) Which of the following sentences: a wood could chuck; wood would a chuck; is more probable, accoding to:
i. An unsmoothed uni-gram language model?
ii. A uni-gram language model, with Laplacian (“add-one”) smoothing? iii. An unsmoothed bi-gram language model?
iv. A bi-gram language model, with Laplacian smoothing? v. An unsmoothed tri-gram language model?
vi. A tri-gram language model, with Laplacian smoothing?