l23-review-v1
COPYRIGHT 2021, THE UNIVERSITY OF MELBOURNE
1
COMP90042
Natural Language Processing
Lecture 23
Semester 1 2021 Week 12
Jey Han Lau
Subject Review
COMP90042 L23
2
Preprocessing
• Sentence segmentation
• Tokenisation
‣ Subword tokenisation
• Word normalisation
‣ Derivational vs. inflectional morphology
‣ Lemmatisation vs. stemming
• Stop words
COMP90042 L23
3
N-gram Language Models
• Derivation
• Smoothing techniques
‣ Add-k
‣ Absolute discounting
‣ Katz Backoff
‣ Kneser-Ney smoothing
‣ Interpolation
COMP90042 L23
4
Text Classification
• Building a classification system
• Text classification tasks
‣ Topic classification
‣ Sentiment analysis
‣ Native language identification
• Algorithms
‣ Naive-Bayes, logistic regression, SVM
‣ kNN, neural networks
• Bias vs. variance
• Evaluation metrics: Precision, recall, F1
COMP90042 L23
5
Part-of-Speech Tagging
• English POS
‣ Open vs. closed POS classes
• Tagsets
‣ Penn Treebank tags
• Automatic taggers
‣ Rule-based
‣ Statistical
– Unigram, classifier-based, HMM
COMP90042 L23
6
Hidden Markov Models
• Probabilistic formulation
‣ Parameters: emission and transition
probabilities
• Training
• Viterbi algorithm
• Generative vs. discriminative models
COMP90042 L23
7
DL: Feed-forward Networks
• Formulation
• Designing FF networks for NLP tasks
‣ Topic classification
‣ Language model
‣ POS tagging
• Word embeddings
• Convolutional networks
COMP90042 L23
8
DL: Recurrent Networks
• Formulation
• RNN language models
• LSTM
‣ Functions of gates
‣ Variants
• Designing RNN for NLP tasks
‣ Text classification: sentiment analysis
‣ POS tagging
COMP90042 L23
9
Lexical Semantics
• Definition of word senses, glosses
• Lexical relationships
‣ Synonymy, antonymy, hypernymy, meronymy
• Structure of WordNet
• Word similarity
‣ Path length, depth information, information content
• Word sense disambiguation
‣ Supervised vs. unsupervised
COMP90042 L23
10
Distributional Semantics
• Matrices for distributional semantics
‣ VSM, TF-IDF, word-word co-occurrence
• Association measures: PMI, PPMI
• Count-based methods: SVD
• Neural methods: skip-gram, CBOW
• Evaluation
‣ Word similarity, analogy
COMP90042 L23
11
Contextual Representation
• Formulation with RNN
• ELMo
• BERT
‣ Objectives
‣ Fine-tuning for downstream tasks
• Transformers
‣ Multi-head attention
COMP90042 L23
12
Discourse
• Motivation for modelling beyond words
• Discourse segmentation
‣ Text Tiling
• Discourse parsing
‣ Rhetorical structure theory
• Anaphora resolution
‣ Centering
‣ Supervised models
COMP90042 L23
13
Formal Language Theory & FSA
• Formal language theory as a framework for defining
language
• Regular languages
‣ Closure properties
• Finite state acceptors
‣ Word morphology, weighted variant
• Finite state transducers
‣ Weighted variant, edit distance, morphological
analysis
COMP90042 L23
14
Context-Free Grammar
• Center embedding
• Basics of CFG
• Syntactic constituent and its properties
• CFG parsing
‣ Chomsky normal form
‣ CYK
• English sentence structure (Penn Treebank)
COMP90042 L23
15
Probabilistic Context-Free Grammar
• Ambiguity in grammars
• Basics of probabilistic CFGs
• Probability of a CFG tree
• Parsing
‣ Probabilistic CYK
• Improvements
‣ Parent conditioning
‣ Head lexicalisation
COMP90042 L23
16
Dependency Grammar
• Notion of dependency between words
• Universal dependency
• Properties of dependency trees
‣ Projectivity
• Parsing
‣ Transition-based
‣ Graph-based
COMP90042 L23
17
Machine Translation
• Statistical MT
‣ Language + translation model
‣ Alignments
• Neural MT
‣ Encoder-decoder
‣ Beam search decoding
‣ Attention mechanism
• Evaluation: BLEU
COMP90042 L23
18
Information Extraction
• Named entity recognition
‣ NER tags, IOB tagging, models
• Relation extraction
‣ Rule-based, supervised, semi-supervised,
distant supervision
‣ Unsupervised
• Temporal expression extraction
• Event extraction
COMP90042 L23
19
Question Answering
• IR-based QA
‣ Question processing, answer type prediction
‣ Passage retrieval, answer extraction
• Reading comprehension
‣ Models: LSTM-based, BERT
• Knowledge-based QA
• Hybrid QA: IBM Watson
COMP90042 L23
20
Topic Modelling
• Evolution of topic models
• LDA
‣ Sampling-based learning
‣ Hyper-parameters
• Evaluation:
‣ Word intrusion
‣ Topic coherence
COMP90042 L23
21
Summarisation
• Extractive summarisation
‣ Single-document
– Unsupervised content selection
‣ Multi-document
– Maximum marginal relevance
• Abstractive summarisation
‣ Encoder-decoder models: copy mechanism
• Evaluation: ROUGE
COMP90042 L23
22
Ethics
• Learning outcomes
• Arguments against ethical checks
• Core NLP ethics concepts:
‣ bias, dual use, privacy
• Discussion of applications/use cases
COMP90042 L23
23
Exam
COMP90042 L23
24
Exam Structure
• Open book
• 120 points (40% for the subject)
• Gradescope
• Time: 120 minutes writing time +15 minutes reading time
• 3 parts:
‣ A: short answer questions
‣ B: method questions
‣ C: algorithm questions
COMP90042 L23
25
Short Answer Questions
• Several short questions
‣ 1-2 sentence answers for each
‣ Definitional, e.g. what is X?
‣ Conceptual, e.g. relate X and Y, purpose of Z?
‣ May call for an example illustrating a technique/
problem
COMP90042 L23
26
Method Questions
• Longer answer
• Focus on analysis and understanding
‣ Contrast different methods
‣ Analyse an algorithm/application
‣ Motivate a modelling technique
‣ Explain or derive mathematical equation
COMP90042 L23
27
Algorithmic Questions
• Perform algorithmic computations
‣ Numerical computations for algorithm on some
given example data
‣ Present an outline of an algorithm on your own
example
• Not required to simplify maths (e.g. leaving
fractions as log(5/4) is fine)
COMP90042 L23
28
What to Expect
• Even coverage of topic from the semester
• Be prepared for concepts that have not yet been
assessed by homework / project
• Prescribed reading is fair game for topics
mentioned in the lectures and workshops
• Mock exam
COMP90042 L23
29
Questions?
• Final survey: https://forms.gle/
CYBfYuEh46mjGqG86
https://forms.gle/CYBfYuEh46mjGqG86
https://forms.gle/CYBfYuEh46mjGqG86
https://forms.gle/CYBfYuEh46mjGqG86
https://forms.gle/CYBfYuEh46mjGqG86
https://forms.gle/CYBfYuEh46mjGqG86
https://forms.gle/CYBfYuEh46mjGqG86