Topic Modelling
COMP90042
Natural Language Processing Lecture 20
COPYRIGHT 2020, THE UNIVERSITY OF MELBOURNE
1
COMP90042
L20
•
•
•
•
•
English Wikipedia: 6M articles Twitter: 500M tweets per day New York Times: 15M articles arXiv: 1M articles
Making Sense of Text
What can we do if we want to learn something about these document collections?
2
COMP90042
L20
Questions
What are the less popular topics on Wikipedia?
What are the big trends on Twitter in the past month?
•
•
• •
How do the themes/topics evolve over time in New York Times from 1900s to 2000s?
What are some influential research areas?
3
COMP90042
L20
Topic Models To The Rescue
•
•
Topic models learn common, overlapping themes in a document collection
Unsupervised model
‣ No labels; input is just the documents!
4
COMP90042
L20
What Is A Topic?
• •
•
A set of words
Collectively describes a
concept or subject
Words of a topic typically
appear in the same set
of documents in the
corpus
Li, Chua, 2017 (“Document Visualization using Topic Clouds”)
5
COMP90042 L20
Wikipedia Topics
https://appliedmachinelearning.blog/2017/09/28/topic-modelling-part-2-discovering-topics-from-articles-with-latent-dirichlet-allocation/ 6
COMP90042
L20
Twitter Topics
Ramage, Dumais, Liebling, 2010
7
COMP90042
L20
New York Times Topics
8
COMP90042
L20
A Brief History of
Topic Models
9
COMP90042
L20
•
What Is A Topic Model? Latent Semantic Analysis (L10): SVD+Truncate
A
𝐴 = 𝑈Σ𝑉𝑇
U Σ VT
m m |D|
|D| |V|
010⋯0 000 0 000 0
⋮⋱⋮
⋯0
000
|V|
mm
2.2 0.3 ⋯ 8.7 5.5 −2.8 0.1 −1.3 3.7 3.5
⋮⋱⋮
−0.2 4.0 ⋯ −1.3 −4.1 0.6 −0.2 2.6 6.1 1.4
⋮⋱⋮
2.9 −2.1
⋯ −1.9
9.100⋯0 04.40 0 002.3 0
⋮⋱⋮
000
⋯ 0.1
−1.9 −1.8
⋯ 0.3
10
COMP90042
L20
𝐴 = 𝑈Σ𝑉𝑇 U
LSA: Truncate
mk
Uk k
Word Vector
2.2 0.3 ⋯ −2.4 5.5 −2.8 1.1 −1.3 3.7 4.7
⋮⋱⋮
⋯ −3.3
2.9 −2.1
2.2 0.3 ⋯ −2.4 5.5 −2.8 1.1 −1.3 3.7 4.7
⋮⋱⋮
⋯ −3.3
2.9 −2.1
8.7 0.1 3.5
⋱⋮ ⋯ −1.9
⋯
|V|
|V|
VT
|D|
Topic
VTk
|D|
−0.2 4.0 ⋯ −1.3 −4.1 0.6 −0.2 2.6 6.1 1.4
⋮⋱⋮
−0.2 4.0 ⋯ −1.3 −4.1 0.6 −0.2 2.6 6.1 1.4
⋮⋱⋮
−1.9 −1.8
⋯ 0.3
m k
k
Topic Distribution
11
COMP90042
L20
• •
Positive and negative values in the U and VT Difficult to interpret
Uk k
Issues
2.2 0.3 ⋯ −2.4 5.5 −2.8 1.1 −1.3 3.7 4.7
⋮⋱⋮
⋯ −3.3
2.9 −2.1
|V|
Topic
12
COMP90042
L20
•
Based on a probabilistic model
0.4 0.3 0.2 0.1
0
Probabilistic LSA
Topic 1
P(w,d) = P(w|d)P(d)
= P(d)∑P(w|t)P(t|d)
T
Word distribution for a topic
film
movie
star baseball plant
Joint probability of a word and a document
Topic distribution for a document
Document 1
Number of topics
0.8 0.6 0.4 0.2
0
topic1 topic2 topic3 topic4
13
COMP90042
L20
Issues
No more negative values!
PLSA can learn topics and topic distribution for documents in the train corpus
•
•
• •
But it is unable to infer topic distribution on new documents
PLSA needs to be re-trained for new documents
14
COMP90042
L20
•
•
•
Introduces a prior to the document-topic and topic- word distribution
Latent Dirichlet Allocation
Fully generative: trained LDA model can infer topics on unseen documents!
LDA is a Bayesian version of PLSA
15
COMP90042 L20
Latent Dirichlet Allocation
16
COMP90042
L20
LDA
Core idea: assume each document contains a mix
of topics
But the topic structure is hidden (latent)
LDA infers the topic structure given the observed words and documents
•
•
• •
LDA produces soft clusters of documents (based on topic overlap), rather than hard clusters
17
COMP90042
L20
LDA
Topics
Topics in Documents
18
COMP90042
L20
• • •
A collection of documents Bag-of-words
Good preprocessing practice:
‣ Remove stopwords
‣ Remove low and high frequency word types ‣ Lemmatisation
Input
19
COMP90042
L20
•
Output
Topics: multinomial distribution over words in each topic
0.6 0.4 0.2
0.6 0.4 0.2
Topic 1
Topic 2
-0
film movie show baseball league
0
film movie show baseball league
•
Topics in documents: multinomial distribution over topics in each document
0.8 0.6 0.4 0.2
0.8 0.6 0.4 0.2
Document 1
Document 2
00 topic1 topic2 topic3 topic4
topic1 topic2 topic3 topic4
20
COMP90042
L20
• •
How do we learn the latent topics? Two main family of algorithms:
‣ Variational methods
‣ Sampling-based methods
Learning
21
COMP90042
L20
Sampling Method (Gibbs) 1. Randomly assign topics to all tokens in documents
doc1
mouse: t1
cat: t3
rat: t2
chase: t1
mouse: t3
doc2
scroll: t1
mouse: t3
scroll: t3
scroll: t2
click: t2
doc3
tonight: t2
baseball: t1
tv: t2
exciting: t1
2. Collect topic-word and document-topic co-occurrence statistics based on the assignments
mouse
cat
scroll
tv
…
t1
1
0
1
0
t2
0
0
1
1
t3
2
1
1
0
t1
t2
t3
d1
2
1
2
d2
1
2
2
…
3. Go through every word token in corpus and sample a new topic:
P(ti) ∝ P(ti|w)P(ti|d) 4. Repeat until convergence
Need to de-allocate the current topic assignment and update the co-occurrence matrices before sampling
‣
t1 t2 t3
22
COMP90042
L20
When Do We Stop?
Train until convergence
Convergence = model probability of training set becomes stable
•
• •
How to compute model probability?
logP(w1,w2,…,wm)=log∑T P(w1|tj)P(tj|dw1)+…+log∑T P(wm|tj)P(tj|dwm)
j=0
‣ m = #word tokens
Based on the topic-word co-occurrence matrix
j=0
Based on the document-topic co-occurrence matrix
23
COMP90042 L20
Infer Topics For New Documents 1. Randomly assign topics to all tokens in new/test documents
testdoc1
tiger: t2
cow: t1
cat: t3
tiger: t3
testdoc2
football: t2
live: t2
men: t2
fun: t3
soccer: t1
testdoc3
news: t1
cnn: t3
tonight: t1
2. Update document-topic matrix based on the assignments; but use the trained topic-word matrix
mouse
cat
scroll
tv
…
t1
1
0
1
0
t2
0
0
1
1
t3
2
1
1
0
t1
t2
t3
td1
1
1
2
td2
1
3
1
…
from trained topic model
3. Go through every word in the test documents and sample topics:
new matrix
P(ti) ∝ P(ti|w)P(ti|d) 4. Repeat
‣
t1 t2 t3
24
COMP90042
L20
•
Hyper-Parameters T: number of topic
Low T (<10): broad topics High T (100+): fine-grained, specific topics 25
COMP90042
L20
•
• •
•
β: prior on the topic-word distribution
α: prior on the document-topic distribution
Analogous to k in add-k smoothing in N-gram LM Pseudo counts when computing:
p(w|t): β p(t|d): α
p(t1|d1)= 2+α 5+Tα
Hyper-Parameters
mouse
cat
scroll
tv
...
t1
1
0
1
0
t2
0
0
1
1
t3
2
1
1
0
‣ ‣
t1
t2
t3
d1
2
1
2
d2
1
2
2
...
26
COMP90042
L20
Hyper-Parameters • Highpriorvalues→flatterdistribution
‣ a very very large value would lead to a uniform distribution
• Lowpriorvalues→peakydistribution
β: generally small (< 0.01)
‣ Large vocabulary, but we want each topic to focus
on specific themes
α: generally larger (> 0.1)
‣ Multiple topics within a document
•
•
27
COMP90042
L20
Evaluation
28
COMP90042 L20
• •
‣ ‣
m ppl = exp−log L
total number of word tokens in test documents
How To Evaluate Topic Models?
Unsupervised learning → no labels Intrinsic evaluation:
‣ model logprob / perplexity on test documents
L = ∏∑P(w|t)P(t|dw) wt
29
COMP90042
L20
• •
More topics = better (lower) perplexity Smaller vocabulary = better perplexity
‣ Perplexity not comparable for different corpora, or different tokenisation/preprocessing methods
•
•
Does not correlate with human perception of topic quality
Issues with Perplexity
Extrinsic evaluation the way to go:
‣ Evaluate topic models based on downstream task
30
COMP90042
L20
Topic Coherence A better intrinsic evaluation method
Measure how coherent the generated topics
•
• •
A good topic model is one that generates more coherent topics
31
COMP90042
L20
Word Intrusion
• Idea:injectonerandomwordtoatopic
• {farmers, farm, food, rice, agriculture}
↓
{farmers, farm, food, rice, cat, agriculture}
• Askuserstoguesswhichistheintruderword
• Correctguess→topiciscoherent
• Tryguesstheintruderwordin:
‣ {choice, count, village, i.e., simply, unionist}
• Manualeffort;doesnotscale
32
COMP90042 L20
Estimate Coherence Automatically? PMI!
Compute pairwise PMI of top-N words in a topic
N j−1 PMI(t) = ∑ ∑ log
• •
‣
• •
P(wi, wj) P(wi)P(wj)
j=2 i=1
Given topic: {farmers, farm, food, rice, agriculture}
Sum logPMI for all word pairs in the topic:
‣ logPMI(farmers, farm) + logPMI(farmers, food)
+ … + logPMI(rice, agriculture)
33
COMP90042
L20
• NormalisedPMI
j−1 log ∑ ∑
P(wi, wj) P(wi)P(wj)
NPMI(t) =
• Conditionalprobability N j−1
Variants
N
j=2 i=1 −logP(wi,wj)
‣
P(wi, wj) P(wi)
• Goodcorrelationwithhumanperceptionoftopiccoherence
• BettercorrelationifweuseadifferentcorpustoestimatePMI
‣
LCP(t) = ∑ ∑ log j=2 i=1
i.e. don’t use corpus that we run the topic model on
34
COMP90042
L20
PMI Examples
Topic
PMI
NPMI
cell hormone insulin muscle receptor
0.61
0.59
electron laser magnetic voltage wavelength
0.54
0.52
magnetic neutrino particle quantum universe
0.55
0.55
album band music release song
0.37
0.56
college education school student university
0.38
0.57
city county district population town
0.34
0.52
35
COMP90042
L20
Improvements
36
COMP90042
L20
• • •
•
Use phrases or n-grams instead of words Learn hierarchical topics
Non-parametric models
‣ #topics automatically learned Supervised models
‣ Takes into account document labels
Topic Model Variants
37
COMP90042
L20
• •
Use wikipedia article titles as labels
Measure distance between a label and topic words based on document embeddings and word embeddings
Topic Labelling
vmware server virtual oracle update church archway building window gothic
investigation fbi official department federal rate population prevalence study incidence
→ virtualisation
→ church architecture
→ criminal investigation → mortality rate
38
COMP90042
L20
•
•
Topic model: an unsupervised model for learning latent concepts in a document collection
•
How to evaluate topic models? ‣ Topic coherence
LDA: a popular topic model ‣ Learning
‣ Hyper-parameters
A Final Word
39