l20-topic-model-v2
COPYRIGHT 2021, THE UNIVERSITY OF MELBOURNE
1
COMP90042
Natural Language Processing
Lecture 20
Semester 1 2021 Week 10
Jey Han Lau
Topic Modelling
COMP90042 L20
2
Making Sense of Text
• English Wikipedia: 6M articles
• Twitter: 500M tweets per day
• New York Times: 15M articles
• arXiv: 1M articles
• What can we do if we want to learn something
about these document collections?
COMP90042 L20
3
Questions
• What are the less popular topics on Wikipedia?
• What are the big trends on Twitter in the past
month?
• How do the themes/topics evolve over time in New
York Times from 1900s to 2000s?
• What are some influential research areas?
COMP90042 L20
4
Topic Models To The Rescue
• Topic models learn common, overlapping themes in
a document collection
• Unsupervised model
‣ No labels; input is just the documents!
• What’s the output of a topic model?
‣ Topics: each topic associated with a list of words
‣ Topic assignments: each document associated
with a list of topics
COMP90042 L20
5
What Do Topics Look Like?
• A list of words
• Collectively describes a
concept or subject
• Words of a topic typically
appear in the same set
of documents in the
corpus
Li, Chua, 2017 (“Document Visualization using Topic Clouds”)
COMP90042 L20
6
Wikipedia Topics
Topic Modelling (Part 2): Discovering Topics from Articles with Latent Dirichlet Allocation
Topic Modelling (Part 2): Discovering Topics from Articles with Latent Dirichlet Allocation
Topic Modelling (Part 2): Discovering Topics from Articles with Latent Dirichlet Allocation
COMP90042 L20
7
Twitter Topics
Ramage, Dumais, Liebling, 2010
COMP90042 L20
8
New York Times Topics
COMP90042 L20
9
• Personalised advertising
• Search engine
• Discover senses of polysemous words
• Part-of-speech tagging
Applications of topic models?
COMP90042 L20
10
Outline
• A Brief History of Topic Models
• Latent Dirichlet Allocation
• Evaluation
COMP90042 L20
11
A Brief History of
Topic Models
COMP90042 L20
12
Latent Semantic Analysis (L10)
U
|V|
2 . 2 0.3
5.5 −2.8
−1.3 3.7
⋯
8 . 7
0.1
3.5
⋮ ⋱ ⋮
2 . 9 −2.1 ⋯ −1.9
m
VT
m
−0.2 4.0
−4.1 0.6
2.6 6.1
⋯
−1.3
−0.2
1.4
⋮ ⋱ ⋮
−1.9 −1.8 ⋯ 0.3
|D|
Σ
m
m
9 . 1 0 0
0 4.4 0
0 0 2.3
⋯
0
0
0
⋮ ⋱ ⋮
0 0 0 ⋯ 0.1
A
0 1 0
0 0 0
0 0 0
⋯
0
0
0
⋮ ⋱ ⋮
0 0 0 ⋯ 0
|V|
|D|
𝐴 = 𝑈Σ𝑉𝑇
We didn’t do
anything with V
We use to get
word embeddings
in L10
U
COMP90042 L20
13
LSA: Truncate
U
|V|
2 . 2 0.3
5.5 −2.8
−1.3 3.7
⋯
−2.4
1.1
4.7
⋯
8 . 7
0.1
3.5
⋮ ⋱ ⋮ ⋱ ⋮
2 . 9 −2.1 ⋯ −3.3 ⋯ −1.9
m k
Uk
|V|
2 . 2 0.3
5.5 −2.8
−1.3 3.7
⋯
−2.4
1.1
4.7
⋮ ⋱ ⋮
2 . 9 −2.1 ⋯ −3.3
k
Word Vector
Topic
topic 1 = {dog: 2.2, cat: 5.5, shadow: -1.3, …}
topic 2 = {dog:0.3, cat:-2.8, shadow: 3.7, …}
m
−0.2 4.0
−4.1 0.6
2.6 6.1
⋯
−1.3
−0.2
1.4
⋮ ⋱ ⋮
−1.9 −1.8 ⋯ 0.3
|D|
VT
k
−0.2 4.0
−4.1 0.6
2.6 6.1
⋯
−1.3
−0.2
1.4
⋮ ⋱ ⋮
k
|D|
VTk
Topic Assignment
doc 1 = {t1: -0.2, t2: -4.1, t3: 2.6, …}
doc2 = {t1: 4.0, t2: 0.6, t3: 6.1, …}
𝐴 = 𝑈Σ𝑉𝑇
COMP90042 L20
14
Issues
• Positive and negative values in the U and VT
• Difficult to interpret
Uk
|V|
2 . 2 0.3
5.5 −2.8
−1.3 3.7
⋯
−2.4
1.1
4.7
⋮ ⋱ ⋮
2 . 9 −2.1 ⋯ −3.3
k
Topic
topic 1 = {dog: 2.2, cat: 5.5, shadow: -1.3, …}
topic 2 = {dog:0.3, cat:-2.8, shadow: 3.7, …}
COMP90042 L20
15
Probabilistic LSA
• Based on a probabilistic model
P(w, d) = P(w |d)P(d)
Joint probability of a
word and a document
Number of topics
Word distribution
for a topic
Topic 1
0
0.1
0.2
0.3
0.4
film movie star baseball plant
Topic distribution
for a document
Document 1
0
0.2
0.4
0.6
0.8
topic1 topic2 topic3 topic4
= P(d)∑
T
P(w | t)P(t |d)
COMP90042 L20
16
Issues
• No more negative values!
• PLSA can learn topics and topic assignment for
documents in the train corpus
• But it is unable to infer topic distribution on new
documents
• PLSA needs to be re-trained for new documents
COMP90042 L20
17
Latent Dirichlet Allocation
• Introduces a prior to the document-topic and topic-
word distribution
• Fully generative: trained LDA model can infer
topics on unseen documents!
• LDA is a Bayesian version of PLSA
COMP90042 L20
18
Latent Dirichlet Allocation
COMP90042 L20
19
LDA
• Core idea: assume each document contains a mix
of topics
• But the topic structure is hidden (latent)
• LDA infers the topic structure given the observed
words and documents
• LDA produces soft clusters of documents (based
on topic overlap), rather than hard clusters
• Given a trained LDA model, it can infer topics on
new documents (not part of train data)
COMP90042 L20
20
Topics
Topics in Documents
LDA
COMP90042 L20
21
Input
• A collection of documents
• Bag-of-words
• Good preprocessing practice:
‣ Remove stopwords
‣ Remove low and high frequency word types
‣ Lemmatisation
COMP90042 L20
22
Output
• Topics: distribution over words in each topic
• Topic assignment: distribution over topics in each
document
Topic 1
-0
0.2
0.4
0.6
film movie show baseball league
Topic 2
0
0.2
0.4
0.6
film movie show baseball league
Document 1
0
0.2
0.4
0.6
0.8
topic1 topic2 topic3 topic4
Document 2
0
0.2
0.4
0.6
0.8
topic1 topic2 topic3 topic4
COMP90042 L20
23
Learning
• How do we learn the latent topics?
• Two main family of algorithms:
‣ Variational methods
‣ Sampling-based methods
COMP90042 L20
24
Sampling Method (Gibbs)
1. Randomly assign topics to all tokens in documents
2. Collect topic-word and document-topic co-occurrence statistics
based on the assignments
3. Go through every word token in corpus and sample a new topic:
‣
4. Repeat until convergence
P(ti) ∝ P(ti |w)P(ti |d)
doc1 mouse: t1 cat: t3 rat: t2 chase: t1 mouse: t3
doc2 scroll: t1 mouse: t3 scroll: t3 scroll: t2 click: t2
doc3 tonight: t2 baseball: t1 tv: t2 exciting: t1
mouse cat scroll tv …
t1 0.01 0.01 0.01 0.01
t2 0.01 0.01 0.01 0.01
t3 0.01 0.01 0.01 0.01
t1 t2 t3
d1 0.1 0.1 0.1
d2 0.1 0.1 0.1
…
Initialise co-occurrence matrix with (=0.01) and (=0.1) priorsβ α
COMP90042 L20
25
Sampling Method (Gibbs)
1. Randomly assign topics to all tokens in documents
2. Collect topic-word and document-topic co-occurrence statistics
based on the assignments
3. Go through every word token in corpus and sample a new topic:
‣
4. Go to step 2 and repeat until convergence
P(ti |w, d) ∝ P(ti |w)P(ti |d)
doc1 mouse: t1 cat: t3 rat: t2 chase: t1 mouse: t3
doc2 scroll: t1 mouse: t3 scroll: t3 scroll: t2 click: t2
doc3 tonight: t2 baseball: t1 tv: t2 exciting: t1
mouse cat scroll tv …
t1 1.01 0.01 1.01 0.01
t2 0.01 0.01 1.01 1.01
t3 2.01 1.01 1.01 0.01
t1 t2 t3
d1 2.1 1.1 2.1
d2 1.1 2.1 2.1
…
t1 t2 t3
Need to de-allocate the current topic assignment and
update the co-occurrence matrices before sampling
-1 -1
P(t1 |w, d) = P(t1 |mouse) × P(t1 |d1)
?
0.01
0.01 + 0.01 + 2.01
×
1.1
1.1 + 1.1 + 2.1
COMP90042 L20
26
Sampling Method (Gibbs)
1. Randomly assign topics to all tokens in documents
2. Collect topic-word and document-topic co-occurrence statistics
based on the assignments
3. Go through every word token in corpus and sample a new topic:
‣
4. Go to step 2 and repeat until convergence
P(ti |w, d) ∝ P(ti |w)P(ti |d)
doc1 mouse: t1 cat: t3 rat: t2 chase: t1 mouse: t3
doc2 scroll: t1 mouse: t3 scroll: t3 scroll: t2 click: t2
doc3 tonight: t2 baseball: t1 tv: t2 exciting: t1
mouse cat scroll tv …
t1 1.01 0.01 1.01 0.01
t2 0.01 0.01 1.01 1.01
t3 2.01 1.01 1.01 0.01
t1 t2 t3
d1 2.1 1.1 2.1
d2 1.1 2.1 2.1
…
-1 -1
P(t3 |w, d) = ?
?
2.01
0.01 + 0.01 + 2.01
×
2.1
1.1 + 1.1 + 2.1
COMP90042 L20
27
When Do We Stop?
• Train until convergence
• Convergence = model probability of training set
becomes stable
• How to compute model probability?
‣ m = #word tokens
log P(w1, w2, . . . , wm) = log
T
∑
j=0
P(w1 | tj)P(tj |dw1) + . . . + log
T
∑
j=0
P(wm | tj)P(tj |dwm)
Based on the topic-word
co-occurrence matrix
Based on the document-topic
co-occurrence matrix
COMP90042 L20
28
Infer Topics For New Documents
1. Randomly assign topics to all tokens in new/test documents
2. Update document-topic matrix based on the assignments; but use
the trained topic-word matrix (kept fixed)
3. Go through every word in the test documents and sample topics:
‣
4. Go to step 2 and repeat until convergence
P(ti |w, d) ∝ P(ti |w)P(ti |d)
testdoc1 tiger: t2 cow: t1 cat: t3 tiger: t3
testdoc2 football: t2 live: t2 men: t2 fun: t3 soccer: t1
testdoc3 news: t1 cnn: t3 tonight: t1
mouse cat scroll tv …
t1 5.01 4.01 1.01 0.01
t2 0.01 2.01 1.01 8.01
t3 0.01 0.01 4.01 0.01
t1 t2 t3
td1 1.1 1.1 2.1
td2 1.1 3.1 1.1
…
from trained
topic model new matrix
t1 t2 t3
COMP90042 L20
29
Hyper-Parameters
• : number of topicT
High (100+): fine-grained, specific topicsTLow (<10): broad topicsT COMP90042 L20 30 Hyper-Parameters • : prior on the topic-word distribution • : prior on the document-topic distribution • Analogous to k in add-k smoothing in N-gram LM • Pseudo counts to initialise co-occurrence matrix: ‣ : ‣ : β α p(w | t) β p(t |d) α mouse cat scroll tv … t1 0.01 0.01 0.01 0.01 t2 0.01 0.01 0.01 0.01 t3 0.01 0.01 0.01 0.01 t1 t2 t3 d1 0.1 0.1 0.1 d2 0.1 0.1 0.1 … COMP90042 L20 31 Hyper-Parameters • High prior values → flatter distribution ‣ a very very large value would lead to a uniform distribution • Low prior values → peaky distribution • : generally small (< 0.01) ‣ Large vocabulary, but we want each topic to focus on specific themes • : generally larger (> 0.1)
‣ Multiple topics within a document
β
α
COMP90042 L20
32
Evaluation
COMP90042 L20
33
How To Evaluate Topic Models?
• Unsupervised learning → no labels
• Intrinsic evaluation:
‣ model logprob / perplexity on test documents
‣
‣
log L = ∑
W
∑
T
log P(w | t)P(t |dw)
ppl = exp
−log L
W
COMP90042 L20
34
Issues with Perplexity
• More topics = better (lower) perplexity
• Smaller vocabulary = better perplexity
‣ Perplexity not comparable for different corpora, or
different tokenisation/preprocessing methods
• Does not correlate with human perception of topic
quality
• Extrinsic evaluation the way to go:
‣ Evaluate topic models based on downstream task
COMP90042 L20
35
Topic Coherence
• A better intrinsic evaluation method
• Measure how coherent the generated topics
• A good topic model is one that generates more
coherent topics
COMP90042 L20
36
Word Intrusion
• Idea: inject one random word to a topic
{farmers, farm, food, rice, agriculture}
↓
{farmers, farm, food, rice, cat, agriculture}
• Ask users to guess which is the intruder word
• Correct guess → topic is coherent
• Try guess the intruder word in:
‣ {choice, count, village, i.e., simply, unionist}
• Manual effort; does not scale
COMP90042 L20
37
PMI Coherence?≈
• High PMI for a pair of words → words are correlated
‣ PMI(farm, rice) ↑
‣ PMI(choice, village) ↓
• If all word pairs in a topic has high PMI → topic is coherent
• If most topics have high PMI → good topic model
• Where to get word co-occurrence statistics for PMI?
‣ Can use same corpus for topic model
‣ A better way is to use an external corpus (e.g.
Wikipedia)
COMP90042 L20
38
PMI
• Compute pairwise PMI of top-N words in a topic
• Given topic: {farmers, farm, food, rice, agriculture}
• Coherence = sum PMI for all word pairs:
‣ PMI(farmers, farm) + PMI(farmers, food)
+ … + PMI(rice, agriculture)
PMI(t) =
N
∑
j=2
j−1
∑
i=1
log
P(wi, wj)
P(wi)P(wj)
COMP90042 L20
39
Variants
• Normalised PMI
• Conditional probability
NPMI(t) =
N
∑
j=2
j−1
∑
i=1
log
P(wi, wj)
P(wi)P(wj)
−log P(wi, wj)
LCP(t) =
N
∑
j=2
j−1
∑
i=1
log
P(wi, wj)
P(wi)
COMP90042 L20
40
PMI Examples
Topic PMI NPMI
cell hormone insulin muscle receptor 0.61 0.59
electron laser magnetic voltage wavelength 0.54 0.52
magnetic neutrino particle quantum universe 0.55 0.55
album band music release song 0.37 0.56
college education school student university 0.38 0.57
city county district population town 0.34 0.52
COMP90042 L20
41
A Final Word
• Topic model: an unsupervised model for learning
latent concepts in a document collection
• LDA: a popular topic model
‣ Learning
‣ Hyper-parameters
• How to evaluate topic models?
‣ Topic coherence