Topic Modelling
COMP90042
Natural Language Processing
Lecture 20
Semester 1 2021 Week 10 Jey Han Lau
COPYRIGHT 2021, THE UNIVERSITY OF MELBOURNE
1
COMP90042
L20
•
•
•
•
•
English Wikipedia: 6M articles Twitter: 500M tweets per day New York Times: 15M articles arXiv: 1M articles
Making Sense of Text
What can we do if we want to learn something about these document collections?
2
COMP90042
L20
Questions
What are the less popular topics on Wikipedia?
What are the big trends on Twitter in the past month?
•
•
• •
How do the themes/topics evolve over time in New York Times from 1900s to 2000s?
What are some influential research areas?
3
COMP90042
L20
Topic Models To The Rescue
• Topicmodelslearncommon,overlappingthemesin a document collection
• Unsupervisedmodel
‣ No labels; input is just the documents!
• What’stheoutputofatopicmodel?
‣ Topics: each topic associated with a list of words
‣ Topic assignments: each document associated with a list of topics
4
COMP90042
L20
What Do Topics Look Like?
• •
•
A list of words
Collectively describes a
concept or subject
Words of a topic typically
appear in the same set
of documents in the
corpus
Li, Chua, 2017 (“Document Visualization using Topic Clouds”)
5
COMP90042
L20
Wikipedia Topics
https://appliedmachinelearning.blog/2017/09/28/topic-modelling-part-2-discovering-topics-from-articles-with-latent-dirichlet-allocation/ 6
COMP90042
L20
Twitter Topics
Ramage, Dumais, Liebling, 2010
7
COMP90042
L20
New York Times Topics
8
COMP90042
L20
Applications of topic models?
• Personalised advertising
• Search engine
• Discover senses of polysemous words
• Part-of-speech tagging
9
COMP90042
L20
• • •
A Brief History of Topic Models Latent Dirichlet Allocation Evaluation
Outline
10
COMP90042
L20
A Brief History of
Topic Models
11
COMP90042
L20
Latent Semantic Analysis (L10)
A
|D| |V|
𝐴 = 𝑈Σ𝑉𝑇
010⋯0 000 0 000 0
⋮⋱⋮
⋯0
000
We use U to get word embeddings in L10
We didn’t do anything with V
|V|
mm
U Σ VT
m m |D|
2.2 0.3 ⋯ 8.7 5.5 −2.8 0.1 −1.3 3.7 3.5
⋮⋱⋮
⋯ −1.9
−0.2 4.0 ⋯ −1.3 −4.1 0.6 −0.2 2.6 6.1 1.4
⋮⋱⋮
⋯ 0.3
2.9 −2.1
9.100⋯0 04.40 0 002.3 0
⋮⋱⋮
⋯ 0.1
000
−1.9 −1.8
12
COMP90042
𝐴 = 𝑈Σ𝑉𝑇 U
L20
LSA: Truncate
mk
Uk k
Word Vector
2.2 0.3 ⋯ −2.4 5.5 −2.8 1.1 −1.3 3.7 4.7
⋮⋱⋮
⋯ −3.3
2.9 −2.1
2.2 0.3 ⋯ −2.4 5.5 −2.8 1.1 −1.3 3.7 4.7
⋮⋱⋮
⋯ −3.3
2.9 −2.1
8.7 0.1 3.5
⋱⋮ ⋯ −1.9
⋯
|V|
|V|
VT
|D|
Topic
topic 1 = {dog: 2.2, cat: 5.5, shadow: -1.3, …} topic 2 = {dog:0.3, cat:-2.8, shadow: 3.7, …}
VTk
|D|
−0.2 4.0 ⋯ −1.3 −4.1 0.6 −0.2 2.6 6.1 1.4
⋮⋱⋮
−0.2 4.0 ⋯ −1.3 −4.1 0.6 −0.2 2.6 6.1 1.4
⋮⋱⋮
⋯ 0.3
−1.9 −1.8
m k
k
Topic Assignment
doc 1 = {t1: -0.2, t2: -4.1, t3: 2.6, …} 13
doc2 = {t1: 4.0, t2: 0.6, t3: 6.1, …}
COMP90042
L20
• •
Positive and negative values in the U and VT Difficult to interpret
Uk k
Issues
2.2 0.3 ⋯ −2.4 5.5 −2.8 1.1 −1.3 3.7 4.7
⋮⋱⋮
⋯ −3.3
2.9 −2.1
|V|
Topic
topic 1 = {dog: 2.2, cat: 5.5, shadow: -1.3, …} topic 2 = {dog:0.3, cat:-2.8, shadow: 3.7, …}
14
COMP90042
L20
•
0.4 0.3 0.2 0.1
0
P(w,d) = P(w|d)P(d)
= P(d)∑P(w|t)P(t|d)
T
Probabilistic LSA
Topic 1
Based on a probabilistic model
film movie
star baseball plant
Word distribution for a topic
Joint probability of a word and a document
Topic distribution for a document
Document 1
Number of topics
0.8 0.6 0.4 0.2
0
topic1 topic2 topic3 topic4
15
COMP90042
L20
Issues
No more negative values!
PLSA can learn topics and topic assignment for documents in the train corpus
•
•
• •
But it is unable to infer topic distribution on new documents
PLSA needs to be re-trained for new documents
16
COMP90042
L20
•
•
•
Introduces a prior to the document-topic and topic- word distribution
Latent Dirichlet Allocation
Fully generative: trained LDA model can infer topics on unseen documents!
LDA is a Bayesian version of PLSA
17
COMP90042
L20
Latent Dirichlet Allocation
18
COMP90042
L20
LDA
Core idea: assume each document contains a mix
of topics
But the topic structure is hidden (latent)
LDA infers the topic structure given the observed words and documents
•
•
•
• •
LDA produces soft clusters of documents (based on topic overlap), rather than hard clusters
Given a trained LDA model, it can infer topics on new documents (not part of train data)
19
COMP90042
L20
LDA
Topics
Topics in Documents
20
COMP90042
L20
• • •
A collection of documents Bag-of-words
Good preprocessing practice:
‣ Remove stopwords
‣ Remove low and high frequency word types ‣ Lemmatisation
Input
21
COMP90042
L20
•
Output
Topics: distribution over words in each topic
0.6 0.4 0.2
0.6 0.4 0.2
Topic 1
Topic 2
-0
film movie show baseball league
0
film movie show baseball league
•
Topic assignment: distribution over topics in each document
0.8 0.6 0.4 0.2
0.8 0.6 0.4 0.2
Document 1
Document 2
00 topic1 topic2 topic3 topic4
topic1 topic2 topic3 topic4
22
COMP90042
L20
• •
How do we learn the latent topics? Two main family of algorithms:
‣ Variational methods
‣ Sampling-based methods
Learning
23
COMP90042
L20
Sampling Method (Gibbs) 1. Randomly assign topics to all tokens in documents
doc1
mouse: t1
cat: t3
rat: t2
chase: t1
mouse: t3
doc2
scroll: t1
mouse: t3
scroll: t3
scroll: t2
click: t2
doc3
tonight: t2
baseball: t1
tv: t2
exciting: t1
2. Collect topic-word and document-topic co-occurrence statistics based on the assignments
mouse
cat
scroll
tv
…
t1
0.01
0.01
0.01
0.01
t2
0.01
0.01
0.01
0.01
t3
0.01
0.01
0.01
0.01
t1
t2
t3
d1
0.1
0.1
0.1
d2
0.1
0.1
0.1
…
3. Go through every word token in corpus and sample a new topic:
‣
P(t) ∝ P(t |w)P(t |d) iii
Initialise co-occurrence matrix with β (=0.01) and α (=0.1) priors 4. Repeat until convergence
24
COMP90042
L20
Sampling Method (Gibbs) 1. Randomly assign topics to all tokens in documents
doc1
mouse:
t ?1
cat: t3
rat: t2
chase: t1
mouse: t3
doc2
scroll: t1
mouse: t3
scroll: t3
scroll: t2
click: t2
doc3
tonight: t2
baseball: t1
tv: t2
exciting: t1
2. Collect topic-word and document-topic co-occurrence statistics based on the assignments
mouse
cat
scroll
tv
…
t1
1.01-1
0.01
1.01
0.01
t2
0.01
0.01
1.01
1.01
t3
2.01
1.01
1.01
0.01
t1
t2
t3
d1
2.1-1
1.1
2.1
d2
1.1
2.1
2.1
…
3. Go through every word token in corpus and sample a new topic:
P(ti|w,d) ∝ P(ti|w)P(ti|d)
4. Go to step 2 and repeat until convergence 1 0.01 1 1.1 1 1
‣
Need to de-allocate the current topic assignment and update the co-occurrence matrices before sampling
t1 t2 t3 P(t |w,d)=P(t |mouse)×P(t |d) 0.01+0.01+2.01 × 1.1+1.1+2.1 25
COMP90042
L20
Sampling Method (Gibbs) 1. Randomly assign topics to all tokens in documents
doc1
mouse:
t ?1
cat: t3
rat: t2
chase: t1
mouse: t3
doc2
scroll: t1
mouse: t3
scroll: t3
scroll: t2
click: t2
doc3
tonight: t2
baseball: t1
tv: t2
exciting: t1
2. Collect topic-word and document-topic co-occurrence statistics based on the assignments
mouse
cat
scroll
tv
…
t1
1.01-1
0.01
1.01
0.01
t2
0.01
0.01
1.01
1.01
t3
2.01
1.01
1.01
0.01
t1
t2
t3
d1
2.1-1
1.1
2.1
d2
1.1
2.1
2.1
…
3. Go through every word token in corpus and sample a new topic:
‣ P(ti|w,d) ∝ P(ti|w)P(ti|d)
4. Go to step 2 and repeat until convergence
2.01 × 2.1 26 0.01+0.01+2.01 1.1+1.1+2.1
P(t3|w,d) = ?
COMP90042
L20
When Do We Stop?
Train until convergence
Convergence = model probability of training set becomes stable
•
• •
How to compute model probability?
logP(w1,w2,…,wm)=log∑T P(w1|tj)P(tj|dw1)+…+log∑T P(wm|tj)P(tj|dwm)
j=0
‣ m = #word tokens
Based on the topic-word co-occurrence matrix
j=0
Based on the document-topic co-occurrence matrix
27
COMP90042
L20
Infer Topics For New Documents 1. Randomly assign topics to all tokens in new/test documents
testdoc1
tiger: t2
cow: t1
cat: t3
tiger: t3
testdoc2
football: t2
live: t2
men: t2
fun: t3
soccer: t1
testdoc3
news: t1
cnn: t3
tonight: t1
2. Update document-topic matrix based on the assignments; but use the trained topic-word matrix (kept fixed)
mouse
cat
scroll
tv
…
t1
5.01
4.01
1.01
0.01
t2
0.01
2.01
1.01
8.01
t3
0.01
0.01
4.01
0.01
t1
t2
t3
td1
1.1
1.1
2.1
td2
1.1
3.1
1.1
…
from trained topic model
3. Go through every word in the test documents and sample topics:
P(ti|w,d) ∝ P(ti|w)P(ti|d)
4. Go to step 2 and repeat until convergence
new matrix
‣
t1 t2 t3
28
COMP90042
L20
•
Hyper-Parameters T: number of topic
Low T (<10): broad topics High T (100+): fine-grained, specific topics 29
COMP90042
L20
•
• •
•
β: prior on the topic-word distribution
α: prior on the document-topic distribution
Analogous to k in add-k smoothing in N-gram LM Pseudo counts to initialise co-occurrence matrix:
p(w|t): β p(t|d): α
Hyper-Parameters
mouse
cat
scroll
tv
...
t1
0.01
0.01
0.01
0.01
t2
0.01
0.01
0.01
0.01
t3
0.01
0.01
0.01
0.01
‣ ‣
t1
t2
t3
d1
0.1
0.1
0.1
d2
0.1
0.1
0.1
...
30
COMP90042
L20
Hyper-Parameters • Highpriorvalues→flatterdistribution
‣ a very very large value would lead to a uniform distribution
• Lowpriorvalues→peakydistribution
β: generally small (< 0.01)
‣ Large vocabulary, but we want each topic to focus
on specific themes
α: generally larger (> 0.1)
‣ Multiple topics within a document
•
•
31
COMP90042
L20
Evaluation
32
COMP90042
L20
• •
‣ ‣
How To Evaluate Topic Models?
Unsupervised learning → no labels Intrinsic evaluation:
‣ model logprob / perplexity on test documents
logL = ∑∑logP(w|t)P(t|dw) WT
W ppl = exp−log L
33
COMP90042
L20
• •
More topics = better (lower) perplexity Smaller vocabulary = better perplexity
‣ Perplexity not comparable for different corpora, or different tokenisation/preprocessing methods
•
•
Does not correlate with human perception of topic quality
Issues with Perplexity
Extrinsic evaluation the way to go:
‣ Evaluate topic models based on downstream task
34
COMP90042
L20
Topic Coherence A better intrinsic evaluation method
Measure how coherent the generated topics
•
• •
A good topic model is one that generates more coherent topics
35
COMP90042
L20
Word Intrusion
• Idea:injectonerandomwordtoatopic
{farmers, farm, food, rice, agriculture}
↓
{farmers, farm, food, rice, cat, agriculture} • Askuserstoguesswhichistheintruderword • Correctguess→topiciscoherent
• Tryguesstheintruderwordin:
‣ {choice, count, village, i.e., simply, unionist} • Manualeffort;doesnotscale
36
COMP90042
L20
PMI ≈ Coherence?
• HighPMIforapairofwords→wordsarecorrelated ‣ PMI(farm, rice) ↑
‣ PMI(choice, village) ↓
• IfallwordpairsinatopichashighPMI→topiciscoherent • IfmosttopicshavehighPMI→goodtopicmodel
• Wheretogetwordco-occurrencestatisticsforPMI?
‣ Can use same corpus for topic model
‣ A better way is to use an external corpus (e.g. Wikipedia)
37
COMP90042
L20
•
Compute pairwise PMI of top-N words in a topic
N j−1 PMI(t) = ∑ ∑ log
j=2 i=1
Given topic: {farmers, farm, food, rice, agriculture} Coherence = sum PMI for all word pairs:
‣ PMI(farmers, farm) + PMI(farmers, food)
+ … + PMI(rice, agriculture)
PMI
P(wi, wj) P(wi)P(wj)
• •
38
COMP90042
L20
Variants Normalised PMI
•
NPMI(t) =
•
N j−1 log ∑ ∑
P(wi, wj) P(wi)P(wj)
Conditional probability
N j−1 LCP(t) = ∑ ∑ log
j=2 i=1
P(wi, wj) P(wi)
j=2 i=1 −logP(wi,wj)
39
COMP90042
L20
PMI Examples
Topic
PMI
NPMI
cell hormone insulin muscle receptor
0.61
0.59
electron laser magnetic voltage wavelength
0.54
0.52
magnetic neutrino particle quantum universe
0.55
0.55
album band music release song
0.37
0.56
college education school student university
0.38
0.57
city county district population town
0.34
0.52
40
COMP90042
L20
•
•
Topic model: an unsupervised model for learning latent concepts in a document collection
•
How to evaluate topic models? ‣ Topic coherence
LDA: a popular topic model ‣ Learning
‣ Hyper-parameters
A Final Word
41