CS计算机代考程序代写 Bayesian algorithm Topic Modelling

Topic Modelling
COMP90042
Natural Language Processing
Lecture 20
Semester 1 2021 Week 10 Jey Han Lau
COPYRIGHT 2021, THE UNIVERSITY OF MELBOURNE
1

COMP90042
L20





English Wikipedia: 6M articles Twitter: 500M tweets per day New York Times: 15M articles arXiv: 1M articles
Making Sense of Text
What can we do if we want to learn something about these document collections?
2

COMP90042
L20
Questions
What are the less popular topics on Wikipedia?
What are the big trends on Twitter in the past month?


• •
How do the themes/topics evolve over time in New York Times from 1900s to 2000s?
What are some influential research areas?
3

COMP90042
L20
Topic Models To The Rescue
• Topicmodelslearncommon,overlappingthemesin a document collection
• Unsupervisedmodel
‣ No labels; input is just the documents!
• What’stheoutputofatopicmodel?
‣ Topics: each topic associated with a list of words
‣ Topic assignments: each document associated with a list of topics
4

COMP90042
L20
What Do Topics Look Like?
• •

A list of words
Collectively describes a 
 concept or subject
Words of a topic typically 
 appear in the same set 
 of documents in the 
 corpus
Li, Chua, 2017 (“Document Visualization using Topic Clouds”)
5

COMP90042
L20
Wikipedia Topics
https://appliedmachinelearning.blog/2017/09/28/topic-modelling-part-2-discovering-topics-from-articles-with-latent-dirichlet-allocation/ 6

COMP90042
L20
Twitter Topics
Ramage, Dumais, Liebling, 2010
7

COMP90042
L20
New York Times Topics

Topic Modeling and Digital Humanities


8

COMP90042
L20
Applications of topic models?
• Personalised advertising
• Search engine
• Discover senses of polysemous words
• Part-of-speech tagging
9

COMP90042
L20
• • •
A Brief History of Topic Models Latent Dirichlet Allocation Evaluation
Outline
10

COMP90042
L20
A Brief History of
Topic Models
11

COMP90042
L20
Latent Semantic Analysis (L10)
A
|D| |V|
𝐴 = 𝑈Σ𝑉𝑇
010⋯0 000 0 000 0
⋮⋱⋮
⋯0
000
We use U to get word embeddings in L10
We didn’t do anything with V
|V|
mm
U Σ VT
m m |D|
2.2 0.3 ⋯ 8.7 5.5 −2.8 0.1 −1.3 3.7 3.5
⋮⋱⋮
⋯ −1.9
−0.2 4.0 ⋯ −1.3 −4.1 0.6 −0.2 2.6 6.1 1.4
⋮⋱⋮
⋯ 0.3
2.9 −2.1
9.100⋯0 04.40 0 002.3 0
⋮⋱⋮
⋯ 0.1
000
−1.9 −1.8
12

COMP90042
𝐴 = 𝑈Σ𝑉𝑇 U
L20
LSA: Truncate
mk
Uk k
Word Vector
2.2 0.3 ⋯ −2.4 5.5 −2.8 1.1 −1.3 3.7 4.7
⋮⋱⋮
⋯ −3.3
2.9 −2.1
2.2 0.3 ⋯ −2.4 5.5 −2.8 1.1 −1.3 3.7 4.7
⋮⋱⋮
⋯ −3.3
2.9 −2.1
8.7 0.1 3.5
⋱⋮ ⋯ −1.9

|V|
|V|
VT
|D|
Topic

topic 1 = {dog: 2.2, cat: 5.5, shadow: -1.3, …} topic 2 = {dog:0.3, cat:-2.8, shadow: 3.7, …}
VTk
|D|
−0.2 4.0 ⋯ −1.3 −4.1 0.6 −0.2 2.6 6.1 1.4
⋮⋱⋮
−0.2 4.0 ⋯ −1.3 −4.1 0.6 −0.2 2.6 6.1 1.4
⋮⋱⋮
⋯ 0.3
−1.9 −1.8
m k
k
Topic Assignment
doc 1 = {t1: -0.2, t2: -4.1, t3: 2.6, …} 13
doc2 = {t1: 4.0, t2: 0.6, t3: 6.1, …}

COMP90042
L20
• •
Positive and negative values in the U and VT Difficult to interpret
Uk k
Issues
2.2 0.3 ⋯ −2.4 5.5 −2.8 1.1 −1.3 3.7 4.7
⋮⋱⋮
⋯ −3.3
2.9 −2.1
|V|
Topic

topic 1 = {dog: 2.2, cat: 5.5, shadow: -1.3, …} topic 2 = {dog:0.3, cat:-2.8, shadow: 3.7, …}
14

COMP90042
L20

0.4 0.3 0.2 0.1
0
P(w,d) = P(w|d)P(d)
= P(d)∑P(w|t)P(t|d)
T
Probabilistic LSA
Topic 1
Based on a probabilistic model
film movie
star baseball plant
Word distribution for a topic
Joint probability of a word and a document
Topic distribution for a document
Document 1
Number of topics
0.8 0.6 0.4 0.2
0
topic1 topic2 topic3 topic4
15

COMP90042
L20
Issues
No more negative values!
PLSA can learn topics and topic assignment for documents in the train corpus


• •
But it is unable to infer topic distribution on new documents
PLSA needs to be re-trained for new documents
16

COMP90042
L20



Introduces a prior to the document-topic and topic- word distribution
Latent Dirichlet Allocation
Fully generative: trained LDA model can infer topics on unseen documents!
LDA is a Bayesian version of PLSA
17

COMP90042
L20
Latent Dirichlet Allocation
18

COMP90042
L20
LDA
Core idea: assume each document contains a mix
of topics
But the topic structure is hidden (latent)
LDA infers the topic structure given the observed words and documents



• •
LDA produces soft clusters of documents (based on topic overlap), rather than hard clusters
Given a trained LDA model, it can infer topics on new documents (not part of train data)
19

COMP90042
L20
LDA
Topics
Topics in Documents
20

COMP90042
L20
• • •
A collection of documents Bag-of-words
Good preprocessing practice:
‣ Remove stopwords
‣ Remove low and high frequency word types ‣ Lemmatisation
Input
21

COMP90042
L20

Output
Topics: distribution over words in each topic
0.6 0.4 0.2
0.6 0.4 0.2
Topic 1
Topic 2
-0
film movie show baseball league
0
film movie show baseball league

Topic assignment: distribution over topics in each document
0.8 0.6 0.4 0.2
0.8 0.6 0.4 0.2
Document 1
Document 2
00 topic1 topic2 topic3 topic4
topic1 topic2 topic3 topic4
22

COMP90042
L20
• •
How do we learn the latent topics? Two main family of algorithms:
‣ Variational methods
‣ Sampling-based methods
Learning
23

COMP90042
L20
Sampling Method (Gibbs) 1. Randomly assign topics to all tokens in documents
doc1
mouse: t1
cat: t3
rat: t2
chase: t1
mouse: t3
doc2
scroll: t1
mouse: t3
scroll: t3
scroll: t2
click: t2
doc3
tonight: t2
baseball: t1
tv: t2
exciting: t1
2. Collect topic-word and document-topic co-occurrence statistics based on the assignments
mouse
cat
scroll
tv

t1
0.01
0.01
0.01
0.01
t2
0.01
0.01
0.01
0.01
t3
0.01
0.01
0.01
0.01
t1
t2
t3
d1
0.1
0.1
0.1
d2
0.1
0.1
0.1

3. Go through every word token in corpus and sample a new topic:

P(t) ∝ P(t |w)P(t |d) iii
Initialise co-occurrence matrix with β (=0.01) and α (=0.1) priors 4. Repeat until convergence
24

COMP90042
L20
Sampling Method (Gibbs) 1. Randomly assign topics to all tokens in documents
doc1
mouse:
t ?1
cat: t3
rat: t2
chase: t1
mouse: t3
doc2
scroll: t1
mouse: t3
scroll: t3
scroll: t2
click: t2
doc3
tonight: t2
baseball: t1
tv: t2
exciting: t1
2. Collect topic-word and document-topic co-occurrence statistics based on the assignments
mouse
cat
scroll
tv

t1
1.01-1
0.01
1.01
0.01
t2
0.01
0.01
1.01
1.01
t3
2.01
1.01
1.01
0.01
t1
t2
t3
d1
2.1-1
1.1
2.1
d2
1.1
2.1
2.1

3. Go through every word token in corpus and sample a new topic:
P(ti|w,d) ∝ P(ti|w)P(ti|d)
4. Go to step 2 and repeat until convergence 1 0.01 1 1.1 1 1

Need to de-allocate the current topic assignment and update the co-occurrence matrices before sampling
t1 t2 t3 P(t |w,d)=P(t |mouse)×P(t |d) 0.01+0.01+2.01 × 1.1+1.1+2.1 25

COMP90042
L20
Sampling Method (Gibbs) 1. Randomly assign topics to all tokens in documents
doc1
mouse:
t ?1
cat: t3
rat: t2
chase: t1
mouse: t3
doc2
scroll: t1
mouse: t3
scroll: t3
scroll: t2
click: t2
doc3
tonight: t2
baseball: t1
tv: t2
exciting: t1
2. Collect topic-word and document-topic co-occurrence statistics based on the assignments
mouse
cat
scroll
tv

t1
1.01-1
0.01
1.01
0.01
t2
0.01
0.01
1.01
1.01
t3
2.01
1.01
1.01
0.01
t1
t2
t3
d1
2.1-1
1.1
2.1
d2
1.1
2.1
2.1

3. Go through every word token in corpus and sample a new topic:
‣ P(ti|w,d) ∝ P(ti|w)P(ti|d)
4. Go to step 2 and repeat until convergence
2.01 × 2.1 26 0.01+0.01+2.01 1.1+1.1+2.1
P(t3|w,d) = ?

COMP90042
L20
When Do We Stop?
Train until convergence
Convergence = model probability of training set becomes stable

• •
How to compute model probability?

logP(w1,w2,…,wm)=log∑T P(w1|tj)P(tj|dw1)+…+log∑T P(wm|tj)P(tj|dwm)
j=0
‣ m = #word tokens
Based on the topic-word co-occurrence matrix
j=0
Based on the document-topic co-occurrence matrix
27

COMP90042
L20
Infer Topics For New Documents 1. Randomly assign topics to all tokens in new/test documents
testdoc1
tiger: t2
cow: t1
cat: t3
tiger: t3
testdoc2
football: t2
live: t2
men: t2
fun: t3
soccer: t1
testdoc3
news: t1
cnn: t3
tonight: t1
2. Update document-topic matrix based on the assignments; but use the trained topic-word matrix (kept fixed)
mouse
cat
scroll
tv

t1
5.01
4.01
1.01
0.01
t2
0.01
2.01
1.01
8.01
t3
0.01
0.01
4.01
0.01
t1
t2
t3
td1
1.1
1.1
2.1
td2
1.1
3.1
1.1

from trained topic model
3. Go through every word in the test documents and sample topics:
P(ti|w,d) ∝ P(ti|w)P(ti|d)
4. Go to step 2 and repeat until convergence
new matrix

t1 t2 t3
28

COMP90042
L20

Hyper-Parameters T: number of topic
Low T (<10): broad topics High T (100+): fine-grained, specific topics 29 COMP90042 L20 • • • • β: prior on the topic-word distribution α: prior on the document-topic distribution Analogous to k in add-k smoothing in N-gram LM Pseudo counts to initialise co-occurrence matrix: p(w|t): β p(t|d): α Hyper-Parameters mouse cat scroll tv ... t1 0.01 0.01 0.01 0.01 t2 0.01 0.01 0.01 0.01 t3 0.01 0.01 0.01 0.01 ‣ ‣ t1 t2 t3 d1 0.1 0.1 0.1 d2 0.1 0.1 0.1 ... 30 COMP90042 L20 Hyper-Parameters • Highpriorvalues→flatterdistribution ‣ a very very large value would lead to a uniform distribution • Lowpriorvalues→peakydistribution β: generally small (< 0.01) ‣ Large vocabulary, but we want each topic to focus on specific themes α: generally larger (> 0.1)
‣ Multiple topics within a document


31

COMP90042
L20
Evaluation
32

COMP90042
L20
• •
‣ ‣
How To Evaluate Topic Models?
Unsupervised learning → no labels Intrinsic evaluation:
‣ model logprob / perplexity on test documents
logL = ∑∑logP(w|t)P(t|dw) WT
W ppl = exp−log L
33

COMP90042
L20
• •
More topics = better (lower) perplexity Smaller vocabulary = better perplexity
‣ Perplexity not comparable for different corpora, or different tokenisation/preprocessing methods


Does not correlate with human perception of topic quality
Issues with Perplexity
Extrinsic evaluation the way to go:
‣ Evaluate topic models based on downstream task
34

COMP90042
L20
Topic Coherence A better intrinsic evaluation method
Measure how coherent the generated topics

• •
A good topic model is one that generates more coherent topics
35

COMP90042
L20
Word Intrusion
• Idea:injectonerandomwordtoatopic
{farmers, farm, food, rice, agriculture} 
 ↓

{farmers, farm, food, rice, cat, agriculture} • Askuserstoguesswhichistheintruderword • Correctguess→topiciscoherent
• Tryguesstheintruderwordin:
‣ {choice, count, village, i.e., simply, unionist} • Manualeffort;doesnotscale
36

COMP90042
L20
PMI ≈ Coherence?
• HighPMIforapairofwords→wordsarecorrelated ‣ PMI(farm, rice) ↑
‣ PMI(choice, village) ↓
• IfallwordpairsinatopichashighPMI→topiciscoherent • IfmosttopicshavehighPMI→goodtopicmodel
• Wheretogetwordco-occurrencestatisticsforPMI?
‣ Can use same corpus for topic model
‣ A better way is to use an external corpus (e.g. Wikipedia)
37

COMP90042
L20

Compute pairwise PMI of top-N words in a topic
N j−1 PMI(t) = ∑ ∑ log
j=2 i=1
Given topic: {farmers, farm, food, rice, agriculture} Coherence = sum PMI for all word pairs:
‣ PMI(farmers, farm) + PMI(farmers, food) 
 + … + PMI(rice, agriculture)
PMI
P(wi, wj) P(wi)P(wj)
• •
38

COMP90042
L20
Variants Normalised PMI

NPMI(t) =

N j−1 log ∑ ∑
P(wi, wj) P(wi)P(wj)
Conditional probability
N j−1 LCP(t) = ∑ ∑ log
j=2 i=1
P(wi, wj) P(wi)
j=2 i=1 −logP(wi,wj)
39

COMP90042
L20
PMI Examples
Topic
PMI
NPMI
cell hormone insulin muscle receptor
0.61
0.59
electron laser magnetic voltage wavelength
0.54
0.52
magnetic neutrino particle quantum universe
0.55
0.55
album band music release song
0.37
0.56
college education school student university
0.38
0.57
city county district population town
0.34
0.52
40

COMP90042
L20


Topic model: an unsupervised model for learning latent concepts in a document collection

How to evaluate topic models? ‣ Topic coherence
LDA: a popular topic model ‣ Learning
‣ Hyper-parameters
A Final Word
41