CS计算机代考程序代写 Bayesian algorithm l20-topic-model-v2

l20-topic-model-v2

COMP90042
Natural Language Processing

Lecture 20
Semester 1 2021 Week 10

Jey Han Lau

Topic Modelling

COMP90042 L20

Making Sense of Text

• English Wikipedia: 6M articles

• Twitter: 500M tweets per day

• New York Times: 15M articles

• arXiv: 1M articles

• What can we do if we want to learn something
about these document collections?

COMP90042 L20

Questions

• What are the less popular topics on Wikipedia?

• What are the big trends on Twitter in the past
month?

• How do the themes/topics evolve over time in New
York Times from 1900s to 2000s?

• What are some influential research areas?

COMP90042 L20

Topic Models To The Rescue

• Topic models learn common, overlapping themes in
a document collection

• Unsupervised model

‣ No labels; input is just the documents!

• What’s the output of a topic model?

‣ Topics: each topic associated with a list of words

‣ Topic assignments: each document associated
with a list of topics

COMP90042 L20

What Do Topics Look Like?

• A list of words

• Collectively describes a  
concept or subject

• Words of a topic typically  
appear in the same set  
of documents in the  
corpus

Li, Chua, 2017 (“Document Visualization using Topic Clouds”)

COMP90042 L20

Wikipedia Topics

Topic Modelling (Part 2): Discovering Topics from Articles with Latent Dirichlet Allocation

Topic Modelling (Part 2): Discovering Topics from Articles with Latent Dirichlet Allocation
Topic Modelling (Part 2): Discovering Topics from Articles with Latent Dirichlet Allocation

COMP90042 L20

Twitter Topics

Ramage, Dumais, Liebling, 2010

COMP90042 L20

New York Times Topics

Topic Modeling and Digital Humanities

COMP90042 L20

• Personalised advertising
• Search engine
• Discover senses of polysemous words
• Part-of-speech tagging

Applications of topic models?

COMP90042 L20

Outline

• A Brief History of Topic Models

• Latent Dirichlet Allocation

• Evaluation

COMP90042 L20

A Brief History of
Topic Models

COMP90042 L20

Latent Semantic Analysis (L10)

|V|

2 . 2 0.3
5.5 −2.8

−1.3 3.7
⋯

8 . 7
0.1
3.5

⋮ ⋱ ⋮
2 . 9 −2.1 ⋯ −1.9

m
VT

−0.2 4.0
−4.1 0.6
2.6 6.1

⋯
−1.3
−0.2
1.4

⋮ ⋱ ⋮
−1.9 −1.8 ⋯ 0.3

|D|
Σ

9 . 1 0 0
0 4.4 0
0 0 2.3

⋯
0
0
0

⋮ ⋱ ⋮
0 0 0 ⋯ 0.1

0 1 0
0 0 0
0 0 0

⋯
0
0
0

⋮ ⋱ ⋮
0 0 0 ⋯ 0

|V|

|D|

𝐴 = 𝑈Σ𝑉𝑇
We didn’t do

anything with V

We use to get
word embeddings

in L10

COMP90042 L20

LSA: Truncate
U

|V|

2 . 2 0.3
5.5 −2.8

−1.3 3.7
⋯

−2.4
1.1
4.7

⋯
8 . 7
0.1
3.5

⋮ ⋱ ⋮ ⋱ ⋮
2 . 9 −2.1 ⋯ −3.3 ⋯ −1.9

m k
Uk

|V|

2 . 2 0.3
5.5 −2.8

−1.3 3.7
⋯

−2.4
1.1
4.7

⋮ ⋱ ⋮
2 . 9 −2.1 ⋯ −3.3

k
Word Vector

Topic 
topic 1 = {dog: 2.2, cat: 5.5, shadow: -1.3, …}
topic 2 = {dog:0.3, cat:-2.8, shadow: 3.7, …}

m
−0.2 4.0
−4.1 0.6
2.6 6.1

⋯
−1.3
−0.2
1.4

⋮ ⋱ ⋮
−1.9 −1.8 ⋯ 0.3

|D|
VT

−0.2 4.0
−4.1 0.6
2.6 6.1

⋯
−1.3
−0.2
1.4

⋮ ⋱ ⋮
k

|D|
VTk

Topic Assignment
doc 1 = {t1: -0.2, t2: -4.1, t3: 2.6, …}
doc2 = {t1: 4.0, t2: 0.6, t3: 6.1, …}

𝐴 = 𝑈Σ𝑉𝑇

COMP90042 L20

Issues

• Positive and negative values in the U and VT

• Difficult to interpret
Uk

|V|

2 . 2 0.3
5.5 −2.8

−1.3 3.7
⋯

−2.4
1.1
4.7

⋮ ⋱ ⋮
2 . 9 −2.1 ⋯ −3.3

Topic 
topic 1 = {dog: 2.2, cat: 5.5, shadow: -1.3, …}
topic 2 = {dog:0.3, cat:-2.8, shadow: 3.7, …}

COMP90042 L20

Probabilistic LSA

• Based on a probabilistic model

P(w, d) = P(w |d)P(d)

Joint probability of a
word and a document

Number of topics

Word distribution
for a topic

Topic 1

0
0.1
0.2
0.3
0.4

film movie star baseball plant

Topic distribution
for a document

Document 1

0
0.2
0.4
0.6
0.8

topic1 topic2 topic3 topic4

= P(d)∑
T

P(w | t)P(t |d)

COMP90042 L20

Issues

• No more negative values!

• PLSA can learn topics and topic assignment for
documents in the train corpus

• But it is unable to infer topic distribution on new
documents

• PLSA needs to be re-trained for new documents

COMP90042 L20

Latent Dirichlet Allocation

• Introduces a prior to the document-topic and topic-
word distribution

• Fully generative: trained LDA model can infer
topics on unseen documents!

• LDA is a Bayesian version of PLSA

COMP90042 L20

Latent Dirichlet Allocation

COMP90042 L20

LDA

• Core idea: assume each document contains a mix
of topics

• But the topic structure is hidden (latent)

• LDA infers the topic structure given the observed
words and documents

• LDA produces soft clusters of documents (based
on topic overlap), rather than hard clusters

• Given a trained LDA model, it can infer topics on
new documents (not part of train data)

COMP90042 L20

Topics

Topics in Documents

LDA

COMP90042 L20

Input

• A collection of documents

• Bag-of-words

• Good preprocessing practice:

‣ Remove stopwords

‣ Remove low and high frequency word types

‣ Lemmatisation

COMP90042 L20

Output
• Topics: distribution over words in each topic

• Topic assignment: distribution over topics in each
document

Topic 1

-0

0.2

0.4

0.6

film movie show baseball league

Topic 2

0.2

0.4

0.6

film movie show baseball league

Document 1

0
0.2
0.4
0.6
0.8

topic1 topic2 topic3 topic4

Document 2

0
0.2
0.4
0.6
0.8

topic1 topic2 topic3 topic4

COMP90042 L20

Learning

• How do we learn the latent topics?

• Two main family of algorithms:

‣ Variational methods

‣ Sampling-based methods

COMP90042 L20

Sampling Method (Gibbs)
1. Randomly assign topics to all tokens in documents

2. Collect topic-word and document-topic co-occurrence statistics
based on the assignments

3. Go through every word token in corpus and sample a new topic:

‣

4. Repeat until convergence

P(ti) ∝ P(ti |w)P(ti |d)

doc1 mouse: t1 cat: t3 rat: t2 chase: t1 mouse: t3
doc2 scroll: t1 mouse: t3 scroll: t3 scroll: t2 click: t2
doc3 tonight: t2 baseball: t1 tv: t2 exciting: t1

mouse cat scroll tv …
t1 0.01 0.01 0.01 0.01
t2 0.01 0.01 0.01 0.01
t3 0.01 0.01 0.01 0.01

t1 t2 t3
d1 0.1 0.1 0.1
d2 0.1 0.1 0.1
…

Initialise co-occurrence matrix with (=0.01) and (=0.1) priorsβ α

COMP90042 L20

Sampling Method (Gibbs)
1. Randomly assign topics to all tokens in documents

2. Collect topic-word and document-topic co-occurrence statistics
based on the assignments

3. Go through every word token in corpus and sample a new topic:

‣

4. Go to step 2 and repeat until convergence

P(ti |w, d) ∝ P(ti |w)P(ti |d)

doc1 mouse: t1 cat: t3 rat: t2 chase: t1 mouse: t3
doc2 scroll: t1 mouse: t3 scroll: t3 scroll: t2 click: t2
doc3 tonight: t2 baseball: t1 tv: t2 exciting: t1

mouse cat scroll tv …
t1 1.01 0.01 1.01 0.01
t2 0.01 0.01 1.01 1.01
t3 2.01 1.01 1.01 0.01

t1 t2 t3
d1 2.1 1.1 2.1
d2 1.1 2.1 2.1
…

t1 t2 t3

Need to de-allocate the current topic assignment and
update the co-occurrence matrices before sampling

-1 -1

P(t1 |w, d) = P(t1 |mouse) × P(t1 |d1)

0.01
0.01 + 0.01 + 2.01

×
1.1

1.1 + 1.1 + 2.1

COMP90042 L20

Sampling Method (Gibbs)
1. Randomly assign topics to all tokens in documents

2. Collect topic-word and document-topic co-occurrence statistics
based on the assignments

3. Go through every word token in corpus and sample a new topic:

‣

4. Go to step 2 and repeat until convergence

P(ti |w, d) ∝ P(ti |w)P(ti |d)

doc1 mouse: t1 cat: t3 rat: t2 chase: t1 mouse: t3
doc2 scroll: t1 mouse: t3 scroll: t3 scroll: t2 click: t2
doc3 tonight: t2 baseball: t1 tv: t2 exciting: t1

mouse cat scroll tv …
t1 1.01 0.01 1.01 0.01
t2 0.01 0.01 1.01 1.01
t3 2.01 1.01 1.01 0.01

t1 t2 t3
d1 2.1 1.1 2.1
d2 1.1 2.1 2.1
…

-1 -1

P(t3 |w, d) = ?

2.01
0.01 + 0.01 + 2.01

×
2.1

1.1 + 1.1 + 2.1

COMP90042 L20

When Do We Stop?

• Train until convergence

• Convergence = model probability of training set
becomes stable

• How to compute model probability?  

‣ m = #word tokens

log P(w1, w2, . . . , wm) = log
T

∑
j=0

P(w1 | tj)P(tj |dw1) + . . . + log
T

∑
j=0

P(wm | tj)P(tj |dwm)

Based on the topic-word
co-occurrence matrix

Based on the document-topic
co-occurrence matrix

COMP90042 L20

Infer Topics For New Documents
1. Randomly assign topics to all tokens in new/test documents

2. Update document-topic matrix based on the assignments; but use
the trained topic-word matrix (kept fixed)

3. Go through every word in the test documents and sample topics:

‣

4. Go to step 2 and repeat until convergence

P(ti |w, d) ∝ P(ti |w)P(ti |d)

testdoc1 tiger: t2 cow: t1 cat: t3 tiger: t3
testdoc2 football: t2 live: t2 men: t2 fun: t3 soccer: t1
testdoc3 news: t1 cnn: t3 tonight: t1

mouse cat scroll tv …
t1 5.01 4.01 1.01 0.01
t2 0.01 2.01 1.01 8.01
t3 0.01 0.01 4.01 0.01

t1 t2 t3
td1 1.1 1.1 2.1
td2 1.1 3.1 1.1
…

from trained
topic model new matrix

t1 t2 t3

COMP90042 L20

Hyper-Parameters

• : number of topicT

High (100+): fine-grained, specific topicsTLow (<10): broad topicsT COMP90042 L20 30 Hyper-Parameters • : prior on the topic-word distribution • : prior on the document-topic distribution • Analogous to k in add-k smoothing in N-gram LM • Pseudo counts to initialise co-occurrence matrix: ‣ : ‣ : β α p(w | t) β p(t |d) α mouse cat scroll tv … t1 0.01 0.01 0.01 0.01 t2 0.01 0.01 0.01 0.01 t3 0.01 0.01 0.01 0.01 t1 t2 t3 d1 0.1 0.1 0.1 d2 0.1 0.1 0.1 … COMP90042 L20 31 Hyper-Parameters • High prior values → flatter distribution ‣ a very very large value would lead to a uniform distribution • Low prior values → peaky distribution • : generally small (< 0.01) ‣ Large vocabulary, but we want each topic to focus on specific themes • : generally larger (> 0.1)

‣ Multiple topics within a document

COMP90042 L20

Evaluation

COMP90042 L20

How To Evaluate Topic Models?

• Unsupervised learning → no labels

• Intrinsic evaluation:

‣ model logprob / perplexity on test documents

‣

log L = ∑
W

∑
T

log P(w | t)P(t |dw)

ppl = exp
−log L

COMP90042 L20

Issues with Perplexity

• More topics = better (lower) perplexity

• Smaller vocabulary = better perplexity

‣ Perplexity not comparable for different corpora, or
different tokenisation/preprocessing methods

• Does not correlate with human perception of topic
quality

• Extrinsic evaluation the way to go:

‣ Evaluate topic models based on downstream task

COMP90042 L20

Topic Coherence

• A better intrinsic evaluation method

• Measure how coherent the generated topics

• A good topic model is one that generates more
coherent topics

COMP90042 L20

Word Intrusion

• Idea: inject one random word to a topic

{farmers, farm, food, rice, agriculture}  
↓  

{farmers, farm, food, rice, cat, agriculture}

• Ask users to guess which is the intruder word

• Correct guess → topic is coherent
• Try guess the intruder word in:

‣ {choice, count, village, i.e., simply, unionist}
• Manual effort; does not scale

COMP90042 L20

PMI Coherence?≈
• High PMI for a pair of words → words are correlated

‣ PMI(farm, rice) ↑

‣ PMI(choice, village) ↓

• If all word pairs in a topic has high PMI → topic is coherent

• If most topics have high PMI → good topic model

• Where to get word co-occurrence statistics for PMI?

‣ Can use same corpus for topic model

‣ A better way is to use an external corpus (e.g.
Wikipedia)

COMP90042 L20

PMI

• Compute pairwise PMI of top-N words in a topic

• Given topic: {farmers, farm, food, rice, agriculture}

• Coherence = sum PMI for all word pairs:

‣ PMI(farmers, farm) + PMI(farmers, food)  
+ … + PMI(rice, agriculture)

PMI(t) =
N

∑
j=2

j−1

∑
i=1

log
P(wi, wj)

P(wi)P(wj)

COMP90042 L20

Variants

• Normalised PMI

• Conditional probability

NPMI(t) =
N

∑
j=2

j−1

∑
i=1

log
P(wi, wj)

P(wi)P(wj)

−log P(wi, wj)

LCP(t) =
N

∑
j=2

j−1

∑
i=1

log
P(wi, wj)

P(wi)

COMP90042 L20

PMI Examples

Topic PMI NPMI

cell hormone insulin muscle receptor 0.61 0.59

electron laser magnetic voltage wavelength 0.54 0.52

magnetic neutrino particle quantum universe 0.55 0.55

album band music release song 0.37 0.56

college education school student university 0.38 0.57

city county district population town 0.34 0.52

COMP90042 L20

A Final Word

• Topic model: an unsupervised model for learning
latent concepts in a document collection

• LDA: a popular topic model

‣ Learning

‣ Hyper-parameters

• How to evaluate topic models?

‣ Topic coherence

Related Posts