程序代写代做代考 algorithm Bayesian go Topic Modelling

Topic Modelling
COMP90042
Natural Language Processing Lecture 20
COPYRIGHT 2020, THE UNIVERSITY OF MELBOURNE
1

COMP90042
L20





English Wikipedia: 6M articles Twitter: 500M tweets per day New York Times: 15M articles arXiv: 1M articles
Making Sense of Text
What can we do if we want to learn something about these document collections?
2

COMP90042
L20
Questions
What are the less popular topics on Wikipedia?
What are the big trends on Twitter in the past month?


• •
How do the themes/topics evolve over time in New York Times from 1900s to 2000s?
What are some influential research areas?
3

COMP90042
L20
Topic Models To The Rescue


Topic models learn common, overlapping themes in a document collection
Unsupervised model
‣ No labels; input is just the documents!
4

COMP90042
L20
What Is A Topic?
• •

A set of words
Collectively describes a 
 concept or subject
Words of a topic typically 
 appear in the same set 
 of documents in the 
 corpus
Li, Chua, 2017 (“Document Visualization using Topic Clouds”)
5

COMP90042 L20
Wikipedia Topics
https://appliedmachinelearning.blog/2017/09/28/topic-modelling-part-2-discovering-topics-from-articles-with-latent-dirichlet-allocation/ 6

COMP90042
L20
Twitter Topics
Ramage, Dumais, Liebling, 2010
7

COMP90042
L20
New York Times Topics

Topic Modeling and Digital Humanities


8

COMP90042
L20
A Brief History of
Topic Models
9

COMP90042
L20

What Is A Topic Model? Latent Semantic Analysis (L10): SVD+Truncate
A
𝐴 = 𝑈Σ𝑉𝑇
U Σ VT
m m |D|
|D| |V|
010⋯0 000 0 000 0
⋮⋱⋮
⋯0
000
|V|
mm
2.2 0.3 ⋯ 8.7 5.5 −2.8 0.1 −1.3 3.7 3.5
⋮⋱⋮
−0.2 4.0 ⋯ −1.3 −4.1 0.6 −0.2 2.6 6.1 1.4
⋮⋱⋮
2.9 −2.1
⋯ −1.9
9.100⋯0 04.40 0 002.3 0
⋮⋱⋮
000
⋯ 0.1
−1.9 −1.8
⋯ 0.3
10

COMP90042
L20
𝐴 = 𝑈Σ𝑉𝑇 U
LSA: Truncate
mk
Uk k
Word Vector
2.2 0.3 ⋯ −2.4 5.5 −2.8 1.1 −1.3 3.7 4.7
⋮⋱⋮
⋯ −3.3
2.9 −2.1
2.2 0.3 ⋯ −2.4 5.5 −2.8 1.1 −1.3 3.7 4.7
⋮⋱⋮
⋯ −3.3
2.9 −2.1
8.7 0.1 3.5
⋱⋮ ⋯ −1.9

|V|
|V|
VT
|D|
Topic
VTk
|D|
−0.2 4.0 ⋯ −1.3 −4.1 0.6 −0.2 2.6 6.1 1.4
⋮⋱⋮
−0.2 4.0 ⋯ −1.3 −4.1 0.6 −0.2 2.6 6.1 1.4
⋮⋱⋮
−1.9 −1.8
⋯ 0.3
m k
k
Topic Distribution
11

COMP90042
L20
• •
Positive and negative values in the U and VT Difficult to interpret
Uk k
Issues
2.2 0.3 ⋯ −2.4 5.5 −2.8 1.1 −1.3 3.7 4.7
⋮⋱⋮
⋯ −3.3
2.9 −2.1
|V|
Topic
12

COMP90042
L20

Based on a probabilistic model
0.4 0.3 0.2 0.1
0
Probabilistic LSA
Topic 1
P(w,d) = P(w|d)P(d)
= P(d)∑P(w|t)P(t|d)
T
Word distribution for a topic
film
movie
star baseball plant
Joint probability of a word and a document
Topic distribution for a document
Document 1
Number of topics
0.8 0.6 0.4 0.2
0
topic1 topic2 topic3 topic4
13

COMP90042
L20
Issues
No more negative values!
PLSA can learn topics and topic distribution for documents in the train corpus


• •
But it is unable to infer topic distribution on new documents
PLSA needs to be re-trained for new documents
14

COMP90042
L20



Introduces a prior to the document-topic and topic- word distribution
Latent Dirichlet Allocation
Fully generative: trained LDA model can infer topics on unseen documents!
LDA is a Bayesian version of PLSA
15

COMP90042 L20
Latent Dirichlet Allocation
16

COMP90042
L20
LDA
Core idea: assume each document contains a mix
of topics
But the topic structure is hidden (latent)
LDA infers the topic structure given the observed words and documents


• •
LDA produces soft clusters of documents (based on topic overlap), rather than hard clusters
17

COMP90042
L20
LDA
Topics
Topics in Documents
18

COMP90042
L20
• • •
A collection of documents Bag-of-words
Good preprocessing practice:
‣ Remove stopwords
‣ Remove low and high frequency word types ‣ Lemmatisation
Input
19

COMP90042
L20

Output
Topics: multinomial distribution over words in each topic
0.6 0.4 0.2
0.6 0.4 0.2
Topic 1
Topic 2
-0
film movie show baseball league
0
film movie show baseball league

Topics in documents: multinomial distribution over topics in each document
0.8 0.6 0.4 0.2
0.8 0.6 0.4 0.2
Document 1
Document 2
00 topic1 topic2 topic3 topic4
topic1 topic2 topic3 topic4
20

COMP90042
L20
• •
How do we learn the latent topics? Two main family of algorithms:
‣ Variational methods
‣ Sampling-based methods
Learning
21

COMP90042
L20
Sampling Method (Gibbs) 1. Randomly assign topics to all tokens in documents
doc1
mouse: t1
cat: t3
rat: t2
chase: t1
mouse: t3
doc2
scroll: t1
mouse: t3
scroll: t3
scroll: t2
click: t2
doc3
tonight: t2
baseball: t1
tv: t2
exciting: t1
2. Collect topic-word and document-topic co-occurrence statistics based on the assignments
mouse
cat
scroll
tv

t1
1
0
1
0
t2
0
0
1
1
t3
2
1
1
0
t1
t2
t3
d1
2
1
2
d2
1
2
2

3. Go through every word token in corpus and sample a new topic:
P(ti) ∝ P(ti|w)P(ti|d) 4. Repeat until convergence
Need to de-allocate the current topic assignment and update the co-occurrence matrices before sampling

t1 t2 t3
22

COMP90042
L20
When Do We Stop?
Train until convergence
Convergence = model probability of training set becomes stable

• •
How to compute model probability?

logP(w1,w2,…,wm)=log∑T P(w1|tj)P(tj|dw1)+…+log∑T P(wm|tj)P(tj|dwm)
j=0
‣ m = #word tokens
Based on the topic-word co-occurrence matrix
j=0
Based on the document-topic co-occurrence matrix
23

COMP90042 L20
Infer Topics For New Documents 1. Randomly assign topics to all tokens in new/test documents
testdoc1
tiger: t2
cow: t1
cat: t3
tiger: t3
testdoc2
football: t2
live: t2
men: t2
fun: t3
soccer: t1
testdoc3
news: t1
cnn: t3
tonight: t1
2. Update document-topic matrix based on the assignments; but use the trained topic-word matrix
mouse
cat
scroll
tv

t1
1
0
1
0
t2
0
0
1
1
t3
2
1
1
0
t1
t2
t3
td1
1
1
2
td2
1
3
1

from trained topic model
3. Go through every word in the test documents and sample topics:
new matrix
P(ti) ∝ P(ti|w)P(ti|d) 4. Repeat

t1 t2 t3
24

COMP90042
L20

Hyper-Parameters T: number of topic
Low T (<10): broad topics High T (100+): fine-grained, specific topics 25 COMP90042 L20 • • • • β: prior on the topic-word distribution α: prior on the document-topic distribution Analogous to k in add-k smoothing in N-gram LM Pseudo counts when computing: p(w|t): β p(t|d): α p(t1|d1)= 2+α 5+Tα Hyper-Parameters mouse cat scroll tv ... t1 1 0 1 0 t2 0 0 1 1 t3 2 1 1 0 ‣ ‣ t1 t2 t3 d1 2 1 2 d2 1 2 2 ... 26 COMP90042 L20 Hyper-Parameters • Highpriorvalues→flatterdistribution ‣ a very very large value would lead to a uniform distribution • Lowpriorvalues→peakydistribution β: generally small (< 0.01) ‣ Large vocabulary, but we want each topic to focus on specific themes α: generally larger (> 0.1)
‣ Multiple topics within a document


27

COMP90042
L20
Evaluation
28

COMP90042 L20
• •
‣ ‣
m ppl = exp−log L
total number of word tokens in test documents
How To Evaluate Topic Models?
Unsupervised learning → no labels Intrinsic evaluation:
‣ model logprob / perplexity on test documents
L = ∏∑P(w|t)P(t|dw) wt
29

COMP90042
L20
• •
More topics = better (lower) perplexity Smaller vocabulary = better perplexity
‣ Perplexity not comparable for different corpora, or different tokenisation/preprocessing methods


Does not correlate with human perception of topic quality
Issues with Perplexity
Extrinsic evaluation the way to go:
‣ Evaluate topic models based on downstream task
30

COMP90042
L20
Topic Coherence A better intrinsic evaluation method
Measure how coherent the generated topics

• •
A good topic model is one that generates more coherent topics
31

COMP90042
L20
Word Intrusion
• Idea:injectonerandomwordtoatopic
• {farmers, farm, food, rice, agriculture} 
 ↓

{farmers, farm, food, rice, cat, agriculture}
• Askuserstoguesswhichistheintruderword
• Correctguess→topiciscoherent
• Tryguesstheintruderwordin:
‣ {choice, count, village, i.e., simply, unionist}
• Manualeffort;doesnotscale
32

COMP90042 L20
Estimate Coherence Automatically? PMI!
Compute pairwise PMI of top-N words in a topic
N j−1 PMI(t) = ∑ ∑ log
• •

• •
P(wi, wj) P(wi)P(wj)
j=2 i=1
Given topic: {farmers, farm, food, rice, agriculture}
Sum logPMI for all word pairs in the topic:
‣ logPMI(farmers, farm) + logPMI(farmers, food) 
 + … + logPMI(rice, agriculture)
33

COMP90042
L20
• NormalisedPMI
j−1 log ∑ ∑
P(wi, wj) P(wi)P(wj)
NPMI(t) =
• Conditionalprobability N j−1
Variants
N
j=2 i=1 −logP(wi,wj)

P(wi, wj) P(wi)
• Goodcorrelationwithhumanperceptionoftopiccoherence
• BettercorrelationifweuseadifferentcorpustoestimatePMI

LCP(t) = ∑ ∑ log j=2 i=1
i.e. don’t use corpus that we run the topic model on
34

COMP90042
L20
PMI Examples
Topic
PMI
NPMI
cell hormone insulin muscle receptor
0.61
0.59
electron laser magnetic voltage wavelength
0.54
0.52
magnetic neutrino particle quantum universe
0.55
0.55
album band music release song
0.37
0.56
college education school student university
0.38
0.57
city county district population town
0.34
0.52
35

COMP90042
L20
Improvements
36

COMP90042
L20
• • •

Use phrases or n-grams instead of words Learn hierarchical topics
Non-parametric models
‣ #topics automatically learned Supervised models
‣ Takes into account document labels
Topic Model Variants
37

COMP90042
L20
• •
Use wikipedia article titles as labels
Measure distance between a label and topic words based on document embeddings and word embeddings
Topic Labelling
vmware server virtual oracle update church archway building window gothic
investigation fbi official department federal rate population prevalence study incidence
→ virtualisation
→ church architecture
→ criminal investigation → mortality rate
38

COMP90042
L20


Topic model: an unsupervised model for learning latent concepts in a document collection

How to evaluate topic models? ‣ Topic coherence
LDA: a popular topic model ‣ Learning
‣ Hyper-parameters
A Final Word
39