程序代写代做代考 information retrieval algorithm Hive DNA flex Bayesian chain Probabilistic topic models

Probabilistic topic models

review articles

april 2012 | vol. 55 | no. 4 | communicationS of the acm 77

Doi:10.1145/2133806.2133826

Surveying a suite of algorithms that offer a
solution to managing large document archives.

By DaviD m. Blei

Probabilistic
topic models

as OUr COLLeCTive knowledge continues to be
digitized and stored—in the form of news, blogs, Web
pages, scientific articles, books, images, sound, video,
and social networks—it becomes more difficult to
find and discover what we are looking for. We need
new computational tools to help organize, search, and
understand these vast amounts of information.

Right now, we work with online information using
two main tools—search and links. We type keywords
into a search engine and find a set of documents
related to them. We look at the documents in that
set, possibly navigating to other linked documents.
This is a powerful way of interacting with our online
archive, but something is missing.

Imagine searching and exploring documents
based on the themes that run through them. We might
“zoom in” and “zoom out” to find specific or broader
themes; we might look at how those themes changed
through time or how they are connected to each other.
Rather than finding documents through keyword
search alone, we might first find the theme that we
are interested in, and then examine the documents
related to that theme.

For example, consider using themes
to explore the complete history of the
New York Times. At a broad level, some
of the themes might correspond to
the sections of the newspaper—for-
eign policy, national affairs, sports.
We could zoom in on a theme of in-
terest, such as foreign policy, to reveal
various aspects of it—Chinese foreign
policy, the conflict in the Middle East,
the U.S.’s relationship with Russia. We
could then navigate through time to
reveal how these specific themes have
changed, tracking, for example, the
changes in the conflict in the Middle
East over the last 50 years. And, in all of
this exploration, we would be pointed
to the original articles relevant to the
themes. The thematic structure would
be a new kind of window through which
to explore and digest the collection.

But we do not interact with elec-
tronic archives in this way. While more
and more texts are available online, we
simply do not have the human power
to read and study them to provide the
kind of browsing experience described
above. To this end, machine learning
researchers have developed probabilis-
tic topic modeling, a suite of algorithms
that aim to discover and annotate large
archives of documents with thematic
information. Topic modeling algo-
rithms are statistical methods that ana-
lyze the words of the original texts to
discover the themes that run through
them, how those themes are connected
to each other, and how they change over

key insights
topic models are algorithms for

discovering the main themes that
pervade a large and otherwise
unstructured collection of documents.
topic models can organize the collection
according to the discovered themes.

topic modeling algorithms can be applied
to massive collections of documents.
Recent advances in this field allow us to
analyze streaming collections, like you
might find from a Web aPi.

topic modeling algorithms can be
adapted to many kinds of data. among
other applications, they have been used
to find patterns in genetic data, images,
and social networks.

78 communicationS of the acm | april 2012 | vol. 55 | no. 4

review articles

time. (See, for example, Figure 3 for
topics found by analyzing the Yale Law
Journal.) Topic modeling algorithms
do not require any prior annotations or
labeling of the documents—the topics
emerge from the analysis of the origi-
nal texts. Topic modeling enables us
to organize and summarize electronic
archives at a scale that would be impos-
sible by human annotation.

latent Dirichlet allocation
We first describe the basic ideas behind
latent Dirichlet allocation (LDA), which
is the simplest topic model.8 The intu-
ition behind LDA is that documents
exhibit multiple topics. For example,
consider the article in Figure 1. This
article, entitled “Seeking Life’s Bare
(Genetic) Necessities,” is about using
data analysis to determine the number
of genes an organism needs to survive
(in an evolutionary sense).

By hand, we have highlighted differ-
ent words that are used in the article.
Words about data analysis, such as
“computer” and “prediction,” are high-
lighted in blue; words about evolutionary
biology, such as “life” and “organism,”
are highlighted in pink; words about
genetics, such as “sequenced” and

“genes,” are highlighted in yellow. If we
took the time to highlight every word in
the article, you would see that this arti-
cle blends genetics, data analysis, and
evolutionary biology in different pro-
portions. (We exclude words, such as
“and” “but” or “if,” which contain little
topical content.) Furthermore, know-
ing that this article blends those topics
would help you situate it in a collection
of scientific articles.

LDA is a statistical model of docu-
ment collections that tries to capture
this intuition. It is most easily described
by its generative process, the imaginary
random process by which the model
assumes the documents arose. (The
interpretation of LDA as a probabilistic
model is fleshed out later.)

We formally define a topic to be a
distribution over a fixed vocabulary. For
example, the genetics topic has words
about genetics with high probability
and the evolutionary biology topic has
words about evolutionary biology with
high probability. We assume that these
topics are specified before any data
has been generated.a Now for each

a Technically, the model assumes that the top-
ics are generated first, before the documents.

document in the collection, we gener-
ate the words in a two-stage process.

˲ Randomly choose a distribution
over topics.

˲ For each word in the document
a. Randomly choose a topic from

the distribution over topics in
step #1.

b. Randomly choose a word from the
corresponding distribution over
the vocabulary.

This statistical model reflects the
intuition that documents exhibit mul-
tiple topics. Each document exhib-
its the topics in different proportion
(step #1); each word in each docu-
ment is drawn from one of the topics
(step #2b), where the selected topic is
chosen from the per-document distri-
bution over topics (step #2a).b

In the example article, the distri-
bution over topics would place prob-
ability on genetics, data analysis, and

b We should explain the mysterious name, “latent
Dirichlet allocation.” The distribution that is
used to draw the per-document topic distribu-
tions in step #1 (the cartoon histogram in Figure
1) is called a Dirichlet distribution. In the genera-
tive process for LDA, the result of the Dirichlet
is used to allocate the words of the document to
different topics. Why latent? Keep reading.

figure 1. the intuitions behind latent Dirichlet allocation. We assume that some number of “topics,” which are distributions over words,
exist for the whole collection (far left). each document is assumed to be generated as follows. first choose a distribution over the topics (the
histogram at right); then, for each word, choose a topic assignment (the colored coins) and choose the word from the corresponding topic.
the topics and topic assignments in this figure are illustrative—they are not fit from real data. See figure 2 for topics fit from data.

. , ,

. , ,

. . .

gene
dna
genetic

life
evolve
organism

brain
neuron
nerve

data
number
computer
. , ,

Topics Documents
Topic proportions and

assignments

0.04
0.02
0.01

0.04
0.02
0.01

0.02
0.01
0.01

0.02
0.02
0.01

data
number
computer
. , ,

0.02
0.02
0.01

review articles

april 2012 | vol. 55 | no. 4 | communicationS of the acm 79

evolutionary biology, and each word
is drawn from one of those three top-
ics. Notice that the next article in
the collection might be about data
analysis and neuroscience; its distri-
bution over topics would place prob-
ability on those two topics. This is
the distinguishing characteristic of
latent Dirichlet allocation—all the
documents in the collection share
the same set of topics, but each docu-
ment exhibits those topics in differ-
ent proportion.

As we described in the introduc-
tion, the goal of topic modeling is
to automatically discover the topics
from a collection of documents. The
documents themselves are observed,
while the topic structure—the topics,
per-document topic distributions,
and the per-document per-word topic
assignments—is hidden structure. The
central computational problem for
topic modeling is to use the observed
documents to infer the hidden topic
structure. This can be thought of as
“reversing” the generative process—
what is the hidden structure that likely
generated the observed collection?

Figure 2 illustrates example infer-
ence using the same example docu-
ment from Figure 1. Here, we took
17,000 articles from Science magazine
and used a topic modeling algorithm to
infer the hidden topic structure. (The

algorithm assumed that there were 100
topics.) We then computed the inferred
topic distribution for the example
article (Figure 2, left), the distribution
over topics that best describes its par-
ticular collection of words. Notice that
this topic distribution, though it can
use any of the topics, has only “acti-
vated” a handful of them. Further, we
can examine the most probable terms
from each of the most probable topics
(Figure 2, right). On examination, we
see that these terms are recognizable
as terms about genetics, survival, and
data analysis, the topics that are com-
bined in the example article.

We emphasize that the algorithms
have no information about these sub-
jects and the articles are not labeled
with topics or keywords. The inter-
pretable topic distributions arise by
computing the hidden structure that
likely generated the observed col-
lection of documents.c For example,
Figure 3 illustrates topics discovered
from Yale Law Journal. (Here the num-
ber of topics was set to be 20.) Topics

c Indeed calling these models “topic models”
is retrospective—the topics that emerge from
the inference algorithm are interpretable for
almost any collection that is analyzed. The fact
that these look like topics has to do with the
statistical structure of observed language and
how it interacts with the specific probabilistic
assumptions of LDA.

about subjects like genetics and data
analysis are replaced by topics about
discrimination and contract law.

The utility of topic models stems
from the property that the inferred hid-
den structure resembles the thematic
structure of the collection. This inter-
pretable hidden structure annotates
each document in the collection—a
task that is painstaking to perform
by hand—and these annotations can
be used to aid tasks like information
retrieval, classification, and corpus
exploration.d In this way, topic model-
ing provides an algorithmic solution to
managing, organizing, and annotating
large archives of texts.

Lda and probabilistic models. LDA
and other topic models are part of the
larger field of probabilistic modeling.
In generative probabilistic modeling,
we treat our data as arising from a
generative process that includes hid-
den variables. This generative process
defines a joint probability distribution
over both the observed and hidden
random variables. We perform data
analysis by using that joint distribu-
tion to compute the conditional distri-
bution of the hidden variables given the

d See, for example, the browser of Wikipedia
built with a topic model at http://www.sccs.
swarthmore.edu/users/08/ajb/tmve/wiki100k/
browse/topic-list.html.

figure 2. Real inference with lDa. We fit a 100-topic lDa model to 17,000 articles from the journal Science. at left are the inferred
topic proportions for the example article in figure 1. at right are the top 15 most frequent words from the most frequent topics found
in this article.

“Genetics”
human
genome
dna
genetic
genes
sequence
gene

molecular
sequencing
map

information
genetics
mapping
project
sequences

life

two

“Evolution”
evolution
evolutionary
species
organisms

origin
biology
groups

phylogenetic
living
diversity
group
new

common

“Disease”
disease
host
bacteria
diseases
resistance
bacterial
new
strains
control
infectious
malaria
parasite
parasites
united

tuberculosis

“Computers”
computer
models
information
data

computers
system
network
systems
model
parallel
methods
networks
software
new

simulations

1 8 16 26 36 46 56 66 76 86 96
Topics

P
ro
ba
bi
lit
y

0
.0

0
.1

0
.2

0
.3

0
.4

80 communicationS of the acm | april 2012 | vol. 55 | no. 4

review articles

observed variables. This conditional
distribution is also called the posterior
distribution.

LDA falls precisely into this frame-
work. The observed variables are the
words of the documents; the hidden
variables are the topic structure; and
the generative process is as described
here. The computational problem of
inferring the hidden topic structure
from the documents is the problem of
computing the posterior distribution,
the conditional distribution of the hid-
den variables given the documents.

We can describe LDA more formally
with the following notation. The topics
are b1:K, where each bk is a distribution
over the vocabulary (the distributions
over words at left in Figure 1). The topic
proportions for the dth document are
qd, where q d,k is the topic proportion
for topic k in document d (the car-
toon histogram in Figure 1). The topic
assignments for the dth document are
zd, where zd,n is the topic assignment
for the nth word in document d (the
colored coin in Figure 1). Finally, the
observed words for document d are wd,
where wd,n is the nth word in document
d, which is an element from the fixed
vocabulary.

With this notation, the generative
process for LDA corresponds to the fol-
lowing joint distribution of the hidden
and observed variables,

(1)

Notice that this distribution specifies a
number of dependencies. For example,
the topic assignment zd,n depends on
the per-document topic proportions
q d. As another example, the observed
word w

d,n
depends on the topic assign-

ment z
d,n

and all of the topics b
1:K

.
(Operationally, that term is defined by
looking up as to which topic z

d,n
refers

to and looking up the probability of the
word w

d,n
within that topic.)

These dependencies define LDA.
They are encoded in the statistical
assumptions behind the generative
process, in the particular mathemati-
cal form of the joint distribution, and—
in a third way—in the probabilistic
graphical model for LDA. Probabilistic
graphical models provide a graphical

language for describing families of
probability distributions.e The graphi-
cal model for LDA is in Figure 4. These
three representations are equivalent
ways of describing the probabilistic
assumptions behind LDA.

In the next section, we describe
the inference algorithms for LDA.
However, we first pause to describe the
short history of these ideas. LDA was
developed to fix an issue with a previ-
ously developed probabilistic model
probabilistic latent semantic analysis
(pLSI).21 That model was itself a prob-
abilistic version of the seminal work
on latent semantic analysis,14 which
revealed the utility of the singular value
decomposition of the document-term
matrix. From this matrix factorization
perspective, LDA can also be seen as a
type of principal component analysis
for discrete data.11, 12

Posterior computation for Lda.
We now turn to the computational

e The field of graphical models is actually more
than a language for describing families of
distributions. It is a field that illuminates the
deep mathematical links between probabi-
listic independence, graph theory, and algo-
rithms for computing with probability distri-
butions.35

figure 3. a topic model fit to the Yale Law Journal. here, there are 20 topics (the top eight are plotted). each topic is illustrated with its top-
most frequent words. each word’s position along the x-axis denotes its specificity to the documents. for example “estate” in the first topic
is more specific than “tax.”

4

consumption

earnings

estate

exemption

funds

income

organizations

revenue

subsidies

tax
taxation
taxes

taxpayers

treasury
year

6

crime

crimes

defendant
defendants

evidence

guilty

judge

judges

jurors

jury

offense

punishment

sentence

sentencing

trial

10

bargaining

collective

employee

employees

employer
employers
employment

industrial

job

labor

union

unions

work

worker

workers

15

amendment

conduct

content

context
culture

equality

expression

free

freedom

ideas
information

protect

protected

speech

values

3

child

children

discrimination

family

female

gender

male

marriage

men

parents

sex

sexual

social

woman

women

1

assets

capital

corporate

cost

efficient

firm

firms

insurance

market

offer

price

share

shareholders
stock

value

13

agreement

bargaining

breach

contract

contracting

contracts

contractual

creditors

debt
exchange

liability

limited

parties

party

terms

16

amendment

article

citizens

constitution

constitutional

fourteenth

government

history

justice

legislative

majority

opinion

people

political

republican

review articles

april 2012 | vol. 55 | no. 4 | communicationS of the acm 81

problem, computing the conditional
distribution of the topic structure
given the observed documents. (As we
mentioned, this is called the posterior.)
Using our notation, the posterior is

(2)

The numerator is the joint distribution
of all the random variables, which can
be easily computed for any setting of
the hidden variables. The denomina-
tor is the marginal probability of the
observations, which is the probability
of seeing the observed corpus under
any topic model. In theory, it can be
computed by summing the joint distri-
bution over every possible instantiation
of the hidden topic structure.

That number of possible topic
structures, however, is exponentially
large; this sum is intractable to com-
pute.f As for many modern probabilis-
tic models of interest—and for much
of modern Bayesian statistics—we
cannot compute the posterior because
of the denominator, which is known
as the evidence. A central research
goal of modern probabilistic model-
ing is to develop efficient methods
for approximating it. Topic modeling
algorithms—like the algorithms used
to create Figures 1 and 3—are often
adaptations of general-purpose meth-
ods for approximating the posterior
distribution.

Topic modeling algorithms form
an approximation of Equation 2 by
adapting an alternative distribution
over the latent topic structure to be
close to the true posterior. Topic mod-
eling algorithms generally fall into
two categories—sampling-based algo-
rithms and variational algorithms.

Sampling-based algorithms
attempt to collect samples from the
posterior to approximate it with an
empirical distribution. The most
commonly used sampling algorithm
for topic modeling is Gibbs sampling,
where we construct a Markov chain—
a sequence of random variables, each
dependent on the previous—whose

f More technically, the sum is over all possible
ways of assigning each observed word of the
collection to one of the topics. Document col-
lections usually contain observed words at
least on the order of millions.

limiting distribution is the posterior.
The Markov chain is defined on the
hidden topic variables for a particular
corpus, and the algorithm is to run the
chain for a long time, collect samples

from the limiting distribution, and
then approximate the distribution
with the collected samples. (Often, just
one sample is collected as an approxi-
mation of the topic structure with

figure 4. the graphical model for latent Dirichlet allocation. each node is a random variable
and is labeled according to its role in the generative process (see figure 1). the hidden
nodes—the topic proportions, assignments, and topics—are unshaded. the observed
nodes—the words of the documents—are shaded. the rectangles are “plate” notation,
which denotes replication. the N plate denotes the collection words within documents;
the D plate denotes the collection of documents within the collection.

hZd,n Wd,n
N
D K

qda bk

figure 5. two topics from a dynamic topic model. this model was fit to Science from 1880
to 2002. We have illustrated the top words at each decade.

1880
energy

molecules
atoms

molecular
matter

1890
molecules

energy
atoms

molecular
matter

1900
energy

molecules
atoms
matter
atomic

1910
energy
theory
atoms
atom

molecules

1920
atom

atoms
energy

electrons
electron

1930
energy

electrons
atoms
atom

electron

1940
energy

rays
electron
atomic
atoms

1950
energy

particles
nuclear
electron
atomic

1960
energy

electron
particles
electrons
nuclear

1970
energy

electron
particles
electrons

state

1980
energy

electron
particles

ion
electrons

1990
energy

electron
state

atoms
states

2000
energy
state

quantum
electron
states

1880 1890 1900 1910 1920 1930 1940 1950 1960 1970 1980 1990 2000

P
ro

p
rt

io
n

o
f

S
c

ie
n

c
e

T
o

p
ic

s
c

o
re

“Mass and Energy” (1907)

“The Wave Properties
of Electrons” (1930) “The Z Boson” (1990)

“Quantum Criticality:
Competing Ground States
in Low Dimensions” (2000)

“Structure of the
Proton” (1974)

“Alchemy” (1891)

“Nuclear Fission” (1940)

quantum
molecular

atomic

1880
french
france

england
country
europe

1890
england
france
states

country
europe

1900
states
united

germany
country
france

1910
states
united

country
germany
countries

1920
war

states
united
france
british

1930
international

states
united

countries
american

1940
war

states
united

american
international

1950
international

united
war

atomic
states

1960
united
soviet
states

nuclear
international

1970
nuclear
military
soviet
united
states

1980
nuclear
soviet

weapons
states
united

1990
soviet

nuclear
united
states
japan

2000
european

united
nuclear
states

countries

1880 1890 1900 1910 1920 1930 1940 1950 1960 1970 1980 1990 2000

war

european

nuclear

P
ro

p
rt

io
n

o
f

S
c

ie
n

c
e

T
o

p
ic

s
c

o
re

“Speed of Railway Trains
in Europe” (1889)

“Farming and Food Supplies
in Time of War” (1915)

“The Atom and Humanity” (1945)

“Science in the USSR” (1957)

“The Costs of the Soviet
Empire” (1985)

“Post-Cold War Nuclear
Dangers” (1995)

82 communicationS of the acm | april 2012 | vol. 55 | no. 4

review articles

maximal probability.) See Steyvers and
Griffiths33 for a good description of
Gibbs sampling for LDA, and see http://
CRAN.R-project.org/package=lda for a
fast open-source implementation.

Variational methods are a deter-
ministic alternative to sampling-based
algorithms.22,35 Rather than approxi-
mating the posterior with samples,
variational methods posit a param-
eterized family of distributions over
the hidden structure and then find the
member of that family that is closest
to the posterior.g Thus, the inference
problem is transformed to an opti-
mization problem. Variational meth-
ods open the door for innovations in
optimization to have practical impact
in probabilistic modeling. See Blei
et al.8 for a coordinate ascent varia-
tional inference algorithm for LDA;
see Hoffman et al.20 for a much faster
online algorithm (and open-source
software) that easily handles millions
of documents and can accommodate
streaming collections of text.

Loosely speaking, both types of
algorithms perform a search over the
topic structure. A collection of docu-
ments (the observed random variables
in the model) are held fixed and serve
as a guide toward where to search.
Which approach is better depends
on the particular topic model being
used—we have so far focused on LDA,
but see below for other topic models—
and is a source of academic debate. For
a good discussion of the merits and
drawbacks of both, see Asuncion et al.1

Research in topic modeling
The simple LDA model provides a pow-
erful tool for discovering and exploit-
ing the hidden thematic structure in
large archives of text. However, one of
the main advantages of formulating
LDA as a probabilistic model is that it
can easily be used as a module in more
complicated models for more com-
plicated goals. Since its introduction,
LDA has been extended and adapted
in many ways.

relaxing the assumptions of
Lda. LDA is defined by the statisti-
cal assumptions it makes about the

g Closeness is measured with Kullback–Leibler
divergence, an information theoretic measure-
ment of the distance between two probability
distributions.

corpus. One active area of topic model-
ing research is how to relax and extend
these assumptions to uncover more
sophisticated structure in the texts.

One assumption that LDA makes is
the “bag of words” assumption, that
the order of the words in the document
does not matter. (To see this, note that
the joint distribution of Equation 1
remains invariant to permutation of
the words of the documents.) While
this assumption is unrealistic, it is rea-
sonable if our only goal is to uncover
the course semantic structure of the
texts.h For more sophisticated goals—
such as language generation—it is
patently not appropriate. There have
been a number of extensions to LDA
that model words nonexchangeably.
For example, Wallach36 developed a
topic model that relaxes the bag of
words assumption by assuming that
the topics generate words conditional
on the previous word; Griffiths et al.18
developed a topic model that switches
between LDA and a standard HMM.
These models expand the parameter
space significantly but show improved
language modeling performance.

Another assumption is that the
order of documents does not matter.
Again, this can be seen by noticing
that Equation 1 remains invariant
to permutations of the ordering of
documents in the collection. This
assumption may be unrealistic when
analyzing long-running collections
that span years or centuries. In such
collections, we may want to assume
that the topics change over time.
One approach to this problem is the
dynamic topic model5—a model that
respects the ordering of the docu-
ments and gives a richer posterior
topical structure than LDA. Figure 5
shows a topic that results from analyz-
ing all of Science magazine under the
dynamic topic model. Rather than a
single distribution over words, a topic
is now a sequence of distributions
over words. We can find an underlying
theme of the collection and track how
it has changed over time.

A third assumption about LDA is
that the number of topics is assumed

h As a thought experiment, imagine shuffling
the words of the article in Figure 1. Even when
shuffled, you would be able to glean that the
article has something to do with genetics.

one direction for
topic modeling
is to develop
evaluation methods
that match how
the algorithms
are used.
how can we
compare topic
models based on
how interpretable
they are?

review articles

april 2012 | vol. 55 | no. 4 | communicationS of the acm 83

known and fixed. The Bayesian non-
parametric topic model34 provides an
elegant solution: the number of topics
is determined by the collection during
posterior inference, and furthermore,
new documents can exhibit previously
unseen topics. Bayesian nonparamet-
ric topic models have been extended to
hierarchies of topics, which find a tree
of topics, moving from more general to
more concrete, whose particular struc-
ture is inferred from the data.3

There are still other extensions of
LDA that relax various assumptions
made by the model. The correlated
topic model6 and pachinko alloca-
tion machine24 allow the occurrence
of topics to exhibit correlation (for
example, a document about geology
is more likely to also be about chem-
istry than it is to be about sports); the
spherical topic model28 allows words
to be unlikely in a topic (for example,
“wrench” will be particularly unlikely
in a topic about cats); sparse topic
models enforce further structure in
the topic distributions;37 and “bursty”
topic models provide a more realistic
model of word counts.15

incorporating metadata. In many
text analysis settings, the documents
contain additional information—
such as author, title, geographic loca-
tion, links, and others—that we might
want to account for when fitting a
topic model. There has been a flurry of
research on adapting topic models to
include metadata.

The author-topic model29 is an early
success story for this kind of research.
The topic proportions are attached to
authors; papers with multiple authors
are assumed to attach each word to
an author, drawn from a topic drawn
from his or her topic proportions. The
author-topic model allows for infer-
ences about authors as well as docu-
ments. Rosen-Zvi et al. show examples
of author similarity based on their
topic proportions—such computa-
tions are not possible with LDA.

Many document collections are
linked—for example, scientific papers
are linked by citation or Web pages are
linked by hyperlink—and several topic
models have been developed to account
for those links when estimating the top-
ics. The relational topic model of Chang
and Blei13 assumes that each document
is modeled as in LDA and that the links

between documents depend on the dis-
tance between their topic proportions.
This is both a new topic model and a
new network model. Unlike traditional
statistical models of networks, the rela-
tional topic model takes into account
node attributes (here, the words of the
documents) in modeling the links.

Other work that incorporates meta-
data into topic models includes mod-
els of linguistic structure,10 models that
account for distances between corpora,38
and models of named entities.26 General
-purpose methods for incorporating
metadata into topic models include
Dirichlet-multinomial regression mod-
els25 and supervised topic models.7

Other kinds of data. In LDA, the top-
ics are distributions over words and this
discrete distribution generates observa-
tions (words in documents). One advan-
tage of LDA is that these choices for the
topic parameter and data-generating
distribution can be adapted to other
kinds of observations with only small
changes to the corresponding inference
algorithms. As a class of models, LDA
can be thought of as a mixed-membership
model of grouped data—rather than
associating each group of observa-
tions (document) with one component
(topic), each group exhibits multiple
components in different proportions.
LDA-like models have been adapted
to many kinds of data, including sur-
vey data, user preferences, audio and
music, computer code, network logs,
and social networks. We describe two
areas where mixed-membership mod-
els have been particularly successful.

In population genetics, the same
probabilistic model was independently
invented to find ancestral populations
(for example, originating from Africa,
Europe, the Middle East, among others)
in the genetic ancestry of a sample of
individuals.27 The idea is that each indi-
vidual’s genotype descends from one or
more of the ancestral populations. Using
a model much like LDA, biologists can
both characterize the genetic patterns
in those populations (the “topics”) and
identify how each individual expresses
them (the “topic proportions”). This
model is powerful because the genetic
patterns in ancestral populations can be
hypothesized, even when “pure” sam-
ples from them are not available.

LDA has been widely used and
adapted in computer vision, where the

inference algorithms are applied to
natural images in the service of image
retrieval, classification, and organiza-
tion. Computer vision researchers have
made a direct analogy from images to
documents. In document analysis, we
assume that documents exhibit mul-
tiple topics and the collection of docu-
ments exhibits the same set of topics.
In image analysis, we assume that each
image exhibits a combination of visual
patterns and that the same visual pat-
terns recur throughout a collection of
images. (In a preprocessing step, the
images are analyzed to form collec-
tions of “visual words.”) Topic model-
ing for computer vision has been used
to classify images,16 connect images
and captions,4 build image hierar-
chies,2,23,31 and other applications.

future Directions
Topic modeling is an emerging field in
machine learning, and there are many
exciting new directions for research.

evaluation and model checking.
There is a disconnect between how
topic models are evaluated and why
we expect topic models to be useful.
Typically, topic models are evaluated in
the following way. First, hold out a sub-
set of your corpus as the test set. Then,
fit a variety of topic models to the rest of
the corpus and approximate a measure
of model fit (for example, probability)
for each trained model on the test set.
Finally, choose the model that achieves
the best held-out performance.

But topic models are often used to
organize, summarize, and help users
explore large corpora, and there is no
technical reason to suppose that held-
out accuracy corresponds to better
organization or easier interpretation.
One open direction for topic modeling
is to develop evaluation methods that
match how the algorithms are used.
How can we compare topic models
based on how interpretable they are?

This is the model checking problem.
When confronted with a new corpus
and a new task, which topic model
should I use? How can I decide which
of the many modeling assumptions are
important for my goals? How should I
move between the many kinds of topic
models that have been developed?
These questions have been given some
attention by statisticians,9,30 but they
have been scrutinized less for the scale

84 communicationS of the acm | april 2012 | vol. 55 | no. 4

review articles

difficult task for unsupervised learning
that must be carefully validated.

In general, this problem is best
addressed by teaming computer sci-
entists with other scholars to use topic
models to help explore, visualize, and
draw hypotheses from their data. In
addition to scientific applications, such
as genetics and neuroscience, one can
imagine topic models coming to the
service of history, sociology, linguistics,
political science, legal studies, compar-
ative literature, and other fields, where
texts are a primary object of study. By
working with scholars in diverse fields,
we can begin to develop a new interdis-
ciplinary computational methodology
for working with and drawing conclu-
sions from archives of texts.

Summary
We have surveyed probabilistic topic
models, a suite of algorithms that
provide a statistical solution to the
problem of managing large archives
of documents. With recent scientific
advances in support of unsupervised
machine learning—flexible compo-
nents for modeling, scalable algo-
rithms for posterior inference, and
increased access to massive datasets—
topic models promise to be an impor-
tant component for summarizing and
understanding our growing digitized
archive of information.

References
1. asuncion, a., welling, m., smyth, P., teh, y. on

smoothing and inference for topic models. in
Uncertainty in Artificial Intelligence (2009).

2. bart, e., welling, m., Perona, P. unsupervised
organization of image collections: taxonomies and
beyond. Trans. Pattern Recognit. Mach. Intell. 33, 11
(2010) (2301–2315).

3. blei, D., griffiths, t., Jordan, m. the nested chinese
restaurant process and bayesian nonparametric
inference of topic hierarchies. J. ACM 57, 2 (2010), 1–30.

4. blei, D., Jordan, m. modeling annotated data. in
Proceedings of the 26th Annual International ACM
SIGIR Conference on Research and Development in
Information Retrieval (2003 ), acm Press, 127–134.

5. blei, D., lafferty, J. Dynamic topic models. in
International Conference on Machine Learning (2006),
acm, new york, ny, usa, 113–120.

6. blei, D., lafferty, J. a correlated topic model of
science. Ann. Appl. Stat., 1, 1 (2007), 17–35.

7. blei, D., mcauliffe, J. supervised topic models. in
Neural Information Processing Systems (2007).

8. blei, D., ng, a., Jordan, m. latent Dirichlet allocation.
J. Mach. Learn. Res. 3 (January 2003), 993–1022.

9. box, g. sampling and bayes’ inference in scientific
modeling and robustness. J. Roy. Stat. Soc. 143, 4
(1980), 383–430.

10. boyd-graber, J., blei, D. syntactic topic models. in
Neural Information Processing Systems (2009).

11. buntine, w. Variational extensions to em and
multinomial Pca. in European Conference on Machine
Learning (2002).

12. buntine, w., Jakulin, a. Discrete component analysis.
Subspace, Latent Structure and Feature Selection.
c. saunders, m. grobelink, s. gunn, and J. shawe-taylor,
eds. springer, 2006.

of problems that machine learning
tackles. New computational answers to
these questions would be a significant
contribution to topic modeling.

visualization and user interfaces.
Another promising future direction for
topic modeling is to develop new meth-
ods of interacting with and visualizing
topics and corpora. Topic models pro-
vide new exploratory structure in large
collections—how can we best exploit
that structure to aid in discovery and
exploration?

One problem is how to display the
topics. Typically, we display topics by
listing the most frequent words of each
(see Figure 2), but new ways of label-
ing the topics—by either choosing
different words or displaying the cho-
sen words differently—may be more
effective. A further problem is how to
best display a document with a topic
model. At the document level, topic
models provide potentially useful
information about the structure of the
document. Combined with effective
topic labels, this structure could help
readers identify the most interesting
parts of the document. Moreover, the
hidden topic proportions implicitly
connect each document to the other
documents (by considering a distance
measure between topic proportions).
How can we best display these connec-
tions? What is an effective interface
to the whole corpus and its inferred
topic structure?

These are user interface questions,
and they are essential to topic model-
ing. Topic modeling algorithms show
much promise for uncovering mean-
ingful thematic structure in large col-
lections of documents. But making
this structure useful requires careful
attention to information visualization
and the corresponding user interfaces.

Topic models for data discovery.
Topic models have been developed
with information engineering applica-
tions in mind. As a statistical model,
however, topic models should be able
to tell us something, or help us form
a hypothesis, about the data. What
can we learn about the language (and
other data) based on the topic model
posterior? Some work in this area has
appeared in political science,19 biblio-
metrics,17 and psychology.32 This kind
of research adapts topic models to mea-
sure an external variable of interest, a

13. chang, J., blei, D. hierarchical relational models for
document networks. Ann. Appl. Stat. 4, 1 (2010).

14. Deerwester, s., Dumais, s., landauer, t., Furnas, g.,
harshman, r. indexing by latent semantic analysis. J.
Am. Soc. Inform. Sci. 41, 6 (1990), 391–407.

15. Doyle, g., elkan, c., accounting for burstiness in topic
models. in International Conference on Machine
Learning (2009), acm, 281–288..

16. Fei-Fei, l., Perona, P. a bayesian hierarchical model for
learning natural scene categories. in IEEE Computer
Vision and Pattern Recognition (2005), 524–531.

17. gerrish, s., blei, D. a language-based approach
to measuring scholarly impact. in International
Conference on Machine Learning (2010).

18. griffiths, t., steyvers, m., blei, D., tenenbaum, J.
integrating topics and syntax. Advances in Neural
Information Processing Systems 17. l. K. saul, y.
weiss, and l. bottou, eds. mit Press, cambridge, ma,
2005, 537–544.

19. grimmer, J. a bayesian hierarchical topic model for
political texts: measuring expressed agendas in senate
press releases. Polit. Anal. 18, 1 (2010), 1.

20. hoffman, m., blei, D., bach, F. on-line learning for
latent Dirichlet allocation. in Neural Information
Processing Systems (2010).

21. hofmann, t. Probabilistic latent semantic analysis.
in Uncertainty in Artificial Intelligence (UAI) (1999).

22. Jordan, m., ghahramani, Z., Jaakkola, t., saul, l.
introduction to variational methods for graphical
models. Mach. Learn. 37 (1999), 183–233.

23. li, J., wang, c., lim, y., blei, D., Fei-Fei, l., building and
using a semantivisual image hierarchy. in Computer
Vision and Pattern Recognition (2010).

24. li, w., mccallum, a. Pachinko allocation: Dag-
structured mixture models of topic correlations. in
International Conference on Machine Learning (2006),
577–584.

25. mimno, D., mccallum, a. topic models conditioned on
arbitrary features with Dirichlet-multinomial regression.
in Uncertainty in Artificial Intelligence (2008).

26. newman, D., chemudugunta, c., smyth, P. statistical
entity-topic models. in Knowledge Discovery and Data
Mining (2006).

27. Pritchard, J., stephens, m., Donnelly, P. inference of
population structure using multilocus genotype data.
Genetics 155 (June 2000), 945–959.

28. reisinger, J., waters, a., silverthorn, b., mooney, r.
spherical topic models. in International Conference
on Machine Learning (2010).

29. rosen-Zvi, m., griffiths, t., steyvers, m., smith, P.,
the author-topic model for authors and documents. in
Proceedings of the 20th Conference on Uncertainty in
Artificial Intelligence (2004), auai Press, 487–494.

30. rubin, D. bayesianly justifiable and relevant frequency
calculations for the applied statistician. Ann. Stat. 12,
4 (1984), 1151–1172.

31. sivic, J., russell, b., Zisserman, a., Freeman, w., efros, a.,
unsupervised discovery of visual object class hierarchies.
in Conference on Computer Vision and Pattern
Recognition (2008).

32. socher, r., gershman, s., Perotte, a., sederberg, P., blei,
D., norman, K. a bayesian analysis of dynamics in free
recall. in Advances in Neural Information Processing
Systems 22. y. bengio, D. schuurmans, J. lafferty, c. K.
i. williams, and a. culotta, eds, 2009.

33. steyvers, m., griffiths, t. Probabilistic topic models.
Latent Semantic Analysis: A Road to Meaning.
t. landauer, D. mcnamara, s. Dennis, and w. Kintsch,
eds. lawrence erlbaum, 2006.

34. teh, y., Jordan, m., beal, m., blei, D. hierarchical
Dirichlet processes. J. Am. Stat. Assoc. 101, 476
(2006), 1566–1581.

35. wainwright, m., Jordan, m. graphical models,
exponential families, and variational inference. Found.
Trends Mach. Learn. 1(1–2) (2008), 1–305.

36. wallach, h. topic modeling: beyond bag of words. in
Proceedings of the 23rd International Conference on
Machine Learning (2006).

37. wang, c., blei, D. Decoupling sparsity and
smoothness in the discrete hierarchical Dirichlet
process. Advances in Neural Information Processing
Systems 22. y. bengio, D. schuurmans, J. lafferty, c.
K. i. williams, and a. culotta, eds. 2009, 1982–1989.

38. wang, c., thiesson, b., meek, c., blei, D. markov topic
models. in Artificial Intelligence and Statistics (2009).

David M. Blei (blei@cs.princeton.edu) is an associate
professor in the computer science department of
Princeton university, Princeton, n.J.

© 2012 acm 0001-0782/12/04 $10.00