10/18/21, 7:45 PM GitHub – CS410Assignments/MP3: Implementing the PLSA Algorithm
https://github.com/CS410Assignments/MP3 1/4
CS410Assignments /MP3 Public
Implementing the PLSA Algorithm
0 stars 0 forks
Code Issues Pull requests Actions Projects Wiki Security
View code
MP3: Implementing the PLSA Algorithm
DUE: OCT 24, 2021 at 11�59pm
In this MP, you will implement the PLSA algorithm discussed in lectures 9.7 and 9.8 of
the Text Mining Coursera course. You are not required to implement the PLSA algorithm
with a background model (we will run tests assuming the background model has not
been implemented). You are provided with two data sets in the data folder: test.txt
and dblp.txt . You can assume that each line is a separate document.
The test.txt data set contains 1000 lines. Each line is a document. The first 100
documents are labeled with the topic the document belongs to, either 0 (“Seattle”) or 1
(“Chicago”). Each of the first 100 document is generated by sampling completely from
the topic that is labelled (i.e., generated from one topic only). The rest 900 documents
are generated from a mixture model of the topic “Seattle” and “Chicago”. Run your code
to test if your EM algorithm returns reasonable results.
The DBLP.txt data set is made up of research paper abstracts from DBLP. Each line is a
document. Make sure to test your code on the simpler data set test.txt before
running it on DBLP.txt .
Star Notifications
main Go to file
machilusZ Update README.md … on Aug 29 5
README.md
https://github.com/CS410Assignments
https://github.com/CS410Assignments/MP3
https://github.com/CS410Assignments/MP3/stargazers
https://github.com/CS410Assignments/MP3/network/members
https://github.com/CS410Assignments/MP3
https://github.com/CS410Assignments/MP3/issues
https://github.com/CS410Assignments/MP3/pulls
https://github.com/CS410Assignments/MP3/actions
https://github.com/CS410Assignments/MP3/projects
https://github.com/CS410Assignments/MP3/wiki
https://github.com/CS410Assignments/MP3/security
https://www.coursera.org/learn/cs-410/lecture/HKe8K/9-7-probabilistic-latent-semantic-analysis-plsa-part-1
https://www.coursera.org/learn/cs-410/lecture/GJyGG/9-8-probabilistic-latent-semantic-analysis-plsa-part-2
https://github.com/login?return_to=%2FCS410Assignments%2FMP3
https://github.com/login?return_to=%2FCS410Assignments%2FMP3
https://github.com/CS410Assignments/MP3/find/main
https://github.com/CS410Assignments/MP3/commits?author=machilusZ
https://github.com/CS410Assignments/MP3/commit/3d852c59cceb4140f6f94e3e41bea6d7b10d4e55
https://github.com/CS410Assignments/MP3/commit/3d852c59cceb4140f6f94e3e41bea6d7b10d4e55
https://github.com/CS410Assignments/MP3/commits/main
https://github.com/machilusZ
10/18/21, 7:45 PM GitHub – CS410Assignments/MP3: Implementing the PLSA Algorithm
https://github.com/CS410Assignments/MP3 2/4
You are provided with a code skeleton plsa.py . Do not change anything in the def
__init__() function. But feel free to change anything in the main() function to test
your code.
Some suggested tips:
�. Using matrices is strongly encouraged (writing nested for-loops for calculation is
painful)
�. Python libraries numpy and scipy are useful in matrix based calculations
Writing the PLSA Algorithm:
The original PLSA model does not contain a background model. This MP also is based on
the original PLSA model, you do not have to worry about the background model.
However, lectures are all about PLSA with a background model, so you should not
attempt to map the formulas on lecture slides directly to the code. Instead, you would
need to adjust the formulas on slides by assuming that there is zero probability that the
background model would be chosen. That is, you should set λ to zero. If you do this,
you will see that the E-step would essentially only compute a distribution over k topics
for z=1, …, k, given each word, i.e., p(z=i|w). The M-step would also be simpler as
p(Z=B|w) is also zero for all words (due to the fact that λ =0). If you are still confused,
please take a look at Section 3 of Chase Geigle’s note on EM [2] and the note below.
The main data structures involved in the implementation of this EM algorithm are three
matrices:
�. T (topics by words): this is the set of parameters characterizing topic content that
we denoted by θ ‘s. Each element is the probability of a particular word in a
particular topic.
�. D (documents by topics): this is the set of parameters modeling the coverage of
topics in each document, which we denoted by p ‘s. Each element is the probability
of a particular topic is covered in a particular document.
B
B
i
ij
10/18/21, 7:45 PM GitHub – CS410Assignments/MP3: Implementing the PLSA Algorithm
https://github.com/CS410Assignments/MP3 3/4
�. Z (hidden variables): For every document, we need one Z which represents the
probability that each word in the document has been generated from a particular
topic, so for any document, this is a “word-by-topic” matrix, encoding p(Z|w) for a
particular document. Z is the matrix that we compute in the E-step (based on
matrices T and D, which represent our parameters). Note that we need to compute a
different Z for each document, so we need to allocate a matrix Z for every
document. If we do so, the M-step is simply to use all these Z matrices together with
word counts in each document to re-estimate all the parameters, i.e., updating
matrices T and D based on Z. Thus at a high level, this is what’s happening in the
algorithm:
T and D are initialized.
E-step computes all Z’s based on T and D.
M-step uses all Z’s to update T and D.
We iterate until the likelihood doesn’t change much when we would use T and D
as our output. Note that Zs are also very useful (can you imagine some
applications of Zs?).
Resources:
[1] Cheng s̓ note on the EM algorithm
[2] Chase Geigle s̓ note on the EM algorithm, which includes a derivation of the EM
algorithm (see section 4), and
[3] Qiaozhu Mei s̓ note on the EM algorithm for PLSA, which includes a different
derivation of the EM algorithm.
THIS MP IS CODING AND DEBUGGING HEAVY! PLEASE DON’T LEAVE IT TILL THE LAST
MINUTE!
Releases
No releases published
Packages
No packages published
Contributors 4
Bh Bh
http://sifaka.cs.uiuc.edu/czhai/pub/em-note.pdf
http://times.cs.uiuc.edu/course/598f16/notes/em-algorithm.pdf
http://times.cs.uiuc.edu/course/598f16/plsa-note.pdf
https://github.com/CS410Assignments/MP3/releases
https://github.com/orgs/CS410Assignments/packages?repo_name=MP3
https://github.com/CS410Assignments/MP3/graphs/contributors
https://github.com/Bhaavya
https://github.com/Bhaavya
10/18/21, 7:45 PM GitHub – CS410Assignments/MP3: Implementing the PLSA Algorithm
https://github.com/CS410Assignments/MP3 4/4
Bhaavya Bhavya
assmaboughoula
richameher Richa M
machilusZ Yunan Zhang
Languages
Python 100.0%
https://github.com/Bhaavya
https://github.com/Bhaavya
https://github.com/assmaboughoula
https://github.com/assmaboughoula
https://github.com/richameher
https://github.com/richameher
https://github.com/machilusZ
https://github.com/machilusZ
https://github.com/CS410Assignments/MP3/search?l=python