CS计算机代考程序代写 python data structure algorithm 10/18/21, 7:45 PM GitHub – CS410Assignments/MP3: Implementing the PLSA Algorithm

10/18/21, 7:45 PM GitHub – CS410Assignments/MP3: Implementing the PLSA Algorithm

https://github.com/CS410Assignments/MP3 1/4

CS410Assignments /MP3 Public

Implementing the PLSA Algorithm

0 stars 0 forks

Code Issues Pull requests Actions Projects Wiki Security

View code

MP3: Implementing the PLSA Algorithm

DUE: OCT 24, 2021 at 11�59pm

In this MP, you will implement the PLSA algorithm discussed in lectures 9.7 and 9.8 of

the Text Mining Coursera course. You are not required to implement the PLSA algorithm

with a background model (we will run tests assuming the background model has not

been implemented). You are provided with two data sets in the data folder: test.txt
and dblp.txt . You can assume that each line is a separate document.

The test.txt data set contains 1000 lines. Each line is a document. The first 100
documents are labeled with the topic the document belongs to, either 0 (“Seattle”) or 1

(“Chicago”). Each of the first 100 document is generated by sampling completely from

the topic that is labelled (i.e., generated from one topic only). The rest 900 documents

are generated from a mixture model of the topic “Seattle” and “Chicago”. Run your code

to test if your EM algorithm returns reasonable results.

The DBLP.txt data set is made up of research paper abstracts from DBLP. Each line is a
document. Make sure to test your code on the simpler data set test.txt before
running it on DBLP.txt .

Star Notifications

main Go to file

machilusZ Update README.md … on Aug 29 5

README.md

https://github.com/CS410Assignments
https://github.com/CS410Assignments/MP3
https://github.com/CS410Assignments/MP3/stargazers
https://github.com/CS410Assignments/MP3/network/members
https://github.com/CS410Assignments/MP3
https://github.com/CS410Assignments/MP3/issues
https://github.com/CS410Assignments/MP3/pulls
https://github.com/CS410Assignments/MP3/actions
https://github.com/CS410Assignments/MP3/projects
https://github.com/CS410Assignments/MP3/wiki
https://github.com/CS410Assignments/MP3/security
https://www.coursera.org/learn/cs-410/lecture/HKe8K/9-7-probabilistic-latent-semantic-analysis-plsa-part-1
https://www.coursera.org/learn/cs-410/lecture/GJyGG/9-8-probabilistic-latent-semantic-analysis-plsa-part-2
https://github.com/login?return_to=%2FCS410Assignments%2FMP3
https://github.com/login?return_to=%2FCS410Assignments%2FMP3
https://github.com/CS410Assignments/MP3/find/main
https://github.com/CS410Assignments/MP3/commits?author=machilusZ
https://github.com/CS410Assignments/MP3/commit/3d852c59cceb4140f6f94e3e41bea6d7b10d4e55
https://github.com/CS410Assignments/MP3/commit/3d852c59cceb4140f6f94e3e41bea6d7b10d4e55
https://github.com/CS410Assignments/MP3/commits/main
https://github.com/machilusZ

10/18/21, 7:45 PM GitHub – CS410Assignments/MP3: Implementing the PLSA Algorithm

https://github.com/CS410Assignments/MP3 2/4

You are provided with a code skeleton plsa.py . Do not change anything in the def
__init__() function. But feel free to change anything in the main() function to test
your code.

Some suggested tips:

�. Using matrices is strongly encouraged (writing nested for-loops for calculation is
painful)

�. Python libraries numpy and scipy are useful in matrix based calculations

Writing the PLSA Algorithm:

The original PLSA model does not contain a background model. This MP also is based on

the original PLSA model, you do not have to worry about the background model.

However, lectures are all about PLSA with a background model, so you should not

attempt to map the formulas on lecture slides directly to the code. Instead, you would

need to adjust the formulas on slides by assuming that there is zero probability that the

background model would be chosen. That is, you should set λ to zero. If you do this,

you will see that the E-step would essentially only compute a distribution over k topics

for z=1, …, k, given each word, i.e., p(z=i|w). The M-step would also be simpler as

p(Z=B|w) is also zero for all words (due to the fact that λ =0). If you are still confused,

please take a look at Section 3 of Chase Geigle’s note on EM [2] and the note below.

The main data structures involved in the implementation of this EM algorithm are three

matrices:

�. T (topics by words): this is the set of parameters characterizing topic content that

we denoted by θ ‘s. Each element is the probability of a particular word in a

particular topic.

�. D (documents by topics): this is the set of parameters modeling the coverage of

topics in each document, which we denoted by p ‘s. Each element is the probability

of a particular topic is covered in a particular document.

B

B

i

ij

10/18/21, 7:45 PM GitHub – CS410Assignments/MP3: Implementing the PLSA Algorithm

https://github.com/CS410Assignments/MP3 3/4

�. Z (hidden variables): For every document, we need one Z which represents the

probability that each word in the document has been generated from a particular

topic, so for any document, this is a “word-by-topic” matrix, encoding p(Z|w) for a

particular document. Z is the matrix that we compute in the E-step (based on

matrices T and D, which represent our parameters). Note that we need to compute a

different Z for each document, so we need to allocate a matrix Z for every

document. If we do so, the M-step is simply to use all these Z matrices together with

word counts in each document to re-estimate all the parameters, i.e., updating

matrices T and D based on Z. Thus at a high level, this is what’s happening in the

algorithm:

T and D are initialized.

E-step computes all Z’s based on T and D.

M-step uses all Z’s to update T and D.

We iterate until the likelihood doesn’t change much when we would use T and D

as our output. Note that Zs are also very useful (can you imagine some

applications of Zs?).

Resources:

[1] Cheng s̓ note on the EM algorithm

[2] Chase Geigle s̓ note on the EM algorithm, which includes a derivation of the EM

algorithm (see section 4), and

[3] Qiaozhu Mei s̓ note on the EM algorithm for PLSA, which includes a different

derivation of the EM algorithm.

THIS MP IS CODING AND DEBUGGING HEAVY! PLEASE DON’T LEAVE IT TILL THE LAST

MINUTE!

Releases

No releases published

Packages

No packages published

Contributors 4

Bh Bh

http://sifaka.cs.uiuc.edu/czhai/pub/em-note.pdf
http://times.cs.uiuc.edu/course/598f16/notes/em-algorithm.pdf
http://times.cs.uiuc.edu/course/598f16/plsa-note.pdf
https://github.com/CS410Assignments/MP3/releases
https://github.com/orgs/CS410Assignments/packages?repo_name=MP3
https://github.com/CS410Assignments/MP3/graphs/contributors
https://github.com/Bhaavya
https://github.com/Bhaavya

10/18/21, 7:45 PM GitHub – CS410Assignments/MP3: Implementing the PLSA Algorithm

https://github.com/CS410Assignments/MP3 4/4

Bhaavya Bhavya

assmaboughoula

richameher Richa M

machilusZ Yunan Zhang

Languages

Python 100.0%

https://github.com/Bhaavya
https://github.com/Bhaavya
https://github.com/assmaboughoula
https://github.com/assmaboughoula
https://github.com/richameher
https://github.com/richameher
https://github.com/machilusZ
https://github.com/machilusZ
https://github.com/CS410Assignments/MP3/search?l=python