程序代写代做代考 algorithm 2. (35 points) Featured activity: Analysis of Latin documents for word-co-occurrence (Day 3,4 : 3- 4 hours each day)


2. (35 points) Featured activity: Analysis of Latin documents for word-co-occurrence (Day 3,4 : 3- 4 hours each day)

Program a word-co-occurrence program in Spark and test it for n=2 (n-grams) and just 1 or 2 documents. Scale it up for as many documents as possible.
I have provided two classical texts and a lemmatization file to convert words from one form to a standard or normal form. In this case you will use several passes through the documents. The documents needed for this process are attached.
Pass 1: Lemmetization using the lemmas.csv file

Pass 2: Identify the words in the texts by for two documents.
Pass 3: Repeat this for multiple documents.
Here is a rough algorithm (non-MR version):
for each word in the text

normalize the word spelling by replacing j with i and v with u throughout
check lemmatizer for the normalized spelling of the word
if the word appears in the lemmatizer
obtain the list of lemmas for this word
for each lemma, create a key/value pair from the lemma and the location where the word was found
else
create a key/value pair from the normalized spelling and the location where the word was found
From word co-occurrence that deals with just 2-grams (or two words co-occurring) increase the co-occurrence to n=3 or 3 words co-occurring. Create a table of n-grams and their locations. Discuss the results and plot the performance and scalability.
n-gram (n =2)
Location

{wordx, wordy}
Document.chap#.line#.position#

Etc.

n-gram (n=3)
Location

{wordx, wordy, wordz}
Document.chap#.line#.position#


In this activity you are required to “scale up” the word co-occurrence by increasing the number of documents processed from 2 to n. Record the performance of the Apache Spark infrastructure and plot it. A table similar to the above will have thousands of entries. Add the documents incrementally and not all at once.

/docProps/thumbnail.jpeg