CS计算机代考程序代写 python School of Information Technology and Electrical Engineering

School of Information Technology and Electrical Engineering

INFS3208 – Cloud Computing

Individual Coding Assignment III (10 Marks)

Task Description:

In this assignment, you are asked to write a piece of Spark code to count occurrences of verbs in the

collection of Shakespeare. The returned result should be the top 10 verbs that are most

frequently used in Shakespeare’s collection. This assignment is to test your ability to use

transformation and action operations in Spark RDD programming. You will be given the collection

file (shakespeare.txt), a verb list (all_verbs.txt), a verb dictionary file (verb_dict.txt), and the

programming environment (a docker-compose). You can choose either Scala or Python to program in

the Jupyter Notebook. There are some technical requirements in your code submission as follows:

1. You should use an appropriate method to load the files into RDDs. You are NOT allowed to
make changes to the collection file (shakespeare.txt), verb list file (all_verbs.txt), and verb

dictionary file (verb_dict.txt).

2. To accurately count the verbs in the collection, you should use learned RDD operations to pre-
process the text in the collection file:

a. Remove empty lines;
b. Remove punctuations that could attach to the verbs;

E.g., “work,” and “work” will be counted differently, if you DO NOT remove the

punctuation mark.

c. Change the capitalization or case of text
E.g., “WORK”, “Work” and “work” will be counted as three different verbs, if you

DO NOT make all of them in lower-case.

3. You should use learned RDD operations to find out used verbs in the collection (shakespeare.txt)
by matching the verbs in the given verb list (all_verbs.txt).

4. A verb can have different forms: present tense, past tense, and future tense
a. E.g., regular verb: “work” – works”, “worked”, and “working”.
b. E.g., irregular verb: “begin” – “begins”, “began”, and “begun”.
c. E.g., linking verb “be” and its various forms, including “is”, “am”, “are”, “was”, “were”,

“being” and “been”.

You should use the learned RDD operations to calculate the occurrences of all the verbs (listed

in the given verb dictionary file) and merge the verbs that have different tenses by using the

learned RDD operations to look up the verb dictionary file (verb_dict.txt).

d. E.g., (work, 100), (works,50), (working,150) → (work, 300).
5. In the final result, you should return the top 10 verbs (in the base form, e.g., work) that are

most frequently used in the collection of Shakespeare.

Preparation:

In this individual coding assignment, you will apply your knowledge of Spark and Spark RDD

Programming (in Lectures 10 & 11). Firstly, you should read Task Description to understand what

the task is and what the technical requirements include. Secondly, you should review all the

transformation and action operations in Lectures 10 & 11. In the Appendix, there are some

transformation and action operations you could use in this assignment. Also, it would be very helpful

to practise the RDD programming activity in Prac 9 (Week 10). Lastly, you need to write the code

(Scala or Python) in the Jupyter Notebook. All technical requirements need to be fully met to

achieve full marks.

You can either practise on the GCP’s VM or your local machine with Oracle Virtualbox if you are

unable to access GCP. Please read the Example of writing Spark code below to have more details.

Assignment Submission:

▪ You need to compress the Jupyter Notebook file.
▪ The name of the compressed file should be named “FirstName_LastName_StudentNo.zip”.
▪ You must make an online submission to Blackboard before 1:00 PM 22/10/2021 (Week 12).
▪ Only one extension application could be approved due to medical conditions.

Example of writing Spark code:

Step 1:

Log in your VM and change to your home directory

Step 2:

Download docker-compose.yml and the data you required (shakespeare.txt, all_verbs.txt, verb_dict.txt).

git clone https://github.com/csenw/cca3.git && cd cca3

sudo chmod -R 777 nbs/

Step 3:
Run all the containers: docker-compose up -d

Step 4:
Open the Jupyter Notebook (EXTERNAL_IP:8888) and write the Spark code in it.

Step 5:
Use the learned method to load external files (shakespeare.txt, all_verbs.txt, verb_dict) into RDDs.

Output samples:

shakespeare.txt

https://github.com/csenw/cca3.git%20&&%20cd%20cca3

all_verbs.txt

verb_dict.txt

Step 6:

Use learned RDD operations to pre-process the RDD that stores the text:

1. Remove empty lines
2. Remove punctuations that could attach to the verbs;
3. Change the capitalization or case of text

Output sample:

Step 7:

Use learned RDD operations to keep all the verbs according to the all_verbs.txt.

Output sample:

Step 8:

Use learned RDD operations to count the occurrences of the kept verbs:

Output sample:

Step 9:

Use learned RDD operations to merge the verb pairs that are from the same verb. E.g. (work, 100),

(works,50), (working,150) → (work, 300).

Output sample:

Step 10:

Use learned RDD operations to return the top 10 that are most frequently used in the collection of

Shakespeare.

Output sample:

Note that Steps 5 -10 are just recommended procedure, you can feel free to use your processing steps.

However, your result should reasonably reflect the top 10 verbs that are most frequently used in the

collection of Shakespeare.

Appendix:

Transformations:

Transformation Meaning

map(func) Return a new distributed dataset formed by passing each element of the

source through a function func.

filter(func) Return a new dataset formed by selecting those elements of the source on

which funcreturns true.

flatMap(func) Similar to map, but each input item can be mapped to 0 or more output

items (so funcshould return a Seq rather than a single item).

union(otherDataset) Return a new dataset that contains the union of the elements in the source

dataset and the argument.

intersection(otherDataset) Return a new RDD that contains the intersection of elements in the source

dataset and the argument.

distinct([numPartitions])) Return a new dataset that contains the distinct elements of the source

dataset.

groupByKey([numPartitions]) When called on a dataset of (K, V) pairs, returns a dataset of (K,

Iterable) pairs.

Note: If you are grouping in order to perform an aggregation (such as a

sum or average) over each key, using reduceByKey or aggregateByKey will

yield much better performance.

Note: By default, the level of parallelism in the output depends on the

number of partitions of the parent RDD. You can pass an

optional numPartitions argument to set a different number of tasks.

reduceByKey(func,

[numPartitions])

When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs

where the values for each key are aggregated using the given reduce

function func, which must be of type (V,V) => V. Like in groupByKey, the

number of reduce tasks is configurable through an optional second

argument.

sortByKey([ascending],

[numPartitions])

When called on a dataset of (K, V) pairs where K implements Ordered,

returns a dataset of (K, V) pairs sorted by keys in ascending or descending

order, as specified in the boolean ascending argument.

join(otherDataset,

[numPartitions])

When called on datasets of type (K, V) and (K, W), returns a dataset of (K,

(V, W)) pairs with all pairs of elements for each key. Outer joins are

supported through leftOuterJoin, rightOuterJoin, and fullOuterJoin.

Actions:

Action Meaning

reduce(func) Aggregate the elements of the dataset using a function func (which takes

two arguments and returns one). The function should be commutative

and associative so that it can be computed correctly in parallel.

collect() Return all the elements of the dataset as an array at the driver program.

This is usually useful after a filter or other operation that returns a

sufficiently small subset of the data.

count() Return the number of elements in the dataset.

first() Return the first element of the dataset (similar to take(1)).

take(n) Return an array with the first n elements of the dataset.

countByKey() Only available on RDDs of type (K, V). Returns a hashmap of (K, Int) pairs

with the count of each key.

foreach(func) Run a function func on each element of the dataset. This is usually done

for side effects such as updating an Accumulator or interacting with

external storage systems.

Note: modifying variables other than Accumulators outside of

the foreach() may result in undefined behavior. See Understanding

closures for more details.

https://spark.apache.org/docs/latest/rdd-programming-guide.html#accumulators
https://spark.apache.org/docs/latest/rdd-programming-guide.html#understanding-closures-a-nameclosureslinka
https://spark.apache.org/docs/latest/rdd-programming-guide.html#understanding-closures-a-nameclosureslinka