CS代写 COMP5349: Cloud Computing

COMP5349: Cloud Computing
School of Computer Science Dr.
Sem. 1/2022
24.03.2022

Copyright By PowCoder代写 加微信 powcoder

Objectives
The objectives of this lab are:
• Understand basic operations in Py DD API
• Understand the chain of transformations in Py DD API
Like MapReduce, Spark can run locally or on a cluster. This lab focuses on local mode. The lab exercise will run on Google Colab. If you prefer working offline, you can install Spark on your own computer. There are two local installation options: you may either follow the guide at the Spark official website to install Spark and configure it to allow it to work with your Python installation; or install a image.
Using PySpark in Google Colab
To use PySpark on Google Colab, you will need a Google account and some free space on your Google Drive. As PySpark is not a standard component, each time you start the Colab notebook, you will need to install PySpark on the virtual machine running the notebook. You also need to re-mount your Google Drive on the virtual machine. This may take a couple of minutes.
A sample notebook with week 5 lecture code is prepared with specific instructions on the Colab environment. The sample notebook can be accessed from this address: COMP5349 Week 5 Lecture code. Alternatively, downloaded the code from the Git python- resources repository. Either way, you need to save a copy on your own Google Drive.

Preparing the Data File
The sample notebook uses the following three text files as input for different applications: 1984 processed.txt, movies.csv and ratings.csv. The data files can be downloaded from the same repository. They need to be uploaded to your Google drive under the folder comp5349. The first file is the input of the word count sample application and the two csv files are inputs of the movie average rating sample application.
Coding Exercise
Write your own Spark code using the RDD API to find from the two csv files:
• all movies without a genre listed and print out the IDs of the movies. The data set has a special name ‘(no genres listed)’ for such case. You may hard-code the genre name in your functions.
• the top 5 movies in the ‘Documentary’ genre based on the number of ratings that a movie has received. You may hard-code the genre name in your functions.
• the top 5 genre pairs that co-occur most frequently in the data set. Many movies can be classified into more than one genre. The genre names are concatenated as a single string using the “|” character. For example, Jumanji has three genres: “Adventure|Children|Fantasy”. Its row in movies.csv has the following content:
2,Jumanji (1995),Adventure|Children|Fantasy
This represents three co-occurring genre pairs: (Adventure, Children), (Adventure, Fantasy), and (Children,Fantasy) occur once. It is expected that some genre pairs co-occur more often than other pairs.
For each exercise question, please start by designing the draft computation graph / pipeline and the data structure of the RDDs involved. Then, determine the functions to be used for various operators in the graph. You can use the same notebook to implement the exercise.

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com