程序代写代做 hadoop hbase file system chain School of Computer Science Dr. Ying Zhou

School of Computer Science Dr. Ying Zhou
COMP5349: Cloud Computing Sem. 1/2020
Objectives
Week 5: Spark Tutorial
Similar to MapReduce, Spark can run locally or on a cluster. This lab focuses on local mode, with most of the exercises run on Ed workspace configured with PySpark Docker image. If you prefer working offline, you can install Spark on your own computer following the guide on Spark official site.
The objectives of this lab are:
• Understand basic PySpark RDD API
• Understand the chain of transformations in PySpark RDD API • Practice Ed style code challenge
Data Set and Sample Code
The data set and sample code can be found in folder week 5 of python-resources reposi- tory. Assuming you have already cloned the repository, you only need to run git pull to get the latest update.
The data set contains two csv files: movies.csv and ratings.csv downloaded from http://grouplens.org/datasets/movielens/.
The movies.csv file contains movie information. Each row represents one movie, and has the following fields:
movieId,title,genres
The ratings.csv file contains rating information. Each row represents one rating of
one movie by one user, and has the following field:
userId,movieId,rating,timestamp
Spark is able to talk to any storage source supported by Hadoop, including the local file system, HDFS, Cassandra, HBase, Amazon S3, etc. It can load input file from or write output to those sources. Similar to MapReduce, the input/output location should be specified by a complete URI with protocol prefix unless using the default storage. For instance, hdfs:// refers to hdfs location; file:// refers to local file system. In all sample applications in this week’s lab use data set located on local file system.
1
26.03.2020

Using PySpark in Ed
You can access COMP5349 Ed site from Canvas by clicking Ed menu item on the left panel. There are two ways of using PySpark in Ed: workspace and lesson. The default entry page of COMP5349 Ed site is discussion board. You can access workspace and lesson through the top menu bar. The menu icons for Workspaces and Lessons are right next to the discussion board icon (see figure 1).
Figure 1: Ed top banner menus
Ed workspace allows you to create your own workspace to run customized code. You can also fork a public workspace as entry point. You will find a read-only public workspace week5-lecture containing code used in the lecture. To fork it, open the workspace and click the fork icon on the top right corner (see figure 2.
Figure 2: Ed workspace top right menus
Ed lesson contains a series of exercises preset by the course staff. The exercises usually contains a description, template code, and are equipped with auto marking mechanisms. This week’s exercises are set as Ed lessons. When you click the Lessons icon, you will see the lesson Week 5: Spark RDD API Exercises . The lesson has two questions:
• Question 1 contains the simple application covered in the lecture with two extra cells representing further analysis you are asked to implement.
• Question 2 is a code challenge practice. We have set the deadline to Saturday 28/03/20 11:00pm. You can start working on it during lab time and finish it at home. You may need to check PySpark API to learn the usage of operations.
2