CS代写 COMP5349: Cloud Computing Sem. 1/2022

School of Computer Science Dr.
COMP5349: Cloud Computing Sem. 1/2022
Week 8: Frame API and Performance Observation
Objectives

Copyright By PowCoder代写 加微信 powcoder

The objectives of this lab are:
• Get familiar with common operations in DataFrame API
• Understand how DataFrame operations are executed by Spark framework Question 1: SparkSQL Dataframe Basic Operations
The Colab version of the lecture sample code can be accessed from MovieData Summary.ipynb. Copy this Notebook to your Google Drive and make changes to inspect intermediate data frames.
Question 2: Execution Observation
a) Start a EMR Multi Node cluster
Start a three node EMR cluster: 1 master node and 2 core nodes of C4.xlarge type. Note that in last week’s lab, you had been asked to update the security rules for the master node in order to open a few additional ports. These turned out to be unnec- essary with dynamic port forwarding provided that the proxy and SSH tunnel worked correctly. To be able to create a new cluster, you need to reset the Master security group to the original state by removing those additional rules. The security group can be accessed from EC2 dashboard under “network and security” menu item. Once you have updated the security group setting, you can request a cluster by cloning last week’s cluster. Remember to change the number of core nodes to 2.
b) Cluster vs. Client mode
Log in to the master node using the SSH protocol after it is provisioned and running.
This usually happens before the whole cluster is ready. Install git and clone the Python resources repository.
14.04.2022

sudo yum install git
git clone \
https://github.sydney.edu.au/COMP5349-Cloud-Computing-2022\
/python-resources.git
You will find the script version (MovieData Summary.py) of the same application under week8. The input data for the script is loaded from S3 and the path is hard coded in the script. The week8 folder also contains two submission scripts: one uses “cluster” mode while the other uses “client” mode. Both scripts take an argument as the directory to store the application output.
Run both scripts, remembering to supply a different output directory for each script.
You will notice that there is no console output in cluster mode because the driver program is running on a core node alongside with the AM.
Spark has an alternative history server that is not hosted on the Master node. This can be accessed by clicking the link provided in the cluster summary page (Figure 1). This alternative server has latency in obtaining running statistics and may label completed applications as “incomplete” . You may need to access some of your application’s execution statistics from the ”incomplete applications” list.
Figure 1: Alternative History Server
From the history server, find the respective applications started by the two scripts. You can easily differentiate them from the last four digits of the application id. Check the “Executors” tab to compare the location of the driver program and the number of executors used in “client” and “cluster” modes. Figure 2 and figure 3 show example output for the two modes.
In both cases, the application is submitted from the master node. In “client” mode, the driver runs on the master node. In “cluster” mode, it runs on a core node. You can find the private DNS address of the respective nodes from the hardware tab of the cluster. The number of executors is not specified in either submission script. The decision is

made based by the framework based on configuration and available resources. You can inspect relevant properties on the history server’s “Environment” tab.
Figure 2: Client Mode Execution
Figure 3: Cluster Mode Execution

c) Effect of Caching
In the sample application, the original ratings DataFrame is used in a few queries. Each query may start one or more jobs with their own data flow starting from reading the input csv file to creating the ratings DataFrame. This is not I/O efficient. We can use caching to store the created DataFrame in memory for later use. Update the sample code by adding .cache() at the end of the csv read statement as shown below:
ratings = spark.read.csv(rating_data,header=False, schema=rating_schema) to
ratings = spark.read.csv(rating_data,
header=False, schema=rating_schema).cache()
Resubmit the application using either script and compare the execution statistics of the queries after the first show() statement with those of the previous one. You may notice that the execution plan for Query 1 starts from InMemoryTableScan (see Figure 4). This is in contrast to read csv file (as seen in lecture slides). The overall size of input/output would be different as well. Query 1 corresponds to the statement:
ratings.filter(“mid<=5").groupBy('mid').avg('rate').show() d) Effect of Lazy Execution Figure 4: Cache Effect A query may be totally ignored if it does not return any result to the console or if its result is not used by a subsequent query that needs to return a result. To test this, remove .show() from the following statement and resubmit the application using either script. ratings.filter("mid<=5").groupBy('mid').avg('rate').show() You will find this query is not executed at all. Question 3: Write your own code Use the DataFrame API to re-implement the average rating per genre application as de- scribed in the week 5 lecture and lab. 程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com