School of Computer Science Dr. Ying Zhou
COMP5349: Cloud Computing Sem. 1/2020
Week 10: Machine Learning with Spark
1 Objectives
07.05.2020
In this lab, you will practice two options of doing machine learning with Spark:
2
• •
By using Spark Machine Learning API BycombiningSparkwithotherMLpackage,forinstancemodelstrainedusingTensorflow
and Tensorflow-hub.
Code Samples
The code samples are released as two notebooks:
•
•
Spark Machine Learning Samples.ipynb: this notebook shows how to use Spark’s machine learning API. In particular, how to read and convert input data source to the format Spark ML library would accept. It also demonstrates result visualization on the driver program.
Spark Encoder.ipynb: this notebook demonstrates how to use Spark to accelerate the performance of deep learning models when they are used to generate encoding of input data.
Data Set
3
All data sets used in this lab are stored on S3.
The first notebook uses a subset of MNIST handwritten digit recognition data set. Each
data record represents a gray scaled picture of 28*28 dimension and is stored as pixel’s gray scale value. Only 1000 records and corresponding labels are used in the lab exercise. The data are labels are stored as different files.
The second notebook uses review data released by Amazon. The data is stored as a compressed tsv file. It contains review text and many meta data. Again, for resource consideration, only 10,000 records are included in the analysis.
1
4 Running Environment
This lab is designed to run entirely on EMR, both notebooks are provided as EMR notebook. You may change them to use on local environment. This mainly involves setting up the correct local environment and updating the file path. Also note that EMR notebook uses a different Jupyter magic command for visualization in EMR notebook. You need to change it to the local magic command to display matplotlib plot properly.
We will use a single Node EMR cluster for this lab. After going through the sample code. You may start a multi-node cluster to process larger data set.
5 Software Requirements
This lab uses a few extra packages on EMR cluster. We need to install them before running the notebooks. This section provides a summary of software used and ways to install them. Next section gives detailed installation steps.
6
• Matplotlib packaged is needed to support the visualization cells in the Spark Ma- chine Learning Samples notebook. The Matplotlib package only needs to be in- stalled on the master node.
• Tensorflow and Tensorflow-hub are needed for Spark Encoder.ipynb. Both require cluster wide installation and configuration, as they will be used by worker nodes to run tasks. Tensorflow is part of EMR software release and can be installed by selecting it during cluster software setup. Tensorflow-hub need to be installed as bootstrap action to make it available cluster wide.
Customized Cluster Configuration
To simplify the installation process, we put all installation commands in a bootstrap script and load the script at the start of the cluster. This would install all packages cluster wide. There is no need to log in to master node to run any extra installation.
The bootstrap script (wk10 bootstrap.sh) can be found in course repo under week10 folder. It contains the following commands:
sudo python3 -m nltk.downloader -d /usr/local/share/nltk_data all
sudo pip-3.6 install –quiet tensorflow-hub
sudo pip-3.6 install –quiet matplotlib
The first line installs all nltk data in case you want to use nltk for tokenization. The second and third lines install tensorflow-hub and matplotlib respectively. The bootstrap script needs to be uploaded to S3 to be used at cluster start time.
2
6.1 Upload bootstrap action script to S3
Log in to AWS console and proceed to S3 dashboard. Assuming you have created a bucket called “comp5349-unikey” in week 9. Open this bucket and create a folder in the bucket to store week 10 content. You may call it “week10”. Then upload wk10 bootstrap.sh from folder week10 of python-resources repo to the s3 folder you just created.
6.2 Start single node EMR cluster
Start a single node EMR cluster in the same way as you did in week 9 lab. An easy option is to clone the cluster you have created in week 9. You can review and change settings at each step before actual launching.
• Step 1: Software and steps configuration: select EMR release 5.29 from the drop down list. Then click Hadoop 2.8.5,Spark 2.4.4 Livy 0.6.0 and Tensorflow 1.14.0 to include the four components. Keep the configuration setting the same as last week, that is, add the property to enable maximize resource allocation.
[
{
} ]
“Classification”: “spark”,
“Properties”: { “maximizeResourceAllocation”: “true”
}
This property can be saved as a json file and uploaded to S3 to be used with the ‘Load JSON from S3’ option.
• Step 2: Hardware configuration: same as in week 9.
• Step 3: General cluster settings: change the cluster name as usual; Expand the Bootstrap Actions options at the bottom of the page. In Add bootstrap action drop down list, select “Custom action”, then click Configure and Add. This will bring out a window allowing you to select the script wk10 bootstrap.sh you uploaded to S3.
• Step 4: Security: select the key you have created in early labs in EC2 key pair selection box.
Click “Launch cluster” after all settings are updated. The additional bootstrap action adds considerable latency to the cluster start up time. You may need to wait ten more minutes before the cluster is fully ready to use.
3
7 Run EMR notebooks
Spark Machine Learning Samples.ipynb is a template for you to create your own EMR notebook. Follow the instructions of week 9 lab to create an EMR notebook with the given template. Inspect and run all cells to observe the output. Try to run all cells and do the suggested exercises on the notebook.
Spark Encoder.ipynb is another template for you to create your own EMR notebook. A cluster can run as many notebooks as the master’s resource can support. You can add this notebook to the running cluster. If you encounter memory issue, try to stop the previous notebook, as this notebook is expected to use a lot of master node resources. Inspect and run all cells to observe the output. Pay special attention to various data format conversion.
As additional exercise, you can create a cluster with three or more nodes and run the notebook with the entire review data set.
4