School of Computer Science Dr. Ying Zhou
COMP5349: Cloud Computing Sem. 1/2020
Objectives
Week 4: MapReduce Tutorial
Hadoop has grown into a large ecosystem since its launch in 2006. The core of Hadoop includes a distributed file system HDFS and a parallel programming engine MapReduce. Since Hadoop 2, YARN is included as the default resource scheduler.
Hadoop can be configured in three modes: standalone, pseudo-distributed and full- distributed. Standalone mode needs only the MapReduce component. The pseudo and full distributed modes require HDFS and YARN.
In this lab, we will practice pseudo cluster mode using AWS EMR(Elastic MapReduce). The objectives of this week’s lab are:
• Learn to request AWS EMR (Elastic MapReduce) and configure access to it, using single node cluster (pseudo cluster mode)
• Practice simple program on EMR using HDFS as input/output
• Learn to use basic HDFS commands and to read HDFS information from the web
portal
Lab Exercise
Question 1: Start Single Node EMR
This exercise is done on AWS! You will access AWS using browser and terminal windows on your own desktop or laptop.
a) Request EMR cluster
Login to AWS Management Console and find EMR service under Analytics heading. Click EMR to enter the EMR dashboard. It has similar look to the EC2 dashboard, with less items on the left panel. Click Create cluster button to open the form for cluster specification. We will use default setting for most of the options except the following:
• Cluster Name. You may change the default “Mycluster” to something more in- dicative, for instance “COMP5349 wk4”
1
19.03.2020
• Instance type. The default one is m5.xlarge, which might not be available to AWS educate starter and classroom account). The instance m4.xlarge is tested to work for all account. Number of instances. The default is 3, change this to 1 to have a single node cluster. EC2 key pair. Chose the one you have created previously.
Once you have updated all necessary options, click Start Cluster button to start creating the cluster. The creation of the cluster may take a few minutes. It involves instance provisioning and software installation and configuration. Once the instance is provisioned, you will see a screen similar to figure 1 with Master Public DNS infor- mation.
Figure 1: Master Public DNS
The Master Public DNS represents the master node’s DNS. This is the node you will use to access various web UIs, and to interact with the cluster through command line. This means we need to mange the security setting of this node by allowing incoming traffic to at least port 22. Scroll down to the bottom right corner of the summary page to view the Security and Setting section (see figure 2. Click Security Group for Master to inspect and possibly edit its rules.
Since we are using the same node for Master and Slave, you may find the security group is called Security Group for Slave. A lot of rules have already been defined in this group to allow internal communication among cluster nodes. You will need to add one rule to allow external SSH connections. This means opening port 22 for incoming TCP traffic. Please note that in the lab, for simplicity, we advise you to open this port to incoming traffic from any node. This is not the preferred option in real world situation, you should only to allow SSH connections from trusted computers (See relevant EMR Guide for more details).
b) Connect to the master node
The cluster consists of a single EC2 node with a few Hadoop components configured and started. We can SSH to it in the same way we connect to any other EC2 instance we have used in previous labs. A sample connection command can be found from the summary page of the cluster by clicking the link SSH on Master public DNS row (see figure 1 bottom right). The SSH command would look like:
ssh -i ~/Mykeypair.pem hadoop@ec2-xxx-xxx-xxx-xxx.compute-1.amazonaws.com
Make sure to update the private key file’s path to the correct one in the command. You are logged in as user hadoop.
2
Figure 2: Security Group
Question 2: Run Sample Program
hadoop jar PathToJarFile [mainClass] [ProgramArgLists]}
In this exercise you will run a sample word count program using the data file place.txt as input. You can find the data file from our organization’s repo: python-resources. It is under week4/data directory.
Note that GIT is not installed by default on maste node, you can install it with sudo yum install git. Once installed, run the command to clone the repository on the ec2 node.
git clone \
https://github.sydney.edu.au/COMP5349-Cloud-Computing-2020/python-resources.git
Assuming you have cloned the repository on the home directory of hadoop user. The following command execute a sample wordcount program place.txt
hadoop-mapreduce-examples wordcount \
file:///home/hadoop/python-resources/week4/data/place.txt \
count_out
hadoop-mapreduce-examples is a script for conveniently invoking sample program come with Hadoop. Hadoop installation comes with several sample programs, the argument immediately after the script name indicates the program to run. In our example, it is the wordcount program. The rest are program-specific arguments. For sample wordcount program, we need an input path and an output path. They are specified as:
3
file:///home/hadoop/python-resources/week4/data/place.txt
count_out
You may have noticed that the input has a protocol prefix file://. This instructs hadoop to read the input from local file system. Hadoop can read/write data from a number of file systems. The default file system for cluster mode (including pseudo-cluster mode) is HDFS and a protocol prefix is needed for any other file system.
Notice that the output path now does not have a protocol prefix, which indicate HDFS file system and the output will be put on count out folder under the current user’s home directory on HDFS. We will see how to inspect files on HDFS in the next exercise.
After the job is submitted, progress information will be printed on the console as the job starts to execute.
Hadoop always creates a new output directory each time your run a program. It does not overwrite any existing directory! If you specify an existing directory as the output directory, the program will exit with error. When you need to run the same program multiple times, make sure you either remove the output directory before submitting the job; or change the name of the output directory in a new run.
Question 3: HDFS basic commands
HDFS provides a list of commands for normal file system operations such as listing direc- tory content, creating directory, copying files and so on. All HDFS file system commands have a prefix hdfs dfs. The actual commands have similar names and format as those of the Linux file system commands. Different to Linux file system commands, HDFS com- mand line has a default present working directory always pointing to the current user’s home directory in HDFS, any other locations need to be indicated using absolute path starting with the root path /. There is no corresponding HDFS command like cd to change the present working directory. Figure 3 show results of a few ls command variations:
The first one is an ls command without any parameter. It will display the content of the current user’s home directory in HDFS, which was initially empty but has the output directory count out added after running the sample program in previous step.
The second one shows the content of the root directory of HDFS /. It contains a few directories like app, tmp, user and var.
The third one shows the content of the director /user. This is the parent directory of all user’s home directories, including user hadoop.
Question 4: Inspect content of files in HDFS
There are two ways to inspect the content of HDFS file. You may inspect it using com- mands like cat, or tail. Below is a sample cat command to print out the content of the output file.
4
Figure 3: HDFS Commands and Outputs
hdfs dfs -cat count_out/part-r-00000
A more convenient way is to download the file to local file system then view it using either command line or graphic tools. The following is a sample command downloading the output file part-r-00000 and save it as count out.txt in the present working directory:
hdfs dfs -get count\_out/part-r-00000 countout.txt
Question 5: Run another Sample Program with input/output both on HDFS
Next, upload place.txt to hdfs and run another sample program with the following com- mands:
hdfs dfs -put place.txt place.txt
hadoop-mapreduce-examples grep \
place.txt \
placeout \
“/Australia[\d\w\s\+/]+”
The first command uploads the file place.txt in present working directory (you need to make sure you run this command in the directory containing place.txt or add update the path according) to the home directory of hadoop user on HDFS as place.txt.
The second command run the grep sample program to find all records with word “Aus- tralia” in it. The grep program requires three parameters: input location, output location and a regular expression to search for. In the above command place.txt specifies the in- put location; placeout specify the output directory; the last one is the regular expression. Both uses the default file system HDFS.
5
Question 6: Using Hadoop web UI(s)
HDFS has a web UI displaying the file system information, including directories and files in HDFS. In EMR this service runs on master node, listening on port 50070. The usual ad- dress to access it is: http://
Remember that any EC2 instance has two network addresses: a public DNS with format like ec2-xxx-xxx-xxx-xxx.compute-1.amazonaws.com and an internal DNS with format like ip-xxx-xxx-xxx-xxx.ec2.internal. A typical configuration for HDFS Web UI looks like the following, which uses internal DNS.
In this lab, we will use the relatively simple local port forwarding option. The local port forward requires one SSH tunnelling for each service we want to access externally. The detailed instruction can be found from EMR user guide.
Unfortunately, the local port forwarding is not supported on Windows machine. If you are using a Windows machine, you have to dynamic port forwarding. Details can be found from Set Up an SSH Tunnel to the Master Node Using Dynamic Port Forwarding.
Below is a sample command for local port forwarding. You will need to update it with your own key path and the correct master node’s public DNS name.
ssh -i myKey.pem -N -L \
8157:ec2-3-83-225-88.compute-1.amazonaws.com:50070 \
hadoop@ec2-3-83-225-88.compute-1.amazonaws.com
Thebasiccommandssh -i myKey.pem hadoop@ec2-3-83-225-88.compute-1.amazonaws.com opens a SSH connection to the master node. The option -N means not executing a remote com- mand. An ssh command typically executes a shell command on the remote server to give you the shell prompt. The option -N is the reason that after you issue the above command, the terminal remains open and does not return a response.
6
The option -L 8157:ec2-3-83-225-88.compute-1.amazonaws.com:50070 specifies local port for- warding data. The parameter order is localPort:remoteDNS:remotePort. This particular com- mand means any traffic sent to the local port 8175 should be forwarded to port 50070 of the remote server ec2-3-83-225-88.compute-1.amazonaws.com through the SSH connection. To access the HDFS web UI, we only need to type http://localhost:8157 in browser’s address bar. This will load the HDFS Web UI (see figure 4). Note that the url on the address bar is http://localhost:8157.
Figure 4: HDFS Web Portal Overview
EMR has a few other Web UIs running on the master node. For instance, the YARN web UI listens on port 8088. To access it we need to set up another port forwarding tunnel as follows:
ssh -i myKey.pem -N -L \
8158:ec2-3-83-225-88.compute-1.amazonaws.com:8088 \
hadoop@ec2-3-83-225-88.compute-1.amazonaws.com
This command opens another SSH connection to the same master node and this connection is responsible for forwarding traffics sending to local port 8158 to port 8088 on master node ec2-3-83-225-88.compute-1.amazonaws.com. From your browser window, a request sent to localhost:8158 would load the YARN web UI as show in figure 5.
YARN web UI shows the cluster’s overall resource usage and the applications’ execution statistics, some are similar to those shown in the console. The YARN web UI is quite intuitive and you may explore it yourself in this lab. We will inspect those statistics more closely in later labs with Hadoop in distributed setting.
7
Figure 5: HDFS Web Portal Overview
Question 7: A Customized Sample Program
In this exercise, you will see the complete source code of a MapReduce application in Python. The input of the program is a csv file. Each line represents a photo record of the following format:
photo-id owner tags date-taken place-id accuracy
The tags are represented as a list of words separated by white spaces. Each word is called a tag. You can find a sample input file called partial.txt from the week4 directory of lab commons repository.
The MapReduce application builds a tag-owner index for each tag appearing in the data set. The index key is the tag itself, the entry records the number of times an owner has used this tag in the data set. For example, if a data set contains only these three records:
509657344 7556490@N05 protest 2007-02-21 02:20:03 xbxI9VGYA5oZH8tLJA 14
520345662 8902342@N05 winter protest 2009-02-21 02:20:03 cvx093MoY15ow 14
520345799 8902342@N05 winter protest 2009-02-21 02:35:03 cvx093MoY15ow 14
The program would generate an index with two keys: “protest” and “winter”.
The index of “protest” would look like
protest 7556490@N05=1, 8902342@N05=2,
The index of “winter” would look like
winter 8902342@N05=2,
8
a) Python Source Code
The Python source code contains two versions: a naive implementation and a version with combiner function. They are stored under different directories. Each version con- sists of two python scripts, implementing the map and the reduce function respectively. In the naive implementation, the map function is designed to processes a line of the input file. The key is the byte offset of the line, which we do not need to use in the future. The value is the actual line content as a string. It outputs a list of (tag, owner) pairs.
The reduce function is designed to process all information related with a particular tag. The input key is the tag, the input value is a list of owners that has used this tag. If an owner has used this tag three times, the same owner, tag pair will appear three times in the value list. The output key is still the tag, the output value is a summary of the owner data with respect to that tag.
Below is a summary of the map and reduce function highlighting the theoretic in- put/output key value pairs:
Map:
(fileOffset, line) => {(tag, owner1), (tag,owner2),…}
Reduce:
(tag,{owner1, owner1, owner2, …}) => (tag,{owner1=x, owner2=y, …})
Please note that the input to Python’s reduce function is only sorted but not grouped. The actual input to the reduce function looks like:
tag1, owner1
tag1, owner1
tag1, owner2
…
tag2, owner3
…
Developers need to implement their own program logic to identify the key change.
In the combiner implementation, a combiner function is applied to do local aggregation after each map. The combiner function does exactly the same thing as the reduce function. To make it work, the input/output key value pairs have changed slightly to become:
Map:
(fileOffset, line) => {(tag, owner1=1),(tag,owner2=1),…}
Combine:
(tag,{owner1=a, owner2=b, …}) => (tag,{owner1=x, owner2=y, …})
Reduce:
(tag,{owner1=a, owner2=b, …}) => (tag,{owner1=x, owner2=y, …})
Again, the python reduce function see a slightly different input to the theoretic one. Below are examples of grouped and not grouped reduce input, using the combiner version
9
Two records with grouped format, theoretic input of reduce function as seen if coded in Java:
protest:{7556490@N05=1, 8902342@N05=2}
winter:{8902342@N05 =2}
Sorted but not grouped format, typical input of reduce function in Python code:
protest:7556490@N05=1
protest:8902342@N05 =2
winter:8902342@N05 =2
Python scripts are invoked by a streaming jar comes with Hadoop installation. We need to provide a number of arguments to indicate the script, the input and other properties. A shell script is provided for each version. For the naive version, we have tag driver.sh. For the combiner version, we have tag driver combiner.sh. The shell script demonstrate all necessary arguments of running the python script with streaming jar. Note that you may need make the shell script executable by providing enough permissions.
b) Test Your Understanding Using the three row sample data:
509657344 7556490@N05 protest 2007-02-21 02:20:03 xbxI9VGYA5oZH8tLJA 14
520345662 8902342@N05 winter protest 2009-02-21 02:20:03 cvx093MoY15ow 14
520345799 8902342@N05 winter protest 2009-02-21 02:35:03 cvx093MoY15ow 14
Assuming there is only one reducer Answer the following questions:
• what are the output of the map phase in the naive version?
• what are the input of the reduce phase in the naive version?
• what are the output of the map functions in the combiner version?
• what are the input of the combiner functions in the combiner version? • what are the output of the combiner functions in the combiner version? • what are the input of the reduce function in the combiner version?
Question 8: Homework: write your own code
You have seen a MapReduce like program in week 1 homework exercise. You are asked to convert it to proper Hadoop MapReduce application. The week 1 exercise asks you to compute the average movie rating for those movies whose ID is in a given range in a given range. This involves passing command line arguments to the MapReduce program, which has not been covered in the lecture nor in the lab. In this exercise, you are asked to compute the average movie ratings for all movies in the data set.
10
Notice
Warning:
• Always terminate (or stop in some rare cases) all your AWS resources after finish with it. Otherwise, MONEY WILL BE LOST!
• Always go the the lab location and time allocated, unless you have special consider- ations.
References
• Hadoop: Setting up a Single Node Cluster. https://hadoop.apache.org/docs/ stable/hadoop-project-dist/hadoop-common/SingleCluster.html
• HDFSCommandsGuidehttps://hadoop.apache.org/docs/stable/hadoop-project-dist/ hadoop-hdfs/HDFSCommands.html
• AWS EMR: Supported Instance Types
• AWS EMR: Build Binaries
• View Web Interfaces Hosted on Amazon EMR Clusters
11