A4
Tidy Data
In this problem, you will wrangle the output from themtrcommand which is in a different format than the one used in A3 into aTidy Datasetuseful for analysis.
We will be working with anothermtroutput, this one located ats3:guanly502A4mtr.www.cnn.com.txt.
The output ofmtrtowww.cnn.comlooks like this. The output is formatted differently, because the output was in text mode, not in CSV mode. Whoops! This sort of thing happens all time in the real world. The text output is designed for humans but not to be machinereadable. Your job is to turn this humanformatted output into something your existing scripts can use.
Start: Wed Dec 28 23:36:02 2016
HOST: pidora.local Loss Snt Last Avg Best Wrst StDev
1. 192.168.10.1 0.0 1 1.6 1.6 1.6 1.6 0.0
2. 96.120.104.177 0.0 1 10.2 10.2 10.2 10.2 0.0
3. 68.87.130.233 0.0 1 9.8 9.8 9.8 9.8 0.0
4. ae530ar01.capitolhghts 0.0 1 10.8 10.8 10.8 10.8 0.0
5. be33657cr02.ashburn.va. 0.0 1 12.9 12.9 12.9 12.9 0.0
6. hu01001pe07.ashburn. 0.0 1 11.2 11.2 11.2 11.2 0.0
7. ??? 100.0 1 0.0 0.0 0.0 0.0 0.0
8. ??? 100.0 1 0.0 0.0 0.0 0.0 0.0
9. 151.101.32.73 0.0 1 12.6 12.6 12.6 12.6 0.0
Start: Wed Dec 28 23:37:02 2016
HOST: pidora.local Loss Snt Last Avg Best Wrst StDev
1. 192.168.10.1 0.0 1 1.3 1.3 1.3 1.3 0.0
2. 96.120.104.177 0.0 1 8.7 8.7 8.7 8.7 0.0
3. 68.87.130.233 0.0 1 9.6 9.6 9.6 9.6 0.0
4. ae530ar01.capitolhghts 0.0 1 10.5 10.5 10.5 10.5 0.0
5. be33657cr02.ashburn.va. 0.0 1 14.0 14.0 14.0 14.0 0.0
6. hu01001pe07.ashburn. 0.0 1 13.7 13.7 13.7 13.7 0.0
7. ??? 100.0 1 0.0 0.0 0.0 0.0 0.0
8. ??? 100.0 1 0.0 0.0 0.0 0.0 0.0
9. 151.101.32.73 0.0 1 12.6 12.6 12.6 12.6 0.0
Lets look at the structure of the file more closely. We first see a timestamp on the line starting withStart:, then we see the starting hostname on the line starting withHOST:, and last we see the series of hops with the associated data. This pattern repeats itself; the next instance ofmtrstarts again with theStart:line.
The numeric columns mean the following:
Lossthe percentage loss of the packet in that respective hop
Sntthe number of packets sent only one was sent permtrrun
Last, Avg, Best, Wrst, StDevstatistics about the time. Since there was only one packet sent all values are equal.
While it is possible to generate a more structured output, like the one we had for A3, the goal of this problem is to practice wrangling skills and create a dataset that is useful for analysis. Another goal is to learn how to deal with and live with faulty or incomplete data: if you look at the hostnames, you either see an IP address or a truncated hostname. This may be all that you have this is all the information available to us from this output. Another difference between this output format and last weeks is that we are seeing the duration quantities in milliseconds rounded off to a tenth. In last weeks we had microseconds in whole units.
1 Usingingestmtr.pyfrom A3 as a reference, write a Python program that processes this raw dataset and creates a tidy and clean dataset. The expected tidy dataset should have the following structure:
timestamp, hopnumber, ipaddr, hostname, pctloss, time
where:
timestampis the time whenmtrwas run in ISO 8601 format
hopnumberis the hop number from the printout
ipaddris the ip address of the hop. You have this for some of the hosts in the CNN file. For others, you may be able to find the IP address in the files3:guanly502A3mtr.www.comcast.com.2016.txt.
hostnameis the complete hostname. As you can see from the sample data above, the values for the hostname are truncated. Once again, you should see if you can find the complete hostname for each host in the A3 dataset. You will then need to make a lookup table that transforms the incomplete hostname to the complete hostname.
pctlossis the numeric portion of the Loss column. Because themtrcommand was only sending out a single packet, the loss is either 0 or 100.
timeis the time the packet took to travel, in milliseconds. This is theLastcolumn. Of course, since only a single packet was being sent out, it is also theAvgcolumn, theBestcolumn, and theWrstcolumn.
Your Python program will need to use some of the tools we discussed in class. Specifically, you will need to use a regular expression to parse each line, and you will need to use an inmemory join to transform the partial hostnames to a complete hostname. Where do you find the complete hostnames? In our original dataset!
We have provided you with a skeletal version of the program, which we have calledmtrfix.py. Your job is to finish it. We have also provided you with an initial test program,mtrfixtest.py, which you are free to expand.
Getting to know Spark
In this assignment we will get our first taste of Apache Spark.
Spark provides for largescale, complex, and interactive manipulations of massive datasets. It builds upon many of the technologies that weve already learned. Specifically:
Large datasets can be stored in HDFS or S3.
Map operations can be applied to every element in a dataset.
Datasets can be stored askey, valuepairs.
Reduce operations can combine multiple elements together.
A single Master node controls the cluster. You will log into the Master mode to run your programs.
However, we will see some improvements as well:
You can type interactive commands in Python and see the results immediately.
You can useipythonor Jupyter Notebook to show your results.
Create your cluster!
In this section you will create a cluster and try out both ipython and Jupyter notebook.
Create an EMR cluster withAdvanced Optionsand the following configuration:
emr5.3.1
Hadoop 2.7.3
Spark 2.1.0
Master Nodes: m3.xlarge count: 1
Core Nodes: none
Task Nodes: None
Cluster Name: Spark 1
Options: Logging, Debugging, no Termination Protection
Bootstrap Actions:s3:guanly502bootstrapspark.sh
Note: this is a different bootstrap!
Log into the cluster.
iPython
First we want you to look at iPython, an interactive version of Python that offers completion and history. iPython is similar to the Jupyter notebook, except it runs from the command line.
iPython has been installed on our clusters. To try it, log into one of your existing clusters and typeipython3:
ipython3
Python 3.4.3 default, Sep 1 2016, 23:33:38
Type copyright, credits or license for more information.
IPython 5.1.0 An enhanced Interactive Python.
? Introduction and overview of IPythons features.
quickref Quick reference.
help Pythons own help system.
object? Details about object, use object?? for extra details.
In 1:
For a test, set the variablecourseto beANLY502:
In 1: course ANLY502
In 2:
iPython gives youtab completion. To see all of the different methods that the variablecourseimplements, typecourse.and then hit thetabkey:
In 2: course.
course.capitalize course.endswith course.index course.isidentifier
course.casefold course.expandtabs course.isalnum course.islower
course.center course.find course.isalpha course.isnumeric
course.count course.format course.isdecimal course.isprintable
course.encode course.formatmap course.isdigit course.isspace
iPython also provides parenthesis matching, syntax checking, and other features.
Finally, ipython gives you an easy way to to refer to the output of any command. The arrayInis an array of all of the strings that are provided as input, while the arrayOutis an array of all the outputs. For example, we can capitalize thecourseand then take the result twice together, concatenated:
In 1: course ANLY502
In 2: course.capitalize
Out2: Anly502
In 3: Out2Out2
Out3: Anly502Anly502
In 4: In1
Out4: course ANLY502
In 5:
iPython is a complete shell. You can execute commands such asls,pwd,cwdand more. It also has powerful search, logging, and command alias facilities. You can get a list of its builtin commands by typing:
In 14: quickref
IPython An enhanced Interactive Python Quick Reference Card
obj?, obj?? : Get help, or more help for object also works as
?obj, ??obj.
?foo.abc : List names in foo containing abc in them.
magic : Information about IPythons magic functions.
Magic functions are prefixed by or , and typically take their arguments
…
Please try all of the examples in this section to make sure that you understand the basis of iPython.
You can learn more about iPython byreading the documentationathttp:ipython.readthedocs.ioenstableindex.html.
Jupyter Notebook on Amazon
Amazon has created a bootstrap action that installs Jupyter notebook on EMR. The custom bootstrap action is ats3:awsbigdatablogartifactsawsblogemrjupyterinstalljupyteremr5.sh.
Create a Spark cluster with EMR with thes3:awsbigdatablogartifactsawsblogemrjupyterinstalljupyteremr5.shbootstrap action.FOR NOW, DO NOT SPECIFY ANY OPTIONAL ARGUMENTS FOR THE BOOTSTRAP ACTION.
Amazonsinstalljupyteremr5.shbootstrap script causes an EMR5 server to install all of the necessary programs for Jupyter and to start a of Jupyter notebook running on port8888of the local computer. You connect to the notebook with a web browser. No authentication is required, which means that anyone who can connect to port 8888 on your EMRs local computer can run commands in the Jupyter notebook. By default your server blocks incoming connections to port 8888.DO NOT OPEN THIS PORT!Instead, we will use thesshcommand to forward port 8888 of your local computer to port 8888 of the EMR server.
Log into the cluster withsshfrom your laptop. When you log in, you will add theL 8888:localhost:8888option to yoursshcommand line. This option causes thesshcommand on your local computer to listen to port8888, accept connections on that port, and then to forward that tolocalhost:8888on the remote server.
For example, if your server is running atec25287152232.compute1.amazonaws.com, you would execute the following commands on your laptop:
sshadd
ssh L8888:localhost:8888 hadoopec25287152232.compute1.amazonaws.com
…
hadoopip1723146229
Now, open a web browser and go tohttp:localhost:8888. You should see the jupyter notebooks default file browser:
img srcimagespage1.png
Click New and select Python3:
img srcimagespage2.png
Lets make a simple sinx plot with matplotlib. First, inIn 1:enter the following code and then execute the cell optionreturn on a mac:
import pylab
import numpy as np
Youll see something that looks like this:
img srcimagespage3.png
Now we will specify that we want the plot to beinlinee.g. shown in the jupyter notebook, and then plot ysinx
pylab inline
x np.linspace0, 20, 1000 100 evenlyspaced values from 0 to 50
y np.sinx
pylab.plotx, y
Which produces:
img srcimagespage4.png
Please try all of the examples in this section to make sure that you can run Jupyter notebook on AWS and log into it.
Running Spark
Now its time to explore Spark. If you look online, youll see that most of the Spark examples are still using Python version 2. Thats unfortunate, because the world is moving to Python 3. Recall that we are using Python 3 exclusively in this course.
Start up a copy of Python3 connected to Spark. There are three ways to do this, and we will try them all below.
Method 1 pyspark
Thepysparkcommand runs a copy of the Python interpreter thats connected to the Spark runtime. By defaultpysparkuses Python 2. To run it with Python version 3, use the commandPYSPARKPYTHONpython3 pyspark, as below:
hadoopip1723146229 PYSPARKPYTHONpython3 pyspark
Python 3.4.3 default, Sep 1 2016, 23:33:38
GCC 4.8.3 20140911 Red Hat 4.8.39 on linux
Type help, copyright, credits or license for more information.
Setting default log level to WARN.
To adjust logging level use sc.setLogLevelnewLevel. For SparkR, use setLogLevelnewLevel.
170219 16:59:20 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARKHOME.
170219 16:59:37 WARN ObjectStore: Failed to get database globaltemp, returning NoSuchObjectException
Welcome to
., version 2.1.0
Using Python version 3.4.3 default, Sep 1 2016 23:33:38
SparkSession available as spark.
To verify the version of Spark you are using, examine the variablesc.version:
sc.version
2.1.0
In this course you normally wont runpysparkinteractively. But you will use thepysparkcommand to run a Spark program as a batch script and store the results in a file.
Method 2 iPython
You can use iPython with Spark directly, without using the jupyter notebook. Do this this, you simply run pyspark specifying that thedriver programshould beipython3:
hadoopip1723146229 PYSPARKPYTHONpython3.4 PYSPARKDRIVERPYTHONipython3 pyspark
Python 3.4.3 default, Sep 1 2016, 23:33:38
Type copyright, credits or license for more information.
IPython 5.2.2 An enhanced Interactive Python.
? Introduction and overview of IPythons features.
quickref Quick reference.
help Pythons own help system.
object? Details about object, use object?? for extra details.
Setting default log level to WARN.
To adjust logging level use sc.setLogLevelnewLevel. For SparkR, use setLogLevelnewLevel.
170219 17:15:10 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARKHOME.
170219 17:15:25 WARN ObjectStore: Failed to get database globaltemp, returning NoSuchObjectException
Welcome to
., version 2.1.0
Using Python version 3.4.3 default, Sep 1 2016 23:33:38
SparkSession available as spark.
In 1:
You can demonstrate that Spark is running by displaying theappNameof the Spark Context:
In 1: sc.appName
Out1: PySparkShell
In 2:
Working with pyspark in Jupyter Notebook
To use Spark and Jupyter Notebook together, we need to arrange for the Jupyter kernel to talk to a SparkContext. The modern way to do this is withApache Toree. Toree uses the iPython protocol to form a connection Spark and Jupyter.
Fortunately, Amazons bootstrap script includes support for Toree: all you need to do is add thetoreeoption as a bootstrap argument.
Terminate your cluster, clone it, and create add thetoreebootstrap option, as indicated below:
img srcimagespage5.png
Start up the cluster.
SSH into the cluster with portforwarding enabled.
Open your web browser and connect tohttp:localhost:8888.
Click on new and select Apache Toree PySpark as below:
img srcimagespage6.png
You can verify that you are connected to Spark by evaluatingsc.appName:
img srcimagespage7.png
Please try all of the examples in this section to make sure that you can run Spark with pyspark, iPython and Jupyter Notebook
NOTE: Currently the Amazon bootstrap script only hasApache Toree PySparkworking with Python 2, not with Python 3.
Working with Alexa
Amazon maintains a list of the top 1 million Internet sites by traffic at the URLhttp:s3.amazonaws.comalexastatictop1m.csv.zip
In this section you will:
Download the file
Make an RDD where each record is a tuple with the rank, site
Determine the number representation of toplevel domains TLDs in the top 10,000 websites. Example TLDs are.com,.eduand.cn. The first three also also called generic top level domains GTLDs.
Build a function that takes a domain and fetches its home page.
Determine the prevalence of Google Analytics JavaScript on the top 1,000 domains.
Scale up the cluster, and determine the prevalence of Google Analytics on the top 100,000 domains.
Determine which of the websites are using the same Google Analytics account which implies that they are run by the same organization.
Part 1: Download the list of hosts and Demo
Start up an EMR server running Spark and iPython and download the filehttp:s3.amazonaws.comalexastatictop1m.csv.zipusing wget.
Unzip the file.
Put the filetop1m.csvinto HDFS with the command:
hdfs dfs put top1m.csv top1m.csv
Verify that the file is there withhdfs dfs ls.
Run iPython. Make an RDD calledhoststhat contains the contents of the file:
In 1: top1m sc.textFiletop1m.csv
There is one element in the RDD for each line in the file. The.countmethod will compute how many lines are in the file. Run it.
In the fileq1.pyplace the Python expression that you evaluated to determine the number of lines in the file.
In the fileq1.txtplace the answer.
ANSWER
Your fileq1.pyshould contain a single line:
top1m.count
Your fileq1.txtshould also contain a single line:
1000000
Part 2: Count the .com domains!
How many of the websites in this RDD are in the .com domain?
Place the python expression inq2.pyand the answer inq2.txt.
Part 3: Histogram the TLDs
What is the distribution of TLDs in the top 1 million websites? We can quickly compute this using the RDD functioncountByValue.
Write a function in Python calledtldthat takes a domain name string and outputs the toplevel domain. Save this program in a file calledtld.py. We have provided you with a py.test test program called tldtest.py that will test it.
Map thetop1mRDD usingtldinto a new RDD calledtlds.
Evaluatetop1m.firstandtlds.firstto see if the first line oftop1mtransformed bytldis properly represented as the first line oftlds.
Look at the first 50 elements oftop1mby evaluatingtop1m.take50. Try the same thing withtldsto make sure that the first 50 lines were properly transformed.
At this point,tlds.countByValuewould give us a list of each TLD and the number of times that it appears in the top1m file. Note that this function returns the results as adefaultDicton the master node, not as an RDD. But we want it inverse sorted by popularity To do this, we can set a variable calledtldsandcountsequal totlds.countByValueand then reverse the order, sort, and take the top 50, like this:
tldsandcounts domains.countByValue countsandtlds count,domain for domain,count in tldsandcounts.items countsandtlds.sortreverseTrue countsandtlds0:50
Store the results ofcountsandtldsin a file calledq3counts.txt. You can do it with this python:
openq3counts.txt,w.writestrcountsandtlds0:50
Question:top1m.collect0:50andtop1m.take50produce the same result. Which one is more efficient and why? Put your answer in the file q3.txt.
Note: you will need to install py.test on your server using:
sudo yum pip2.7 install pytest
sudo yum pip3.4 install pytest
Part 4: Get a web page
Here is a simple function in both Python2 and Python3 to get a web page. It gets the page using the programcurlrunning as a subprocess.
def geturlurl:
from subprocess import Popen,PIPE
return Popencurl,s,url,stdoutPIPE.communicate0.decodeutf8,ignore
This assumes that the web page is in ASCII or UTF8. If it is not, errors are ignored.
Now write a new function calledgoogleanalyticsthat returns False if a page does not use Google Analytics and a True if it does. How do you know if a page uses Google Analytics? According toGoogles webpage on the topic, you will see a reference to eitheranalytics.jsorga.js.
Store this function in the fileanalytics.py.
NOTE: DO NOT FETCH ALL 1M web pages!
Instead, create a new RDD calledtop1kthat includes the top 1000 pages:
top1k top1m.take1000
Now, devise an expression that will report which of these top1k web pages obviously use Google Analytics. Store your expression intop1kanalytics.pyand the list intop1kanalytics.txt
What to turn in
The files to turn in are indicated in the Makefile, and listed below:
File Contents
mtrfix.py Program that turns textformatted MTR records into tinydata
q1.py Python expression to evaluate on Spark
q1.txt Python expression results, when evaluated on Spark
q2.py Python expression to evaluate on Spark
q2.txt Python expression results, when evaluated on Spark
q3.txt Which is more efficient, and why?
q3counts.txt The top 50 domains and their counts
tld.py Python function
analytics.py Python function
top1kanalytics.py Python expression
top1kanalytics.txt Python expression results
For more Information
If you want more information, you can read and work through theAWS Big Data BlogentryRun Jupyter Notebook and JupyterHub on Amazon EMR. But remember, this blog entry is not a tutorial, so it may take some work to get the notebook and examples operational.