代写 C graph software network ANLY 502 Assignment 3 Version 1.0

ANLY 502 Assignment 3 Version 1.0
Note: Read this assignment online athttps:bitbucket.orgANLY502anly5022017springsrcHEADA3?atmaster
In this problem set, you will analyze a network flow dataset that was created for this course.
This dataset is based on themtrcommand, a popular open sourcetraceroutecommand that has an interactive, characterbased display. The program performs multiple traceroute operations over time and displays the results. One of the common uses of themtrcommand is to diagnose network problems. Take a moment now andreview the Wikipedia page for the mtr command.
Themtrcommand has an option to produce output as a text file. Here is an example of the output:
MTR.0.85;1482973562;OK;www.comcast.com;1;192.168.10.1;1669
MTR.0.85;1482973562;OK;www.comcast.com;2;96.120.104.177;16404
MTR.0.85;1482973562;OK;www.comcast.com;3;68.87.130.233;11504
MTR.0.85;1482973562;OK;www.comcast.com;4;ae530ar01.capitolhghts.md.bad.comcast.net 68.86.204.217;14874
MTR.0.85;1482973562;OK;www.comcast.com;5;be33657cr02.ashburn.va.ibone.comcast.net 68.86.90.57;13030
MTR.0.85;1482973562;OK;www.comcast.com;6;be10102cr01.newark.nj.ibone.comcast.net 68.86.85.162;19641
MTR.0.85;1482973562;OK;www.comcast.com;7;be10203cr02.newyork.ny.ibone.comcast.net 68.86.85.186;18920
MTR.0.85;1482973562;OK;www.comcast.com;8;be10305cr02.350ecermak.il.ibone.comcast.net 68.86.85.202;40477
MTR.0.85;1482973562;OK;www.comcast.com;9;be7922ar02d.northlake.il.ndcchgo.comcast.net 68.86.87.70;40555
MTR.0.85;1482973562;OK;www.comcast.com;10;69.139.178.142;59072
MTR.0.85;1482973562;OK;www.comcast.com;11;???;0
In this example,mtrhas been used from a home to the hostwww.comcast.com. The fields are separated by semicolons. The fields are:
Version number, in this caseMTR.0.85
TheUnix timestampof when themtrcommand was started.
AnOKindicating that the command terminated successfully.
The number of hops out that theICMPPINGpacket traveled before it returned.
If themtrcommand can resolve the hostname of the remote system, the line contains the host name of the remote system that the PING packet reached, and its IP address in parenthesis. Otherwise,mtroutputs simply the IP address. If the packet is not returned,mtroutputs three question marks i.e.???.
The number of microseconds that the packet required for the roundtrip. If the ECHO packet was not received, this number is zero.
From July 1, 2016 through Dec 28, 2016, we ranmtrevery minute, generatingtracerouteresults from aRaspberry PIcomputer to the hostwww.comcast.com. The Raspberry PI is connected by Ethernet to a Apple Airport Extreme Router, that is connected to a Comcast cable modem. Thus, all of the variability between subsequent runs of themtrcommand are the result of congestion and other network issues within the Comcast network.
In this assignment, you will develop software to help analyze this dataset. A truism of data science is that it is always easier to collect data than to analyze it. For example, if you scour the Internet you will find many resources forgeneratingtraceroutedata, but you will find practically no resources foranalyzingtraceroutedata.
We set up the 35 Raspberry PI to collect this data because we observed that the Comcast performance was inconsistent. But why was it inconsistent? Network congestion? Is it worse on different days, or at different times of the day? Is the problem congestion, or is it reliabilityare there links that are going down? In this problem set, youll perform an analysis that can help answer these questions.
Understanding the dataset
The Raspberry PI was configured to run a traceroute every minute from the home network towww.comcast.com. The rational of choosingwww.comcast.comis that the host was likely to be inside the Comcast network, so we werent looking at problems that might originate elsewhere. We have uploaded the output of themtrcommand tos3:guanly502A3mtr.www.comcast.com.2016.txt. This 206MB output file has 2,397,086 lines resulting from a total of 260,499 invocations of themtrcommand.
This output is not very large, and you could easily analyze it on your laptop. However, in this class, we ask that you analyze it on Amazon using MRJOB and Hadoop.
Here are some questions that we will try to answer:
How consistent is the Comcast network?
How often are there changes within the Comcast network?
How much redundancy is there within the Comcast network?
Here are some questions we will not be answering:
Are the changes in the Comcast network a response to network problems? You could tell this if the changes happen after a network problem, and then the problem goes away.
Can you distinguish network outages from outages at the enduser location where the Raspberry PI was running? Yes, you can: a power failure will result in a gap between subsequentmtrinvocations. We arent looking for them in this problem set, but you can!
Are we getting the service that we are paying for?
We wont be able to answer all of these questions inA3, but you might think about answering them in a final project. Also, note that that analyzing this dataset is apassive analysisproject, but for your final project you could carry out your own measurements.
Getting Started
As collected, the dataset is hard to analyze, because each time we invoke themtrcommand the result may be one or more lines. So the first thing to do is to turn the multiline format into a new format where each invocation results in a single line. This single line will be easier to analyze.
To help you get started, you will find in the repository a file calledA3mtr.www.comcast.com.2016.subset.txtthat contains a subset of the dataset, and a program calledingestmtr.pythat reads anymtrformatted text file and outputs complete records. The record consists of a timestamp and one or more quads, all separated by commas, where each quad consists of:
Step number 1..N
Hostname of the remote system or null
IP address of the remote system
The number of microseconds for the response
The splitting of the name returned bymtrinto a hostname and IP address is done with a regular expression. You should examine the source code of theingestmtr.pyprogram to see how this is done.
The first record looks like this:
20160701T00:01:01,1,,192.168.10.1,1798,2,,96.120.104.177,9739,3,,68.87.130.233,11766,4,ae530ar01.capitolhghts.md.bad.comcast.net,68.86.204.217,11203,5,be33657cr02.ashburn.va.ibone.comcast.net,68.86.90.57,14575,6,he0200ar01d.westchester.pa.bo.comcast.net,68.86.94.226,17923,7,bu101ur21d.westchester.pa.bo.comcast.net,68.85.137.213,16070,8,,68.87.29.59,16761
Even with our preprocessing script, this is a challenging assignment because every input line has a potentially different length.
The dataset represents a series of paths through the Internet at different times. The first four hops of the line above are shown again below:
1,,192.168.10.1,1798,2,,96.120.104.177,9739,3,,68.87.130.233,11766,4,ae530ar01.capitolhghts.md.bad.comcast.net,68.86.204.217,11203,…
This extract represents three hops:
192.168.10.1 96.120.104.177 9739 usec from 192.168.10.1
96.120.104.177 68.87.130.233 11766 usec from 192.168.10.1
68.87.130.233 68.86.204.217 11203 usec from 192.168.10.1
Schematically, we can call this:
ABCD
Where A is192.168.10.1, B is96.120.104.177, C is68.87.130.233and D is68.86.204.217
How is it possible that it took longer to reach C 11766 usec than D 11203 usec, given that C is closer than D?The answer is that these were separate ICMP ECHO ping packets that were sent out, and the amount of time that it takes for the response to come back is not always the sameit depends on network congestion.
Sometimes you might want to look for a link that goes down and comes back up. Of course, if a link goes down, thetraceroutedone bymtrwill stop. So you might end up seeing data that looks like this:
1: ABCDE
2: ABCDE
3: ABCDE
4: ABCDE
5: ABC;
6: ABC;
7: ABC;
8: ABCDE
9: ABCDE
In this example, the link between C and D went down at time 5 and came back up at time 8. The link between D and E may have been up at times 5, 6 and 7, but it may not have been. We have no way of telling for sure.
Note: there are some IP addresses in this dataset that have more than one hostname!
Question 1: Rough Characterization
Start by creating an EMR cluster that has a single Master node and no Core or Task nodes you wont need them. Do not forget to use the bootstrap code! Log into the Master node and clone the course git repository. We will be working in theA3directory.
In Question 1 we are asking that you characterize the data. You should answer all of these questions by usingingestmtr.pyto process the dataset into single records and store this as a file in your and then storing the results either in S3 or HDFS.
Start by copying the MTR data in S3 to your local computer and use theingestmtr.pyprogram to transform the data into a local file. You will use this file as the input tomrjoband will store the results in output files.
We have given you aMRJOBprototype calledq1ipaddresses.py. You need to modify this program so that it will compute how many different IP addresses are in the dataset. Store the output of your program in a file calledq1ipaddresses.txt. Each line of the file should have the format:
ipaddresstcount
Wheretis the TAB character. The file does not need to be sorted.
Create a mapper and reducer with MRJOB calledq1hostnames.pythat computes how many different hostnames are in the dataset. Store the output of your program in a file calledq1hostnames.txt. Each line of the file should have the format:
hostnametcount
The file does not need to be sorted.
Question 2: Multiple Names per IP address
Find the IP addresses that have more than one hostname.
Create a program that does this calledq2multiple.pyand store the results in an output calledq2multiple.txt. Your output file should have this format:
ipaddressthostname1,hostname2
If you feel particularly motivated, you can determine the time for each hostname, and see if the two names overlap in time or if the name changes from one name to the other.
Question 3: Link Analysis
For this question, we define alinkas the connection from one host to another host. You can compute the time associated with the first host from the time associated with the second host. So if the highest step number on a line is 5, that line describes 4 links.
Create a link analysis program with MRJOB namedq3links.pythat will analyze the dataset forevery link in the dataset. Run this program and pipe the output into a fileq3links.txtthat contains,for each link, a line in the following format:
IPADDRESS1IPADDRESS2tCOUNT
There will be many lines in this file, one for each different link taken by a packet.
Modifyq3links.pyso that it calculates thestandard deviationfor the time taken by each hop. Run this program and pipe the output into a file calledq3linksdev.txtthat has this format:
IPADDRESS1IPADDRESS2tCOUNTtSTDDEV
The file does not need to be sorted.
Turn inq3links.pyandq3linksdev.txt.
Question 4: Time Analysis
Finally, lets look at how Comcasts residential distribution network might be impacted by people watching videos in the evenings. That is, were going to look for the Netflix effect. To do this, we will restrict our analysis of the standard deviation of the link from the 2nd hop to the 3rd hop Comcasts residential distribution network, and see how it changes for different times of the day.
Create a new program using MRJOB calledq4hop23.pythat computes the standard deviation between the 2nd and 3rd hop of the dataset by time of day. Generate an output fileq4hop23.txtthat has this form:
HOURtSTDDEV
For example, if the standard deviation was 2.0 between 3pm and 3:59pm and 3.0 between 4pm and 4:59pm, your output file would look like this:
15 2.0
16 3.0
Extra Credit
For extra credit, create a program calledq4grapher.pythat reads the fileq4hop23.txt, and makes a graph of this dataset with matplotlib, saving the result in a file calledq4graph.png.