Create an EMR cluster with Advanced Options and the following configuration:
• Select emr-5.23.0 from the drop-down
• Click check-boxes for these applications only: Hadoop 2.8.5
• Click Next
• Edit the instance types and set 1 master and 4 core of m4.large
• Click Next
• Give the cluster a name, and you can uncheck logging, debugging and termination protection enabled
• Click Next
• Select your key-pair
• Click “Create Cluster”
Once the cluster is in “Waiting” mode (should only take a few minutes), ssh into the master with agent forwarding:
ssh-add
ssh -A hadoop@…
Run the following commands, making sure to replace the values for [[your-name]], [[your-email]] and [[your-assignment-repository]] for the appropriate values.
sudo yum install -y git
git config –global user.name [[your name]]
git config –global user.email [[your email]]
git clone [[your-assignment-repository]]
cd [[repository-directory]]
会提供GitHub和AWS账号
Problem 1 – The quazyilx scientific instrument (3 points)
For this problem, you will be working with data from the quazyilx instrument. The files you will use contain hypothetic measurements of a scientific instrument called a quazyilx that has been specially created for this class. Every few seconds the quazyilx makes four measurements: fnard, fnok, cark and gnuck. The output looks like this:
YYYY-MM-DDTHH:MM:SSZ fnard:10 fnok:4 cark:2 gnuck:9
(This time format is called ISO-8601 and it has the advantage that it is both unambiguous and that it sorts properly. The Z stands for Greenwich Mean Time or GMT, and is sometimes called Zulu Time because the NATO Phonetic Alphabet word for Z is Zulu.)
When one of the measurements is not present, the result is displayed as negative 1 (e.g. -1).
The quazyilx has been malfunctioning, and occasionally generates output with a -1 for all four measurements, like this:
2015-12-10T08:40:10Z fnard:-1 fnok:-1 cark:-1 gnuck:-1
There are four different versions of the quazyilx file, each of a different size. As you can see in the output below the file sizes are 50MB (1,000,000 rows), 4.8GB (100,000,000 rows), 18GB (369,865,098 rows) and 36.7GB (752,981,134 rows). The only difference is the length of the number of records, the file structure is the same.
[hadoop@ip-172-31-1-240 ~]$ hadoop fs -ls s3://bigdatateaching/quazyilx/
Found 4 items
-rw-rw-rw- 1 hadoop hadoop 52443735 2018-01-25 15:37 s3://bigdatateaching/quazyilx/quazyilx0.txt
-rw-rw-rw- 1 hadoop hadoop 5244417004 2018-01-25 15:37 s3://bigdatateaching/quazyilx/quazyilx1.txt
-rw-rw-rw- 1 hadoop hadoop 19397230888 2018-01-25 15:38 s3://bigdatateaching/quazyilx/quazyilx2.txt
-rw-rw-rw- 1 hadoop hadoop 39489364082 2018-01-25 15:41 s3://bigdatateaching/quazyilx/quazyilx3.txt
Your job is to find all of the times where the four instruments malfunctioned together using grep with Hadoop Streaming.
You will run a Hadoop Streaming job using the 18GB fil as input.
Here are the requirements for this Hadoop Streaming job:
• The mapper is the grep function.
• It is a map only job and must be run as such. (Think about why this is a map only job.)
You need to issue the command to submit the job with the appropriate parameters. The reference for Hadoop Streaming commands is here..
Paste the command you issued into a text file called hadoop-streaming-command.txt.
Once the Hadoop Streaming job finishes, create a text file called quazyilx-failures.txt with the results which must be sorted by date and time.
The files to be committed to the repository for this problem are hadoop-streaming-command.txt and quazyilx-failures.txt.
Problem 2 – Log file analysis (7 points)
The file s3://bigdatateaching/forensicswiki/2012_logs.txt is a year’s worth of Apache logs for the forensicswiki website. Each line of the log file correspondents to a single HTTP GET command sent to the web server. The log file is in the Combined Log Format.
Your goal in this problem is to report the number of hits for each month. Your final job output should look like this:
2010-01,xxxxxx
2010-02,yyyyyy
…
Where xxxxxx and yyyyyy are replaced by the actual number of hits in each month.
You need to write a Python mapper.py and reducer.py with the following requirements:
• You must use regular expressions to parse the logs and extract the date, and cannot hard code any date logic
• Your mapper should read each line of the input file and output a key/value pair tab separated format
• Your reducer should tally up the number of hits for each key and output the results in a comma separated format
You need to run the Hadoop Streaming job with the appropriate parameters.
Once the Hadoop Streaming job finishes, create a text file called logfile-counts.csv with the results which must be sorted by date.
The files to be committed to the repository for this problem are mapper.py, reducer.py and logfile-counts.csv.