Ex. 1 — Using MapReduce to Process Semi-structured Data(3 Points)
Webserver logs play a significant role in web companies to help study customer characteristics and behavior. An example line in an apache log could look like this ( a single line in the log file, is broken into two in this description to fit this document).
169.122.23.15 – frank [10/Oct/2000:13:55:36 -0700] ‘‘GET /apache pb.gif HTTP/1.0’’ 200 2326 ‘‘http://www.example.com/start.html’’ ‘‘Mozilla/4.08 [en] (Win98; I ;Nav)’’
Among several things, it tells the web server what is the geographical location of the customer (using IP address), type of web browser, operating system, that they use, etc. Therefore extracting such details out of the logs is a valuable activity for many organizations.
Copyright By PowCoder代写 加微信 powcoder
You can assume that the following high-level language library functions are available to help you parse the individual lines in the log files to help you get specific information.
extract_IP(linetext) // gives 169.122.23.15
extract_url(linetext) // gives /apache_pb.gif
extract_browser(linetext) // gives Mozilla/4.08
(a)Write a MapReduce workflow that will produce a report of the number of users for each browser. Assume that the input to the mapper is the linenumber (key) and the value is the text of that line. For simplicity, assume that an IP address uniquely identifies a user. Include only those browsers with more than 100 users.
(b)Once you have developed your workflow, give an example of what the input and output of each mapper and reducer in your workflow would look like. Writing this down as you develop your design may help you notice and fix any logical mistakes.
(c)If you were allowed to use the combine functionality for your solution, can it reduce the amount of I/O? If a combine is useful, what would be its logic? will it be identical to the reducer? Justify your position for each decision.(No points for this without explanation).
Turn in:- Your solution (typed) in assignment3.pdf under a section Q1.
Ex. 2 — Complex uses of MapReduce (5 Points)
Continuing from the previous question,
(a)Write a MapReduce workflow that compute the total number of users who use more than one browser. Your input is the line number in the log file (key) and the text in that line (value).
(b)How many reducers processes are involved for the last reducer step of your MapReduce workflow?
Make sure to include some comments, example inputs and outputs for each of the mappers and reducers, etc.,
Turn in:- Your solution (typed) in assignment3.pdf under a section Q2.
Ex. 3 — PigLatin – Warmup(0 Points)
The goal of this exercise is to help you verify that you are able to acess the MapReduce cluster and execute Pig Latin scripts:
The data set used for all the Pig Latin questions are based on a modfied version of the Covid vaccination data set1 and population data set2 (Fall 2021 snapshot) provided by WHO3 which is already loaded into the HDFS.
•pop.csv is the population data set, that consists of the following fields: –country → Name of the country, unique in this data set. –population → Population of the country in thousands.
•vaccination-data.csv is the vaccination information as of Fall 2021, which consists of the following fields: –country → Name of the country, unique in this data set.
–iso3 → Country code, unique in this data set.
–who region → indication of the WHO region to which the country belongs to.
–persons fully vaccinated → the number of people in the country that are fully vaccinated.
•vaccination-metadata.csv contains some metadata about different vaccines used in various countries and consists of the following fields:
–iso3 → Country code.
–vaccine name → combination of product and company name (i.e., the next two fields) –product name → the brand name of the vaccine (e.g Comirnaty).
–company name → name of the company developing the vaccine (e.g Pfizer BioNTech).
You are provided with an example.pig script, that contains the necessary LOAD instructions to load the data from HDFS to a schema described above. You should be able to reuse this for the remaining exercises.
It is important that you read through the supporting PigLatinInstructions-vvv.pdf file before starting to write and execute the Pig Latin scripts.
You can either run the script as it is by passing the script as an argument. $ pig example . pig
or by starting pig by itself first
and then copy pasting each statement by itself (for interactive programming).
The example script lists the countries with population above 100 million along with their population (in thousands), ordering them by the name of the countries.
The script starts by first selecting only those records from the data set that has population above 100 million. The script then sorts this data by the name of the country.
We can see that in many ways these individual steps are similar to those performed during query evaluation, except that the onus is on the programmer to figure out the order of execution of the steps instead of the data management system performing an optimized order.
The script will take a couple of minutes to run and produce a lot of messages. At the end, you should be able to see an output like this (truncated for brevity).
(Japan ,127749) (Mexico ,127540)
Turn in:- Nothing.
1 https://covid19.who.int/info/ 2https://www.who.int/data/gho/data/indicators/indicator-details/GHO/population-(in-thousands) 3 https://www.who.int/
Ex. 4 — PigLatin – Vaccination Across Various Regions (3 Points)
Write a PigLatin script such that, for each WHO region, output the region, number of countries and the total number of vaccinated people in those countries. Order the output by the region. The name of the script you turn in should be Q4.pig.
Once you have the script completed and is satisfied with its output, execute it the following way (for submission purposes).
$ pig Q4.pig > Q4.log 2>&1 $
The result part of your script’s output should follow the example format below as-is (truncated for brevity).
(EMRO,34 ,72340623)
Make sure that the log file also contains the various information that pig has been producing and not just the final results. If those log information from pig is missing, points will be deducted.
Turn in:- Q4.pig and Q4.log
Ex. 5 — PigLatin – Vaccine Suppliers (5 Points)
Write a PigLatin script that will list the companies whose vaccines are used across most number of countries. Output the company name and the number of countries they supply to. Order the output with the companies that supply most countries on top. Restrict the output only to top 10 records. The name of the script you turn in should be Q5.pig. Similar to previous exercise, also produce Q5.log. The result part of your script’s output should follow the example format below as-is (truncated for brevity).
(Moderna ,65)
Turn in:- Q5.pig and Q5.log.
Ex. 6 — PigLatin – Vaccination Rate and Vaccine Usage (8 Points)
Write a PigLatin script (Q6.pig) that does the following:
For countries with population above 100 million, list each country, its population, the percentage of fully vaccinated people and the number of vaccine brands used in that country. Order the output by the decreasing order of their population. Similar to previous exercise, also produce Q6.log.
What does the schema look like immediately after you perform the GROUP operation step? Include this under a section Q6 in your assignment3.pdf.
The result part of your script’s output should follow the example format below as-is (truncated for brevity).
(France ,64721 ,65.91931521456 ,5) ( Italy ,59430 ,68.6257445734 ,5)
Turn in:- Q6.pig and Q6.log (and the contents in assignment3.pdf).
Guidelines
NO Handwritten / scanned submissions are accepted for this assignment.
This discussion is pertaining to Exercises Ex.1 and Ex.2.
• We define the term “browser” to also to include its software version. I.e., Mozilla/4.08 is considered a different
browser to Mozilla/4.2 although the two only differs in version numbers.
• Your solutions should have only Mapper and Reducer functions. DO NOT use the Combine functionality.
• When implementing a solution, remember that in some cases you may need more than one MapReduce job to accomplish a task (Output of one MapReduce’s Reducer forms the input of another’s Mapper).
• Each MapReduce step goes through the disk in order to pass data to the next step in the process. Therefore, come up with a solution that will reduce the number of MapReduce jobs required as well as reduce the amount of data that will have to flow from one part of the MapReduce process to the next one (think of some of the simple concepts we applied for query evaluation).
• You can follow the pseudo-code syntax as was shown in class. Our primary interest is to see if you know how to design the workflows to pick the right key/value for the input/output of mappers and reducers and have an understanding of the internal logic that you should put inside these functions. Please do not write Java code, etc.
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com