TEMASEK POLYTECHNIC
TEMASEK POLYTECHNIC
SCHOOL OF INFORMATICS & IT
DIPLOMA IN BIG DATA MANAGEMENT & GOVERNANCE
AY 2018/2019 Oct Semester
Programming for Big Data CIG2C02
Assignment (70%)
AY2018/2019 Oct Semester
Programming for Big Data (CIG2C02)
Page 1
Table of Contents
1.
2.
3.
4.
4.1
4.2
5.
5.1
5.2
Annex A – Cover Page Sample for Group Submission………………………………………………………..11
Annex B – Declaration of Work of Originality for Group submission …………………………………..12
Annex C – Cover Page Sample for Individual Submission…………………………………………………..13
Annex D – Declaration of Work of Originality for Individual submission……………………………..14
Instructions ………………………………………………………………………………………………………………3 Submission ………………………………………………………………………………………………………………3 Guidelines on Late Submissions …………………………………………………………………………………3 Assignment Instruction………………………………………………………………………………………………4
Part1 – Group work (40 marks)……………………………………………………………………………….4
Part2 – Individual work (100 marks)………………………………………………………………………..7 Instructions for submission……………………………………………………………………………………….10 Assignment Part 1 Submission ………………………………………………………………………………10 Assignment Part 2 Submission ………………………………………………………………………………10
AY2018/2019 Oct Semester
Programming for Big Data (CIG2C02) Page 2
1. Instructions
Programming for Big Data Assignment Specifications
AY2018/2019 Oct Semester
This assignment comprises TWO (2) parts. Both parts will be added to form the final grade of your assignment. Total marks for the assignment is 140, contributing to 70% of the assessment in the subject.
Part 1 is group work. There are TWO (2) questions. Students will be grouped in teams of 4 or 5 members each. Each group is expected to submit the source code and document the process of the group work.
Part 2 is individual work. There are THREE (3) questions. 2. Submission
Tasks | Submissions | Mode of submissions | Mark allocation (140 marks) | Due Date |
Part 1 – Group work | Refer to the section, Instructions for Submission, for more details | Submit softcopy online through LMS (one per group) | 40 |
Week 12, 6th Jan 2019, before 11:59 pm. |
Part 2 – Individual work | Submit softcopy online through LMS (individually) | 100 |
Week 16, 3th Feb 2019, before 11:59pm. |
3. Guidelines on Late Submissions
- Late and < 1 day: 10% deduction from the absolute mark given for the submitted components, e.g. 75 marks (100 marks max) becomes 65 marks (deduct 10% of 100 marks)
- Late >=1 and <2 days: 20% deduction from the absolute mark
- Late >=2 days: 0 marks awarded
Programming for Big Data (CIG2C02) Page 3
4. Assignment Instruction
4.1 Part1 – Group work (40 marks)
Students will be grouped in teams of 4 or 5 members each to work on Part 1. This part consists of 2 questions.
Question 1 (20 marks)
You are expected to write Python code to solve this question.
Please use BeautifulSoup and request library to extract the required information
from the following website.
- Group 1 – extract the weather information in Saint Louis: https://forecast.weather.gov/MapClick.php?lon=- 90.42388558387756&lat=39.30029918615028#.W5YTiegzY2w
- Group 2 – extract the weather information in Philadelphia: https://forecast.weather.gov/MapClick.php?lon=- 76.09084725379945&lat=38.75408257856023#.W5YT3OgzY2w
- Group 3 – extract the weather information in Peachtree City: https://forecast.weather.gov/MapClick.php?lon=- 83.93897965550421&lat=33.67559251939777#.W5YUDugzY2w
- Group 4 – extract the weather information in Wilmington: https://forecast.weather.gov/MapClick.php?lon=- 79.71173280850051&lat=34.21028792635836#.W5YUW-gzY2w
-
Group 5 – extract the weather information in San Francisco Bay Area:
https://forecast.weather.gov/MapClick.php?lon=-
121.85111006005164&lat=37.353387236461344#.W5YUhExuJfE
You are requested to
- Download the web page containing the weather forecast information
- Create a BeautifulSoup object to parse the page
- Find the div with id seven-day-forecast, and assign to seven- day
- Inside seven_day, find each individual forecast item with class tomstone-container
- Extract and print at least the first weather forecast item
AY2018/2019 Oct Semester
Programming for Big Data (CIG2C02)
Page 4
The following are expected in your program:
-
Python file must be saved as ‘GroupNumber_City.py’.
o For example, for Group 1: ‘Group1_ Saint_Louis.py’. -
The code must contain proper and meaningful comments. Example:
o Download the web page containing the forecast
o Create a BeautifulSoup object to parse the page Deliverables
- Source code (.py script)
- Word document with : o Screenshots of the output Sample execution
• Here is a sample execution of a solution, the weather forecast information of Washington DC:
AY2018/2019 Oct Semester
>>>
Please type weather website for this city: http://forecast.weather.gov/MapCli ck.php?lat=38.8951&lon=-77.0364#.WQKqCfl96po
<div class="tombstone-container"> <p class="period-name">
Tonight <br>
<br/> </br>
</p> <p>
<img alt=”Tonight: A chance of showers before midnight, then a chance of sh owers and thunderstorms between midnight and 2am, then a chance of showers af ter 2am. Mostly cloudy, with a low around 67. South wind 5 to 11 mph. Chanc e of precipitation is 50%.” class=”forecast-icon” src=”newimages/medium/nshra 50.png” title=”Tonight: A chance of showers before midnight, then a chance of
showers and thunderstorms between midnight and 2am, then a chance of showers
after 2am. Mostly cloudy, with a low around 67. South wind 5 to 11 mph. Ch ance of precipitation is 50%.”/>
</p> <p class="short-desc">
Chance <br>
Programming for Big Data (CIG2C02)
Page 5
Question 2 (20 marks)
You are expected to write R code to solve this question. Please use the following data sets provided in LMS.
- Group 1 – resale-flat-prices_2011.csv
- Group 2 – resale-flat-prices_2012.csv
- Group 3 – resale-flat-prices_2013.csv
- Group 4 – resale-flat-prices_2014.csv
- Group 5 – resale-flat-prices_2015.csv Source: https://data.gov.sg/dataset/resale-flat-prices Do the following with this data:
- Count the number of rows and columns of the dataset o The output should be displayed as : ▪ ‘Number of rows:’ ▪ ‘Number of columns:’
- Subset the first 10 rows from the data set
-
Calculate the mean of the ‘resale_price’ and display it as:
‘The mean of the resale_price of houses in 2011 (depends on the year provided to each group):’ -
Calculate the median value of the ‘resale_price’ and display it as:
‘The median resale price of houses in 2011 (depends on the year provided to each group):’’ - Display the unique values in the column ‘Town’ found in the dataset.
- Filter data for any one town found above and write to a text file named ‘GroupNumber_townname_year.txt’. Example:’Group1_BEDOK_2011.txt’
- All the code must be saved in ‘GroupNumber_year.R’ file. Example: ‘Group1_2011.R’
AY2018/2019 Oct Semester
Showers </br>
</p> <p class="temp temp-low">
Low: 67 °F </p>
</div> >>>
Programming for Big Data (CIG2C02) Page 6
Deliverables
- Source code (R script, .r)
- Output file(.txt)
- Word document with command and screenshot of output of each command 4.2 Part2 – Individual work (100 marks) You are required to solve Part 2 individually and independently. There are THREE (3) questions. Question 3 (15 marks) Students are required to implement a MapReduce job to read the text file provided and compute the average length of all words that start each character. Follow the steps to execute
the job:
- Download input text file from LMS in Cloudera VM
- Copy the input file to local file system: /home/cloudera/
- Create a HDFS directory: /user/cloudera/my_assign
- Copy the input text file from local file system to HDFS: /user/cloudera/my_assign
- Execute the MapReduce job to compute the average length of all words that start each character in the input file
- View the contents of the output in HDFS: /user/cloudera/my_avglength.
- Give any TEN words and the length of the words from the output file.
Deliverables
AY2018/2019 Oct Semester
•
• •
Word document with each command of the MapReduce Job and o Screenshots of :
▪ Execution of each command of the MapReduce job
▪ Output of each command of the MapReduce job Screenshot of the output file in the directory
Screenshot of any TEN words and the length of the TEN words from the output file
Question 4 (25 marks)
You are required to import tables from a relational database into HDFS using Sqoop for this question.
- Download world.sql from LMS in Cloudera Virtual Machine
- Put world.sql in a local directory, for example: /home/cloudera/Downloads
- Log in to Mysql as root (password is cloudera), create a database and name it as YourAdminNumber_world(eg.1709999z_world) in Mysql.
Programming for Big Data (CIG2C02) Page 7
- Load all the tables from world.sql into Mysql.
- Import all the tables into HDFS from YourAdminNumber_world database you had created in MySQL with Sqoop
- List all the tables in HDFS
- List the contents of all 3 tables in YourAdminNumber_world database in HDFS
Deliverables
• Word document with :
o Screenshots of the tables in MySQL
o Screenshots of command used to import the database as well as listing of
tables
o Screenshots of the commands and output in HDFS for 3 tables
o Screenshot of contents of the 3 tables in HDFS
Question 5 (50 marks)
Students are required to solve the question using Pig.
A file is provided. The description of the text file is as follow:
•
•
Create a Pig Script as YourAdminNumber_FL.pig. For example, 1709999Z_FL.pig file and run it to execute the following steps one by one:
o Readthedatafromthefile
o DesignthetablewiththecorrectdatatypesusingPigstatement
In the script, it should include FOUR(4) statements. You may use some of the key
works listed below (This does not include the LOAD statement)
- Filter
- Foreach….generate
- Taking in ‘command-line’ parameters
- ORDER
- GROUP
- And others
-
Other functions and operations (mentioned in lecture notes)
o Avoid using the same syntax for all of your queries.
o Each of the statements generated must be in the document with:
AY2018/2019 Oct Semester
The sample insurance file contains 36,634 records in Florida for 2012 from a
sample company that implemented an aggressive growth plan in 2012. There are
total insured value (TIV) columns containing TIV from 2011 and 2012, so this
dataset is great for testing out the comparison feature. This file has address
information that you can choose to geocode, or you can use the existing
latitude/longitude in the file.
•
•
Programming for Big Data (CIG2C02)
Page 8
- Pig statement
- Explanation of WHAT the statement does
- Result of the statement o The statements will be graded based on the complexity (number of features used, etc.) and usefulness. Statements with an empty result set will not be awarded any marks
• Show the screenshot of the script in .pig file
Deliverables
-
Word document with :
o Command used in each step, from the creation of the .pig file to execution of the file
o Each statement used in Pig and its explanation o Screenshots of the following:- Each statement in .pig file
- Execution of the statement
- Output of the statement
- The softcopy of .pig file. Assignment demo (10 marks) The assignment demo is for the lab tutor to check your understanding for question 3, 4 and 5. You will present your work to the tutor. Your tutor may ask you to execute commands/make modifications to the statements during the demo. You are required to demonstrate the following using Cloudera VM: – MapReduce – Sqoop – Pig
AY2018/2019 Oct Semester
Programming for Big Data (CIG2C02)
Page 9
5. Instructions for submission
5.1 Assignment Part 1 Submission
Soft Copy
• Each group shall submit the following for Part 1 in LMS: o Cover page (refer to Annex A)
o Declaration of work of originality (refer to Annex B)
o Source code for Question 1 – Python, with explanation of the logic used
▪ Screenshots of output
o Source code for Question 2 – R, with explanation of the logic used
▪ Screenshots of output
o Python script of Question 1 (.py)
o R script of Question 2 (.R)
5.2 Assignment Part 2 Submission
Soft Copy
• Each student must submit a soft copy of the following report in LMS, individually.
o Cover page (refer to Annex C)
o Declaration of work of originality (refer to Annex D)
o MapReduce deliverables (see Question 3)
o Sqoop deliverables (See Question 4)
o Pig deliverables(See Question 5)
AY2018/2019 Oct Semester
***** END OF ASSIGNMENT *****
Programming for Big Data (CIG2C02)
Page 10
Annex A – Cover Page Sample for Group Submission
AY2018/2019 Oct Semester
Diploma in Big Data Management & Governance Programming for Big Data (CIG2C02) AY2018/2019 Oct Semester
Submitted by Class: _____________________________________ Tutor: _____________________________________
Group: _____________________________________
Student Number
Student Name
Date of Submission __________________________
Programming for Big Data (CIG2C02) Page 11
Annex B – Declaration of Work of Originality for Group submission
Diploma in Big Data Management & Governance
Programming for Big Data (CIG2C02)
AY2018/2019 Oct Semester
Assignment
Practical Class: PXX
Group Submitted by:
: GYY
<Matric Number> <Full Name of member #1>
<Matric Number> <Full Name of member #2>
<Matric Number> <Full Name of member #3>
<Matric Number> <Full Name of member #4>
<Matric Number> <Full Name of member #5>
Date: dd /mm/yyyy
“By submitting this work, I am / we are declaring that I am / we are the originator(s) of this work and that all other original sources used in this work has been appropriately acknowledged.
I / We understand that plagiarism is the act of taking and using the whole or any part of another person’s work and presenting it as my/ our own without proper acknowledgement.
I / We also understand that plagiarism is an academic offence and that disciplinary action will be taken for plagiarism.”
NAME AND SIGNATURE OF STUDENT: ……………………………………
NAME AND SIGNATURE OF STUDENT: ……………………………………
NAME AND SIGNATURE OF STUDENT: ……………………………………
NAME AND SIGNATURE OF STUDENT: ……………………………………
NAME AND SIGNATURE OF STUDENT: ……………………………………
*Where PXX is the practical class number & GYY is the project team number which your lab instructor will assign each team
AY2018/2019 Oct Semester
Programming for Big Data (CIG2C02) Page 12
Annex C – Cover Page Sample for Individual Submission
AY2018/2019 Oct Semester
Diploma in Big Data Management & Governance Programming for Big Data (CIG2C02) AY2018/2019 Oct Semester
Submitted by
________________________ Student No
________________________ Student Name
Date of Submission
Programming for Big Data (CIG2C02) Page 13
Annex D – Declaration of Work of Originality for Individual submission
Diploma in Big Data Management & Governance
Programming for Big Data (CIG2C02)
AY2018/2019 Oct Semester
Assignment
Submitted by: <Matric number of member> <Full name of member> Date: <signing date in dd /mm/yyyy format>
“By submitting this work, I am declaring that I am the originator(s) of this work and that all other original sources used in this work has been appropriately acknowledged.
I understand that plagiarism is the act of taking and using the whole or any part of another person’s work and presenting it as my own without proper acknowledgement.
I also understand that plagiarism is an academic offence and that disciplinary action will be taken for plagiarism.”
Name and Signature of student: ……………………………………
AY2018/2019 Oct Semester
Programming for Big Data (CIG2C02) Page 14