Big Data Analytics:
Assignment 1 – Hurricane Sandy and Flickr
Suzy Moat and Tobias Preis
Data Science Lab, Behavioural Science, Warwick Business School, The University of Warwick http://www.wbs.ac.uk/about/person/suzy-moat/ http://www.wbs.ac.uk/about/person/tobias-preis/
This coursework is due in on Thursday 25th February 2021 by 20:00 UK time, along with Assignment 2. You can gain up to 10% of your final marks for this course on this assignment.
This assignment will help you improve your programming skills and make the most of the learning support we are offering on this course. If all of your functions behave correctly, you will score full marks on this assignment.
However, to give you some reassurance if you encounter any difficulties with Assignment 1, your overall mark for this pair of assignments will be the higher of the two following calculations:
• Your mark for Assignment 1 (10%) + your mark for Assignment 2 (10%)
• Your mark for Assignment 2 (10%) + your mark for Assignment 2 (10%)
In other words, your overall mark will not be lower than your mark for Assignment 2 multiplied by 2.
The exercise is designed to walk you through key parts of the technical work you will do for your final project, so you can make sure you understand the programming concepts you need now, rather than during your final project. The tasks build on what you have learnt in the R seminars, and you can use and edit the code you wrote there to help you solve this assignment.
Please note that later parts of this assignment build on earlier parts – so make sure you get started straight away.
Along with this description of the assignment, we have made two data files available for you – the mysteriously named “3922327258060dat.txt” and “3922327258060doc.txt”. See Part 3 to find out what these are.
In addition, we have provided an R file called “Assignment1Answers.R”. To complete this assignment, you should edit this file, and submit it back to us for assessment. You should not change the name of the file, or submit any other files for assessment. We have marked an area at the top of the file where you can write your student number.
Each task involves you editing a function that is already in the R file. Only edit between the marked lines – do not change the name of the function or the arguments to the function. We need to be able to find functions with these exact names and arguments when we mark your code to give you credit for your work.
At the bottom of the R file, there is also some code that you should not edit (highlighted as such). Please do not change the code at the bottom of the file to solve the assignment.
These points are important. If you do make changes where we have indicated that you should not, you will not get credit for your efforts and your code may not work as you expect when we test it.
You also need to make sure that there are no errors when you type the following into the R console:
source(“Assignment1Answers.R”)
Big Data Analytics: Assignment 1 Suzy Moat and Tobias Preis If this command produces errors, we will not be able to mark your answers for this assignment
at all.
With all of those warnings out of the way! – we hope this helps you develop and test your skills. Have
fun!
Getting started
To start this exercise, load the assignment answers file like this:
source(“Assignment1Answers.R”)
Goal of your investigation
Humans around the world are uploading increasing amounts of information to social media services such as Twitter, Flickr and Instagram. To what extent can we exploit this information during catastrophic events such as natural disasters, to gather data about changes to our world at a time when good decisions must be reached quickly and effectively?
The subject of your current investigation is Hurricane Sandy, a hurricane that devastated portions of the Caribbean and the Mid-Atlantic and Northeastern United States during late October 2012. As a hurricane approaches, air pressure drops sharply. Your goal is to begin an investigation into whether a relationship might exist between the progression of Hurricane Sandy, as measured by air pressure, and user behaviour on the photo-sharing site Flickr.
If there were a simple relationship between changes in air pressure, and changes in photos taken and then uploaded to Flickr, then perhaps further investigation of these social media data would give insight into problems resulting from a hurricane that are harder to measure using environmental sensors alone. This might include the existence of burst pipes, fires, collapsed trees or damaged property. Such information could be of interest both to policy makers charged with emergency crisis management, and insurance companies too.
Part 1: Acquiring the Flickr data (3%)
Hurricane Sandy, classified as the eighteenth named storm and tenth hurricane of the 2012 Atlantic hurricane season, made landfall near Atlantic City, New Jersey at 00:00 Coordinated Universal Time (UTC) on 30 October 2012.
You have decided to have a look at how Flickr users behaved around this date, from the beginning of 29 October 2012, to the end of 1 November 2012. In particular, you are going to look at data on photos uploaded to Flickr with the text “hurricane sandy”. When were photos with this text taken?
In this task, you want to write code which would let you download hourly counts from Flickr of the number of photos taken and uploaded to Flickr with labels which include the text “hurricane sandy”, for the period 29 October 2012 00:00 to 01 November 2012 23:59.
To solve the first problem of downloading the hourly photo counts, let’s break it down into a few sub- problems:
A) Working out what the URL would be for the first JSON page for one hour
B) Using this URL to retrieve data from Flickr on how many photos were taken in a given hour with
the specified text
C) Writing some code to download data for all of the hours you are interested in, using the code
from step B.
1A. Building a URL to get one hour’s data on Flickr (1%)
First, we want to work out what URL we need to use to download one JSON page of data on Flickr photos with the text ”hurricane sandy” which were taken in a given hour.
February 2021 2
Big Data Analytics: Assignment 1 Suzy Moat and Tobias Preis To do this, edit the function buildFlickrURLOneHour. This function has two arguments:
– startOfHour: a POSIXct date and time object, to specify the beginning of the hour you would like data for, and
– text: the text which should be attached to the photograph.
You can see an example of the kind of date and time that would be passed to startOfHour by
typing
testStartShort
into the R console. This should print
[1] “2012-10-29 01:00:00 UTC”
This is an example date and time which we have defined at the bottom of the file – see if you can find it. It is a POSIXct date and time object – so unlike Date objects, you can store the time too. Don’t change the code at the bottom of the file! You could create your own test dates too however.
You need to change the function to take those two arguments and create the URL which you would use to download one JSON page of data on Flickr photos in an hour, starting at the time specified by startOfHour, with the text attached as specified by text.
You can test this function by typing in:
buildFlickrURLOneHour(startOfHour=testStartShort, text=”hurricane sandy”)
Importantly however, note that the function should work for other times and text values too (indeed, we will test this).
Hint 1: You wrote a function that was very similar to this in Week 3’s R seminar and extension exercises. You do need to make some changes to that code however.
Hint 2: Make sure you are not using a temporary API key, as this will not work when we mark your code. If your API key is correct (and not temporary), you should be able to see it at this URL:
https://www.flickr.com/services/apps/by/me
If no keys show up at this URL, go back to page 2 of Week 3’s R seminar exercise and follow the instructions to sign up for an API key.
Hint 3: In downloading data, you only need to be concerned about min_taken_date and max_taken_date – you can ignore min_upload_date and max_upload_date.
Hint 4: The “time taken” on a photo is in the photographer’s local time. For the purposes of this exercise, don’t worry about time zones – just use the times that Flickr specifies.
1B. Downloading one hour of data from Flickr and counting photos (1%)
Excellent – now you have a URL for one hour’s data. Now you need to work out how to get that data into R and extract the photo count.
To do this, edit the function downloadFlickrDataOneHour. Again, this function has two arguments:
– startOfHour: a POSIXct date and time object, to specify the beginning of the hour you would like data for, and
– text: the text which should be attached to the photograph.
You need to change the function to take those two arguments and download one JSON page of data on Flickr photos in an hour, starting at the time specified by startOfHour, with the text attached as specified by text. You then need to work out the total number of photos that were taken in that hour,
February 2021 3
Big Data Analytics: Assignment 1 Suzy Moat and Tobias Preis
from the one page which has been returned. You can use your buildFlickrURLOneHour to create the URL you need.
To gain credit for your work on this function, you need to return a data frame with one row and two columns. The first column should be called Date, and the second column should be called PhotoCount.
For the one row of data you create:
– the Date column should contain a POSIXct date and time object which specifies the beginning of the hour you have downloaded data for, and
– the PhotoCount column should contain a count of the total number of photos which were taken in that hour and uploaded to Flickr with the text specified by text. Note that this count will be a number, and so data in this column should be numeric (not text).
You can test this function by typing in:
downloadFlickrDataOneHour(startOfHour=testStartShort,
text=”hurricane sandy”)
but again, note that it should work for other times and text values too.
1C. Working out photo counts for multiple hours of data from Flickr (1%)
Brilliant – you know how to work out the number of photos taken in one hour. Now you want to write some code that can work out the number of photos taken in each of a sequence of hours, and put it all together in a data frame for you.
To do this, edit the function downloadFlickrDataMultipleHours. This function has three arguments:
– minHourStart: a POSIXct date and time object, to specify the beginning of the first hour you would like to download data for,
– maxHourEnd: a POSIXct date and time object, to specify the end of the last hour you would like to download data for,
– text: the text which should be attached to the photographs.
You need to change the function to take those three arguments, and get counts of the number of Flickr photos with the text specified by text that were taken in each of the hours between minHourStart and maxHourEnd. You can use your downloadFlickrDataOneHour to get the count of photos for each hour you specify.
To gain credit for your work on this function, you need to return a data frame with a row for each hour and two columns. The first column should be called Date, and the second column should be called PhotoCount. For each row,
– the Date column should contain a POSIXct date and time object which specifies the beginning of the hour you have downloaded data for, and
– the PhotoCount column should contain a count of the total number of photos which were taken in that hour and uploaded to Flickr with the text specified by text. Note that these counts will be numbers, and so data in this column should be numeric (not text).
The data frame should be ordered such that the first row has the earliest time, and the last row has the latest time.
To save time in testing, let’s test this function on a short period of time to start with – for example, 01:00:00 UTC to 04:59:59 UTC on 29 October 2012.
We already have the first time saved in testStartShort. At the bottom of the file, we have saved the second time in testEndShort for you. Don’t edit the code at the bottom of the file!
February 2021 4
Big Data Analytics: Assignment 1 Suzy Moat and Tobias Preis You can test this function by typing in: downloadFlickrDataMultipleHours(minHourStart=testStartShort,
maxHourEnd=testEndShort,
text=”hurricane sandy”)
but again, note that it should work for other times and text values too.
Part 2: Processing the Flickr data (2%)
The hurricane might not be the only influence on the number of photos people take. Perhaps people take more photos at the weekend or at certain times of day, for example.
We should account for this by finding out how many photos were taken in total during each hour, and using this data to normalise our hourly counts of “hurricane sandy” photographs.
To solve this problem, let’s split this task up into two sub-problems:
A) Downloading counts for both “hurricane sandy” photos and all photos taken for a given sequence of hours
B) Usingthisdatatonormalisethe”hurricanesandy”counts
2A. Downloading counts for “hurricane sandy” photos and all photos taken (1%)
You’ve already written some code to download counts of photos taken in a given hour with specified text attached to them.
Go back to the Flickr API Explorer for flickr.photos.search:
https://www.flickr.com/services/api/explore/flickr.photos.search
What value of “text” do you need to specify to get data on all photos taken? (Important hint: you don’t need to remove this parameter completely from the URL.)
Use this information to write code to download data on both how many “hurricane sandy” Flickr photos were taken in one hour, and how many photos were taken in that hour in total.
To do this, edit the function downloadAllFlickrData. This function has two arguments:
– minHourStart: a POSIXct date and time object, to specify the beginning of the first hour you would like to download data for,
– maxHourEnd: a POSIXct date and time object, to specify the end of the last hour you would like to download data for,
You need to change the function to take those two arguments, and create a data frame with three columns, and a row for each of the hours between minHourStart and maxHourEnd.
The first column should be called Date, the second column should be called SandyPhotoCount, and the third column should be called AllPhotosCount. For each row,
– the Date column should contain a POSIXct date and time object which specifies the beginning of the hour you have downloaded data for,
– the SandyPhotoCount column should contain a count of the total number of photos which were taken in that hour and uploaded to Flickr with the text “hurricane sandy” attached. Note that this count will be a number, and so data in this column should be numeric (not text).
– the AllPhotosCount column should contain a count of the total number of photos which were taken in that hour and uploaded to Flickr. Note that this count will be a number, and so data in this column should be numeric (not text).
You can use your downloadFlickrDataMultipleHours function to download both the “hurricane sandy” counts and the total counts, if you specify the right values for text. You then need to combine
February 2021 5
Big Data Analytics: Assignment 1 Suzy Moat and Tobias Preis
these counts to make the data frame described above. Data in the photo count columns should again be numeric.
You can test this function by typing in:
downloadAllFlickrData(minHourStart=testStartShort,
maxHourEnd=testEndShort)
but again, note that it should work for other times too.
Once you’ve got this working, save the data for this short period in shortFlickrData as follows, so that you can use it to solve the next task.
shortFlickrData <- downloadAllFlickrData(minHourStart=testStartShort,
maxHourEnd=testEndShort)
Hint 1: The command merge will help you combine the datasets.
2B. Normalising the "hurricane sandy" photo counts (1%)
You now want to use this data to calculate normalised counts of "hurricane sandy" photographs. To do this, edit the function normaliseFlickrCounts. This function has one argument:
- allFlickrData: a data frame created by downloadAllFlickrData which has hourly counts of both how many "hurricane sandy" Flickr photos were taken in a given hour, and how many photos were taken in that hour in total.
You can use the shortFlickrData data frame you created above to test this function.
You need to change the function to take this argument, and add an extra column to this data frame called SandyNormalised. For each row, this column should contain the result of dividing the value in SandyPhotoCount by the value in AllPhotosCount.
The function should then return the whole data frame, which should now have four columns, called Date, SandyPhotoCount, AllPhotosCount, and SandyNormalised, and the same number of rows as it had before.
Again, data in the columns SandyPhotoCount, AllPhotosCount, and SandyNormalised will be numbers, and so data in these columns should be numeric (not text).
You can test this function by typing in:
normaliseFlickrCounts(allFlickrData=shortFlickrData)
but again, note that it should work for other data frames created by downloadAllFlickrData too.
If you have got all of this working, then now is the time to try downloading all of the Flickr data you need for the assignment, for the period 29 October 2012 00:00 to 01 November 2012 23:59.
At the bottom of the file, we have saved the time at which this period begins in testStartFull, and the time at which this period ends in testEndFull. Don't edit the code at the bottom of the file!
You can save the full data set by typing this into the console:
fullFlickrData <- downloadAllFlickrData(minHourStart=testStartFull,
maxHourEnd=testEndFull)
If everything is working correctly, then this should take under 5 minutes on a good broadband connection.
February 2021 6
Big Data Analytics: Assignment 1 Suzy Moat and Tobias Preis You can now add the normalised counts by typing this into the console:
fullFlickrData <- normaliseFlickrCounts(allFlickrData=fullFlickrData)
Well done – you have successfully acquired and processed the Flickr data. Now for the environmental data!
Part 3: Processing the environmental data (1%)
As a hurricane approaches an area, atmospheric pressure falls. We can therefore use data on atmospheric pressure as a measure of the hurricane’s progress.
3A. Accessing data on atmospheric pressure
Hurricane Sandy made landfall very close to Atlantic City in New Jersey.
Data on atmospheric pressure is made available by the National Oceanic and Atmospheric Administration in the US. Here is their website:
https://www.ncdc.noaa.gov/
We have retrieved hourly atmospheric pressure observations for all of you from NOAA’s Atlantic City weather station, from the beginning of 29 October 2012 to the final hour of 1 November 2012 – the same as the Flickr data.
The results of this order are attached to this assignment in exactly the form that NOAA provided them, in the file “3922327258060dat.txt”.
Open this file and have a look at the data.
Hint: What format is this data in?
Save this file in your R working directory.
Hint: Look at previous R seminar sheets or ask Google if you can't remember what your R working directory is.
To use this data in the next tasks, find the variable noaaFilename in your R answers file. Change the value of this variable to the name of the data file.
3B. Reading in the atmospheric pressure data
Now you've got the atmospheric pressure data, you need to read it in. Have another look at the file. The information you require is the date on which each reading was taken, the time at which it was taken, and the atmospheric pressure measurement.
The column headings might be a little tricky to understand. NOAA has provided some documentation to try and help with this. This is in the file “3922327258060doc.txt”. Open this documentation file up and take a look at the guidance NOAA has given you.
Use the documentation file to identify which columns contain the data you require in the data file. You want to read in the data file and return a data frame containing these three columns.
To do this, edit the function readNOAAData. This function has one argument: - filename: the name of the NOAA data file.
You need to change this function to read in the file at the location given by filename, extract the three columns you are interested, and rename them, so that you can return a data frame with a row for each atmospheric pressure reading and the following three columns:
- Date: the date at which the reading was taken. By default, R will read this in as a number. You should leave it as a number for now. Each of the dates will have the following format: 20121029
February 2021 7
Big Data Analytics: Assignment 1 Suzy Moat and Tobias Preis
- Time: the time at which the reading was taken. By default, R will read this in as a number. You should leave it as a number for now. The times will look a little strange, due to them being numbers: for example, you will see times such as 0 (midnight), 100 (1am) and 200 (2am).
- AtmosPressure: the atmospheric pressure reading for the given date and time. You can test this function by typing in:
readNOAAData(filename=noaaFilename)
but again, note that the function should work for any filename. (Specifically, the function needs to work for us if we give it a path for a data file stored on our computers, potentially with a different name.)
If you've got this working, save the data in noaaData as follows, so that you can use it to solve the next task.
noaaData <- readNOAAData(filename=noaaFilename)
If your code is correct, when you type in the following command
str(noaaData)
you should see the following output:
'data.frame': 93 obs. of 3 variables:
$ Date : int 20121029 20121029 20121029 20121029 20121029 20121029 20121029 20121029 20121029 20121029 ...
$Time :int 0100200300400500600700800900...
$ AtmosPressure: num 1000 999 999 998 998 ...
Hint 1: Remember to rename your column names once you've loaded the data!
Part 4: Combining the Flickr and environmental data (2%)
Now you have the Flickr data and the environmental data. To start to investigate how these data sets relate, you need to combine these datasets into one table.
For each hour from the beginning of 29 October 2012 to the end of 1 November 2012, you have both a normalised count of the number of Hurricane Sandy Flickr photos taken, and a measurement of atmospheric pressure in Atlantic City. However, the atmospheric pressure data uses a different format than the Flickr data for specifying the date and time.
You need to work out how to change the format of the atmospheric pressure date and time, so that it matches the format used in the Flickr data, and then use the dates to combine the datasets.
To solve this problem, let's split this task up into two sub-problems:
A) Changing the format of the atmospheric pressure date and time B) CombiningtheFlickrandenvironmentaldatasets
4A. Changing the format of the atmospheric pressure date and time (1%)
You just had a look at the structure of the noaaData data frame by using the following command: str(noaaData)
Have a look at the structure of the fullFlickrData data frame too:
str(fullFlickrData)
You can see that dates in the Flickr dataset are in POSIXct format, such that the date and time are both stored in the Date column.
You need to take the date and time information in the noaaData data frame and combine it to create a new column with POSIXct date-time objects in the noaaData data frame.
February 2021 8
Big Data Analytics: Assignment 1 Suzy Moat and Tobias Preis To do this, edit the function changeNOAADateFormat. This function has one argument:
- noaaData: the atmospheric pressure data you read in with readNOAAData
You need to change this function to process the date and time information in noaaData, and for each reading, create a POSIXct object that represents the time and date of the reading. These new timestamps should got in a new column called DateTime. You should then remove the old Date and Time columns.
The function should return a data frame with a row for each reading in noaaData, and the following two columns:
- AtmosPressure: the atmospheric pressure readings from noaaData
- DateTime: the time and date of each reading as a POSIXct object.
You can test this function by typing in:
changeNOAADateFormat(noaaData)
If you've got this working, save the processed data in noaaData as follows, so that you can use it to
solve the next task.
noaaData <- changeNOAADateFormat(noaaData=noaaData)
If your code is correct, when you type in the following command
str(noaaData)
you should see the following output:
'data.frame': 93 obs. of 2 variables:
$ AtmosPressure: num 1000 999 999 998 998 ...
$ DateTime : POSIXct, format: "2012-10-29 00:00:00" "2012-10-29 01:00:00" "2012-10-29 02:00:00" ...
Hint 1: The formatC function looks a bit complicated, but is a simple way of converting a number into a string with a certain number of characters (e.g., 4). Might this be useful to process the times?
Hint 2: Remember the as.POSIXct() function – it will come in handy here.. 4B. Combining the Flickr and environmental datasets (1%)
Great – you’ve got data on Hurricane Sandy related photos from Flickr, and data on atmospheric pressure from NOAA. If you wanted to carry out an analysis of how these two datasets related, you would want to put them into the same data frame.
To do this, edit the function mergeFlickrAndNOAAData. This function has two arguments:
- allFlickrData: a data frame containing counts of Flickr photos which you created using normaliseFlickrCounts
- noaaData: a data frame containing atmospheric pressure data which you created using changeNOAADateFormat
You want to change this function to merge the datasets specified by allFlickrData and noaaData and create one data frame, where each row represents one hour, and contains a measurement of the atmospheric pressure and the normalised count of Hurricane Sandy Flickr photos. This is one line of code (on top of the return function).
The function should return a data frame with a row for every atmospheric pressure reading you downloaded, and the following five columns:
February 2021 9
Big Data Analytics: Assignment 1 Suzy Moat and Tobias Preis
- Date: the time and date of the hour for which you have Flickr data and atmospheric pressure data
- SandyPhotoCount: a count of the total number of photos which were taken in that hour and uploaded to Flickr with the text "hurricane sandy" attached, as in allFlickrData
- AllPhotosCount: a count of the total number of photos which were taken in that hour and uploaded to Flickr, as in allFlickrData
- SandyNormalised: the normalised count of “hurricane sandy” Flickr photos for that hour, as in allFlickrData
- AtmosPressure: the atmospheric pressure reading for that hour, as in noaaData You can test this function by typing in:
mergeFlickrAndNOAAData(allFlickrData=fullFlickrData, noaaData=noaaData)
If you've got this working, save the processed data in allData as follows, so that you can use it to
solve the next task.
allData <- mergeFlickrAndNOAAData(allFlickrData=fullFlickrData,
noaaData=noaaData)
Part 5: Visualising the data (2%)
Excellent – you have all the data!
The first things you would want to do when working out whether there was a relationship between the two data sets is to plot the data. For now, let’s make one graph of the Flickr data and one graph of the atmospheric pressure data, and make sure they are clear and easy to read.
5A. Visualising the Flickr data (1%)
It would be helpful to see a line graph of the normalised counts of “hurricane sandy” Flickr photographs across time.
To do this, edit the function createFlickrPlot. This function has one argument:
- data: the merged Flickr and atmospheric pressure dataset you created with
mergeFlickrAndNOAAData
You will see that there is already some ggplot code there, but that the output of the code is saved in a variable called p and returned. The plot is being saved in a variable, and returned it so it can be printed by code which calls this function.
Try typing this into the console:
createFlickrPlot(data=allData)
At the moment, this will give you a warning message and no plot.
You want to add code to this function to make the createFlickrPlot(data=allData) command
print a plot with the following specifications:
- Your plot should be a line graph, with the date and time on the x-axis, and the normalised number of “hurricane sandy” Flickr photos on the y-axis.
- The colour of the line should be “blue”
- The line width should be 1mm (Hint: the default unit for line widths is mm, so you can just
consider this to be a width of 1)
- The title of the x-axis should be “Time [hours]”, where “[hours]” indicates the units for your data.
- The title of the y-axis should be “Normalised number of photos”. (continued overleaf...)
- The axis titles should have a font size of 20pt.
- The tick mark labels should have a font size of 16pt.
February 2021 10
Big Data Analytics: Assignment 1 Suzy Moat and Tobias Preis
To gain credit for your work on this function, make sure you follow these instructions exactly. In particular, make sure the axis titles are exactly as stated above.
To test your code, type:
createFlickrPlot(data=allData)
Hint 1: Cookbook for R has plenty of information on how to customise plots made with ggplot2
5B. Visualising the atmospheric pressure data (1%)
For comparison, it would be useful to make a plot of the atmospheric pressure data across time. To do this, edit the function createNOAAPlot. This function has one argument:
- data: the merged Flickr and atmospheric pressure dataset you created with mergeFlickrAndNOAAData
You want to add code to this function to make it print a plot with the following specifications:
- Your plot should be a line graph, with the date and time on the x-axis, and the atmospheric pressure on the y-axis.
- The colour of the line should be “red”
- The line width should be 1mm (Hint: the default unit for line widths is mm, so you can just
consider this to be a width of 1)
- The title of the x-axis should be “Time [hours]”, where “[hours]” indicates the units for your data
- The title of the y-axis should be “Atmospheric pressure [mbar]”, where “[mbar]” indicates the
units for your data
- The axis titles should have a font size of 20pt
- The tick mark labels should have a font size of 16pt
To gain credit for your work on this function, make sure you follow these instructions exactly. In particular, make sure the axis titles are exactly as stated above.
To test your code, type:
createNOAAPlot(data=allData)
February 2021 11
Big Data Analytics: Assignment 1 Suzy Moat and Tobias Preis
Well done – you’ve reached the end!
Fantastic – you’ve downloaded data from Flickr, processed it and combined it with data on atmospheric pressure, and written functions to create two plots! For your projects, you would also want carry out a statistical analysis – but we’ll leave that part for this assignment.
Before you submit your answer file, do a final check to make sure R can read it without generating any errors:
source("Assignment1Answers.R")
This is very important – we won’t be able to mark your work if there are errors on running this command.
Please do also check again that you have not edited the code outside of the marked areas. In particular, you should make sure that:
• you have NOT changed the function names or argument names;
• you have NOT added or removed any arguments or functions;
• you have NOT added any code BETWEEN the functions; and
• you have NOT edited the code at the bottom of the file, underneath the “DO NOT EDIT ANY
CODE UNDERNEATH THIS LINE” marker.
Again, if you do make changes where we have indicated that you should not, you will not get credit
for your efforts and your code may not work as you expect when we test it.
Once you’ve checked this all looks OK, if you would like to see how all your functions would all be brought together automatically, you can look at the function we have written at the bottom of the file called doEverything. You can see that this uses all the functions you have just written.
You can try this out if you like – but bear in mind that it will download all the Flickr data again, so it could take a few minutes on a good broadband connection. (This is also not a compulsory part of the assignment.)
If you want to try it, type the following into your console:
doEverything()
You’ll see R use all of your functions to download all the Flickr data, read in the atmospheric pressure data, combine them and create your two plots.
Please note that you should not rely on the doEverything() function to test your code. Instead, you should check each function works as described in the Task specifications above. Importantly, you should make sure that your functions respond correctly to changes in the values of the arguments.
Well done. You’ve completed the assignment – and learnt something along the way, we hope!
February 2021 12
Big Data Analytics: Assignment 1 Suzy Moat and Tobias Preis
WBS Plagiarism Policy
Please ensure that any work submitted by you for assessment has been correctly referenced as WBS expects all students to demonstrate the highest standards of academic integrity at all times and treats all cases of poor academic practice and suspected plagiarism very seriously. You can find information on these matters on my.wbs, in your student handbook and on the University’s library web pages:
https://warwick.ac.uk/services/library/students/referencing
The University’s Regulation 11 (see link below) clarifies that “...’cheating’ means an attempt to benefit oneself or another by deceit or fraud. This includes reproducing one’s own work...” It is important to note that it is not permissible to reuse work which has already been submitted by you for credit either at WBS or at another institution (unless you have been explicitly told that you can do so). This is considered self-plagiarism and could result in significant mark reductions.
Upon submission of assignments, students will be asked to agree to one of the following declarations:
Individual work submissions:
"I declare that this work is entirely my own in accordance with the University's Regulation 11 and the WBS guidelines on plagiarism and collusion. All external references and sources are clearly acknowledged and identified within the contents. No substantial part(s) of the work submitted here has also been submitted by me in other assessments for accredited courses of study, and I acknowledge that if this has been done it may result in me being reported for self-plagiarism and an appropriate reduction in marks may be made when marking this piece of work.”
Group work submissions:
"I declare that this work is being submitted on behalf of my group, in accordance with the University's Regulation 11 and the WBS guidelines on plagiarism and collusion. All external references and sources are clearly acknowledged and identified within the contents. No substantial part(s) of the work submitted here has also been submitted in other assessments for accredited courses of study and if this has been done it may result in us being reported for self- plagiarism and an appropriate reduction in marks may be made when marking this piece of work."
By agreeing to these declarations you are acknowledging that you have understood the rules about plagiarism and self-plagiarism and have taken all possible steps to ensure that your work complies with the requirements of WBS and the University.
You should only indicate your agreement with the relevant statement, once you have satisfied yourself that you have fully understood its implications. If you are in any doubt, you must consult with the NIE of the relevant module, because once you have indicated your agreement it will not be possible to later claim that you were unaware of these requirements in the event that your work is subsequently found to be problematic in respect to suspected plagiarism or self-plagiarism.
Regulation 11: http://www2.warwick.ac.uk/services/gov/calendar/section2/regulations/cheating
February 2021 13