Problem 1 [40 points]
Statistics 4004: MidTerm
Due Thursday March 21, start of class
2019-03-04
Over the last several years, a research team has been collecting survey data on candy preference. The data currently lives in several files and suffers from all the normal stuff: inconsistencies between years, coding inconsistencies, etc. We (the team, well, ok, you) need to analyze the data.
Specifically, we need to:
Part A [10 points / 40]
1. get and munge the data
• hint1: use readxl instead of xlsx package
• hint2: you will need to fix column names (bulk regex fixes, don’t worry about single year question differences)
• reformat the data as necessary
• combine all the datasets
Note, this problem was inspired in part by the authors own work: http://www.scq.ubc.ca/so-much-candy-data-seriously/.
Here is the data: http://www.scq.ubc.ca/wp-content/uploads/2017/10/candyhierarchy2017.xlsx
https://www.scq.ubc.ca/wp-content/uploads/2016/10/BOING-BOING-CANDY-HIERARCHY-2016-SURVEY-Responses. xlsx
https://www.scq.ubc.ca/wp-content/uploads/2015/10/CANDY-HIERARCHY-2015-SURVEY-Responses. xlsx
Part B [20 points / 40]
1. comment on any issues you faced and see remaining in the data
2. how many total candy questions are there
3. which candy(ies) get the most reviews
4. which candy(ies) get the least reviews
5. what candies get the best and worst ratings (top 10 – joy=+1, despair=-1)
6. what candies get the best and worst ratings (top 10) by year
7. create summary plots of the top 10 highest and lowest
8. create summary plots of the top 10 highest and lowest rated candies by year normalized by total responses for that candy in that year
9. report your findings
1
Part C [10 points / 40]
What is of interest is changes in preference. Focusing on 2016 and 2017, convert the preferences to a fraction having joy, i.e. joy/(total responses for that candy). Which candies saw the most gain or loss in preference year to year?
Problem 2 [10 points]
In problem 1, we did some data munging and found there may be some changes in preference. Here, we will get data on one of the companies seeing an overall increase in preference for thier brands. Specifically, we want market data for Mars, Nestle (NSGRY) and Hershey (HSY) from June 2016 – June 2018 to look at how those stocks did through 2017. Mars is private and Nestle is foreign, so we will work on HSY. Please get and plot the stock open/close data for HSY for the time period indicated using the quantmod package. See the examples section for how to get data and plot. For this, you can use src=“yahoo” and chart series to get a nice plot.
https://www.quantmod.com/examples/
Problem 3 [20 points]
Looking at the HSY chart, there is in an interesting tick around July 2016 and another around May-June 2017. What we would like to do is search through the Twitter archives to see if we see any activity on Twitter. Unfor- tunately, free Twitter searches only show the last 10-ish days. If this were our business, we would likely either a) have a Twitter premium account (expensive) or b) be collecting live streams. Short of that, let’s see if the idea has merit. Note, there are companies out there doing this (albeit, probably pulling from multiple sources and spending lots of time worrying about scoring algorithms, for instance: https://www.stockfluence.com and this paper https://www.semanticscholar.org/paper/Analyzing-Stock-Market-Movements-Using-Twitter-tushar/ a475a1b6b826ed8d3ade47341b4442aced6c5dd3).
In this problem, we want to look at DJIA index quotes and sentiment in Twitter. 1. gather tweets using the search term “djia”
2. question: do you keep or delete retweets, justify your answer
3. limit the tweets to the 5 business days prior to your analysis
4. pull out the tweet text and “created_at” fields 5. find sentiment terms for each tweet
6. calculate sentiment score for each tweet
7. create 2 plots showing
• DJIA price trend for the 5 days you have tweets (try src=”FRED” if 404 error)
• boxplot of sentiment score for each of those 5 days 8. comment on what you see
Problem 4
Please knit this document to PDF (name should be MidTerm_pid) and push to BitBucket: In the R Terminal, type:
2
1. git pull
2. git add MidTerm_pid.[pR]* (NOTE: this should add two files) 3. git commit -m “final MidTerm submission”
4. git push
Grading Rubric:
• 10 point: successfully submitted to BitBucket • 20 points: Neat, well written document
– 10 points style
– 5 points Reproducible Research
– 5 points Good Programming Practices
• 70 points: correct answers to problems as given
Functions and packages I used when I took the exam
rm(list=ls())
knitr::opts_chunk$set(echo = TRUE, include=T, eval=T,
tidy=T, tidy.opts = c(width.cutoff=55)) options(“getSymbols.warning4.0″=FALSE)
# more compact way to load a bunch of packages in one go
necessary_packages <- c("plyr","tidyverse", "data.table", "lubridate",
"rvest", "readxl", "knitr", "quantmod", "rtweet", "syuzhet")
#quick function to check if a library is installed, if not, install, if so, load
load_libraries <- function(lib_name){
# lib_name is a character vector of packages I want to load if(!(lib_name %in% installed.packages())){
install.packages(lib_name) }
# require will return true if the library is loaded
return(require(lib_name, character.only=TRUE)) }
loaded_packages <- sapply(necessary_packages, function(x) load_libraries(x))
# should add a line to capture library load fails and alert user
if(sum(loaded_packages)!=length(necessary_packages)){ cat("stuff didn't work","\n",sep=" ")
}
# A function for captioning and referencing images
# figure captions are a pain IMO,
# I don't remember where I got this from but it may be referenced here: ##https://rpubs.com/ajlyons/autonumfigs
fig <- local({ i <- 0
ref <- list() list(
3
cap=function(refName, text) { i <<- i + 1
ref[[refName]] <<- i
text },
ref=function(refName) { ref[[refName]]
}) })
#cat("\n\n\\pagebreak\n")
## to use it, put this in your code chunk header:
## fig.cap=fig$cap("plot1","figure legend text goes here.")
4