Problem 3
In this problem, we will explore two yearly data sets from http://www.nationalchickencouncil.org/about-the-industry/statistics/. The second link is Per Capita Consumption of Poultry & Livestock and the third one is Wholesale and Retail Prices for Chicken, Beef, and Pork. You can simply click on the two titles to see the data.
• Scrape the Per Capita Consumption of Poultry & Livestock data from the web and store it as a tibble. Manipulate your tibble so that
• The rows only contain the observed values, i.e., exclude the estimate/forecast values. (Hint: consider the slice() function from tidyverse.)
• The columns only contain the variables Year, Beef, Pork, and Broilers. (Hint: consider the select() function from tidyverse.) At this point, you should have a wide data.
• Then convert the data into a long format. That is, the long data should have three columns Year (carried over from the wide data), meat (which has three levels: Beef, Pork, and Broilers) and sales.
• Convert the variable Year from character to numeric value, i.e., dbl in tibble. Note, here we don’t convert Year to a date object as it is not complete (why?)
After the 4 steps, your data should be clean enough…print the first and last 5 rows of your data. (5 pts)
• Make a time series plot of your data. Your plot should have 3 lines (you can distinguish them by using different colors or types) and scale x-axis by 5 years. Make your plot as beautiful and clear as possible! (5 pts)
• Scrape the Wholesale and Retail Prices for Chicken, Beef, and Pork data from the web and store it as a tibble. This one is a little bit harder than the previous one as, you’ll see, the table header is not that regular. So, please follow the steps below.
• After you get the table from the web, you can convert it to a data.frame so that you can extract the wanted rows.
• Extract the rows that only contain the observed values, again you can use the slice() function.
• Now, you need to rename the columns. You can use the rename() function. I want to name the first column as Year, and the last three as Beef,Pork, andBroilers`. You can name the others on your own.
• Select the columns of Year, Beef, Pork, and Broilers. And change their class to numeric. You can consider the modify_at() function.
• Then convert the data into a long format. That is, the long data should have three columns Year (carried over from the wide data), meat (which has three levels: Beef, Pork, and Broilers) and price.
• Create a new variable called price_dollar which converts the unit of price from cent to dollar.
Finally, print the first and last 5 rows of your data. (7 pts)
• Make a time series plot of your data. Your plot should have 3 lines (you can distinguish them by using different colors or types) and scale x-axis by 5 years. Make your plot as beautiful and clear as possible! (3 pts)
• Draw a concise conclusion by considering the two plots together. (2 pts)
Problem 4
The Civilian Complaint Review Board (CCRB) is an independent municipal agency that investigates complaints of NYPD misconduct. A “complaint” is defined as any incident within the Agency’s jurisdiction that falls into one or more of the following categories of misconduct specified by the New York City Charter: Force, Abuse of Authority, Discourtesy, and Offensive Language, collectively known as “FADO”. Upon receiving a complaint, CCRB investigators gather evidence and interview witnesses to prepare reports on the allegations of misconduct. Once an investigation is concluded, a closing report is prepared detailing the evidence and a legal analysis, and the case is turned over to the CCRB Board for review and vote.
Now go here at http://www1.nyc.gov/site/ccrb/policy/data-transparency-initiative-complaints.page to see more details about the data they maintained. Then scroll down to the end and download the latest data by clicking on Download the record-level dataset. In what follows, you will reproduce the first 3 plots on this page.
• Import and manipulate the data. You can follow the steps below: (5 pts)
• The data you download is a xlsx file, so you can import it in to R by using the read_excel function from the readxl package as you have seen from class.
• After you read in the data, you should see that the column names are not ready to use. So, you can convert it into a data.frame by using data.frame() function.
• Then you can use as_tibble() to convert it back to a tibble.
Now, print out the first 10 rows of your tidy data.
• In this question we will reproduce the plot “How many complaints has the CCRB received over time?” (5 pts)
• First, you need to obtain the number of complaints for each year. To do so, you should group_by() the data by Received.Year.
• Then summarize the data by counting the number of distinct complaints in each year. To do so, you can apply the summarize() function and with in it use the argument n_distinct(UniqueComplaintId) to obtain the count.
• Make the plot from 2005 to 2016 and use the same Title and Subtitle from the original graph.
• In this question we will reproduce the plot “How many CCRB complaints contain at least one allegation of excessive force, abuse of authority, discourtesy or offensive language (FADO)?” (5 pts)
• You need to obtain the number of complaints for each year by FADO type. To do so, you should group_by() the data by Received.Year and Allegation.FADO.Type.
• Then summarize the data by counting the number of distinct complaints in each year and FADO type. To do so, you can apply the summarize() function and with in it use the argument n_distinct(UniqueComplaintId) to obtain the count.
• Make the plot from 2005 to 2016 and use the same Title and Subtitle from the original graph. Also, you need to delete the NA level from the type.
• In this question we will reproduce the plot “How many complaints does the CCRB receive by borough of incident?” (5 pts)
• You need to obtain the number of complaints for each year by Borough. To do so, you should group_by() the data by Received.Year and Borough.of.Occurrence.
• Then summarize the data by counting the number of distinct complaints in each year and borough. To do so, you can apply the summarize() function and with in it use the argument n_distinct(UniqueComplaintId) to obtain the count.
• Make the plot from 2005 to 2016 and use the same Title and Subtitle from the original graph. Also, you only need to include the 5 boroughs as shown on the origianl graph.