STT 301 Homework Assignment 2
STT 301 Homework Assignment 2
Shawn Santo
September 25, 2017
Homework Assignment 2 is due Wednesday, October 4 at 11:00pm EST.
Instructions and Rubric
You must complete this individual homework assignment using R Markdown. You may modify this file to include your solutions. However, please be sure it only contains your solutions (delete the instructions, rubric, and any other extraneous details). Turn in only the R Markdown file (.Rmd). This file should be uploaded into the dropbox in D2L.
For any plots you create it is not required that you add labels, add a title, modify the scale, add a legend, etc. If you are interested in adding some basic options to your plots consider looking at http://www.statmethods.net/advgraphs/parameters.html.
Some of the questions are open-ended, and there is not a correct or incorrect answer. I am interested in you providing good commentary/explanations of what you are doing in the R Markdown file. You only need to provide commentary where it is stated. Remember, part of data science is communicating your results.
Total: 10 points
Correctness: Point values for the question and their respective part(s) are listed. Deductions will be made at the discretion of the grader.
Late Submission: -1.0 point
Knitting: -0.5 points if the Rmd file does not knit
Style: Use a third-level header to off-set each question in your solutions – as is done below. For questions with multiple parts, use fourth level headers to off-set the parts in your solutions – as is done below. Coding style is very important. You will receive a deduction of up to 1.0 point if you do not adhere to good coding style.
No deduction if
appropriate variable use and naming
appropriate function use
good code commenting
consistent style
-0.5 points if two of the above are not satisfied
-1.0 point if three or more of the above are not satisfied
Required Data: nypd_2016.csv, variables_nypd_2016
Required Packages: rgdal, RColorBrewer, classInt
The stop-question-and-frisk (SQF) program, or stop-and-frisk, in New York City, is a New York City Police Department practice of temporarily detaining, questioning, and at times searching civilians on the street for weapons and other contraband. The NYPD SQF database is publicly available at http://www1.nyc.gov/site/nypd/stats/reports-analysis/stopfrisk.page and contains a plethora of information about every individual stop-question-and-frisk. We will use the 2016 data set only. The CSV file nypd_2016 and Microsoft Excel file variables_nypd_2016 are available on D2L and should be downloaded for use in this assignment. The file named nypd_2016 is the data you will read into R using the read.csv function below (code is given). The file named variables_nypd_2016 contains descriptions of the variables and values for the various labels. You will want to reference this document repeatedly throughout the assignment (you do not need to read it into R). Additionally, you will need to install and load the required packages listed above. Remember, the function install.packages() will install a package and the function library() will load the package once installed.
nypd <- read.csv(file = "nypd_2016.csv", header = TRUE, stringsAsFactors = FALSE) Question 1 (3 points) Data cleaning and exploration Here you will clean, filter, and investigate the nypd_2016 data set. Part a (0.25 points) Create a new data frame called nypd.filter that contains the following variables from the original data frame nypd: ‘pct’, ‘crimsusp’, ‘arstmade’, ‘sex’, ‘race’, ‘age’, ‘ht_feet’, ‘ht_inch’, ‘weight’, ‘build’, ‘xcoord’, ‘ycoord’. Display the structure of the nypd.filter data frame. Part b (0.5 points) Change the variable age to numeric in the data frame nypd.filter. Display the structure of the nypd.filter data frame to verify it is now numeric. Part c (0.5 points) Remove any rows from the data frame nypd.filter that contain at least one NA value. You should overwrite the nypd.filter data frame so now nypd.filter will have no NA values. Display the dimensions of your now NA free data frame. Display the last six rows of the data frame nypd.filter. Part d (0.25 points) Use the table() function in R to create a table for the variable arstmade from the nypd.filter data frame. create a table for the variable sex from the nypd.filter data frame. create a table for the variable race from the nypd.filter data frame. Part e (0.5 points) To plot a qualitative (categorical) variable you can do so using the plot() function, but you will need to ensure your variable is a factor - do this locally inside the plot() function. Use the plot() function in R to create a bar plot of the variable race from the nypd.filter data frame. create a bar plot of the variable sex from the nypd.filter data frame. create a bar plot of the variable pct from the nypd.filter data frame. Comment on what you observe from the plots below the corresponding R chunk. Part f (0.5 points) In part d you used the table() function in R to create tables for various variables. You can also use the function to create a two-way contingency table by passing a data frame with two columns into the function table(). Use the table() function in R to create a two-way contingency table for the variables sex and race from the nypd.filter data frame. create a two-way contingency table for the variables arstmade and race from the nypd.filter data frame. Comment on what you observe from the tables below the corresponding R chunk. Part g (0.5 points) Using the nypd.filter data frame compute and display the mean age of all the individuals who were SQF. compute and display the median age of all the individuals who were SQF. compute and display the standard deviation for the age of all the individuals who were SQF. create a histogram for the age of all the individuals who were SQF. Comment on what you observe from the histogram below the corresponding R chunk. Question 2 (3 points) Data subsetting There is not one correct way to answer these questions below with regards to subsetting. You may use any technique you desire to arrive at the solution. Part a (1 point) Use the nypd.filter data frame to answer the following questions. Given an arrest was made, what is the empirical distribution (compute percentages/proportions) for the variable race? Given an arrest was made, what is the empirical distribution (compute percentages/proportions) for the variable sex? Part b (1 point) Use the nypd.filter data frame to perform the following. Compute and display the mean age of an adult (18 or older) that was SQF. Compute and display the mean age of an individual SQF for each gender. Compute and display the mean weight of an individual SQF for each gender. Compute and display the mean age of an individual SQF by an officer from precinct 106. Compute and display the mean weight of an individual SQF from precinct 106 or precinct 49. Part c (1 point) Use the nypd.filter data frame to answer the following questions. Pay careful attention to the case/character structure. How many suspected crimes were DWI or D.W.I.? How many suspected crimes were MURDER? How many arrests were made where the suspected crime was MURDER? Question 3 (3 points) Writing a function Part a (1.5 points) Create a function that converts the height of an individual in feet and inches to inches. Your function should have the following: a descriptive name two arguments (inputs): one to input the feet and another to input the inches a check to ensure no values are negative a check to ensure no NA values exist return the total height in inches Part b (0.5 points) Use your newly created function from above on the variables ht_feet and ht_inch from the nypd.filter data frame. Attach the result to the nypd.filter data frame under the new variable name ht_inch_total. Your data frame should now have 13 columns. Part c (0.5 points) Compute and display the correlation between the variables ht_inch_total and weight from the nypd.filter data frame. Part d (0.5 points) Create a scatter plot using the plot() function for the variables ht_inch_total and weight from the nypd.filter data frame. You may choose which variable corresponds to x in plot(). Comment on what you observe from the scatter plot below the corresponding R chunk. Question 4 (1 point) Part a (0.5 points) Create a data frame that contains two columns. Columns 1 should be the the precinct number. Column 2 should be the number of individuals SQF by that respective precinct. As before, you should be working with the nypd.filter data frame. Save this new data frame as pct.df. Part b (0.25 points) Run the below chunk to bring the function nyc.precinct.plot into the working environment. List at least 2 problems/issues with the coding style of the below function. nyc.precinct.plot <- function(df){ # load required packages library(rgdal) library(RColorBrewer) library(classInt) # download precinct map download.file("http://www.rob-barry.com/assets/data/mapping/nypp_15b.zip",destfile = "nypp_15b.zip") unzip(zipfile = "nypp_15b.zip") nypp <- readOGR("nypp_15b", "nypp") {colnames(df) <- c("pct", "stops")} # create a sub function for merging data frames merge.shpdf.df = function(shpdf, sub.df, by.shpdf, by.df) { shpdf@data <- data.frame(shpdf@data, sub.df[match(shpdf@data[, by.shpdf], sub.df[, by.df]), ]) return(shpdf) } # merge data frames using sub function nypp.merge <- merge.shpdf.df(nypp, df, "Precinct", "pct") # create the plot pal = brewer.pal(5, "YlOrRd") fill.clr <- findColours(classIntervals(nypp.merge@data$stops, style = "pretty", n = 5), pal) plot(nypp, col = fill.clr, main="Stop-Question-Frisk Incidents by Precinct") legend( "topleft", fill=attr(fill.clr, "palette"), legend=names(attr(fill.clr, "table")), bty = "n" )} Part c (0.25 points) Pass your newly created pct.df data frame into the function nyc.precinct.plot. Comment on what you observe from the plot below the corresponding R chunk. Additionaly, comment on how could the plot be improved?