Data Cleaning Lab Key
Data Cleaning Lab Key
Copyright By PowCoder代写 加微信 powcoder
Bike Lanes Dataset: BikeBaltimore is the Department of Transportation’s bike program. The data is from http://data.baltimorecity.gov/Transportation/Bike-Lanes/xzfj-gyms
You can Download as a CSV in your current working directory.
library(readr)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ dplyr 1.0.7
## ✓ tibble 3.1.5 ✓ stringr 1.4.0
## ✓ tidyr 1.1.4 ✓ forcats 0.5.1
## ✓ purrr 0.3.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(dplyr)
library(lubridate)
## Attaching package: ‘lubridate’
## The following objects are masked from ‘package:base’:
## date, intersect, setdiff, union
library(tidyverse)
library(broom)
bike = read_csv(
“data/Bike_Lanes.csv”)
## Rows: 1631 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: “,”
## chr (6): subType, name, block, type, project, route
## dbl (3): numLanes, length, dateInstalled
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Count the number of rows of the bike data and count the number of complete cases of the bike data. Use sum and complete.cases.
nrow(bike)
## [1] 1631
sum(complete.cases(bike))
## [1] 257
Create a data set called namat which is equal to is.na(bike).
What is the class of namat? Run rowSums and colSums on namat. These represent the number of missing values in the rows and columns of bike. Don’t print rowSums, but do a table of the rowSums
namat = is.na(bike)
colSums(namat)
## subType name block type numLanes
## 4 12 215 9 0
## project route length dateInstalled
## 74 1269 0 0
table(rowSums(namat))
## 0 1 2 3 4
## 257 1181 182 6 5
Filter rows of bike that are NOT missing the route variable, assign this to the object have_route. Do a table of the subType using table, including the missing subTypes Get the same frequency distribution using group_by(subType) and tally()
have_route = bike %>% filter(!is.na(route))
dim(have_route)
## [1] 362 9
table(have_route$subType, useNA = “always”)
## STRALY STRPRD
## 3 358 1
have_route %>% group_by(subType) %>% tally()
## # A tibble: 3 × 2
## subType n
##
## 1 STRALY 3
## 2 STRPRD 358
## 3
have_route %>% count(subType)
## # A tibble: 3 × 2
## subType n
##
## 1 STRALY 3
## 2 STRPRD 358
## 3
have_route %>% group_by(subType) %>%
summarize(n_obs = n())
## # A tibble: 3 × 2
## subType n_obs
##
## 1 STRALY 3
## 2 STRPRD 358
## 3
tally( group_by(have_route, subType) )
## # A tibble: 3 × 2
## subType n
##
## 1 STRALY 3
## 2 STRPRD 358
## 3
have_route = group_by(have_route, subType)
tally(have_route)
## # A tibble: 3 × 2
## subType n
##
## 1 STRALY 3
## 2 STRPRD 358
## 3
Filter rows of bike that have the type SIDEPATH or BIKE LANE using %in%. Call it side_bike. Confirm this gives you the same number of results using the | and ==.
side_bike = bike %>% filter(type %in% c(“SIDEPATH”, “BIKE LANE”))
side_bike2 = bike %>% filter(type == “SIDEPATH” | type == “BIKE LANE”)
identical(side_bike, side_bike2)
## [1] TRUE
nrow(side_bike)
## [1] 628
nrow(side_bike2)
## [1] 628
side_bike = filter(bike,type %in% c(“SIDEPATH”, “BIKE LANE”))
side_bike2 = filter(bike, type == “SIDEPATH” | type == “BIKE LANE”)
identical(side_bike, side_bike2)
## [1] TRUE
Do a cross tabulation of the bike type and the number of lanes. Call it tab. Do a prop.table on the rows and columns margins. Try as.data.frame(tab) or broom::tidy(tab)
tab = table(type=bike$type, numLanes=bike$numLanes)
prop.table(tab, 1)
## numLanes
## type 0 1 2
## BIKE BOULEVARD 0.00000000 0.02040816 0.97959184
## BIKE LANE 0.03220612 0.66183575 0.30595813
## CONTRAFLOW 0.00000000 0.53846154 0.46153846
## SHARED BUS BIKE 0.00000000 1.00000000 0.00000000
## SHARROW 0.00000000 0.36842105 0.63157895
## SIDEPATH 0.00000000 0.85714286 0.14285714
## SIGNED ROUTE 0.00000000 0.69407895 0.30592105
prop.table(tab, 2)
## numLanes
## type 0 1 2
## BIKE BOULEVARD 0.000000000 0.001121076 0.067605634
## BIKE LANE 1.000000000 0.460762332 0.267605634
## CONTRAFLOW 0.000000000 0.007847534 0.008450704
## SHARED BUS BIKE 0.000000000 0.043721973 0.000000000
## SHARROW 0.000000000 0.243273543 0.523943662
## SIDEPATH 0.000000000 0.006726457 0.001408451
## SIGNED ROUTE 0.000000000 0.236547085 0.130985915
as.data.frame(tab)
## type numLanes Freq
## 1 BIKE BOULEVARD 0 0
## 2 BIKE LANE 0 20
## 3 CONTRAFLOW 0 0
## 4 SHARED BUS BIKE 0 0
## 5 SHARROW 0 0
## 6 SIDEPATH 0 0
## 7 SIGNED ROUTE 0 0
## 8 BIKE BOULEVARD 1 1
## 9 BIKE LANE 1 411
## 10 CONTRAFLOW 1 7
## 11 SHARED BUS BIKE 1 39
## 12 SHARROW 1 217
## 13 SIDEPATH 1 6
## 14 SIGNED ROUTE 1 211
## 15 BIKE BOULEVARD 2 48
## 16 BIKE LANE 2 190
## 17 CONTRAFLOW 2 6
## 18 SHARED BUS BIKE 2 0
## 19 SHARROW 2 372
## 20 SIDEPATH 2 1
## 21 SIGNED ROUTE 2 93
## Warning: ‘tidy.table’ is deprecated.
## Use ‘tibble::as_tibble()’ instead.
## See help(“Deprecated”)
## # A tibble: 21 × 3
## type numLanes n
##
## 1 BIKE BOULEVARD 0 0
## 2 BIKE LANE 0 20
## 3 CONTRAFLOW 0 0
## 4 SHARED BUS BIKE 0 0
## 5 SHARROW 0 0
## 6 SIDEPATH 0 0
## 7 SIGNED ROUTE 0 0
## 8 BIKE BOULEVARD 1 1
## 9 BIKE LANE 1 411
## 10 CONTRAFLOW 1 7
## # … with 11 more rows
Read the Property Tax data into R and call it the variable tax
tax = read_csv( “data/Real_Property_Taxes.csv.gz”)
## Rows: 237963 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: “,”
## chr (15): PropertyID, Block, Lot, Ward, Sect, PropertyAddress, LotSize, City…
## dbl (1): CouncilDistrict
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
How many addresses pay property taxes?
## [1] 237963 16
## [1] 237963
length(tax$PropertyID)
## [1] 237963
sum(is.na(tax$CityTax))
## [1] 18696
sum(!is.na(tax$CityTax))
## [1] 219267
summary(tax$CityTax)
## Length Class Mode
## 237963 character character
cat(‘Mean of Annual Salary: ‘, mean(!is.na(tax$CityTax)), ‘\n’)
## Mean of Annual Salary: 0.9214332
subset_pay <-cbind(tax$CityTax, tax$PropertyID)
What is the total city and state tax paid?
You need to remove the $ from the CityTax variable then you need to make it numeric. Try str_replace, but remember $ is “special” and you need fixed() around it.
head(tax$CityTax)
## [1] NA NA "$1463.45" "$2861.70" "$884.21" "$1011.60"
tax = tax %>%
CityTax = str_replace(CityTax,
fixed(“$”), “”) ,
CityTax = as.numeric(CityTax)
# no piping
tax$CityTax = str_replace(tax$CityTax, fixed(“$”), “”)
tax$CityTax = as.numeric(tax$CityTax)
sum(tax$CityTax,na.rm=TRUE)
## [1] 831974818
## useing parse_numbers()
options(digits=12) # so no rounding
tax = tax %>% mutate(StateTax = parse_number(StateTax))
sum(tax$CityTax)
sum(tax$CityTax, na.rm = TRUE)
## [1] 831974817.74
sum(tax$CityTax, na.rm = TRUE)/1e6
## [1] 831.97481774
sum(tax$StateTax, na.rm = TRUE)
## [1] 42085409.8
sum(tax$StateTax, na.rm = TRUE)/1e6
## [1] 42.0854098
Using table() or group_by and summarize(n()) or tally()
how many observations/properties are in each ward?
table(tax$Ward)
## 01 02 03 04 05 06 07 08 09 10 11 12 13
## 6613 3478 2490 1746 466 4360 4914 11445 11831 1704 3587 8874 9439
## 14 15 16 17 18 19 20 21 22 23 24 25 26
## 3131 18663 11285 1454 2314 3834 11927 3182 1720 2577 6113 14865 25510
## 27 28 50
## 50215 10219 7
ward_table = tax %>%
group_by(Ward) %>%
summarize(group_by(tax,Ward),number_of_obs = n())
## # A tibble: 29 × 2
## Ward number_of_obs
##
## 1 01 6613
## 2 02 3478
## 3 03 2490
## 4 04 1746
## 5 05 466
## 6 06 4360
## 7 07 4914
## 8 08 11445
## 9 09 11831
## 10 10 1704
## # … with 19 more rows
ward_table = tax %>%
group_by(Ward) %>%
summarize(number_of_obs = n())
what is the mean state tax per ward? use group_by and summarize
tax %>% group_by(Ward) %>%
summarize(mean_state = mean(StateTax, na.rm = TRUE))
## # A tibble: 29 × 2
## Ward mean_state
##
## 1 01 291.
## 2 02 344.
## 3 03 726.
## 4 04 2327.
## 5 05 351.
## 6 06 178.
## 7 07 90.3
## 8 08 55.3
## 9 09 89.4
## 10 10 82.2
## # … with 19 more rows
what is the maximum amount still due in each ward? different summarization (max)
tax$AmountDue = tax$AmountDue %>%
str_replace(fixed(“$”), “”) %>%
as.numeric
tax = tax %>% mutate(
AmountDue = as.numeric(str_replace(AmountDue, fixed(“$”), “”))
tax %>% group_by(Ward) %>%
summarize(maxDue = max(AmountDue, na.rm = TRUE))
## Warning in max(AmountDue, na.rm = TRUE): no non-missing arguments to max;
## returning -Inf
## # A tibble: 29 × 2
## Ward maxDue
##
## 1 01 610032.
## 2 02 1459942.
## 3 03 1023666.
## 4 04 3746356.
## 5 05 948503.
## 6 06 407589.
## 7 07 1014574.
## 8 08 249133.
## 9 09 150837.
## 10 10 156216.
## # … with 19 more rows
What is the 75th percentile of city and state tax paid by Ward? (quantile)
tax %>% group_by(Ward) %>%
summarize(Percentile = quantile(StateTax, prob = 0.75,na.rm = TRUE))
## # A tibble: 29 × 2
## Ward Percentile
##
## 1 01 311.
## 2 02 321.
## 3 03 305.
## 4 04 569.
## 5 05 131.
## 6 06 195.
## 7 07 40.3
## 8 08 84.1
## 9 09 135.
## 10 10 101.
## # … with 19 more rows
ward_table = tax %>%
group_by(Ward) %>%
summarize(
number_of_obs = n(),
mean_state_tax = mean(StateTax, na.rm = TRUE),
max_amount_due = max(AmountDue, na.rm = TRUE),
q75_city = quantile(CityTax, probs = 0.75, na.rm = TRUE),
q75_state = quantile(StateTax, probs = 0.75, na.rm = TRUE)
## Warning in max(AmountDue, na.rm = TRUE): no non-missing arguments to max;
## returning -Inf
Make boxplots using showing cityTax (y -variable) by whether the property is a principal residence (x) or not.
tax = tax %>%
mutate(ResCode = str_trim(ResCode))
qplot(y = CityTax, x = ResCode, data = tax, geom = “boxplot”)
## Warning: Removed 18696 rows containing non-finite values (stat_boxplot).
qplot(y = log10(CityTax+1), x = ResCode, data = tax, geom = “boxplot”)
## Warning: Removed 18696 rows containing non-finite values (stat_boxplot).
boxplot(log10(CityTax+1) ~ ResCode, data = tax)
boxplot(CityTax ~ ResCode, data = tax)
tax %>% filter(CityTax == max(CityTax, na.rm = TRUE))
## # A tibble: 1 × 16
## PropertyID Block Lot Ward Sect PropertyAddress LotSize CityTax StateTax
##
## 1 0841 001 0841 001 21 090 1525 RUSSELL ST 3.627 ACR… 6.28e6 312992.
## # … with 7 more variables: ResCode
## # Neighborhood
## # Location
Subset the data to only retain those houses that are principal residences. which command subsets rows? Filter or select?
How many such houses are there?
pres = tax %>% filter( ResCode %in% “PRINCIPAL RESIDENCE”)
pres = tax %>% filter( ResCode == “PRINCIPAL RESIDENCE”)
## [1] 114934 16
Describe the distribution of property taxes on these residences. Use hist with certain breaks or plot(density(variable))
qplot(x = log10(CityTax+1),data = pres, geom = “histogram”)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 499 rows containing non-finite values (stat_bin).
qplot(x = CityTax, data = pres, geom = “density”)
## Warning: Removed 499 rows containing non-finite values (stat_density).
hist(log2(pres$CityTax+1))
hist(pres$CityTax)
head(pres$CityTax)
## [1] 1463.45 2861.70 1011.60 472.08 1109.01 1011.60
plot(density(pres$CityTax, na.rm = TRUE))
Read in the Salary FY2015 dataset
sal = read_csv(“data/Baltimore_City_Employee_Salaries_FY2015.csv”)
## Rows: 14017 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: “,”
## chr (7): name, JobTitle, AgencyID, Agency, HireDate, AnnualSalary, GrossPay
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Make an object called health.sal using the salaries data set, with only agencies of those with “fire” (or any forms), if any, in the name remember fixed( ignore_case = TRUE) will ignore cases
health.sal = sal %>%
filter(str_detect(JobTitle,
fixed(“fire”, ignore_case = TRUE)))
Make a data set called trans which contains only agencies that contain “TRANS”.
trans = sal %>%
filter(str_detect(JobTitle, “TRANS”))
What is/are the profession(s) of people who have “abra” in their name for Baltimore’s Salaries? Case should be ignored
filter(str_detect(name, fixed(“abra”, ignore_case = TRUE)))
## # A tibble: 12 × 7
## name JobTitle AgencyID Agency HireDate AnnualSalary GrossPay
##
## 1 Abraham,Donta D LABORER (… B70357 DPW-So… 10/16/2… $30721.00 $32793.…
## 2 Abraham,Santhosh ACCOUNTAN… A50100 DPW-Wa… 12/31/2… $49573.00 $49104.…
## 3 Abraham, HOUSING I… A06043 Housin… 12/15/1… $50365.00 $50804.…
## 4 Abrahams, POLICE OF… A99416 Police… 01/20/2… $46199.00 $18387.…
## 5 Abrams, FFICE SE… A29008 States… 09/19/2… $35691.00 $34970.…
## 6 Abrams,Maxine COLLECTIO… A14003 FIN-Co… 08/26/2… $37359.00 $37768.…
## 7 Abrams, ECREATIO… P04001 R&P-Re… 06/19/2… $20800.00 $1690.00
## 8 Bey, ROCUREME… A17001 FIN-Pu… 04/23/2… $69600.00 $69763.…
## 9 Elgamil, AUDITOR S… A24001 COMP-A… 01/31/1… $94400.00 $95791.…
## 10 Gatto, POLICE OF… A99200 Police… 12/12/2… $66086.00 $73651.…
## 11 Schwartz, GENERAL C… A54005 FPR Ad… 07/26/1… $110400.00 $111408…
## 12 Velez, POLICE OF… A99036 Police… 10/24/2… $70282.00 $112957…
What is the distribution of annual salaries look like? (use hist, 20 breaks) What is the IQR? Hint: first convert to numeric. Try str_replace, but remember$ is “special” and you need fixed() around it.
sal = sal %>% mutate(AnnualSalary = str_replace(AnnualSalary, fixed(“$”), “”))
sal = sal %>% mutate(AnnualSalary = as.numeric(AnnualSalary))
qplot(x = AnnualSalary, data = sal, geom = “histogram”, bins = 20)
hist(sal$AnnualSalary, breaks = 20)
quantile(sal$AnnualSalary)
## 0% 25% 50% 75% 100%
## 900 33354 48126 68112 238772
Convert HireDate to the Date class – plot Annual Salary vs Hire Date. Use AnnualSalary ~ HireDate with a data = sal argument in plot or use x, y notation in scatter.smooth Use lubridate package. Is it mdy(date) or dmy(date) for this data – look at HireDate
sal = sal %>% mutate(HireDate = lubridate::mdy(HireDate))
q = qplot(y = AnnualSalary, x = HireDate,
data = sal, geom = “point”)
q + geom_smooth(colour = “red”, se = FALSE)
## `geom_smooth()` using method = ‘gam’ and formula ‘y ~ s(x, bs = “cs”)’
## Warning: Removed 10 rows containing non-finite values (stat_smooth).
## Warning: Removed 10 rows containing missing values (geom_point).
q + geom_smooth(colour = “red”, se = FALSE, method = “loess”)
## `geom_smooth()` using formula ‘y ~ x’
## Warning: Removed 10 rows containing non-finite values (stat_smooth).
## Removed 10 rows containing missing values (geom_point).
q + geom_smooth(colour = “red”, se = FALSE, method = “loess”, span = 2/3)
## `geom_smooth()` using formula ‘y ~ x’
## Warning: Removed 10 rows containing non-finite values (stat_smooth).
## Removed 10 rows containing missing values (geom_point).
plot(AnnualSalary ~ HireDate, data = sal)
scatter.smooth(sal$AnnualSalary, x = sal$HireDate, col = “red”)
Create a smaller dataset that only includes the Police Department, Fire Department and Sheriff’s Office. Use the Agency variable with string matching. Call this emer. How many employees are in this new dataset?
emer = sal %>% filter(
str_detect(Agency, “Sheriff’s Office|Police Department|Fire Department”)
emer = sal %>% filter(
str_detect(Agency, “Sheriff’s Office”) |
str_detect(Agency, “Police Department”) |
str_detect(Agency, “Fire Department”)
Create a variable called dept in the emer data set. dept = str_extract(Agency, “.*(ment|ice)”). E.g. want to extract all characters up until ment or ice (we can group in regex using parentheses) and then discard the rest. Replot annual salary versus hire date, color by dept (not yet – using ggplot)
emer = emer %>%
dept = str_extract(Agency, “.*(ment|ice)”)
# Replot annual salary versus hire date, color by dept (not yet – using ggplot)
ggplot(aes(x = HireDate, y = AnnualSalary,
colour = dept), data = emer) +
geom_point() + theme(legend.position = c(0.5, 0.8))
## Warning: Removed 10 rows containing missing values (geom_point).
19 (Bonus). Convert the ‘LotSize’ variable to a numeric square feet variable in the tax data set. Some tips: a) 1 acre = 43560 square feet b) The hyphens represent inches (not decimals)
tax$LotSize = str_trim(tax$LotSize) # trim to be safe
lot = tax$LotSize # for checking later
First lets take care of acres
aIndex= which(str_detect(tax$LotSize, “AC.*”) |
str_detect(tax$LotSize, fixed(” %”)))
head(aIndex)
## [1] 1 2 9 10 11 15
head(lot[aIndex])
## [1] “0.020 ACRES” “0.020 ACRES” “0.020 ACRES” “0.020 ACRES” “0.020 ACRES”
## [6] “0.020 ACRES”
acre = tax$LotSize[aIndex] # temporary variable
## find and replace character strings
acre = str_replace_all(acre, ” AC.*”,””)
acre = str_replace_all(acre, ” %”,””)
table(!is.na(as.numeric(acre)))
## Warning in table(!is.na(as.numeric(acre))): NAs introduced by coercion
## FALSE TRUE
## 237 18328
head(acre[is.na(as.numeric(acre))],50)
## Warning in head(acre[is.na(as.numeric(acre))], 50): NAs introduced by coercion
## [1] “2-158”
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com