Homework 2 Solutions
Homework 2 Solutions
Gabriel Young (gjy2107)
Febuary 8, 2018
Part 1
i. Since NYChousing is a .csv file I use read.csv() to import the data into R.
setwd(“~/Desktop/Data”)
housing <- read.csv("NYChousing.csv", as.is = TRUE)
ii. The function dim() provides the dimension of its input object.
orig_dim <- dim(housing)
orig_dim
## [1] 2506 22
iii.
apply(is.na(housing), 2, sum)
## UID PropertyName
## 0 0
## Lon Lat
## 15 15
## AgencyID Name
## 0 0
## Value Address
## 52 0
## Violations2010 REACNumber
## 0 1873
## Borough CD
## 0 0
## CityCouncilDistrict CensusTract
## 10 0
## BuildingCount UnitCount
## 0 0
## YearBuilt Owner
## 0 0
## Rental.Coop OwnerProfitStatus
## 0 0
## AffordabilityRestrictions StartAffordabilityRestrictions
## 0 5
The command is.na(housing) creates a matrix of the same dimensions as housing with each element being
TRUE or FALSE depending on whether or not the corresponding element in housing is an NA value. Then
the full call apply(is.na(housing), 2, sum) counts the number of NA values each column of housing.
1
iv.
housing <- housing[!is.na(housing$Value), ]
The call is.na(housing$Value) returns a logical vector with TRUE where housing$Value is NA, therefore
I filter using !is.na(housing$Value) to get only the rows where Value is not NA. I reassign my housing
dataframe, to be the filtered dataframe.
v.
new_dim <- dim(housing)
orig_dim[1] - new_dim[1]
## [1] 52
I removed 52 rows of my dataframe which is what I expect, since my ouput in (iii) told me that I have 52
missing values in Value.
v.
housing$logValue <- log(housing$Value)
summary(housing$logValue)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.41 12.49 13.75 13.68 14.80 20.47
vi.
housing$logUnits <- log(housing$UnitCount)
vii.
housing$after1950 <- housing$YearBuilt >= 1950
Part 2: EDA
i.
plot(housing$logUnits, housing$logValue, xlab = “log(Units)”, ylab = “Log(Value)”)
2
0 2 4 6 8 10
8
1
0
1
2
1
4
1
6
1
8
2
0
log(Units)
L
o
g
(V
a
lu
e
)
I plot a scatterplot with the plot() command and add argument xlab = and ylab = for the labels.
ii.
plot(housing$logUnits, housing$logValue, col = factor(housing$after1950), xlab = “log(Units)”, ylab = “Log(Value)”)
legend(“bottomright”, legend = levels(factor(housing$after1950)), fill = unique(factor(housing$after1950)))
0 2 4 6 8 10
8
1
0
1
2
1
4
1
6
1
8
2
0
log(Units)
L
o
g
(V
a
lu
e
)
FALSE
TRUE
There appears to be a pretty strong linear reltionship between logValue and logUnits. When colored
according to the after1950 variable, it is clear that newer buildings (those built after 1950) tend to be more
expensive and have more units than older buildings.
3
iii.
cor(housing$logValue, housing$logUnits)
## [1] 0.8727348
cor(housing$logValue[housing$Borough == “Manhattan”], housing$logUnits[housing$Borough == “Manhattan”])
## [1] 0.8830348
cor(housing$logValue[housing$Borough == “Brooklyn”], housing$logUnits[housing$Borough == “Brooklyn”])
## [1] 0.9102601
cor(housing$logValue[housing$after1950], housing$logUnits[housing$after1950])
## [1] 0.721735
cor(housing$logValue[!housing$after1950], housing$logUnits[!housing$after1950])
## [1] 0.8643297
iv.
plot(housing$logUnits[housing$Borough == “Manhattan”], housing$logValue[housing$Borough == “Manhattan”], xlab = “log(Units)”, ylab = “Log(Value)”)
points(housing$logUnits[housing$Borough == “Brooklyn”], housing$logValue[housing$Borough == “Brooklyn”], col = “red”, pch = “+”)
legend(“bottomright”, legend = c(“Manhattan”, “Brooklyn”), fill = c(“black”, “red”))
1 2 3 4 5 6 7 8
1
0
1
2
1
4
1
6
1
8
2
0
log(Units)
L
o
g
(V
a
lu
e
)
+
+ +
+
++
+
+
+
+
+
+
+
+
+
++
++
+
+
+
++
+
++
+
+
+ +
+
+
+
+
+
+
++ +
+
+
+
+
+++ +
+
+
+
++
++
++
++ ++
+
++
+
+
++
+
+
+
+ +
+ +
+
+
+
+
+
+
+
+
+
+
+
+
+
+ ++
+
+
+
++
+
+ ++
+
+
+
++
+
+
+ +
+
+
+
+
+ +
+ +
+
+
+
+
+
+
+
+
+
+
+
+ +
+
+
+
+
+
+
+
+ +
+
+
+
+ +
+
+
+
+
+ +
+
+
++
+
+
+
+
+ +
+
+
+
+
+
+
+
+++
+
+ +
+ +
+
+++
++
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ +
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
++++++
+
+
+
+
+
+ +
+ +
+
+
+
+
++
+
+ +
+
++
+
+
+
+
+
++
+
+
+
+
+
+
+
+
+
+++
+
+
+
+
+
+ +
+ ++
+
+
+
+
+
+
+
+
+
+
++ ++ +
+
+
+
+
+
+
+
++
+
++ ++ ++
+ +
+
+
+
+
+
+
+
+
+
+
+
++
+
+
+
+
+
+
++ +
+
+
+ +
+
++ +
+
+
+
+ ++
+
+
+
+
+
++
+
+
+
+
+
+
+
+
+
+
+
++ +
+
++
++
++
+
+
+
+
+
+ ++
+
+
+
+
+
+
++
+
+ ++ +
+
+
+
+ +
+
+++
+
+ +
++
+
++
+ +
+
+++
++
+
+
+
+
+
+
++
+
+
+
+
+
+
+
+
+
++
+
+
+
+
+
++
+
+
+ ++ + ++
+
+
+
+
++
+
+
+
+
++
+
+ +
+
+
+
+
+
+
+
+++
+++
+
+
++
+
+
+
+
+
+
+
+
+
+
+ + +
+
++
+
+
+
++
+ + ++ +
+
+
+
+
+
++ +
+
+
+
+
+
++ ++
+
+
+
+
+
+
+
+
+
+
+ +
+
+
++
++
+
+
+
++
+
+++
+ +
+
++
+
+
+
+
+ +++
+
+
++
+
+
+
+
++
+
+
+
+
++
+
+
+
+++
+
+
+
+
+
+
+ +
+
++
+
++
+
+++ ++ +
++
++
+
+++ ++
+
+
+
+
+
+ ++
++ ++
+
+
+
+
+
+ ++ +
+
+
+ +
+
+
++ ++ ++ ++ +
+
+
+
+
+
++
++
+
++
+
+
+
+
+++
+
+
+
+
+
+
+
+
+
+
+
+
+
+
++ +
+
++++
+
+ +
+
+
+++
+
+
+
++ +
+
++++
+
+ +
+
+
+
+
+++
++
+
++
+
+
++
+
+
+
+
+
++
+
++++ +++
+
++
+
+
++
++
++
+
+
+
+++ +
+
++
+
+
+
+
+
+
+
++
+
+
+
+
++
+
+ ++
+
+
+
+
+ +
+
+
+
+++
+
+
++
+
+
+
+
+
++
++
+
+
+ +
++
+ +
+
Manhattan
Brooklyn
v.
4
median(housing$Value[housing$Borough == “Manhattan”])
## [1] 1172362
The code calculates the median property value for all properties in Manhattan.
vi.
boxplot(housing$logValue ~ housing$Borough)
Bronx Brooklyn Manhattan Queens Staten Island
8
1
0
1
2
1
4
1
6
1
8
2
0
vii.
tapply(housing$Value, housing$Borough, median)
## Bronx Brooklyn Manhattan Queens Staten Island
## 1192950 417610 1172362 3611700 2654100
We use tapply() which splits the property value into groups based on Borough and then calculated the
median within each group.
5
Part 1
Part 2: EDA