Data Management and Exploratory Analysis in R
Data Management and Exploratory Analysis in R
Davide Benedetti
Basic Functionalities
Let’s import a dataset and perform some analysis on it. We can use the read.table family functions to load from a variety of standard data formats, including txt and csv.
?read.table # to get more info on function read.table
mydata <- read.csv("data/eaef21.csv") # load the dataset eaef21.csv into R
class(mydata) # class will show us the type of object 'mydata'
## [1] "data.frame"
str(mydata) # str() gives us the structure,
## 'data.frame': 540 obs. of 51 variables:
## $ ID : int 5531 2658 5365 4468 3142 2170 2344 4583 2517 3563 ...
## $ FEMALE : int 0 0 0 0 0 0 0 0 0 0 ...
## $ MALE : int 1 1 1 1 1 1 1 1 1 1 ...
## $ ETHBLACK: int 0 0 0 0 0 1 0 0 0 0 ...
## $ ETHHISP : int 0 1 0 0 0 0 0 0 0 0 ...
## $ ETHWHITE: int 1 0 1 1 1 0 1 1 1 1 ...
## $ AGE : int 45 40 38 43 38 39 40 37 39 42 ...
## $ S : int 12 12 15 13 18 16 14 12 12 12 ...
## $ EDUCPROF: int 0 0 0 0 0 0 0 0 0 0 ...
## $ EDUCPHD : int 0 0 0 0 0 0 0 0 0 0 ...
## $ EDUCMAST: int 0 0 0 0 1 0 0 0 0 0 ...
## $ EDUCBA : int 0 0 0 0 0 0 0 0 0 0 ...
## $ EDUCAA : int 0 0 1 0 0 0 1 0 0 0 ...
## $ EDUCHSD : int 1 0 0 1 0 0 0 1 1 1 ...
## $ EDUCDO : int 0 1 0 0 0 0 0 0 0 0 ...
## $ SINGLE : int 0 0 0 0 0 0 1 1 0 0 ...
## $ MARRIED : int 1 0 0 1 1 1 0 0 1 1 ...
## $ DIVORCED: int 0 1 1 0 0 0 0 0 0 0 ...
## $ FAITHN : int 0 1 0 0 0 0 0 0 0 0 ...
## $ FAITHP : int 1 0 0 1 1 1 1 1 1 1 ...
## $ FAITHC : int 0 0 0 0 0 0 0 0 0 0 ...
## $ FAITHJ : int 0 0 0 0 0 0 0 0 0 0 ...
## $ FAITHO : int 0 0 1 0 0 0 0 0 0 0 ...
## $ ASVAB01 : int 58 32 42 66 64 62 50 56 60 46 ...
## $ ASVAB02 : int 64 39 40 55 65 53 32 62 59 57 ...
## $ ASVAB03 : int 52 29 37 59 61 51 44 52 55 28 ...
## $ ASVAB04 : int 56 29 38 53 62 59 38 44 62 32 ...
## $ ASVAB05 : int 54 27 42 51 47 58 33 41 58 47 ...
## $ ASVAB06 : int 56 22 45 52 34 45 29 47 46 46 ...
## $ ASVABC : num 60.9 33.6 38.8 57.1 65.5 ...
## $ HEIGHT : int 67 67 69 72 76 69 65 69 72 70 ...
## $ WEIGHT85: int 160 185 135 200 185 180 150 125 195 180 ...
## $ WEIGHT02: int 200 205 185 250 220 215 145 140 240 210 ...
## $ SM : int 8 5 11 12 16 12 12 8 12 10 ...
## $ SF : int 8 5 12 16 20 12 12 13 12 8 ...
## $ SIBLINGS: int 11 3 3 2 1 2 1 10 3 9 ...
## $ LIBRARY : int 0 0 1 1 1 0 1 1 1 1 ...
## $ POV78 : int 0 1 0 0 0 0 0 0 NA 0 ...
## $ EXP : num 22.4 8.9 13.2 18.2 13.8 ...
## $ EARNINGS: num 53.4 8 24 29.5 32 ...
## $ HOURS : int 45 40 40 40 54 40 48 40 40 40 ...
## $ TENURE : num 2.75 2.385 5.75 6.135 0.827 ...
## $ COLLBARG: int 0 0 1 1 0 0 0 0 1 1 ...
## $ CATGOV : int 0 0 0 0 1 0 0 0 1 0 ...
## $ CATPRI : int 1 1 1 1 0 1 1 1 0 1 ...
## $ CATSE : int 0 0 0 0 0 0 0 0 0 0 ...
## $ URBAN : int 0 0 1 1 1 1 1 1 1 1 ...
## $ REGNE : int 0 0 0 0 1 0 0 0 0 0 ...
## $ REGNC : int 0 0 1 0 0 0 0 0 1 1 ...
## $ REGW : int 0 0 0 1 0 0 0 1 0 0 ...
## $ REGS : int 1 1 0 0 0 1 1 0 0 0 ...
# i.e. the elements within 'mydata' and how to access them
# (e.g., with $ or @)
head(mydata) # shows the first 6 rows of your dataset
## ID FEMALE MALE ETHBLACK ETHHISP ETHWHITE AGE S EDUCPROF EDUCPHD
## 1 5531 0 1 0 0 1 45 12 0 0
## 2 2658 0 1 0 1 0 40 12 0 0
## 3 5365 0 1 0 0 1 38 15 0 0
## 4 4468 0 1 0 0 1 43 13 0 0
## 5 3142 0 1 0 0 1 38 18 0 0
## 6 2170 0 1 1 0 0 39 16 0 0
## EDUCMAST EDUCBA EDUCAA EDUCHSD EDUCDO SINGLE MARRIED DIVORCED FAITHN
## 1 0 0 0 1 0 0 1 0 0
## 2 0 0 0 0 1 0 0 1 1
## 3 0 0 1 0 0 0 0 1 0
## 4 0 0 0 1 0 0 1 0 0
## 5 1 0 0 0 0 0 1 0 0
## 6 0 0 0 0 0 0 1 0 0
## FAITHP FAITHC FAITHJ FAITHO ASVAB01 ASVAB02 ASVAB03 ASVAB04 ASVAB05
## 1 1 0 0 0 58 64 52 56 54
## 2 0 0 0 0 32 39 29 29 27
## 3 0 0 0 1 42 40 37 38 42
## 4 1 0 0 0 66 55 59 53 51
## 5 1 0 0 0 64 65 61 62 47
## 6 1 0 0 0 62 53 51 59 58
## ASVAB06 ASVABC HEIGHT WEIGHT85 WEIGHT02 SM SF SIBLINGS LIBRARY POV78
## 1 56 60.89985 67 160 200 8 8 11 0 0
## 2 22 33.63790 67 185 205 5 5 3 0 1
## 3 45 38.81767 69 135 185 11 12 3 1 0
## 4 52 57.08318 72 200 250 12 16 2 1 0
## 5 34 65.53439 76 185 220 16 20 1 1 0
## 6 45 55.44746 69 180 215 12 12 2 0 0
## EXP EARNINGS HOURS TENURE COLLBARG CATGOV CATPRI CATSE URBAN
## 1 22.384615 53.41 45 2.7500000 0 0 1 0 0
## 2 8.903846 8.00 40 2.3846154 0 0 1 0 0
## 3 13.250000 24.00 40 5.7500000 1 0 1 0 1
## 4 18.250000 29.50 40 6.1346154 1 0 1 0 1
## 5 13.769231 32.05 54 0.8269231 0 1 0 0 1
## 6 11.692307 14.73 40 4.2884617 0 0 1 0 1
## REGNE REGNC REGW REGS
## 1 0 0 0 1
## 2 0 0 0 1
## 3 0 1 0 0
## 4 0 0 1 0
## 5 1 0 0 0
## 6 0 0 0 1
tail(mydata) # shows the last 6 rows of your dataset
## ID FEMALE MALE ETHBLACK ETHHISP ETHWHITE AGE S EDUCPROF EDUCPHD
## 535 5614 1 0 1 0 0 37 12 0 0
## 536 682 1 0 1 0 0 39 12 0 0
## 537 1564 1 0 0 0 1 42 14 0 0
## 538 4373 1 0 0 0 1 38 12 0 0
## 539 1444 1 0 0 1 0 40 13 0 0
## 540 1056 1 0 0 0 1 42 12 0 0
## EDUCMAST EDUCBA EDUCAA EDUCHSD EDUCDO SINGLE MARRIED DIVORCED FAITHN
## 535 0 0 0 1 0 0 1 0 0
## 536 0 0 0 1 0 1 0 0 0
## 537 0 1 0 0 0 0 1 0 0
## 538 0 0 0 1 0 0 1 0 0
## 539 0 0 0 1 0 1 0 0 0
## 540 0 0 0 1 0 0 0 1 0
## FAITHP FAITHC FAITHJ FAITHO ASVAB01 ASVAB02 ASVAB03 ASVAB04 ASVAB05
## 535 1 0 0 0 40 34 41 41 32
## 536 1 0 0 0 44 38 46 50 51
## 537 0 0 0 1 58 58 60 62 52
## 538 1 0 0 0 36 40 44 53 43
## 539 0 1 0 0 48 46 43 62 53
## 540 0 0 0 1 48 34 38 50 44
## ASVAB06 ASVABC HEIGHT WEIGHT85 WEIGHT02 SM SF SIBLINGS LIBRARY POV78
## 535 37 37.45457 60 125 175 12 12 7 0 1
## 536 50 43.45220 68 140 190 9 8 12 1 1
## 537 56 61.44509 67 125 155 13 14 2 1 0
## 538 50 44.81530 62 124 125 12 20 2 0 0
## 539 58 50.26769 62 130 190 10 11 2 1 0
## 540 54 39.09029 62 97 95 11 12 2 1 0
## EXP EARNINGS HOURS TENURE COLLBARG CATGOV CATPRI CATSE URBAN
## 535 14.942307 9.75 40 0.8076923 0 0 1 0 1
## 536 7.307693 9.50 40 7.7307692 0 1 0 0 1
## 537 19.288462 17.30 40 5.3269229 0 0 1 0 1
## 538 21.230770 27.52 51 10.4615383 0 0 1 0 1
## 539 18.076923 3.88 45 1.1346154 0 0 1 0 1
## 540 18.807692 11.13 40 6.3076925 1 0 1 0 1
## REGNE REGNC REGW REGS
## 535 0 0 0 1
## 536 0 1 0 0
## 537 0 0 1 0
## 538 0 0 1 0
## 539 0 0 0 1
## 540 0 1 0 0
summary(mydata) # gives you a simple summary of your dataset (not very nice looking)
## ID FEMALE MALE ETHBLACK
## Min. : 18 Min. :0.0 Min. :0.0 Min. :0.0000
## 1st Qu.: 1525 1st Qu.:0.0 1st Qu.:0.0 1st Qu.:0.0000
## Median : 2788 Median :0.5 Median :0.5 Median :0.0000
## Mean : 3330 Mean :0.5 Mean :0.5 Mean :0.1056
## 3rd Qu.: 4399 3rd Qu.:1.0 3rd Qu.:1.0 3rd Qu.:0.0000
## Max. :12110 Max. :1.0 Max. :1.0 Max. :1.0000
##
## ETHHISP ETHWHITE AGE S
## Min. :0.00000 Min. :0.0000 Min. :37.00 Min. : 7.00
## 1st Qu.:0.00000 1st Qu.:1.0000 1st Qu.:39.00 1st Qu.:12.00
## Median :0.00000 Median :1.0000 Median :41.00 Median :13.00
## Mean :0.05185 Mean :0.8426 Mean :40.92 Mean :13.67
## 3rd Qu.:0.00000 3rd Qu.:1.0000 3rd Qu.:43.00 3rd Qu.:16.00
## Max. :1.00000 Max. :1.0000 Max. :45.00 Max. :20.00
##
## EDUCPROF EDUCPHD EDUCMAST EDUCBA
## Min. :0.000000 Min. :0.000000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.000000 1st Qu.:0.000000 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.000000 Median :0.000000 Median :0.00000 Median :0.0000
## Mean :0.007407 Mean :0.001852 Mean :0.05185 Mean :0.1944
## 3rd Qu.:0.000000 3rd Qu.:0.000000 3rd Qu.:0.00000 3rd Qu.:0.0000
## Max. :1.000000 Max. :1.000000 Max. :1.00000 Max. :1.0000
##
## EDUCAA EDUCHSD EDUCDO SINGLE
## Min. :0.00000 Min. :0.0000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.00000 Median :1.0000 Median :0.00000 Median :0.0000
## Mean :0.08889 Mean :0.5481 Mean :0.07778 Mean :0.1481
## 3rd Qu.:0.00000 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:0.0000
## Max. :1.00000 Max. :1.0000 Max. :1.00000 Max. :1.0000
##
## MARRIED DIVORCED FAITHN FAITHP
## Min. :0.0000 Min. :0.0000 Min. :-3.0000 Min. :-3.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.: 0.0000 1st Qu.: 0.0000
## Median :1.0000 Median :0.0000 Median : 0.0000 Median : 1.0000
## Mean :0.6574 Mean :0.1944 Mean : 0.0463 Mean : 0.4963
## 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.: 0.0000 3rd Qu.: 1.0000
## Max. :1.0000 Max. :1.0000 Max. : 1.0000 Max. : 1.0000
##
## FAITHC FAITHJ FAITHO ASVAB01
## Min. :-3.0000 Min. :-3.000000 Min. :-3.0000 Min. :22
## 1st Qu.: 0.0000 1st Qu.: 0.000000 1st Qu.: 0.0000 1st Qu.:44
## Median : 0.0000 Median : 0.000000 Median : 0.0000 Median :50
## Mean : 0.3204 Mean :-0.003704 Mean : 0.1111 Mean :50
## 3rd Qu.: 1.0000 3rd Qu.: 0.000000 3rd Qu.: 0.0000 3rd Qu.:56
## Max. : 1.0000 Max. : 1.000000 Max. : 1.0000 Max. :68
##
## ASVAB02 ASVAB03 ASVAB04 ASVAB05
## Min. :31.00 Min. :24.00 Min. :20.00 Min. :20.00
## 1st Qu.:42.00 1st Qu.:44.00 1st Qu.:47.00 1st Qu.:46.00
## Median :50.00 Median :52.00 Median :53.00 Median :52.00
## Mean :50.48 Mean :49.71 Mean :50.36 Mean :50.78
## 3rd Qu.:58.00 3rd Qu.:57.00 3rd Qu.:56.00 3rd Qu.:59.00
## Max. :66.00 Max. :61.00 Max. :62.00 Max. :62.00
##
## ASVAB06 ASVABC HEIGHT WEIGHT85
## Min. :22.00 Min. :25.46 Min. :48.00 Min. : 95.0
## 1st Qu.:45.00 1st Qu.:44.54 1st Qu.:64.00 1st Qu.:129.0
## Median :51.00 Median :52.72 Median :67.00 Median :155.0
## Mean :50.22 Mean :51.36 Mean :67.67 Mean :156.7
## 3rd Qu.:57.00 3rd Qu.:58.72 3rd Qu.:71.00 3rd Qu.:178.0
## Max. :72.00 Max. :66.08 Max. :80.00 Max. :322.0
##
## WEIGHT02 SM SF SIBLINGS
## Min. : 95 Min. : 0.00 Min. : 0.00 Min. : 0.000
## 1st Qu.:150 1st Qu.:11.00 1st Qu.:10.00 1st Qu.: 2.000
## Median :180 Median :12.00 Median :12.00 Median : 3.000
## Mean :184 Mean :11.58 Mean :11.84 Mean : 3.274
## 3rd Qu.:210 3rd Qu.:12.00 3rd Qu.:14.00 3rd Qu.: 4.000
## Max. :400 Max. :20.00 Max. :20.00 Max. :13.000
##
## LIBRARY POV78 EXP EARNINGS
## Min. :0.0000 Min. :0.0000 Min. : 1.154 Min. : 2.13
## 1st Qu.:1.0000 1st Qu.:0.0000 1st Qu.:14.596 1st Qu.: 10.76
## Median :1.0000 Median :0.0000 Median :17.510 Median : 16.00
## Mean :0.7796 Mean :0.1306 Mean :16.900 Mean : 19.64
## 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:20.197 3rd Qu.: 23.16
## Max. :1.0000 Max. :1.0000 Max. :23.558 Max. :120.19
## NA's :27
## HOURS TENURE COLLBARG CATGOV
## Min. :10.00 Min. : 0.01923 Min. :0.000 Min. :0.0000
## 1st Qu.:40.00 1st Qu.: 1.93750 1st Qu.:0.000 1st Qu.:0.0000
## Median :40.00 Median : 4.69231 Median :0.000 Median :0.0000
## Mean :40.53 Mean : 7.03397 Mean :0.187 Mean :0.2278
## 3rd Qu.:45.00 3rd Qu.:10.98077 3rd Qu.:0.000 3rd Qu.:0.0000
## Max. :60.00 Max. :24.94231 Max. :1.000 Max. :1.0000
##
## CATPRI CATSE URBAN REGNE
## Min. :0.0000 Min. :0.00000 Min. :0.0000 Min. :0.0000
## 1st Qu.:1.0000 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :1.0000 Median :0.00000 Median :1.0000 Median :0.0000
## Mean :0.7537 Mean :0.01852 Mean :0.7241 Mean :0.1426
## 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:1.0000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.00000 Max. :2.0000 Max. :1.0000
##
## REGNC REGW REGS
## Min. :0.000 Min. :0.0000 Min. :0.000
## 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:0.000
## Median :0.000 Median :0.0000 Median :0.000
## Mean :0.337 Mean :0.1574 Mean :0.363
## 3rd Qu.:1.000 3rd Qu.:0.0000 3rd Qu.:1.000
## Max. :1.000 Max. :1.0000 Max. :1.000
##
nrow(mydata) # gives the number of rows
## [1] 540
ncol(mydata) # gives the number of columns
## [1] 51
The object mydata is a data.frame, i.e. a dataset with named columns. We can perform operations by rows and by columns with function apply.
# this will apply function class to each column of the dataset
apply(mydata, # the dataset to work on
MARGIN = 2, # 1 to work by rows and 2 by columns
FUN = class) # the function to apply
## ID FEMALE MALE ETHBLACK ETHHISP ETHWHITE AGE
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## S EDUCPROF EDUCPHD EDUCMAST EDUCBA EDUCAA EDUCHSD
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## EDUCDO SINGLE MARRIED DIVORCED FAITHN FAITHP FAITHC
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## FAITHJ FAITHO ASVAB01 ASVAB02 ASVAB03 ASVAB04 ASVAB05
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## ASVAB06 ASVABC HEIGHT WEIGHT85 WEIGHT02 SM SF
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## SIBLINGS LIBRARY POV78 EXP EARNINGS HOURS TENURE
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## COLLBARG CATGOV CATPRI CATSE URBAN REGNE REGNC
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## REGW REGS
## "numeric" "numeric"
R can also load from different data sources. For example, package foreign can read stata datasets.
## install packages foreign
if(!require(foreign)){install.packages("foreign")}
library(foreign)
# We can now read STATA file using the function read.dta
?read.dta
# Let's load the dataset
mydata_new <- read.dta("data/eaef21.dta")
# head and summary functions give an overview of what's inside your dataset
head(mydata_new)
## ID FEMALE MALE ETHBLACK ETHHISP ETHWHITE AGE S EDUCPROF EDUCPHD
## 1 5531 0 1 0 0 1 45 12 0 0
## 2 2658 0 1 0 1 0 40 12 0 0
## 3 5365 0 1 0 0 1 38 15 0 0
## 4 4468 0 1 0 0 1 43 13 0 0
## 5 3142 0 1 0 0 1 38 18 0 0
## 6 2170 0 1 1 0 0 39 16 0 0
## EDUCMAST EDUCBA EDUCAA EDUCHSD EDUCDO SINGLE MARRIED DIVORCED FAITHN
## 1 0 0 0 1 0 0 1 0 0
## 2 0 0 0 0 1 0 0 1 1
## 3 0 0 1 0 0 0 0 1 0
## 4 0 0 0 1 0 0 1 0 0
## 5 1 0 0 0 0 0 1 0 0
## 6 0 0 0 0 0 0 1 0 0
## FAITHP FAITHC FAITHJ FAITHO ASVAB01 ASVAB02 ASVAB03 ASVAB04 ASVAB05
## 1 1 0 0 0 58 64 52 56 54
## 2 0 0 0 0 32 39 29 29 27
## 3 0 0 0 1 42 40 37 38 42
## 4 1 0 0 0 66 55 59 53 51
## 5 1 0 0 0 64 65 61 62 47
## 6 1 0 0 0 62 53 51 59 58
## ASVAB06 ASVABC HEIGHT WEIGHT85 WEIGHT02 SM SF SIBLINGS LIBRARY POV78
## 1 56 60.89985 67 160 200 8 8 11 0 0
## 2 22 33.63790 67 185 205 5 5 3 0 1
## 3 45 38.81767 69 135 185 11 12 3 1 0
## 4 52 57.08318 72 200 250 12 16 2 1 0
## 5 34 65.53439 76 185 220 16 20 1 1 0
## 6 45 55.44746 69 180 215 12 12 2 0 0
## EXP EARNINGS HOURS TENURE COLLBARG CATGOV CATPRI CATSE URBAN
## 1 22.384615 53.41 45 2.7500000 0 0 1 0 0
## 2 8.903846 8.00 40 2.3846154 0 0 1 0 0
## 3 13.250000 24.00 40 5.7500000 1 0 1 0 1
## 4 18.250000 29.50 40 6.1346154 1 0 1 0 1
## 5 13.769231 32.05 54 0.8269231 0 1 0 0 1
## 6 11.692307 14.73 40 4.2884617 0 0 1 0 1
## REGNE REGNC REGW REGS
## 1 0 0 0 1
## 2 0 0 0 1
## 3 0 1 0 0
## 4 0 0 1 0
## 5 1 0 0 0
## 6 0 0 0 1
summary(mydata_new)
## ID FEMALE MALE ETHBLACK
## Min. : 18 Min. :0.0 Min. :0.0 Min. :0.0000
## 1st Qu.: 1525 1st Qu.:0.0 1st Qu.:0.0 1st Qu.:0.0000
## Median : 2788 Median :0.5 Median :0.5 Median :0.0000
## Mean : 3330 Mean :0.5 Mean :0.5 Mean :0.1056
## 3rd Qu.: 4399 3rd Qu.:1.0 3rd Qu.:1.0 3rd Qu.:0.0000
## Max. :12110 Max. :1.0 Max. :1.0 Max. :1.0000
##
## ETHHISP ETHWHITE AGE S
## Min. :0.00000 Min. :0.0000 Min. :37.00 Min. : 7.00
## 1st Qu.:0.00000 1st Qu.:1.0000 1st Qu.:39.00 1st Qu.:12.00
## Median :0.00000 Median :1.0000 Median :41.00 Median :13.00
## Mean :0.05185 Mean :0.8426 Mean :40.92 Mean :13.67
## 3rd Qu.:0.00000 3rd Qu.:1.0000 3rd Qu.:43.00 3rd Qu.:16.00
## Max. :1.00000 Max. :1.0000 Max. :45.00 Max. :20.00
##
## EDUCPROF EDUCPHD EDUCMAST EDUCBA
## Min. :0.000000 Min. :0.000000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.000000 1st Qu.:0.000000 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.000000 Median :0.000000 Median :0.00000 Median :0.0000
## Mean :0.007407 Mean :0.001852 Mean :0.05185 Mean :0.1944
## 3rd Qu.:0.000000 3rd Qu.:0.000000 3rd Qu.:0.00000 3rd Qu.:0.0000
## Max. :1.000000 Max. :1.000000 Max. :1.00000 Max. :1.0000
##
## EDUCAA EDUCHSD EDUCDO SINGLE
## Min. :0.00000 Min. :0.0000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.00000 Median :1.0000 Median :0.00000 Median :0.0000
## Mean :0.08889 Mean :0.5481 Mean :0.07778 Mean :0.1481
## 3rd Qu.:0.00000 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:0.0000
## Max. :1.00000 Max. :1.0000 Max. :1.00000 Max. :1.0000
##
## MARRIED DIVORCED FAITHN FAITHP
## Min. :0.0000 Min. :0.0000 Min. :-3.0000 Min. :-3.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.: 0.0000 1st Qu.: 0.0000
## Median :1.0000 Median :0.0000 Median : 0.0000 Median : 1.0000
## Mean :0.6574 Mean :0.1944 Mean : 0.0463 Mean : 0.4963
## 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.: 0.0000 3rd Qu.: 1.0000
## Max. :1.0000 Max. :1.0000 Max. : 1.0000 Max. : 1.0000
##
## FAITHC FAITHJ FAITHO ASVAB01
## Min. :-3.0000 Min. :-3.000000 Min. :-3.0000 Min. :22
## 1st Qu.: 0.0000 1st Qu.: 0.000000 1st Qu.: 0.0000 1st Qu.:44
## Median : 0.0000 Median : 0.000000 Median : 0.0000 Median :50
## Mean : 0.3204 Mean :-0.003704 Mean : 0.1111 Mean :50
## 3rd Qu.: 1.0000 3rd Qu.: 0.000000 3rd Qu.: 0.0000 3rd Qu.:56
## Max. : 1.0000 Max. : 1.000000 Max. : 1.0000 Max. :68
##
## ASVAB02 ASVAB03 ASVAB04 ASVAB05
## Min. :31.00 Min. :24.00 Min. :20.00 Min. :20.00
## 1st Qu.:42.00 1st Qu.:44.00 1st Qu.:47.00 1st Qu.:46.00
## Median :50.00 Median :52.00 Median :53.00 Median :52.00
## Mean :50.48 Mean :49.71 Mean :50.36 Mean :50.78
## 3rd Qu.:58.00 3rd Qu.:57.00 3rd Qu.:56.00 3rd Qu.:59.00
## Max. :66.00 Max. :61.00 Max. :62.00 Max. :62.00
##
## ASVAB06 ASVABC HEIGHT WEIGHT85
## Min. :22.00 Min. :25.46 Min. :48.00 Min. : 95.0
## 1st Qu.:45.00 1st Qu.:44.54 1st Qu.:64.00 1st Qu.:129.0
## Median :51.00 Median :52.72 Median :67.00 Median :155.0
## Mean :50.22 Mean :51.36 Mean :67.67 Mean :156.7
## 3rd Qu.:57.00 3rd Qu.:58.72 3rd Qu.:71.00 3rd Qu.:178.0
## Max. :72.00 Max. :66.08 Max. :80.00 Max. :322.0
##
## WEIGHT02 SM SF SIBLINGS
## Min. : 95 Min. : 0.00 Min. : 0.00 Min. : 0.000
## 1st Qu.:150 1st Qu.:11.00 1st Qu.:10.00 1st Qu.: 2.000
## Median :180 Median :12.00 Median :12.00 Median : 3.000
## Mean :184 Mean :11.58 Mean :11.84 Mean : 3.274
## 3rd Qu.:210 3rd Qu.:12.00 3rd Qu.:14.00 3rd Qu.: 4.000
## Max. :400 Max. :20.00 Max. :20.00 Max. :13.000
##
## LIBRARY POV78 EXP EARNINGS
## Min. :0.0000 Min. :0.0000 Min. : 1.154 Min. : 2.13
## 1st Qu.:1.0000 1st Qu.:0.0000 1st Qu.:14.596 1st Qu.: 10.76
## Median :1.0000 Median :0.0000 Median :17.510 Median : 16.00
## Mean :0.7796 Mean :0.1306 Mean :16.900 Mean : 19.64
## 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:20.197 3rd Qu.: 23.16
## Max. :1.0000 Max. :1.0000 Max. :23.558 Max. :120.19
## NA's :27
## HOURS TENURE COLLBARG CATGOV
## Min. :10.00 Min. : 0.01923 Min. :0.000 Min. :0.0000
## 1st Qu.:40.00 1st Qu.: 1.93750 1st Qu.:0.000 1st Qu.:0.0000
## Median :40.00 Median : 4.69231 Median :0.000 Median :0.0000
## Mean :40.53 Mean : 7.03397 Mean :0.187 Mean :0.2278
## 3rd Qu.:45.00 3rd Qu.:10.98077 3rd Qu.:0.000 3rd Qu.:0.0000
## Max. :60.00 Max. :24.94231 Max. :1.000 Max. :1.0000
##
## CATPRI CATSE URBAN REGNE
## Min. :0.0000 Min. :0.00000 Min. :0.0000 Min. :0.0000
## 1st Qu.:1.0000 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :1.0000 Median :0.00000 Median :1.0000 Median :0.0000
## Mean :0.7537 Mean :0.01852 Mean :0.7241 Mean :0.1426
## 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:1.0000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.00000 Max. :2.0000 Max. :1.0000
##
## REGNC REGW REGS
## Min. :0.000 Min. :0.0000 Min. :0.000
## 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:0.000
## Median :0.000 Median :0.0000 Median :0.000
## Mean :0.337 Mean :0.1574 Mean :0.363
## 3rd Qu.:1.000 3rd Qu.:0.0000 3rd Qu.:1.000
## Max. :1.000 Max. :1.0000 Max. :1.000
##
# remove R object from environment
rm(mydata_new)
We can show some more summary statistics with package stargather.
Load the Data
## install packages stargazer
if(!require(stargazer)){install.packages("stargazer")}
library(stargazer)
stargazer(mydata, type = "text", iqr = T, median = T)
##
## ==========================================================================
## Statistic N Mean St. Dev. Min Pctl(25) Median Pctl(75) Max
## --------------------------------------------------------------------------
## ID 540 3,330.191 2,684.020 18 1,525 2,787.5 4,398.8 12,110
## FEMALE 540 0.500 0.500 0 0 0.5 1 1
## MALE 540 0.500 0.500 0 0 0.5 1 1
## ETHBLACK 540 0.106 0.308 0 0 0 0 1
## ETHHISP 540 0.052 0.222 0 0 0 0 1
## ETHWHITE 540 0.843 0.365 0 1 1 1 1
## AGE 540 40.919 2.323 37 39 41 43 45
## S 540 13.672 2.438 7 12 13 16 20
## EDUCPROF 540 0.007 0.086 0 0 0 0 1
## EDUCPHD 540 0.002 0.043 0 0 0 0 1
## EDUCMAST 540 0.052 0.222 0 0 0 0 1
## EDUCBA 540 0.194 0.396 0 0 0 0 1
## EDUCAA 540 0.089 0.285 0 0 0 0 1
## EDUCHSD 540 0.548 0.498 0 0 1 1 1
## EDUCDO 540 0.078 0.268 0 0 0 0 1
## SINGLE 540 0.148 0.356 0 0 0 0 1
## MARRIED 540 0.657 0.475 0 0 1 1 1
## DIVORCED 540 0.194 0.396 0 0 0 0 1
## FAITHN 540 0.046 0.258 -3 0 0 0 1
## FAITHP 540 0.496 0.522 -3 0 1 1 1
## FAITHC 540 0.320 0.490 -3 0 0 1 1
## FAITHJ 540 -0.004 0.136 -3 0 0 0 1
## FAITHO 540 0.111 0.348 -3 0 0 0 1
## ASVAB01 540 49.996 9.540 22 44 50 56 68
## ASVAB02 540 50.476 9.808 31 42 50 58 66
## ASVAB03 540 49.709 9.376 24 44 52 57 61
## ASVAB04 540 50.356 9.677 20 47 53 56 62
## ASVAB05 540 50.780 9.470 20 46 52 59 62
## ASVAB06 540 50.222 9.530 22 45 51 57 72
## ASVABC 540 51.363 9.568 25.459 44.543 52.721 58.719 66.080
## HEIGHT 540 67.674 4.238 48 64 67 71 80
## WEIGHT85 540 156.748 34.880 95 129 155 178 322
## WEIGHT02 540 184.013 44.828 95 150 180 210 400
## SM 540 11.580 2.816 0 11 12 12 20
## SF 540 11.837 3.537 0 10 12 14 20
## SIBLINGS 540 3.274 2.188 0 2 3 4 13
## LIBRARY 540 0.780 0.415 0 1 1 1 1
## POV78 513 0.131 0.337 0 0 0 0 1
## EXP 540 16.900 4.433 1.154 14.596 17.510 20.197 23.558
## EARNINGS 540 19.636 14.416 2.130 10.758 16.000 23.155 120.190
## HOURS 540 40.531 9.186 10 40 40 45 60
## TENURE 540 7.034 6.427 0.019 1.937 4.692 10.981 24.942
## COLLBARG 540 0.187 0.390 0 0 0 0 1
## CATGOV 540 0.228 0.420 0 0 0 0 1
## CATPRI 540 0.754 0.431 0 1 1 1 1
## CATSE 540 0.019 0.135 0 0 0 0 1
## URBAN 540 0.724 0.483 0 0 1 1 2
## REGNE 540 0.143 0.350 0 0 0 0 1
## REGNC 540 0.337 0.473 0 0 0 1 1
## REGW 540 0.157 0.365 0 0 0 0 1
## REGS 540 0.363 0.481 0 0 0 1 1
## --------------------------------------------------------------------------
Tidyverse
Tidyverse is a nice R package combining several packages for datamanagement and plotting, including dplyr tidyr and ggplot2. This package allows us to easily manipulate data. We can summarize the data with summarize.
if(!require(tidyverse)){install.packages("tidyverse")}
## Loading required package: tidyverse
## -- Attaching packages ------------------------------------------------------------------------ tidyverse 1.2.1 --
## v ggplot2 3.1.0 v purrr 0.3.2
## v tibble 2.1.1 v dplyr 0.8.1
## v tidyr 0.8.1 v stringr 1.3.0
## v readr 1.1.1 v forcats 0.4.0
## -- Conflicts --------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(tidyverse) # Package for data manipulation
# the following code gives same result as
# mean(mydata$EARNINGS)
mydata %>%
summarize(mean(EARNINGS))
## mean(EARNINGS)
## 1 19.63622
Create new varables with mutate.
# create and add a new variable
mydata %>%
mutate(EDUCUNI = EDUCBA + EDUCMAST) %>%
head()
## ID FEMALE MALE ETHBLACK ETHHISP ETHWHITE AGE S EDUCPROF EDUCPHD
## 1 5531 0 1 0 0 1 45 12 0 0
## 2 2658 0 1 0 1 0 40 12 0 0
## 3 5365 0 1 0 0 1 38 15 0 0
## 4 4468 0 1 0 0 1 43 13 0 0
## 5 3142 0 1 0 0 1 38 18 0 0
## 6 2170 0 1 1 0 0 39 16 0 0
## EDUCMAST EDUCBA EDUCAA EDUCHSD EDUCDO SINGLE MARRIED DIVORCED FAITHN
## 1 0 0 0 1 0 0 1 0 0
## 2 0 0 0 0 1 0 0 1 1
## 3 0 0 1 0 0 0 0 1 0
## 4 0 0 0 1 0 0 1 0 0
## 5 1 0 0 0 0 0 1 0 0
## 6 0 0 0 0 0 0 1 0 0
## FAITHP FAITHC FAITHJ FAITHO ASVAB01 ASVAB02 ASVAB03 ASVAB04 ASVAB05
## 1 1 0 0 0 58 64 52 56 54
## 2 0 0 0 0 32 39 29 29 27
## 3 0 0 0 1 42 40 37 38 42
## 4 1 0 0 0 66 55 59 53 51
## 5 1 0 0 0 64 65 61 62 47
## 6 1 0 0 0 62 53 51 59 58
## ASVAB06 ASVABC HEIGHT WEIGHT85 WEIGHT02 SM SF SIBLINGS LIBRARY POV78
## 1 56 60.89985 67 160 200 8 8 11 0 0
## 2 22 33.63790 67 185 205 5 5 3 0 1
## 3 45 38.81767 69 135 185 11 12 3 1 0
## 4 52 57.08318 72 200 250 12 16 2 1 0
## 5 34 65.53439 76 185 220 16 20 1 1 0
## 6 45 55.44746 69 180 215 12 12 2 0 0
## EXP EARNINGS HOURS TENURE COLLBARG CATGOV CATPRI CATSE URBAN
## 1 22.384615 53.41 45 2.7500000 0 0 1 0 0
## 2 8.903846 8.00 40 2.3846154 0 0 1 0 0
## 3 13.250000 24.00 40 5.7500000 1 0 1 0 1
## 4 18.250000 29.50 40 6.1346154 1 0 1 0 1
## 5 13.769231 32.05 54 0.8269231 0 1 0 0 1
## 6 11.692307 14.73 40 4.2884617 0 0 1 0 1
## REGNE REGNC REGW REGS EDUCUNI
## 1 0 0 0 1 0
## 2 0 0 0 1 0
## 3 0 1 0 0 0
## 4 0 0 1 0 0
## 5 1 0 0 0 1
## 6 0 0 0 1 0
Combine it with other functions, e.g. stargazer.
# stargazer with dplyr
mydata %>%
mutate(EDUCUNI = EDUCBA + EDUCMAST) %>%
stargazer(type = “text”)
##
## ================================================
## Statistic N Mean St. Dev. Min Max
## ————————————————
## ID 540 3,330.191 2,684.020 18 12,110
## FEMALE 540 0.500 0.500 0 1
## MALE 540 0.500 0.500 0 1
## ETHBLACK 540 0.106 0.308 0 1
## ETHHISP 540 0.052 0.222 0 1
## ETHWHITE 540 0.843 0.365 0 1
## AGE 540 40.919 2.323 37 45
## S 540 13.672 2.438 7 20
## EDUCPROF 540 0.007 0.086 0 1
## EDUCPHD 540 0.002 0.043 0 1
## EDUCMAST 540 0.052 0.222 0 1
## EDUCBA 540 0.194 0.396 0 1
## EDUCAA 540 0.089 0.285 0 1
## EDUCHSD 540 0.548 0.498 0 1
## EDUCDO 540 0.078 0.268 0 1
## SINGLE 540 0.148 0.356 0 1
## MARRIED 540 0.657 0.475 0 1
## DIVORCED 540 0.194 0.396 0 1
## FAITHN 540 0.046 0.258 -3 1
## FAITHP 540 0.496 0.522 -3 1
## FAITHC 540 0.320 0.490 -3 1
## FAITHJ 540 -0.004 0.136 -3 1
## FAITHO 540 0.111 0.348 -3 1
## ASVAB01 540 49.996 9.540 22 68
## ASVAB02 540 50.476 9.808 31 66
## ASVAB03 540 49.709 9.376 24 61
## ASVAB04 540 50.356 9.677 20 62
## ASVAB05 540 50.780 9.470 20 62
## ASVAB06 540 50.222 9.530 22 72
## ASVABC 540 51.363 9.568 25.459 66.080
## HEIGHT 540 67.674 4.238 48 80
## WEIGHT85 540 156.748 34.880 95 322
## WEIGHT02 540 184.013 44.828 95 400
## SM 540 11.580 2.816 0 20
## SF 540 11.837 3.537 0 20
## SIBLINGS 540 3.274 2.188 0 13
## LIBRARY 540 0.780 0.415 0 1
## POV78 513 0.131 0.337 0 1
## EXP 540 16.900 4.433 1.154 23.558
## EARNINGS 540 19.636 14.416 2.130 120.190
## HOURS 540 40.531 9.186 10 60
## TENURE 540 7.034 6.427 0.019 24.942
## COLLBARG 540 0.187 0.390 0 1
## CATGOV 540 0.228 0.420 0 1
## CATPRI 540 0.754 0.431 0 1
## CATSE 540 0.019 0.135 0 1
## URBAN 540 0.724 0.483 0 2
## REGNE 540 0.143 0.350 0 1
## REGNC 540 0.337 0.473 0 1
## REGW 540 0.157 0.365 0 1
## REGS 540 0.363 0.481 0 1
## EDUCUNI 540 0.246 0.431 0 1
## ————————————————
Perform operations by clusters.
# mutate/summarize by groups
mydata %>%
group_by(FEMALE) %>%
summarize(mean(EXP))
## # A tibble: 2 x 2
## FEMALE `mean(EXP)`
##
## 1 0 17.9
## 2 1 15.9
mydata %>%
group_by(MARRIED) %>%
mutate(DHOURS = HOURS – mean(HOURS)) %>%
head()
## # A tibble: 6 x 52
## # Groups: MARRIED [2]
## ID FEMALE MALE ETHBLACK ETHHISP ETHWHITE AGE S EDUCPROF EDUCPHD
##
## 1 5531 0 1 0 0 1 45 12 0 0
## 2 2658 0 1 0 1 0 40 12 0 0
## 3 5365 0 1 0 0 1 38 15 0 0
## 4 4468 0 1 0 0 1 43 13 0 0
## 5 3142 0 1 0 0 1 38 18 0 0
## 6 2170 0 1 1 0 0 39 16 0 0
## # … with 42 more variables: EDUCMAST
## # EDUCHSD
## # DIVORCED
## # FAITHJ
## # ASVAB03
## # ASVABC
## # SF
## # EARNINGS
## # CATGOV
## # REGNC
Perform more operations at the same time
# you can also ‘filter’ rows and ‘select’ columns
mydata %>%
mutate(AGE_D = ifelse(AGE <= 40, "below 40", "above 40")) %>%
group_by(AGE_D) %>%
summarise(mean(EARNINGS), mean(HOURS))
## # A tibble: 2 x 3
## AGE_D `mean(EARNINGS)` `mean(HOURS)`
##
## 1 above 40 19.9 40.3
## 2 below 40 19.3 40.8
Pick only columns and observations we are interested into.
# you can also ‘filter’ rows and ‘select’ columns
mydata %>%
filter(AGE > 40) %>%
select(EARNINGS, EXP, FEMALE) %>%
head()
## EARNINGS EXP FEMALE
## 1 53.41 22.38461 0
## 2 29.50 18.25000 0
## 3 11.78 22.03846 0
## 4 27.30 16.38461 0
## 5 55.05 17.30769 0
## 6 12.30 22.40385 0
Now, let’s compare the tidyverse approach with the classic one. The tidyverse approach is easier to follow, particularly even when the actions we want to perform are more complex.
# the two following lines of code perform the same operation
# tidyverse approach
mydata %>%
filter(AGE > 40 & EDUCBA == 1) %>%
group_by(URBAN) %>%
summarise(mean(EARNINGS))
# classic R
summarise(
group_by(
filter(
mydata, AGE > 40 & EDUCBA == 1),
URBAN),
mean(EARNINGS)
)
Now we can try to perform more complex operations. For example, variables FAITH have some strange values (-3) when they should be binary. We remove strage values from FAITH variables. Moreover, we can convert all binary variables from numeric to qualitative (i.e. charcaters or factors). To achieve this we can use more advanced functions like mutate_if.
# We need to create a function which tells as whether a variable is binary
# the function below takes a vector x and returns TRUE if all elements in x
# are 0 or 1 (i.e. if x is binary), and FALSE otherwise.
is.binary <- function(x){
all(x %in% 0:1)
}
# modify the dataset and save it into new object mydata2
mydata2 <- mydata %>%
mutate(FAITHC = ifelse(FAITHC < 0, 0, FAITHC)) %>% # replace all values below 0 with 0
mutate(FAITHJ = ifelse(FAITHJ < 0, 0, FAITHJ)) %>%
mutate(FAITHN = ifelse(FAITHN < 0, 0, FAITHN)) %>%
mutate(FAITHO = ifelse(FAITHO < 0, 0, FAITHO)) %>%
mutate(FAITHP = ifelse(FAITHP < 0, 0, FAITHP)) %>%
mutate(URBAN = factor(URBAN)) %>%
mutate_if(is.binary, # mutate all variables which satisfy this condition
factor) # apply function factor() to all columns satisfying the above condition
mydata2 %>% stargazer(type = “text”)
##
## ================================================
## Statistic N Mean St. Dev. Min Max
## ————————————————
## ID 540 3,330.191 2,684.020 18 12,110
## AGE 540 40.919 2.323 37 45
## S 540 13.672 2.438 7 20
## ASVAB01 540 49.996 9.540 22 68
## ASVAB02 540 50.476 9.808 31 66
## ASVAB03 540 49.709 9.376 24 61
## ASVAB04 540 50.356 9.677 20 62
## ASVAB05 540 50.780 9.470 20 62
## ASVAB06 540 50.222 9.530 22 72
## ASVABC 540 51.363 9.568 25.459 66.080
## HEIGHT 540 67.674 4.238 48 80
## WEIGHT85 540 156.748 34.880 95 322
## WEIGHT02 540 184.013 44.828 95 400
## SM 540 11.580 2.816 0 20
## SF 540 11.837 3.537 0 20
## SIBLINGS 540 3.274 2.188 0 13
## POV78 513 0.131 0.337 0 1
## EXP 540 16.900 4.433 1.154 23.558
## EARNINGS 540 19.636 14.416 2.130 120.190
## HOURS 540 40.531 9.186 10 60
## TENURE 540 7.034 6.427 0.019 24.942
## ————————————————
Univariate analysis
We can also easily plot our data. We can use histograms for numerical variables.
mydata2 %>%
ggplot(aes(EARNINGS)) +
geom_histogram(bins = 100,
color = “grey30”,
fill = “white”) +
ylab(“”) +
xlab(“”) +
ggtitle(“Histograms of numerical variables”)
We can add a density line, but we need to show histograms as probabilities.
mydata2 %>%
ggplot(aes(EARNINGS)) +
geom_histogram(aes(y = ..density..),
bins = 100,
color = “grey30”,
fill = “white”) +
geom_density(alpha = .2,
fill = “antiquewhite3”) +
ylab(“”) +
xlab(“”) +
ggtitle(“Histograms of Earnings”)
We can also perform multi-plotting by some conditions, using package reshape2.
if(!require(reshape2)){install.packages(“reshape2”)}
library(reshape2) # package for data reshaping
# Prepare data for Histogram plots of numerical variables
mydata2 %>%
select_if(negate(is.factor)) %>% # Remove categorical var.
mutate(ID = factor(ID)) %>% # id to a categorical variable
melt(id.vars=c(“ID”)) %>% # Reshape dataset for multiplots
ggplot(aes(value)) + # Histograms of numerical var.
geom_histogram() +
facet_wrap(~variable, scales = “free”) +
ylab(“”) +
xlab(“”) +
ggtitle(“Histograms of numerical variables”)
For qualitative analysis you can try for yourself geom_bar() and geom_count().
Bivariate analysis
Let’s start by checking correlations. This can help to identify possible multi-collinearity issues.
if(!require(corrplot)){install.packages(“corrplot”)}
library(corrplot) # package for correlation plotting
# Prepare data for correlation plot of numerical variables
mydata2 %>%
na.omit() %>%
select_if(negate(is.factor)) %>% # Remove categorical var.
select(-ID) %>% # Remove id
cor() %>%
corrplot(order = “hclust”, # Correlation plot
type = “upper”,
tl.col = “black”,
tl.srt = 45)
We can then perform boxplots/histograms/bars by groups.
mydata2 %>%
na.omit() %>%
select(ID, FEMALE, EARNINGS) %>%
mutate(ID = factor(ID)) %>% # id to a categorical var.
mutate(FEMALE = factor(FEMALE)) %>% # FEMALE to a categorical var.
melt(id.vars=c(“ID”,”FEMALE”)) %>% # Reshape dataset
ggplot() +
geom_boxplot(aes(x = FEMALE,
y = value,
fill = FEMALE)) +
facet_wrap(~variable, scales = “free”) +
xlab(“”) +
ylab(“”) +
ggtitle(“Boxplots of Earnings variables by Gender”)
mydata2 %>%
ggplot(aes(x = EXP,
y = log(EARNINGS),
color = URBAN)) +
geom_point() +
geom_smooth(method = “lm”, se = F) +
xlab(“”) +
ylab(“”) +
ggtitle(“Earnings on Gender by Urban”)
Univariate regression.
mod0 <- mydata2 %>%
lm(log(EARNINGS) ~ EXP, data = .)
mod0 %>% summary()
##
## Call:
## lm(formula = log(EARNINGS) ~ EXP, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.00515 -0.41689 -0.03241 0.35866 2.02727
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.449410 0.098823 24.786 < 2e-16 ***
## EXP 0.020271 0.005656 3.584 0.00037 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5822 on 538 degrees of freedom
## Multiple R-squared: 0.02332, Adjusted R-squared: 0.0215
## F-statistic: 12.84 on 1 and 538 DF, p-value: 0.0003695
Multivariate analysis
We can now start with multiple linear regression model.
mod1 <- mydata2 %>%
lm(log(EARNINGS) ~ AGE + EXP +
EDUCPROF + EDUCPHD +
EDUCMAST + EDUCBA +
EDUCAA + EDUCHSD +
EDUCDO, data = .)
mod1 %>% summary()
##
## Call:
## lm(formula = log(EARNINGS) ~ AGE + EXP + EDUCPROF + EDUCPHD +
## EDUCMAST + EDUCBA + EDUCAA + EDUCHSD + EDUCDO, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.77593 -0.32854 -0.03132 0.30212 2.01786
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.671919 0.411639 8.920 < 2e-16 ***
## AGE -0.030217 0.009872 -3.061 0.002319 **
## EXP 0.036813 0.005303 6.942 1.13e-11 ***
## EDUCPROF1 1.322317 0.281213 4.702 3.29e-06 ***
## EDUCPHD1 0.135952 0.518411 0.262 0.793232
## EDUCMAST1 0.244102 0.158043 1.545 0.123057
## EDUCBA1 0.123178 0.134919 0.913 0.361672
## EDUCAA1 -0.247465 0.145376 -1.702 0.089296 .
## EDUCHSD1 -0.437111 0.129628 -3.372 0.000801 ***
## EDUCDO1 -0.651896 0.148039 -4.404 1.29e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5024 on 530 degrees of freedom
## Multiple R-squared: 0.2835, Adjusted R-squared: 0.2714
## F-statistic: 23.31 on 9 and 530 DF, p-value: < 2.2e-16
mod1 %>% stargazer(type = “text”)
##
## ===============================================
## Dependent variable:
## —————————
## log(EARNINGS)
## ———————————————–
## AGE -0.030***
## (0.010)
##
## EXP 0.037***
## (0.005)
##
## EDUCPROF1 1.322***
## (0.281)
##
## EDUCPHD1 0.136
## (0.518)
##
## EDUCMAST1 0.244
## (0.158)
##
## EDUCBA1 0.123
## (0.135)
##
## EDUCAA1 -0.247*
## (0.145)
##
## EDUCHSD1 -0.437***
## (0.130)
##
## EDUCDO1 -0.652***
## (0.148)
##
## Constant 3.672***
## (0.412)
##
## ———————————————–
## Observations 540
## R2 0.284
## Adjusted R2 0.271
## Residual Std. Error 0.502 (df = 530)
## F Statistic 23.306*** (df = 9; 530)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
We can compare two models together.
stargazer(mod0, mod1,
type = "text")
##
## ===================================================================
## Dependent variable:
## -----------------------------------------------
## log(EARNINGS)
## (1) (2)
## -------------------------------------------------------------------
## AGE -0.030***
## (0.010)
##
## EXP 0.020*** 0.037***
## (0.006) (0.005)
##
## EDUCPROF1 1.322***
## (0.281)
##
## EDUCPHD1 0.136
## (0.518)
##
## EDUCMAST1 0.244
## (0.158)
##
## EDUCBA1 0.123
## (0.135)
##
## EDUCAA1 -0.247*
## (0.145)
##
## EDUCHSD1 -0.437***
## (0.130)
##
## EDUCDO1 -0.652***
## (0.148)
##
## Constant 2.449*** 3.672***
## (0.099) (0.412)
##
## -------------------------------------------------------------------
## Observations 540 540
## R2 0.023 0.284
## Adjusted R2 0.021 0.271
## Residual Std. Error 0.582 (df = 538) 0.502 (df = 530)
## F Statistic 12.843*** (df = 1; 538) 23.306*** (df = 9; 530)
## ===================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Test Hypothesis
After we made sure the model is well-specified we can use it to test hypothesis.
library(lmtest)
library(sandwich)
library(car)
# we can test that the coeff of AGE=0
linearHypothesis(mod1, "AGE = 0")
## Linear hypothesis test
##
## Hypothesis:
## AGE = 0
##
## Model 1: restricted model
## Model 2: log(EARNINGS) ~ AGE + EXP + EDUCPROF + EDUCPHD + EDUCMAST + EDUCBA +
## EDUCAA + EDUCHSD + EDUCDO
##
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 531 136.13
## 2 530 133.77 1 2.3646 9.3688 0.002319 **
## —
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1
# test coef of EXP=0.5
linearHypothesis(mod1, “EXP = 0.5”)
## Linear hypothesis test
##
## Hypothesis:
## EXP = 0.5
##
## Model 1: restricted model
## Model 2: log(EARNINGS) ~ AGE + EXP + EDUCPROF + EDUCPHD + EDUCMAST + EDUCBA +
## EDUCAA + EDUCHSD + EDUCDO
##
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 531 2059.49
## 2 530 133.77 1 1925.7 7629.9 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# test coef of AGE equal - coef of EXP
linearHypothesis(mod1, "AGE = -EXP")
## Linear hypothesis test
##
## Hypothesis:
## AGE + EXP = 0
##
## Model 1: restricted model
## Model 2: log(EARNINGS) ~ AGE + EXP + EDUCPROF + EDUCPHD + EDUCMAST + EDUCBA +
## EDUCAA + EDUCHSD + EDUCDO
##
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 531 133.88
## 2 530 133.77 1 0.11699 0.4635 0.4963
# test same as before
linearHypothesis(mod1, “AGE + EXP = 0”)
## Linear hypothesis test
##
## Hypothesis:
## AGE + EXP = 0
##
## Model 1: restricted model
## Model 2: log(EARNINGS) ~ AGE + EXP + EDUCPROF + EDUCPHD + EDUCMAST + EDUCBA +
## EDUCAA + EDUCHSD + EDUCDO
##
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 531 133.88
## 2 530 133.77 1 0.11699 0.4635 0.4963
# test several hypothesis simultaneously
linearHypothesis(mod1, c(“EDUCPHD1 = 0”,
“EDUCMAST1 = 0”,
“EDUCBA1 = 0”,
“EDUCAA1 = 0”))
## Linear hypothesis test
##
## Hypothesis:
## EDUCPHD1 = 0
## EDUCMAST1 = 0
## EDUCBA1 = 0
## EDUCAA1 = 0
##
## Model 1: restricted model
## Model 2: log(EARNINGS) ~ AGE + EXP + EDUCPROF + EDUCPHD + EDUCMAST + EDUCBA +
## EDUCAA + EDUCHSD + EDUCDO
##
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 534 139.62
## 2 530 133.77 4 5.848 5.7925 0.000144 ***
## —
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1
Predictions
We can also use the model to forecast unseen values with predict().
# let’s estimate a model of earnings on age and experience only,
# and then forecast the salary of two individuals
# one has age = 40 and exp = 18
# the other has age = 42 and exp = 20
mydata2 %>%
lm(log(EARNINGS) ~ AGE + EXP, data = .) %>%
predict(newdata = data.frame(
AGE = c(40, 42),
EXP = c(18, 20)
))
## 1 2
## 2.834429 2.844080
The above forecasted salaries are in logs so they need to be re-transformed
Diagnostics (Advanced)
There are several things that need to be checked in a linear model Heteroskedasticity (Quasi)-Multicollinearity Non-Normality Misspecification Autocorrelation and Non-stationarity (for time-series)
Heteroskedasticity happens when the variance of the residuals is correlated with the covariates. Heteroskedasticity will not bias the estimated coefficients but it will bias their standard errors, meaning that your tests might erroneously conclude that some coefficients are significant when, in reality, they are not, and vice-versa. In case of Heteroskedasticity we must use robust standard errors to test our hypothesis or we can use log transformation of our variables. We can check graphically if the squared residuals are correlated with the model covariates.
We can also run the Breusch-Pagan test. The null hypothesis of this test is homoscedasticity. Breusch-Pagan test runs a regression of squared residuals over covariates, and then tests N*R2 ~ Chi2(k) where k is the number of paramters.
mod1 %>% bptest()
##
## studentized Breusch-Pagan test
##
## data: .
## BP = 8.9716, df = 9, p-value = 0.4399
If we have heteroskedasticity we need robust standard errors.
standard_se <- function(x) {
sqrt(diag(vcov(x)))
}
robust_se <- function(x) {
sqrt(diag(vcovHAC(x)))
}
# fist column uses classic s.e. while second one uses robust s.e.
stargazer(mod1, mod1,
type = "text",
se = list(standard_se(mod1),robust_se(mod1)))
##
## ===========================================================
## Dependent variable:
## ----------------------------
## log(EARNINGS)
## (1) (2)
## -----------------------------------------------------------
## AGE -0.030*** -0.030***
## (0.010) (0.009)
##
## EXP 0.037*** 0.037***
## (0.005) (0.006)
##
## EDUCPROF1 1.322*** 1.322***
## (0.281) (0.227)
##
## EDUCPHD1 0.136 0.136
## (0.518) (0.143)
##
## EDUCMAST1 0.244 0.244
## (0.158) (0.168)
##
## EDUCBA1 0.123 0.123
## (0.135) (0.151)
##
## EDUCAA1 -0.247* -0.247
## (0.145) (0.159)
##
## EDUCHSD1 -0.437*** -0.437***
## (0.130) (0.140)
##
## EDUCDO1 -0.652*** -0.652***
## (0.148) (0.157)
##
## Constant 3.672*** 3.672***
## (0.412) (0.389)
##
## -----------------------------------------------------------
## Observations 540 540
## R2 0.284 0.284
## Adjusted R2 0.271 0.271
## Residual Std. Error (df = 530) 0.502 0.502
## F Statistic (df = 9; 530) 23.306*** 23.306***
## ===========================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
We can also test linear hypothesis with robust standard errors.
# test with robust standard errors
linearHypothesis(mod1, "AGE = 0",
vcov. = vcovHAC)
## Linear hypothesis test
##
## Hypothesis:
## AGE = 0
##
## Model 1: restricted model
## Model 2: log(EARNINGS) ~ AGE + EXP + EDUCPROF + EDUCPHD + EDUCMAST + EDUCBA +
## EDUCAA + EDUCHSD + EDUCDO
##
## Note: Coefficient covariance matrix supplied.
##
## Res.Df Df F Pr(>F)
## 1 531
## 2 530 1 10.578 0.001217 **
## —
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1
We can test multicollinarity with the Variance inflation factor. Vif is a transformation the R2 obtained from regressing variable x_i on all other covariates. Therefore, we have one vif for every covariate. Values of vif above 10 can indicate that that variable is collinear with one or more of the others. Some researchers suggest using lower values such as 2.50. Check ?vif for more info.
mod1 %>% vif()
## AGE EXP EDUCPROF EDUCPHD EDUCMAST EDUCBA EDUCAA EDUCHSD
## 1.123057 1.180256 1.244010 1.062837 2.627282 6.100367 3.662003 8.904531
## EDUCDO
## 3.363237
Non-Normality happens when the residuals are not normally distributed often due to the presence of outliers that we can account for with dummy variables. We can use qqplot to plot if the residual ‘behave normally’. The plot compares the residuals of the model with those of a normal distribution. If residuals are normally distributed, they should be on the line y=x
mod1 %>% plot(2)
Misspecification happens when the model is wrongly specified. It can be due to several reasons, e.g. wrong functional form, (i.e. using linear model when the phenomenon is non-linear), endogeneity (i.e. omission of important covariates or simultaneous effects between dependent variable and one or more covariates). If the model is mis-specified all estimated coefficients will be biased and we need to modify our model. We can test mispecification (regarding linearity of the model) with the Ramsey’s reset test. The null hypothesis of this test is that the model is well specified.
mod1 %>% reset(power = 1:3)
##
## RESET test
##
## data: .
## RESET = 0.43615, df1 = 3, df2 = 527, p-value = 0.7272
The reset test is performed in the following way. We regress the dependent variable over all the covariates and on polynomials of the fitted values of our model, in this case squares and cubes. Then we need to test that the coefficients on the polynomials of the fitted values are simultaneously zeros.