Data wrangling
STT 180 Module 2 Lecture 5
Dola Pathak
Michigan State University
(Michigan State University) Introduction to Data Science 1 / 18
Learning objectives
1 Learning to organize the data using the dplyr package.
2 How to manipulate data frames using the functions of the dplyr
package.
3 Effectively use the pipe operator to get the desired information.
4 Learning to summarize the data using the functions available in the
dplyr package.
(Michigan State University) Introduction to Data Science 2 / 18
Package dplyr
To follow along, load the tidyverse package with
library(tidyverse)
Package dplyr will automatically be loaded. We will use mtcars for
easy-to-follow examples.
(Michigan State University) Introduction to Data Science 3 / 18
Pipes
The pipe operator is implemented in the package magrittr. Instead of
stop(run(hit_ball(swing_bat(identify(what = “baseball”))), to = “first”))
we can use the pipe operator
identify(what = “baseball”) %>%
swing_bat() %>%
hit_ball() %>%
run(to = “first”) %>%
stop()
Writing it out using pipes give it a more natural (and easier to read)
structure. The shortcut Ctrl/Cmd + Shift + M implements %>%.
(Michigan State University) Introduction to Data Science 4 / 18
dplyr functions
Package dplyr is based on the concept of functions as verbs that
manipulate data frames.
• filter(): choose rows matching a set of criteria
• slice(): choose rows using indices
• select(): choose columns by name
• pull(): grab a column as a vector
• rename(): rename specific columns
• arrange(): reorder rows
• mutate(): add new variables to the data frame
• transmute(): create a new data frame with new variables
• distinct(): filter for unique rows
• sample_n / sample_frac(): randomly sample rows
• summarise(): reduce variables to values
(Michigan State University) Introduction to Data Science 5 / 18
dplyr function rules
1 First argument is always a data frame
2 Subsequent arguments say what to do with that data frame
3 Result is almost always a data frame
4 Does not modify original data frame
(Michigan State University) Introduction to Data Science 6 / 18
Selecting variables
# using subsetting techniques
mtcars[c(“mpg”, “hp”, “wt”)] %>%
head()
mpg hp wt
Mazda RX4 21.0 110 2.620
Mazda RX4 Wag 21.0 110 2.875
Datsun 710 22.8 93 2.320
Hornet 4 Drive 21.4 110 3.215
Hornet Sportabout 18.7 175 3.440
Valiant 18.1 105 3.460
# using dplyr::select
mtcars %>%
select(mpg, hp, wt) %>%
head()
mpg hp wt
Mazda RX4 21.0 110 2.620
Mazda RX4 Wag 21.0 110 2.875
Datsun 710 22.8 93 2.320
Hornet 4 Drive 21.4 110 3.215
Hornet Sportabout 18.7 175 3.440
Valiant 18.1 105 3.460
Use variable names as opposed to variable position. Strive to make your
code readable.
(Michigan State University) Introduction to Data Science 7 / 18
Selecting variables
# using subsetting techniques
mtcars[c(“mpg”, “hp”, “wt”)] %>%
head()
mpg hp wt
Mazda RX4 21.0 110 2.620
Mazda RX4 Wag 21.0 110 2.875
Datsun 710 22.8 93 2.320
Hornet 4 Drive 21.4 110 3.215
Hornet Sportabout 18.7 175 3.440
Valiant 18.1 105 3.460
# using dplyr::select
mtcars %>%
select(mpg, hp, wt) %>%
head()
mpg hp wt
Mazda RX4 21.0 110 2.620
Mazda RX4 Wag 21.0 110 2.875
Datsun 710 22.8 93 2.320
Hornet 4 Drive 21.4 110 3.215
Hornet Sportabout 18.7 175 3.440
Valiant 18.1 105 3.460
Use variable names as opposed to variable position. Strive to make your
code readable.
(Michigan State University) Introduction to Data Science 7 / 18
Selecting a single variable as a vector
# using subsetting to grab a vector
mtcars[, 1]
[1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2
[15] 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4
[29] 15.8 19.7 15.0 21.4
# using dplyr::pull
mtcars %>%
pull(var = 1)
[1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2
[15] 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4
[29] 15.8 19.7 15.0 21.4
(Michigan State University) Introduction to Data Science 8 / 18
Selecting a single variable as a vector
# using subsetting to grab a vector
mtcars[, 1]
[1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2
[15] 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4
[29] 15.8 19.7 15.0 21.4
# using dplyr::pull
mtcars %>%
pull(var = 1)
[1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2
[15] 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4
[29] 15.8 19.7 15.0 21.4
(Michigan State University) Introduction to Data Science 8 / 18
Filter based on conditions
# using subsetting techniques
mtcars[(mtcars$mpg < 20) & (mtcars$am == 0), ] %>%
head()
mpg cyl disp hp drat wt qsec vs am gear carb
Hornet Sportabout 18.7 8 360.0 175 3.15 3.44 17.02 0 0 3 2
Valiant 18.1 6 225.0 105 2.76 3.46 20.22 1 0 3 1
Duster 360 14.3 8 360.0 245 3.21 3.57 15.84 0 0 3 4
Merc 280 19.2 6 167.6 123 3.92 3.44 18.30 1 0 4 4
Merc 280C 17.8 6 167.6 123 3.92 3.44 18.90 1 0 4 4
Merc 450SE 16.4 8 275.8 180 3.07 4.07 17.40 0 0 3 3
# using dplyr::filter
mtcars %>%
filter((mpg < 20) & (am == 0)) %>%
head()
mpg cyl disp hp drat wt qsec vs am gear carb
1 18.7 8 360.0 175 3.15 3.44 17.02 0 0 3 2
2 18.1 6 225.0 105 2.76 3.46 20.22 1 0 3 1
3 14.3 8 360.0 245 3.21 3.57 15.84 0 0 3 4
4 19.2 6 167.6 123 3.92 3.44 18.30 1 0 4 4
5 17.8 6 167.6 123 3.92 3.44 18.90 1 0 4 4
6 16.4 8 275.8 180 3.07 4.07 17.40 0 0 3 3
(Michigan State University) Introduction to Data Science 9 / 18
Filter based on conditions
# using subsetting techniques
mtcars[(mtcars$mpg < 20) & (mtcars$am == 0), ] %>%
head()
mpg cyl disp hp drat wt qsec vs am gear carb
Hornet Sportabout 18.7 8 360.0 175 3.15 3.44 17.02 0 0 3 2
Valiant 18.1 6 225.0 105 2.76 3.46 20.22 1 0 3 1
Duster 360 14.3 8 360.0 245 3.21 3.57 15.84 0 0 3 4
Merc 280 19.2 6 167.6 123 3.92 3.44 18.30 1 0 4 4
Merc 280C 17.8 6 167.6 123 3.92 3.44 18.90 1 0 4 4
Merc 450SE 16.4 8 275.8 180 3.07 4.07 17.40 0 0 3 3
# using dplyr::filter
mtcars %>%
filter((mpg < 20) & (am == 0)) %>%
head()
mpg cyl disp hp drat wt qsec vs am gear carb
1 18.7 8 360.0 175 3.15 3.44 17.02 0 0 3 2
2 18.1 6 225.0 105 2.76 3.46 20.22 1 0 3 1
3 14.3 8 360.0 245 3.21 3.57 15.84 0 0 3 4
4 19.2 6 167.6 123 3.92 3.44 18.30 1 0 4 4
5 17.8 6 167.6 123 3.92 3.44 18.90 1 0 4 4
6 16.4 8 275.8 180 3.07 4.07 17.40 0 0 3 3
(Michigan State University) Introduction to Data Science 9 / 18
Add a variable to a data frame
# old technique
mtcars$new_var <- mtcars$hp / 10 head(mtcars[8:12]) vs am gear carb new_var Mazda RX4 0 1 4 4 11.0 Mazda RX4 Wag 0 1 4 4 11.0 Datsun 710 1 1 4 1 9.3 Hornet 4 Drive 1 0 3 1 11.0 Hornet Sportabout 0 0 3 2 17.5 Valiant 1 0 3 1 10.5 # using dplyr::mutate mtcars %>%
mutate(new_var = hp / 10) %>%
select(vs:new_var) %>%
head()
vs am gear carb new_var
1 0 1 4 4 11.0
2 0 1 4 4 11.0
3 1 1 4 1 9.3
4 1 0 3 1 11.0
5 0 0 3 2 17.5
6 1 0 3 1 10.5
(Michigan State University) Introduction to Data Science 10 / 18
Add a variable to a data frame
# old technique
mtcars$new_var <- mtcars$hp / 10 head(mtcars[8:12]) vs am gear carb new_var Mazda RX4 0 1 4 4 11.0 Mazda RX4 Wag 0 1 4 4 11.0 Datsun 710 1 1 4 1 9.3 Hornet 4 Drive 1 0 3 1 11.0 Hornet Sportabout 0 0 3 2 17.5 Valiant 1 0 3 1 10.5 # using dplyr::mutate mtcars %>%
mutate(new_var = hp / 10) %>%
select(vs:new_var) %>%
head()
vs am gear carb new_var
1 0 1 4 4 11.0
2 0 1 4 4 11.0
3 1 1 4 1 9.3
4 1 0 3 1 11.0
5 0 0 3 2 17.5
6 1 0 3 1 10.5
(Michigan State University) Introduction to Data Science 10 / 18
Mutate or transmute?
# using dplyr::mutate
mtcars %>%
mutate(new_var = hp / 10) %>%
head()
mpg cyl disp hp drat wt qsec vs am gear carb new_var
1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 11.0
2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 11.0
3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 9.3
4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 11.0
5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 17.5
6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 10.5
# using dplyr::transmute
mtcars %>%
transmute(new_var = hp / 10) %>%
head()
new_var
1 11.0
2 11.0
3 9.3
4 11.0
5 17.5
6 10.5
(Michigan State University) Introduction to Data Science 11 / 18
Mutate or transmute?
# using dplyr::mutate
mtcars %>%
mutate(new_var = hp / 10) %>%
head()
mpg cyl disp hp drat wt qsec vs am gear carb new_var
1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 11.0
2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 11.0
3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 9.3
4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 11.0
5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 17.5
6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 10.5
# using dplyr::transmute
mtcars %>%
transmute(new_var = hp / 10) %>%
head()
new_var
1 11.0
2 11.0
3 9.3
4 11.0
5 17.5
6 10.5
(Michigan State University) Introduction to Data Science 11 / 18
Summaries and groupings
mtcars %>%
summarise(mean_mpg = mean(mpg),
mean_hp = mean(hp),
mean_wt = mean(wt))
mean_mpg mean_hp mean_wt
1 20.09062 146.6875 3.21725
mtcars %>%
group_by(cyl) %>%
summarise(mean_mpg = mean(mpg),
mean_hp = mean(hp),
mean_wt = mean(wt))
# A tibble: 3 x 4
cyl mean_mpg mean_hp mean_wt
1 4 26.7 82.6 2.29
2 6 19.7 122. 3.12
3 8 15.1 209. 4.00
(Michigan State University) Introduction to Data Science 12 / 18
Summaries and groupings
mtcars %>%
group_by(cyl, am) %>%
summarise(mean_mpg = mean(mpg),
mean_hp = mean(hp),
mean_wt = mean(wt))
# A tibble: 6 x 5
# Groups: cyl [?]
cyl am mean_mpg mean_hp mean_wt
1 4 0 22.9 84.7 2.94
2 4 1 28.1 81.9 2.04
3 6 0 19.1 115. 3.39
4 6 1 20.6 132. 2.76
5 8 0 15.0 194. 4.10
6 8 1 15.4 300. 3.37
mtcars %>%
group_by(am, cyl) %>%
summarise(count = n())
# A tibble: 6 x 3
# Groups: am [?]
am cyl count
1 0 4 3
2 0 6 4
3 0 8 12
4 1 4 8
5 1 6 3
6 1 8 2
(Michigan State University) Introduction to Data Science 13 / 18
Summaries and groupings
mtcars %>%
group_by(cyl, am) %>%
summarise(mean_mpg = mean(mpg),
mean_hp = mean(hp),
mean_wt = mean(wt))
# A tibble: 6 x 5
# Groups: cyl [?]
cyl am mean_mpg mean_hp mean_wt
1 4 0 22.9 84.7 2.94
2 4 1 28.1 81.9 2.04
3 6 0 19.1 115. 3.39
4 6 1 20.6 132. 2.76
5 8 0 15.0 194. 4.10
6 8 1 15.4 300. 3.37
mtcars %>%
group_by(am, cyl) %>%
summarise(count = n())
# A tibble: 6 x 3
# Groups: am [?]
am cyl count
1 0 4 3
2 0 6 4
3 0 8 12
4 1 4 8
5 1 6 3
6 1 8 2
(Michigan State University) Introduction to Data Science 13 / 18
Useful functions with summarise
• Center: mean(), median()
• Spread: sd(), IQR(), mad()
• Range: min(), max(), quantile()
• Position: first(), last(), nth()
• Count: n(), n_distinct()
• Logical: any(), all()
(Michigan State University) Introduction to Data Science 14 / 18
Pipes and plots
mtcars %>%
group_by(cyl, am) %>%
summarise(mean_mpg = mean(mpg)) %>%
ggplot(mapping = aes(x = cyl, y = factor(am))) +
geom_raster(mapping = aes(fill = mean_mpg)) +
labs(x = “Number of cylinders”, y = “Transmission type”,
fill = “Mean miles per gallon”)
0
1
4 6 8
Number of cylinders
Tr
a
n
sm
is
si
o
n
t
yp
e
17.5
20.0
22.5
25.0
27.5
Mean miles per gallon
(Michigan State University) Introduction to Data Science 15 / 18
Pipes and plots
mtcars %>%
group_by(cyl, am) %>%
summarise(count = n()) %>%
arrange((cyl)) %>%
ggplot(mapping = aes(x = factor(cyl), y = count)) +
geom_bar(stat = “identity”, aes(fill = factor(am)), position = “dodge”) +
labs(x = “Number of cylinders”, y = “Count”,
fill = “Transmission type”)
0.0
2.5
5.0
7.5
10.0
12.5
4 6 8
Number of cylinders
C
o
u
n
t Transmission type
0
1
(Michigan State University) Introduction to Data Science 16 / 18
Today’s R essentials
• Ctrl/Cmd + Shift + M implements %>%
• In a series of piped dplyr function, %>% always ends a line
• One dplyr function per line for readability
(Michigan State University) Introduction to Data Science 17 / 18
Today’s R Markdown essentials
• Cascading Style Sheets is a style sheet language used for describing
the presentation of a document written in a markup language such as
HTML.
• Since R Markdown generates and HTML file you can customize CSS
styles.
(Michigan State University) Introduction to Data Science 18 / 18