CS计算机代考程序代写 data science Data wrangling

Data wrangling
STT 180 Module 2 Lecture 5

Dola Pathak

Michigan State University

(Michigan State University) Introduction to Data Science 1 / 18

Learning objectives

1 Learning to organize the data using the dplyr package.

2 How to manipulate data frames using the functions of the dplyr
package.

3 Effectively use the pipe operator to get the desired information.

4 Learning to summarize the data using the functions available in the
dplyr package.

(Michigan State University) Introduction to Data Science 2 / 18

Package dplyr

To follow along, load the tidyverse package with

library(tidyverse)

Package dplyr will automatically be loaded. We will use mtcars for
easy-to-follow examples.

(Michigan State University) Introduction to Data Science 3 / 18

Pipes

The pipe operator is implemented in the package magrittr. Instead of

stop(run(hit_ball(swing_bat(identify(what = “baseball”))), to = “first”))

we can use the pipe operator

identify(what = “baseball”) %>%

swing_bat() %>%

hit_ball() %>%

run(to = “first”) %>%

stop()

Writing it out using pipes give it a more natural (and easier to read)
structure. The shortcut Ctrl/Cmd + Shift + M implements %>%.

(Michigan State University) Introduction to Data Science 4 / 18

dplyr functions

Package dplyr is based on the concept of functions as verbs that
manipulate data frames.

• filter(): choose rows matching a set of criteria
• slice(): choose rows using indices
• select(): choose columns by name
• pull(): grab a column as a vector
• rename(): rename specific columns
• arrange(): reorder rows
• mutate(): add new variables to the data frame
• transmute(): create a new data frame with new variables
• distinct(): filter for unique rows
• sample_n / sample_frac(): randomly sample rows
• summarise(): reduce variables to values
(Michigan State University) Introduction to Data Science 5 / 18

dplyr function rules

1 First argument is always a data frame

2 Subsequent arguments say what to do with that data frame

3 Result is almost always a data frame

4 Does not modify original data frame

(Michigan State University) Introduction to Data Science 6 / 18

Selecting variables

# using subsetting techniques

mtcars[c(“mpg”, “hp”, “wt”)] %>%

head()

mpg hp wt

Mazda RX4 21.0 110 2.620

Mazda RX4 Wag 21.0 110 2.875

Datsun 710 22.8 93 2.320

Hornet 4 Drive 21.4 110 3.215

Hornet Sportabout 18.7 175 3.440

Valiant 18.1 105 3.460

# using dplyr::select

mtcars %>%

select(mpg, hp, wt) %>%

head()

mpg hp wt

Mazda RX4 21.0 110 2.620

Mazda RX4 Wag 21.0 110 2.875

Datsun 710 22.8 93 2.320

Hornet 4 Drive 21.4 110 3.215

Hornet Sportabout 18.7 175 3.440

Valiant 18.1 105 3.460

Use variable names as opposed to variable position. Strive to make your
code readable.

(Michigan State University) Introduction to Data Science 7 / 18

Selecting variables

# using subsetting techniques

mtcars[c(“mpg”, “hp”, “wt”)] %>%

head()

mpg hp wt

Mazda RX4 21.0 110 2.620

Mazda RX4 Wag 21.0 110 2.875

Datsun 710 22.8 93 2.320

Hornet 4 Drive 21.4 110 3.215

Hornet Sportabout 18.7 175 3.440

Valiant 18.1 105 3.460

# using dplyr::select

mtcars %>%

select(mpg, hp, wt) %>%

head()

mpg hp wt

Mazda RX4 21.0 110 2.620

Mazda RX4 Wag 21.0 110 2.875

Datsun 710 22.8 93 2.320

Hornet 4 Drive 21.4 110 3.215

Hornet Sportabout 18.7 175 3.440

Valiant 18.1 105 3.460

Use variable names as opposed to variable position. Strive to make your
code readable.

(Michigan State University) Introduction to Data Science 7 / 18

Selecting a single variable as a vector

# using subsetting to grab a vector

mtcars[, 1]

[1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2

[15] 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4

[29] 15.8 19.7 15.0 21.4

# using dplyr::pull

mtcars %>%

pull(var = 1)

[1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2

[15] 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4

[29] 15.8 19.7 15.0 21.4

(Michigan State University) Introduction to Data Science 8 / 18

Selecting a single variable as a vector

# using subsetting to grab a vector

mtcars[, 1]

[1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2

[15] 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4

[29] 15.8 19.7 15.0 21.4

# using dplyr::pull

mtcars %>%

pull(var = 1)

[1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2

[15] 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4

[29] 15.8 19.7 15.0 21.4

(Michigan State University) Introduction to Data Science 8 / 18

Filter based on conditions

# using subsetting techniques

mtcars[(mtcars$mpg < 20) & (mtcars$am == 0), ] %>%

head()

mpg cyl disp hp drat wt qsec vs am gear carb

Hornet Sportabout 18.7 8 360.0 175 3.15 3.44 17.02 0 0 3 2

Valiant 18.1 6 225.0 105 2.76 3.46 20.22 1 0 3 1

Duster 360 14.3 8 360.0 245 3.21 3.57 15.84 0 0 3 4

Merc 280 19.2 6 167.6 123 3.92 3.44 18.30 1 0 4 4

Merc 280C 17.8 6 167.6 123 3.92 3.44 18.90 1 0 4 4

Merc 450SE 16.4 8 275.8 180 3.07 4.07 17.40 0 0 3 3

# using dplyr::filter

mtcars %>%

filter((mpg < 20) & (am == 0)) %>%

head()

mpg cyl disp hp drat wt qsec vs am gear carb

1 18.7 8 360.0 175 3.15 3.44 17.02 0 0 3 2

2 18.1 6 225.0 105 2.76 3.46 20.22 1 0 3 1

3 14.3 8 360.0 245 3.21 3.57 15.84 0 0 3 4

4 19.2 6 167.6 123 3.92 3.44 18.30 1 0 4 4

5 17.8 6 167.6 123 3.92 3.44 18.90 1 0 4 4

6 16.4 8 275.8 180 3.07 4.07 17.40 0 0 3 3

(Michigan State University) Introduction to Data Science 9 / 18

Filter based on conditions

# using subsetting techniques

mtcars[(mtcars$mpg < 20) & (mtcars$am == 0), ] %>%

head()

mpg cyl disp hp drat wt qsec vs am gear carb

Hornet Sportabout 18.7 8 360.0 175 3.15 3.44 17.02 0 0 3 2

Valiant 18.1 6 225.0 105 2.76 3.46 20.22 1 0 3 1

Duster 360 14.3 8 360.0 245 3.21 3.57 15.84 0 0 3 4

Merc 280 19.2 6 167.6 123 3.92 3.44 18.30 1 0 4 4

Merc 280C 17.8 6 167.6 123 3.92 3.44 18.90 1 0 4 4

Merc 450SE 16.4 8 275.8 180 3.07 4.07 17.40 0 0 3 3

# using dplyr::filter

mtcars %>%

filter((mpg < 20) & (am == 0)) %>%

head()

mpg cyl disp hp drat wt qsec vs am gear carb

1 18.7 8 360.0 175 3.15 3.44 17.02 0 0 3 2

2 18.1 6 225.0 105 2.76 3.46 20.22 1 0 3 1

3 14.3 8 360.0 245 3.21 3.57 15.84 0 0 3 4

4 19.2 6 167.6 123 3.92 3.44 18.30 1 0 4 4

5 17.8 6 167.6 123 3.92 3.44 18.90 1 0 4 4

6 16.4 8 275.8 180 3.07 4.07 17.40 0 0 3 3

(Michigan State University) Introduction to Data Science 9 / 18

Add a variable to a data frame

# old technique

mtcars$new_var <- mtcars$hp / 10 head(mtcars[8:12]) vs am gear carb new_var Mazda RX4 0 1 4 4 11.0 Mazda RX4 Wag 0 1 4 4 11.0 Datsun 710 1 1 4 1 9.3 Hornet 4 Drive 1 0 3 1 11.0 Hornet Sportabout 0 0 3 2 17.5 Valiant 1 0 3 1 10.5 # using dplyr::mutate mtcars %>%

mutate(new_var = hp / 10) %>%

select(vs:new_var) %>%

head()

vs am gear carb new_var

1 0 1 4 4 11.0

2 0 1 4 4 11.0

3 1 1 4 1 9.3

4 1 0 3 1 11.0

5 0 0 3 2 17.5

6 1 0 3 1 10.5

(Michigan State University) Introduction to Data Science 10 / 18

Add a variable to a data frame

# old technique

mutate(new_var = hp / 10) %>%

select(vs:new_var) %>%

head()

vs am gear carb new_var

1 0 1 4 4 11.0

2 0 1 4 4 11.0

3 1 1 4 1 9.3

4 1 0 3 1 11.0

5 0 0 3 2 17.5

6 1 0 3 1 10.5

(Michigan State University) Introduction to Data Science 10 / 18

Mutate or transmute?

# using dplyr::mutate

mtcars %>%

mutate(new_var = hp / 10) %>%

head()

mpg cyl disp hp drat wt qsec vs am gear carb new_var

1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 11.0

2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 11.0

3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 9.3

4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 11.0

5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 17.5

6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 10.5

# using dplyr::transmute

mtcars %>%

transmute(new_var = hp / 10) %>%

head()

new_var

1 11.0

2 11.0

3 9.3

4 11.0

5 17.5

6 10.5

(Michigan State University) Introduction to Data Science 11 / 18

Mutate or transmute?

# using dplyr::mutate

mtcars %>%

mutate(new_var = hp / 10) %>%

head()

mpg cyl disp hp drat wt qsec vs am gear carb new_var

1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 11.0

2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 11.0

3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 9.3

4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 11.0

5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 17.5

6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 10.5

# using dplyr::transmute

mtcars %>%

transmute(new_var = hp / 10) %>%

head()

new_var

1 11.0

2 11.0

3 9.3

4 11.0

5 17.5

6 10.5

(Michigan State University) Introduction to Data Science 11 / 18

Summaries and groupings

mtcars %>%

summarise(mean_mpg = mean(mpg),

mean_hp = mean(hp),

mean_wt = mean(wt))

mean_mpg mean_hp mean_wt

1 20.09062 146.6875 3.21725

mtcars %>%

group_by(cyl) %>%

summarise(mean_mpg = mean(mpg),

mean_hp = mean(hp),

mean_wt = mean(wt))

# A tibble: 3 x 4

cyl mean_mpg mean_hp mean_wt

1 4 26.7 82.6 2.29

2 6 19.7 122. 3.12

3 8 15.1 209. 4.00

(Michigan State University) Introduction to Data Science 12 / 18

Summaries and groupings

mtcars %>%

group_by(cyl, am) %>%

summarise(mean_mpg = mean(mpg),

mean_hp = mean(hp),

mean_wt = mean(wt))

# A tibble: 6 x 5

# Groups: cyl [?]

cyl am mean_mpg mean_hp mean_wt

1 4 0 22.9 84.7 2.94

2 4 1 28.1 81.9 2.04

3 6 0 19.1 115. 3.39

4 6 1 20.6 132. 2.76

5 8 0 15.0 194. 4.10

6 8 1 15.4 300. 3.37

mtcars %>%

group_by(am, cyl) %>%

summarise(count = n())

# A tibble: 6 x 3

# Groups: am [?]

am cyl count

1 0 4 3

2 0 6 4

3 0 8 12

4 1 4 8

5 1 6 3

6 1 8 2

(Michigan State University) Introduction to Data Science 13 / 18

Summaries and groupings

mtcars %>%

group_by(cyl, am) %>%

summarise(mean_mpg = mean(mpg),

mean_hp = mean(hp),

mean_wt = mean(wt))

# A tibble: 6 x 5

# Groups: cyl [?]

cyl am mean_mpg mean_hp mean_wt

1 4 0 22.9 84.7 2.94

2 4 1 28.1 81.9 2.04

3 6 0 19.1 115. 3.39

4 6 1 20.6 132. 2.76

5 8 0 15.0 194. 4.10

6 8 1 15.4 300. 3.37

mtcars %>%

group_by(am, cyl) %>%

summarise(count = n())

# A tibble: 6 x 3

# Groups: am [?]

am cyl count

1 0 4 3

2 0 6 4

3 0 8 12

4 1 4 8

5 1 6 3

6 1 8 2

(Michigan State University) Introduction to Data Science 13 / 18

Useful functions with summarise

• Center: mean(), median()
• Spread: sd(), IQR(), mad()
• Range: min(), max(), quantile()
• Position: first(), last(), nth()
• Count: n(), n_distinct()
• Logical: any(), all()

(Michigan State University) Introduction to Data Science 14 / 18

Pipes and plots

mtcars %>%

group_by(cyl, am) %>%

summarise(mean_mpg = mean(mpg)) %>%

ggplot(mapping = aes(x = cyl, y = factor(am))) +

geom_raster(mapping = aes(fill = mean_mpg)) +

labs(x = “Number of cylinders”, y = “Transmission type”,

fill = “Mean miles per gallon”)

4 6 8

Number of cylinders

Tr
a

n
sm

is
si

o
n

t
yp

17.5

20.0

22.5

25.0

27.5

Mean miles per gallon

(Michigan State University) Introduction to Data Science 15 / 18

Pipes and plots

mtcars %>%

group_by(cyl, am) %>%

summarise(count = n()) %>%

arrange((cyl)) %>%

ggplot(mapping = aes(x = factor(cyl), y = count)) +

geom_bar(stat = “identity”, aes(fill = factor(am)), position = “dodge”) +

labs(x = “Number of cylinders”, y = “Count”,

fill = “Transmission type”)

0.0

2.5

5.0

7.5

10.0

12.5

4 6 8

Number of cylinders

C
o

u
n

t Transmission type

(Michigan State University) Introduction to Data Science 16 / 18

Today’s R essentials

• Ctrl/Cmd + Shift + M implements %>%

• In a series of piped dplyr function, %>% always ends a line

• One dplyr function per line for readability

(Michigan State University) Introduction to Data Science 17 / 18

Today’s R Markdown essentials

• Cascading Style Sheets is a style sheet language used for describing
the presentation of a document written in a markup language such as
HTML.

• Since R Markdown generates and HTML file you can customize CSS
styles.

(Michigan State University) Introduction to Data Science 18 / 18

Related Posts