Data know hows and more
STT 180 Module 2 Lecture 3
Dola Pathak
Michigan State University
(Michigan State University) Introduction to Data Science 1 / 1
Learning objectives
• Load data into the R console/ source pane using built-in R functions.
• Understand the structure of the data frames/tables.
• How to use the base R functions like apply(), within(), names(), etc.
(Michigan State University) Introduction to Data Science 2 / 1
Course standing
You should be comfortable with
• installing and loading R packages,
• formatting an R Mardown file,
• subsetting and manipulating vectors and data frames,
• applying functions to your data, i.e. mean(), is.na(), etc.
(Michigan State University) Introduction to Data Science 3 / 1
Getting data into R
Data may be
• available in base R or through an R package such as the diamonds
data set that is available through tidyverse,
• read in to R from a file on your computer,
• read in to R directly from a website,
• scraped from a website.
(Michigan State University) Introduction to Data Science 4 / 1
Getting data into R
Built-in R functions read.table() and read.csv() will handle most
rectangular data in text that you want to read in to R. The function
read.csv() is the same as read.table() except the default separator is
a comma, whereas the default separator for read.table() is a
whitespace.
We have also seen that the function load() is used to read in .Rdata files.
(Michigan State University) Introduction to Data Science 5 / 1
A closer look
read.table() reads a file in table format and creates a data frame from
it, with cases corresponding to lines and variables to fields in the file.
read.table(file, header = FALSE, sep = “”, quote = “\”‘”,
dec = “.”, numerals = c(“allow.loss”, “warn.loss”, “no.loss”),
row.names, col.names, as.is = !stringsAsFactors,
na.strings = “NA”, colClasses = NA, nrows = -1,
skip = 0, check.names = TRUE, fill = !blank.lines.skip,
strip.white = FALSE, blank.lines.skip = TRUE,
comment.char = “#”,
allowEscapes = FALSE, flush = FALSE,
stringsAsFactors = default.stringsAsFactors(),
fileEncoding = “”, encoding = “unknown”, text, skipNul = FALSE)
(Michigan State University) Introduction to Data Science 6 / 1
A closer look
The essentials
read.table(file, header = FALSE, sep = “”, na.strings = “NA”,
stringsAsFactors = default.stringsAsFactors())
• file: rectangular text file you want to read (in quotes)
• header: does the text file have column names?
• sep: how are cells separated in the text file?
• na.strings: how are NA values represented in the text file?
• stringsAsFactors: texts as character or factor?
(Michigan State University) Introduction to Data Science 7 / 1
Helpful tips
• Examine the text file before you read it in to R
• It may be easier to make quick changes in the text file before you
read it in to R
• Ensure your working directory is set to where the text file is located
read.csv(“no-file-here.csv”)
Warning in file(file, “rt”): cannot open file ’no-file-here.csv’:
No such file or directory
Error in file(file, “rt”): cannot open the connection
(Michigan State University) Introduction to Data Science 8 / 1
Packages to handle other types of data
• data.table – large rectangular files
• RcppCNPy – Python npy files
• haven – SPSS, Stata, and SAS files
• readxl – excel files (.xls and .xlsx)
• DBI – databases
• jsonlite – json
• xml2 – XML
• httr – Web APIs
• rvest – HTML (Web Scraping)
The bold represents the package that will handle the corresponding data.
(Michigan State University) Introduction to Data Science 9 / 1
More on data frames
• Use $ to add new variables to a data frame
my.data.frame$new.var <- 1:10
• Function within() also allows you to add new variables to a data
frame
• Function names() gives all variable names and will let you set/change
the names
• Row names and column names can be extracted with rownames()
and colnames(), respectively.
(Michigan State University) Introduction to Data Science 10 / 1
Some examples with mtcars
mtcars$mpg.adj <- mtcars$mpg / mtcars$wt
head(mtcars[-c(1:6)])
qsec vs am gear carb mpg.adj
Mazda RX4 16.46 0 1 4 4 8.015267
Mazda RX4 Wag 17.02 0 1 4 4 7.304348
Datsun 710 18.61 1 1 4 1 9.827586
Hornet 4 Drive 19.44 1 0 3 1 6.656299
Hornet Sportabout 17.02 0 0 3 2 5.436047
Valiant 20.22 1 0 3 1 5.231214
(Michigan State University) Introduction to Data Science 11 / 1
Some examples with mtcars
mtcars$mpg.adj <- mtcars$mpg / mtcars$wt
head(mtcars[-c(1:6)])
qsec vs am gear carb mpg.adj
Mazda RX4 16.46 0 1 4 4 8.015267
Mazda RX4 Wag 17.02 0 1 4 4 7.304348
Datsun 710 18.61 1 1 4 1 9.827586
Hornet 4 Drive 19.44 1 0 3 1 6.656299
Hornet Sportabout 17.02 0 0 3 2 5.436047
Valiant 20.22 1 0 3 1 5.231214
(Michigan State University) Introduction to Data Science 11 / 1
Some examples with mtcars
mtcars <- within(data = mtcars, disp.new <- disp / 10)
head(mtcars[-c(1:6)])
qsec vs am gear carb mpg.adj disp.new
Mazda RX4 16.46 0 1 4 4 8.015267 16.0
Mazda RX4 Wag 17.02 0 1 4 4 7.304348 16.0
Datsun 710 18.61 1 1 4 1 9.827586 10.8
Hornet 4 Drive 19.44 1 0 3 1 6.656299 25.8
Hornet Sportabout 17.02 0 0 3 2 5.436047 36.0
Valiant 20.22 1 0 3 1 5.231214 22.5
(Michigan State University) Introduction to Data Science 12 / 1
Some examples with mtcars
mtcars <- within(data = mtcars, disp.new <- disp / 10)
head(mtcars[-c(1:6)])
qsec vs am gear carb mpg.adj disp.new
Mazda RX4 16.46 0 1 4 4 8.015267 16.0
Mazda RX4 Wag 17.02 0 1 4 4 7.304348 16.0
Datsun 710 18.61 1 1 4 1 9.827586 10.8
Hornet 4 Drive 19.44 1 0 3 1 6.656299 25.8
Hornet Sportabout 17.02 0 0 3 2 5.436047 36.0
Valiant 20.22 1 0 3 1 5.231214 22.5
(Michigan State University) Introduction to Data Science 12 / 1
Apply function
If you want to repeatedly use a function on rows or columns of a data
frame, apply() will allow you to do just that. It returns a vector or array
or list of values obtained by applying a function to margins of an array or
matrix.
apply(X, MARGIN, FUN)
• X: data frame, matrix, or array
• MARGIN: 1 is by rows, 2 is by columns
• FUN: name of function to apply
(Michigan State University) Introduction to Data Science 13 / 1
Some examples using apply
Not efficient:
c(mean(mtcars$mpg), mean(mtcars$cyl), mean(mtcars$disp))
[1] 20.09062 6.18750 230.72188
Efficient:
apply(X = mtcars, MARGIN = 2, FUN = mean)
mpg cyl disp hp drat wt
20.090625 6.187500 230.721875 146.687500 3.596563 3.217250
qsec vs am gear carb mpg.adj
17.848750 0.437500 0.406250 3.687500 2.812500 7.494852
disp.new
23.072188
(Michigan State University) Introduction to Data Science 14 / 1
Some examples using apply
Not efficient:
c(mean(mtcars$mpg), mean(mtcars$cyl), mean(mtcars$disp))
[1] 20.09062 6.18750 230.72188
Efficient:
apply(X = mtcars, MARGIN = 2, FUN = mean)
mpg cyl disp hp drat wt
20.090625 6.187500 230.721875 146.687500 3.596563 3.217250
qsec vs am gear carb mpg.adj
17.848750 0.437500 0.406250 3.687500 2.812500 7.494852
disp.new
23.072188
(Michigan State University) Introduction to Data Science 14 / 1
Some examples using apply
apply(X = mtcars[c(1:3)], MARGIN = 2, FUN = sd)
mpg cyl disp
6.026948 1.785922 123.938694
(Michigan State University) Introduction to Data Science 15 / 1
Some examples using apply
apply(X = mtcars[c(1:3)], MARGIN = 2, FUN = sort)[1:5, ]
mpg cyl disp
[1,] 10.4 4 71.1
[2,] 10.4 4 75.7
[3,] 13.3 4 78.7
[4,] 14.3 4 79.0
[5,] 14.7 4 95.1
apply(X = mtcars[c(1:4), ], MARGIN = 1, FUN = sum)
Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive
352.9953 353.0993 280.2076 458.5913
(Michigan State University) Introduction to Data Science 16 / 1
Some examples using apply
apply(X = mtcars[c(1:3)], MARGIN = 2, FUN = sort)[1:5, ]
mpg cyl disp
[1,] 10.4 4 71.1
[2,] 10.4 4 75.7
[3,] 13.3 4 78.7
[4,] 14.3 4 79.0
[5,] 14.7 4 95.1
apply(X = mtcars[c(1:4), ], MARGIN = 1, FUN = sum)
Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive
352.9953 353.0993 280.2076 458.5913
(Michigan State University) Introduction to Data Science 16 / 1