CS计算机代考程序代写 python data science database Excel Data know hows and more

Data know hows and more
STT 180 Module 2 Lecture 3

Dola Pathak

Michigan State University

(Michigan State University) Introduction to Data Science 1 / 1

Learning objectives

• Load data into the R console/ source pane using built-in R functions.

• Understand the structure of the data frames/tables.

• How to use the base R functions like apply(), within(), names(), etc.

(Michigan State University) Introduction to Data Science 2 / 1

Course standing

You should be comfortable with

• installing and loading R packages,

• formatting an R Mardown file,

• subsetting and manipulating vectors and data frames,

• applying functions to your data, i.e. mean(), is.na(), etc.

(Michigan State University) Introduction to Data Science 3 / 1

Getting data into R

Data may be

• available in base R or through an R package such as the diamonds
data set that is available through tidyverse,

• read in to R from a file on your computer,

• read in to R directly from a website,

• scraped from a website.

(Michigan State University) Introduction to Data Science 4 / 1

Getting data into R

Built-in R functions read.table() and read.csv() will handle most
rectangular data in text that you want to read in to R. The function
read.csv() is the same as read.table() except the default separator is
a comma, whereas the default separator for read.table() is a
whitespace.

We have also seen that the function load() is used to read in .Rdata files.

(Michigan State University) Introduction to Data Science 5 / 1

A closer look

read.table() reads a file in table format and creates a data frame from
it, with cases corresponding to lines and variables to fields in the file.

read.table(file, header = FALSE, sep = “”, quote = “\”‘”,
dec = “.”, numerals = c(“allow.loss”, “warn.loss”, “no.loss”),

row.names, col.names, as.is = !stringsAsFactors,

na.strings = “NA”, colClasses = NA, nrows = -1,

skip = 0, check.names = TRUE, fill = !blank.lines.skip,

strip.white = FALSE, blank.lines.skip = TRUE,

comment.char = “#”,

allowEscapes = FALSE, flush = FALSE,

stringsAsFactors = default.stringsAsFactors(),

fileEncoding = “”, encoding = “unknown”, text, skipNul = FALSE)

(Michigan State University) Introduction to Data Science 6 / 1

A closer look

The essentials

read.table(file, header = FALSE, sep = “”, na.strings = “NA”,

stringsAsFactors = default.stringsAsFactors())

• file: rectangular text file you want to read (in quotes)

• header: does the text file have column names?

• sep: how are cells separated in the text file?

• na.strings: how are NA values represented in the text file?

• stringsAsFactors: texts as character or factor?

(Michigan State University) Introduction to Data Science 7 / 1

Helpful tips

• Examine the text file before you read it in to R

• It may be easier to make quick changes in the text file before you
read it in to R

• Ensure your working directory is set to where the text file is located

read.csv(“no-file-here.csv”)

Warning in file(file, “rt”): cannot open file ’no-file-here.csv’:

No such file or directory

Error in file(file, “rt”): cannot open the connection

(Michigan State University) Introduction to Data Science 8 / 1

Packages to handle other types of data

• data.table – large rectangular files
• RcppCNPy – Python npy files
• haven – SPSS, Stata, and SAS files
• readxl – excel files (.xls and .xlsx)
• DBI – databases
• jsonlite – json
• xml2 – XML
• httr – Web APIs
• rvest – HTML (Web Scraping)

The bold represents the package that will handle the corresponding data.

(Michigan State University) Introduction to Data Science 9 / 1

More on data frames

• Use $ to add new variables to a data frame
my.data.frame$new.var <- 1:10 • Function within() also allows you to add new variables to a data frame • Function names() gives all variable names and will let you set/change the names • Row names and column names can be extracted with rownames() and colnames(), respectively. (Michigan State University) Introduction to Data Science 10 / 1 Some examples with mtcars mtcars$mpg.adj <- mtcars$mpg / mtcars$wt head(mtcars[-c(1:6)]) qsec vs am gear carb mpg.adj Mazda RX4 16.46 0 1 4 4 8.015267 Mazda RX4 Wag 17.02 0 1 4 4 7.304348 Datsun 710 18.61 1 1 4 1 9.827586 Hornet 4 Drive 19.44 1 0 3 1 6.656299 Hornet Sportabout 17.02 0 0 3 2 5.436047 Valiant 20.22 1 0 3 1 5.231214 (Michigan State University) Introduction to Data Science 11 / 1 Some examples with mtcars mtcars$mpg.adj <- mtcars$mpg / mtcars$wt head(mtcars[-c(1:6)]) qsec vs am gear carb mpg.adj Mazda RX4 16.46 0 1 4 4 8.015267 Mazda RX4 Wag 17.02 0 1 4 4 7.304348 Datsun 710 18.61 1 1 4 1 9.827586 Hornet 4 Drive 19.44 1 0 3 1 6.656299 Hornet Sportabout 17.02 0 0 3 2 5.436047 Valiant 20.22 1 0 3 1 5.231214 (Michigan State University) Introduction to Data Science 11 / 1 Some examples with mtcars mtcars <- within(data = mtcars, disp.new <- disp / 10) head(mtcars[-c(1:6)]) qsec vs am gear carb mpg.adj disp.new Mazda RX4 16.46 0 1 4 4 8.015267 16.0 Mazda RX4 Wag 17.02 0 1 4 4 7.304348 16.0 Datsun 710 18.61 1 1 4 1 9.827586 10.8 Hornet 4 Drive 19.44 1 0 3 1 6.656299 25.8 Hornet Sportabout 17.02 0 0 3 2 5.436047 36.0 Valiant 20.22 1 0 3 1 5.231214 22.5 (Michigan State University) Introduction to Data Science 12 / 1 Some examples with mtcars mtcars <- within(data = mtcars, disp.new <- disp / 10) head(mtcars[-c(1:6)]) qsec vs am gear carb mpg.adj disp.new Mazda RX4 16.46 0 1 4 4 8.015267 16.0 Mazda RX4 Wag 17.02 0 1 4 4 7.304348 16.0 Datsun 710 18.61 1 1 4 1 9.827586 10.8 Hornet 4 Drive 19.44 1 0 3 1 6.656299 25.8 Hornet Sportabout 17.02 0 0 3 2 5.436047 36.0 Valiant 20.22 1 0 3 1 5.231214 22.5 (Michigan State University) Introduction to Data Science 12 / 1 Apply function If you want to repeatedly use a function on rows or columns of a data frame, apply() will allow you to do just that. It returns a vector or array or list of values obtained by applying a function to margins of an array or matrix. apply(X, MARGIN, FUN) • X: data frame, matrix, or array • MARGIN: 1 is by rows, 2 is by columns • FUN: name of function to apply (Michigan State University) Introduction to Data Science 13 / 1 Some examples using apply Not efficient: c(mean(mtcars$mpg), mean(mtcars$cyl), mean(mtcars$disp)) [1] 20.09062 6.18750 230.72188 Efficient: apply(X = mtcars, MARGIN = 2, FUN = mean) mpg cyl disp hp drat wt 20.090625 6.187500 230.721875 146.687500 3.596563 3.217250 qsec vs am gear carb mpg.adj 17.848750 0.437500 0.406250 3.687500 2.812500 7.494852 disp.new 23.072188 (Michigan State University) Introduction to Data Science 14 / 1 Some examples using apply Not efficient: c(mean(mtcars$mpg), mean(mtcars$cyl), mean(mtcars$disp)) [1] 20.09062 6.18750 230.72188 Efficient: apply(X = mtcars, MARGIN = 2, FUN = mean) mpg cyl disp hp drat wt 20.090625 6.187500 230.721875 146.687500 3.596563 3.217250 qsec vs am gear carb mpg.adj 17.848750 0.437500 0.406250 3.687500 2.812500 7.494852 disp.new 23.072188 (Michigan State University) Introduction to Data Science 14 / 1 Some examples using apply apply(X = mtcars[c(1:3)], MARGIN = 2, FUN = sd) mpg cyl disp 6.026948 1.785922 123.938694 (Michigan State University) Introduction to Data Science 15 / 1 Some examples using apply apply(X = mtcars[c(1:3)], MARGIN = 2, FUN = sort)[1:5, ] mpg cyl disp [1,] 10.4 4 71.1 [2,] 10.4 4 75.7 [3,] 13.3 4 78.7 [4,] 14.3 4 79.0 [5,] 14.7 4 95.1 apply(X = mtcars[c(1:4), ], MARGIN = 1, FUN = sum) Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive 352.9953 353.0993 280.2076 458.5913 (Michigan State University) Introduction to Data Science 16 / 1 Some examples using apply apply(X = mtcars[c(1:3)], MARGIN = 2, FUN = sort)[1:5, ] mpg cyl disp [1,] 10.4 4 71.1 [2,] 10.4 4 75.7 [3,] 13.3 4 78.7 [4,] 14.3 4 79.0 [5,] 14.7 4 95.1 apply(X = mtcars[c(1:4), ], MARGIN = 1, FUN = sum) Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive 352.9953 353.0993 280.2076 458.5913 (Michigan State University) Introduction to Data Science 16 / 1