Introduction to information system
Data I/O in R
Bowei Chen
School of Computer Science
University of Lincoln
CMP3036M/CMP9063M
Data Science 2016 – 2017 Workshop
Today’s Objectives
• Study the following slides for Data I/O in R
• Do the exercises 1-8 (exercise 5-8 are important!!!)
• Do the additional exercises (which can help you understand the contents of
the last week’s lectures)
Sources of Data for R
Entering Data from Keyboard
Perhaps the simplest method of data
entry is from the keyboard. The edit()
function in R will invoke a text editor
that will allow you to enter your data
manually.
> mydata <- data.frame(age = numeric(0), + gender = character(0), + weight = numeric(0)) > mydata <- edit(mydata) Importing Data from Excel There are many R packages can allow you to import data from excel. For example: openxlsx XLConnect xlsx … Note that most of the advice is for pre-Excel 2007 spreadsheets and not the later .xlsx format. > install.packages(“openxlsx”)
> library(“openxlsx”)
> df <-
+ read.xlsx(
+ "PublicHealthEnglandDataTableDistrict.xlsx",
+ sheet = 1,
+ startRow = 1,
+ colNames = TRUE
+ )
Importing Data from Delimited Text File (1/2)
You can import data from delimited text files using read.table() , a function that
reads a file in table format and saves it as a data frame.
where file is a delimited ASCII file , header is a logical value indicating whether
the first row contains variable names (TRUE or FALSE), sep specifies the
delimiter separating data values, and row.names is an optional parameter
specifying one or more variables to represent row identifiers.
> mydataframe <- read.table(file, header = logical_value,
+ sep = "delimiter",
+ row.names = "name")
Importing Data from Delimited Text File (2/2)
> df <- read.table(file, header = TRUE, sep = ",")
> head(df)
Date Open High Low Close Volume Adj.Close
1 2016-03-04 581.07 581.40 571.07 575.14 3405100 575.14
2 2016-03-03 577.96 579.87 573.11 577.49 2736700 577.49
3 2016-03-02 581.75 585.00 573.70 580.21 4576900 580.21
4 2016-03-01 556.29 579.25 556.00 579.04 5014400 579.04
5 2016-02-29 554.00 564.81 552.51 552.52 4013400 552.52
6 2016-02-26 560.12 562.50 553.17 555.23 4858200 555.23
Importing Data from XML
> # install and load the necessary package
> install.packages(“XML”)
> library(XML)
> xml.url <- "http://www.w3schools.com/xml/plant_catalog.xml" > xmlfile <- xmlTreeParse(xml.url) > class(xmlfile)
[1] “XMLDocument” “XMLAbstractDocument”
> xmltop = xmlRoot(xmlfile)
> plantcat <- xmlSApply(xmltop, function(x) { xmlSApply(x, xmlValue) } ) > # Finally, get the data in a data-frame and have a look at the first rows and columns
> plantcat_df <- data.frame(t(plantcat),row.names = NULL) > plantcat_df[1:5,1:4]
Importing Data from R Package
> library(MASS)
> data()
> data(phones)
> phones
$year
[1] 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73
$calls
[1] 4.4 4.7 4.7 5.9 6.6 7.3 8.1 8.8 10.6 12.0 13.5 14.9 16.1
[14] 21.2 119.0 124.0 142.0 159.0 182.0 212.0 43.0 24.0 27.0 29.0
Importing Data from Other Sources
• Importing SPSS files into R
• Importing Stata files into R
• Importing SAS files into R
• Importing Minitab files into R
• Importing Matlab files into R
• …
Importing Data in RStudio (1/2)
Importing Data in RStudio (2/2)
Writing Data Frame into CSV or TXT Files
write.table(x, file = “”, append = FALSE, quote = TRUE, sep = ” “,
eol = “\n”, na = “NA”, dec = “.”, row.names = TRUE,
col.names = TRUE, qmethod = c(“escape”, “double”),
fileEncoding = “”)
write.csv(…)
write.csv2(…)
Useful Functions for Working with Data Objects (1/2)
Number of elements/components length(object)
Dimensions of an object dim(object)
Structure of an object str(object)
Class or type of an object class(object)
How an object is stored mode(object)
Names of components in an object names(object)
Combines objects into a vector c(object, object,…)
Combines objects as columns cbind(object, object, …)
Combines objects as rows rbind(object, object, …)
Prints the object object
Useful Functions for Working with Data Objects (2/2)
Lists the first part of the object head(object)
Lists the last part of the object tail(object)
Lists current objects ls()
Deletes one or more objects. rm(object, object, …)
Edits object and saves as new object newobject <- edit(object)
drop Argument
By default, subscripting operations
reduce the dimensions of an array
whenever possible. To avoid that, we
can use the drop=FALSE argument
> mat <- matrix(1:12, 3, 4, byrow = TRUE)
> mat
[,1] [,2] [,3] [,4]
[1,] 1 2 3 4
[2,] 5 6 7 8
[3,] 9 10 11 12
> s1 <- mat[1,]; s1 [1] 1 2 3 4 > dim(s1)
NULL
> s2 <- mat[1,,drop=FALSE]; s2 [,1] [,2] [,3] [,4] [1,] 1 2 3 4 > dim(s2)
[1] 1 4
Combined Selection
Suppose we want to get all the columns
for which the element at the first row is
less than 3:
> mat <- matrix(1:12, 3, 4, byrow = TRUE) > mat
[,1] [,2] [,3] [,4]
[1,] 1 2 3 4
[2,] 5 6 7 8
[3,] 9 10 11 12
> mycols <- mat[1,] < 3; mycols
[1] TRUE TRUE FALSE FALSE
> mat[ , mycols, drop=FALSE]
[,1] [,2]
[1,] 1 2
[2,] 5 6
[3,] 9 10
Using SQL Statements in R to Manipulate Data Frames
# install the package
> install.packages(“sqldf”)
> library(sqldf)
> newdf <- sqldf("select * from mtcars where carb=1 order by mpg", row.names=TRUE) > newdf
mpg cyl disp hp drat wt qsec vs am gear carb
Valiant 18.1 6 225.0 105 2.76 3.46 20.2 1 0 3 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.21 19.4 1 0 3 1
Toyota Corona 21.5 4 120.1 97 3.70 2.46 20.0 1 0 3 1
Datsun 710 22.8 4 108.0 93 3.85 2.32 18.6 1 1 4 1
Fiat X1-9 27.3 4 79.0 66 4.08 1.94 18.9 1 1 4 1
Fiat 128 32.4 4 78.7 66 4.08 2.20 19.5 1 1 4 1
Toyota Corolla 33.9 4 71.1 65 4.22 1.83 19.9 1 1 4 1
References
• W. Venables, D. Smith, and the R Core Team (2015) An Introduction to R.
• P. Teetor (2011) R Cookbook. O’Reilly.
• J. Adler (2012) R in a Nutshell, 2nd Edition, O’Reilly
Exercise 1/8
1) Create a vector x represent numbers from 1 to 11
2) Save x into the x.RData file
3) Remove the object x from R workspace
4) Import the x.RData file into R and save it to x.
Please google it
Exercise 2/8
1) Create the following data frame dfA
ID Case Number
1 case1 10
2 case2 20
3 case3 30
2) Save dfA into the dfA.csv file
Exercise 3/8
1) Download PublicHealthEnglandDataTableDistrict.xlsx from Blackboard
2) Import the dataset into R using openxlsx package
3) Show the first and last 20 lines of the dataset, respectively
4) Obtain the column names of the dataset
5) Create a new data frame which has the same column names of the
dataset and has the first and last 20 lines of the dataset
Exercise 4/8
1) Download the file AMZN.csv from Blackboard
2) Import the dataset into R
3) Show the class of all columns/fields
4) Create a new data frame where Open <= 570 and Close >= 550
5) Sort the data frame by High (in decreasing order)
6) Create another data frame where Close >= Open
Exercise 5/8
1) Import data from url “http://www.w3schools.com/xml/plant_catalog.xml”
2) Use the xmlTreePares function to parse xml file directly from the web
3) Use the xmlRoot function to access the top node
Exercise 6/8
1) Install the MASS package
2) Find Cars93 dataset
3) Extract all the records for the Volkswagen from the field Manufacturer
4) Order the extracted records (ascend) by Price and save it to a data frame
5) Write the data frame to Cars93FilteredData.csv
Exercise 7/8
1) Use SQL statements to manipulate data frame as required in Exercise 4/6
2) Write the data frame to Cars93FilteredData.RData
Exercise 8/8
1) Download the files iris1.csv and iris2.csv from Blackboard
2) Import these two files into R
3) Combine these two datasets into one data frame
4) Calculate the mean value of every columns
5) What will you do with missing values?
Additional Exercises
Well done if you’ve completed the exercises. Once you complete these
additional exercises, you can leave the workshop sessions
Additional Exercise (1/2)
If a random variable 𝑋 follows a Poisson distribution, 𝑋~𝑃𝑜𝑖𝑠𝑠𝑜𝑛(𝜆), please
prove the following equations:
1) 𝑥=0
∞ 𝑒
−𝜆𝜆𝑥
𝑥!
= 1
2) 𝔼 X = 𝜆
3) 𝕍 𝑋 = 𝜆
Additional Exercise (2/2)
If a random variable 𝑋 follows a Binomial distribution, 𝑋~𝐵𝑖𝑛 𝑛, 𝑝 . Suppose
that 𝑛𝑝 = 𝜆, please prove the following equation
lim
𝑛→∞
𝑓 𝑥; 𝑛, 𝑝 = lim
𝑛→∞
𝑛
𝑥
𝑝𝑥 1 − 𝑝 𝑛−𝑥 =
𝜆𝑥
𝑥!
𝑒−𝜆 = 𝑓(𝑥; 𝜆)
Thank You!
bchen@lincoln.ac.uk
mailto:bchen@lincoln.ac.uk