Introduction to information system
Data Structures in R
Bowei Chen
School of Computer Science
University of Lincoln
CMP3036M/CMP9063M
Data Science 2016 – 2017 Workshop
Vectors (1/3)
Vectors are one-dimensional arrays
that can hold numeric data, character
data, or logical data. The combine
function c() is used to form the vector.
Note that the data in a vector must
only be one data type (numeric,
character, or logical).
> a <-c(1, 2, 5, 3, 6, -2, 4) > b <-c("one", "two", "three") > d <-c(TRUE, TRUE, TRUE, FALSE, TRUE, FALSE) # a is numeric vector, # b is a character vector # d is a logical vector Vectors (2/3) Scalars are one-element vectors. > f <- 3 > x <- TRUE > y <- 100.01 > K <- as.logical(0) Vectors (3/3) You can refer to elements of a vector using a numeric vector of positions within brackets. > a <- c(1, 2, 5, 3, 6, -2, 4) > a[3]
[1] 5
> a[c(1, 3, 5)]
[1] 1 5 6
> a[2:6]
[1] 2 5 3 6 -2
Matrices (1/4)
A matrix is a two-dimensional array where each element has the same data type
(numeric, character, or logical). Matrices are created with the matrix() function.
mymatrix <- matrix(vector, nrow=number_of_rows, ncol=number_of_columns, byrow=logical_value, dimnames=list(char_vector_rownames, char_vector_colnames) ) Matrices (2/4) # Create a matrix from a vector > vector <- c(1,2,3,4) > foo <- matrix(vector, nrow=2, ncol=2) > foo
[,1] [,2]
[1,] 1 3
[2,] 2 4
# Create a 5×4 matrix
> y <- matrix(1:20, nrow=5, ncol=4) > y
[,1] [,2] [,3] [,4]
[1,] 1 6 11 16
[2,] 2 7 12 17
[3,] 3 8 13 18
[4,] 4 9 14 19
[5,] 5 10 15 20
> z <- matrix(1:20, nrow=5) Matrices (3/4) Create a 2x2 matrix with labels and fill the matrix by rows Create a 2x2 matrix with labels and fill the matrix by column > cells <- c(1,26,24,68) > rnames <- c("R1", "R2") > cnames <- c("C1", "C2") > mymatrix <- matrix( + cells, nrow = 2, ncol = 2, byrow = TRUE, + dimnames = list(rnames, cnames) ) > mymatrix
C1 C2
R1 1 26
R2 24 68
> mymatrix <- matrix( + cells, nrow = 2, ncol = 2, byrow = FALSE, + dimnames = list(rnames, cnames)) > mymatrix
C1 C2
R1 1 24
R2 26 68
Matrices (4/4)
You can identify rows, columns, or
elements of a matrix, x, by using
subscripts and brackets.
• x[i,] refers to the ith row
• x[,j] refers to jth column
• x[i,j] refers to the i,jth element
> x <- matrix(1:10, nrow=2) > x
[,1] [,2] [,3] [,4] [,5]
[1,] 1 3 5 7 9
[2,] 2 4 6 8 10
> x[2,]
[1] 2 4 6 8 10
> x[,2]
[1] 3 4
> x[1,4]
[1] 7
> x[1, c(4,5)]
[1] 7 9
Arrays (1/2)
Matrices are two-dimensional and, like vectors, can contain only one data type.
When there are more than two dimensions, you’ll use arrays.
myarray <- array(vector, dimensions, dimnames)
Arrays (2/2)
> dim1 <- c("A1", "A2")
> dim2 <- c("B1", "B2", "B3")
> dim3 <- c("C1", "C2", "C3", "C4")
> z <- array(1:24, c(2, 3, 4),
dimnames=list(dim1, dim2, dim3))
> z
, , C1
B1 B2 B3
A1 1 3 5
A2 2 4 6
, , C2
B1 B2 B3
A1 7 9 11
A2 8 10 12
, , C3
B1 B2 B3
A1 13 15 17
A2 14 16 18
, , C4
B1 B2 B3
A1 19 21 23
A2 20 22 24
Data Frame (1/4)
A data frame is more general than a matrix in that different columns can
contain different modes of data (numeric, character, etc.). A data frame is
created with the data.frame() function
It’s similar to the datasets you’d typically see in Python (pandas), SAS, SPSS,
and Stata. Each column must have only one data type, but you can put
columns of different data types together to form the data frame. Because
data frames are close to what analysts typically think of as datasets, we
sometimes use the terms columns and variables interchangeably when
discussing data frames.
mydata <- data.frame(col1, col2, col3,…) Data Frame (2/4) > patientID <- c(1, 2, 3, 4) > age <- c(25, 34, 28, 52) > diabetes <- c("Type1", "Type2", "Type1", "Type1") > status <- c("Poor", "Improved", "Excellent", "Poor") > patientdata <- data.frame(patientID, age, diabetes, status) > patientdata
patientID age diabetes status
1 1 25 Type1 Poor
2 2 34 Type2 Improved
3 3 28 Type1 Excellent
4 4 52 Type1 Poor
Data Frame (3/4)
Accessing data frame elements can be
straight forward. Element can be
accessed by column names.
> patientdata$patientID
[1] 1 2 3 4
> patientdata$diabetes
[1] Type1 Type2 Type1 Type1
Levels: Type1 Type2
> patientdata$status
[1] Poor Improved Excellent Poor
Levels: Excellent Improved Poor
> patientdata[,’age’]
[1] 25 34 28 52
Data Frame (4/4)
If you want to cross tabulate diabetes type by status.
> table(patientdata$diabetes, patientdata$status)
Excellent Improved Poor
Type1 1 0 2
Type2 0 1 0
Some Useful Functions for Data Frame (1/8)
The summary() function can
quickly summarise the variables
in a data frame
> summary(patientdata)
patientID age diabetes status
Min. :1.00 Min. :25.00 Type1:3 Excellent:1
1st Qu.:1.75 1st Qu.:27.25 Type2:1 Improved :1
Median :2.50 Median :31.00 Poor :2
Mean :2.50 Mean :34.75
3rd Qu.:3.25 3rd Qu.:38.50
Max. :4.00 Max. :52.00
Some Useful Functions for Data Frame (2/8)
The attach() function adds the data
frame to the R search path. When a
variable name is encountered, data
frames in the search path are checked
in order to locate the variable.
> summary(mtcars$mpg)
Min. 1st Qu. Median Mean 3rd Qu. Max.
10.40 15.42 19.20 20.09 22.80 33.90
> plot(mtcars$mpg, mtcars$disp)
> plot(mtcars$mpg, mtcars$wt)
> attach(mtcars)
> summary(mpg)
Min. 1st Qu. Median Mean 3rd Qu. Max.
10.40 15.42 19.20 20.09 22.80 33.90
> plot(mpg, disp)
> plot(mpg, wt)
> detach(mtcars)
Some Useful Functions for Data Frame (3/8)
The detach() function removes the
data frame from the search path.
Note that detach() does nothing to
the data frame itself. The statement is
optional but is good programming
practice and should be included
routinely.
> attach(mtcars)
> summary(mpg)
Min. 1st Qu. Median Mean 3rd Qu. Max.
10.40 15.42 19.20 20.09 22.80 33.90
> plot(mpg, disp)
> plot(mpg, wt)
> detach(mtcars)
Some Useful Functions for Data Frame (4/8)
The limitations with this approach are
evident when more than one object
can have the same name.
Here we already have an object
named mpg in our environment when
the mtcars data frame is attached. In
such cases, the original object takes
precedence, which isn’t what you
want. The plot statement fails
because mpg has 3 elements and
disp has 32 elements.
> mpg <- c(25, 36, 47)
> attach(mtcars)
The following object is masked _by_
.GlobalEnv:
mpg
> plot(mpg, wt)
Error in xy.coords(x, y, xlabel, ylabel,
log) :
‘x’ and ‘y’ lengths differ
Some Useful Functions for Data Frame (5/8)
In this case, the statements within
the {} brackets are evaluated with
reference to the mtcars data
frame. You don’t have to worry
about name conflicts here. If
there’s only one statement (for
example, summary(mpg)), the {}
brackets are optional.
> with(mtcars, {
+ summary(mpg, disp, wt)
+ plot(mpg, disp)
+ plot(mpg, wt)
+ })
Some Useful Functions for Data Frame (6/8)
The limitation of the with()
function is that assignments will
only exist within the function
brackets.
> with(mtcars, {
stats <- summary(mpg) stats }) Min. 1st Qu. Median Mean 3rd Qu. Max. 10.40 15.43 19.20 20.09 22.80 33.90 > stats
Error: object ‘stats’ not found
Some Useful Functions for Data Frame (7/8)
If you need to create objects that
will exist outside of the with()
construct, use the special
assignment operator <<- instead of the standard one <-. It will save the object to the global environment outside of the with() call. > with(mtcars, {
nokeepstats <- summary(mpg) keepstats <<- summary(mpg) }) > nokeepstats
Error: object ‘nokeepstats’ not found
> keepstats
Min. 1st Qu. Median Mean 3rd Qu. Max.
10.40 15.43 19.20 20.09 22.80 33.90
Some Useful Functions for Data Frame (8/8)
> head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
Factors (1/3)
Categorical (nominal) and ordered
categorical (ordinal) variables in R
are called factors.
The function factor() stores the
categorical values as a vector of
integers in the range [1… k] (where
k is the number of unique values in
the nominal variable), and an
internal vector of character strings
(the original values) mapped to
these integers.
> diabetes <- c("Type1", "Type2",
"Type1", "Type1")
> diabetes
[1] “Type1” “Type2” “Type1” “Type1”
Factors (2/3)
> patientID <- c(1, 2, 3, 4) age <- c(25, 34, 28, 52) > diabetes <- c("Type1", "Type2", "Type1", "Type1") > status <- c("Poor", "Improved", "Excellent", "Poor") > diabetes <- factor(diabetes) > status <- factor(status, order=TRUE) > patientdata <- data.frame(patientID, age, diabetes, status) > str(patientdata)
‘data.frame’: 4 obs. of 4 variables:
$ patientID: num 1 2 3 4 w
$ age : num 25 34 28 52
$ diabetes : Factor w/ 2 levels “Type1″,”Type2”: 1 2 1 1
$ status : Ord.factor w/ 3 levels “Excellent”<"Improved"<..: 3 2 1 3 Factors (3/3) > summary(patientdata)
patientID age diabetes status
Min. :1.00 Min. :25.00 Type1:3 Excellent:1
1st Qu.:1.75 1st Qu.:27.25 Type2:1 Improved :1
Median :2.50 Median :31.00 Poor :2
Mean :2.50 Mean :34.75
3rd Qu.:3.25 3rd Qu.:38.50
Max. :4.00 Max. :52.00
Lists (1/2)
Lists are the most complex of the R
data types. Basically, a list is an
ordered collection of objects
(components). A list allows you to
gather a variety of (possibly
unrelated) objects under one name.
mylist <- list(object1, object2, …)
mylist <- list(name1=object1,
name2=object2, …)
Lists (2/2)
> g <- "My First List"
> h <- c(25, 26, 18, 39)
> j <- matrix(1:10, nrow=5)
> k <- c("one", "two", "three")
> mylist <- list(title=g, ages=h, j, k)
> mylist
$title
[1] “My First List”
$ages
[1] 25 26 18 39
[[3]]
[,1] [,2]
[1,] 1 6
[2,] 2 7
[3,] 3 8
[4,] 4 9
[5,] 5 10
[[4]]
[1] “one” “two” “three”
> mylist[[2]]
[1] 25 26 18 39
> mylist[[“ages”]]
[[1] 25 26 18 39
References
• W. Venables, D. Smith, and the R Core Team (2015) An Introduction to R.
• P. Teetor (2011) R Cookbook. O’Reilly.
• J. Adler (2012) R in a Nutshell, 2nd Edition, O’Reilly
Exercise 1/10
# Declare different variables
typesmy_numeric <- 42
my_character <- "universe“
my_logical <- FALSE
# Check class of my_numeric
class(my_numeric)
# Check class of my_character
class(my_character)
# Check class of my_logical
class(my_logical)
Exercise 2/10
# Vector operations
a) Create a verctor like 1,2,3, . . ., 10
b) Get the length of the above vector
c) Get the last three numbers from the vector
d) Sort the numbers with decreasing order
e) Remove the number 9 from the above vector
Exercise 3/10
# Vector operations
a) Create a vector from 1 to 3.1415 with the length of 100
b) Create a vector from -2 to 0.1 with the length of 100
c) Get the sum and inner product of a and b
Exercise 4/10
# Vector operations
a) Create a vector x contains 2, 3, 4, 1
b) Create a vector y contains 1, 1, 3, 7
c) Combine column vectors x, y
Exercise 5/10
# Vector operations
Use rep() function to create the following vectors:
a) “0” “x” “0” “x” “0” “x”
b) 1 3 2 1 3 2 1 3 2 1 3 2
c) 1 1 1 2 2 2 3 3 3
Exercise 6/10
# Matrix operations
a) Create a matrix which contains values from 1 to 100 with 5 rows and 20 columns
b) Print out the dimensions of the matrix
c) Find out the 4th column’s sum
d) Find out the sum of row 3 and row 17
e) Assign the following names to the rows:
“A”, “B”, “C”, “D”, “E”
Exercise 7/10
# Matrix operations
a) Use matrix() function to create the following matrix:
TypeA TypeB TypeC
Navarra 190 8 22
Zaragoza 191 4 1.7
Madrid 223 80 2.0
b) Add the following column into the matrix:
TypeD
2.00
3.50
2.75
c) Use apply() function to calculate the means of each column of the matrix
Exercise 8/10
# Array operations
Create the following array
, , 1
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
, , 2
[,1] [,2] [,3]
[1,] 10 13 16
[2,] 11 14 17
[3,] 12 15 18
, , 3
[,1] [,2] [,3]
[1,] 19 22 25
[2,] 20 23 26
[3,] 21 24 27
Exercise 9/10
# Data frame operations
Type df <- iris, then
a) Print out the dimensions of df
b) Find out the sum of “Sepal.Width” column
c) Rename column “Species” as “label”
d) Find out how many records with “Petal.Length” larger than 1.41
Exercise 10/10
# List operations
Create the following list and save it to the variable x:
[[1]]
[1] 2 3 5
[[2]]
[1] "aa" "bb" "cc" "dd" "ee"
[[3]]
[1] TRUE FALSE TRUE FALSE FALSE
[[4]]
[1] 3
Additional Exercises
Well done if you’ve completed the exercises. Once you complete these
additional exercises, you can leave the workshop sessions
Additional Exercise (1/3)
Create the following data frame;
surname nationality deceased
1 Tukey US yes
2 Venables Australia no
3 Tierney US no
4 Ripley UK no
5 McNeil Australia no
Additional Exercise (2/3)
# Poker and roulette winnings from Monday to Friday:
poker_vector <- c(140, -50, 20, -120, 240)
roulette_vector <- c(-24, -50, 100, -350, 10)
days_vector <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")
names(poker_vector) <- days_vector
names(roulette_vector) <- days_vector
# Total winnings with poker
total_poker <- sum(poker_vector)
a) Calculate total winnings with roulette
b) Calculate winnings overall
Additional Exercise (3/3)
a) Create a data frame as follows:
df <- data.frame(Product=gl(3,10,labels=c("A","B", "C")),
Year=factor(rep(2002:2011,3)),
Sales=1:30)
b) Find the sum of all products’ sales by year
Thank You!
bchen@lincoln.ac.uk
mailto:bchen@lincoln.ac.uk