程序代写代做代考 data structure data science Excel python Introduction to information system

Introduction to information system

Data Structures in R

Bowei Chen

School of Computer Science

University of Lincoln

CMP3036M/CMP9063M

Data Science 2016 – 2017 Workshop

Vectors (1/3)

Vectors are one-dimensional arrays

that can hold numeric data, character

data, or logical data. The combine

function c() is used to form the vector.

Note that the data in a vector must

only be one data type (numeric,

character, or logical).

> a <-c(1, 2, 5, 3, 6, -2, 4) > b <-c("one", "two", "three") > d <-c(TRUE, TRUE, TRUE, FALSE, TRUE, FALSE) # a is numeric vector, # b is a character vector # d is a logical vector Vectors (2/3) Scalars are one-element vectors. > f <- 3 > x <- TRUE > y <- 100.01 > K <- as.logical(0) Vectors (3/3) You can refer to elements of a vector using a numeric vector of positions within brackets. > a <- c(1, 2, 5, 3, 6, -2, 4) > a[3]

[1] 5

> a[c(1, 3, 5)]

[1] 1 5 6

> a[2:6]

[1] 2 5 3 6 -2

Matrices (1/4)

A matrix is a two-dimensional array where each element has the same data type

(numeric, character, or logical). Matrices are created with the matrix() function.

mymatrix <- matrix(vector, nrow=number_of_rows, ncol=number_of_columns, byrow=logical_value, dimnames=list(char_vector_rownames, char_vector_colnames) ) Matrices (2/4) # Create a matrix from a vector > vector <- c(1,2,3,4) > foo <- matrix(vector, nrow=2, ncol=2) > foo

[,1] [,2]
[1,] 1 3
[2,] 2 4

# Create a 5×4 matrix

> y <- matrix(1:20, nrow=5, ncol=4) > y

[,1] [,2] [,3] [,4]
[1,] 1 6 11 16
[2,] 2 7 12 17
[3,] 3 8 13 18
[4,] 4 9 14 19
[5,] 5 10 15 20

> z <- matrix(1:20, nrow=5) Matrices (3/4) Create a 2x2 matrix with labels and fill the matrix by rows Create a 2x2 matrix with labels and fill the matrix by column > cells <- c(1,26,24,68) > rnames <- c("R1", "R2") > cnames <- c("C1", "C2") > mymatrix <- matrix( + cells, nrow = 2, ncol = 2, byrow = TRUE, + dimnames = list(rnames, cnames) ) > mymatrix

C1 C2

R1 1 26

R2 24 68

> mymatrix <- matrix( + cells, nrow = 2, ncol = 2, byrow = FALSE, + dimnames = list(rnames, cnames)) > mymatrix

C1 C2

R1 1 24

R2 26 68

Matrices (4/4)

You can identify rows, columns, or
elements of a matrix, x, by using
subscripts and brackets.

• x[i,] refers to the ith row
• x[,j] refers to jth column
• x[i,j] refers to the i,jth element

> x <- matrix(1:10, nrow=2) > x

[,1] [,2] [,3] [,4] [,5]
[1,] 1 3 5 7 9
[2,] 2 4 6 8 10
> x[2,]
[1] 2 4 6 8 10
> x[,2]
[1] 3 4
> x[1,4]
[1] 7
> x[1, c(4,5)]
[1] 7 9

Arrays (1/2)

Matrices are two-dimensional and, like vectors, can contain only one data type.

When there are more than two dimensions, you’ll use arrays.

myarray <- array(vector, dimensions, dimnames) Arrays (2/2) > dim1 <- c("A1", "A2") > dim2 <- c("B1", "B2", "B3") > dim3 <- c("C1", "C2", "C3", "C4") > z <- array(1:24, c(2, 3, 4), dimnames=list(dim1, dim2, dim3)) > z
, , C1

B1 B2 B3
A1 1 3 5
A2 2 4 6

, , C2

B1 B2 B3
A1 7 9 11
A2 8 10 12

, , C3

B1 B2 B3
A1 13 15 17
A2 14 16 18

, , C4

B1 B2 B3
A1 19 21 23
A2 20 22 24

Data Frame (1/4)

A data frame is more general than a matrix in that different columns can
contain different modes of data (numeric, character, etc.). A data frame is
created with the data.frame() function

It’s similar to the datasets you’d typically see in Python (pandas), SAS, SPSS,
and Stata. Each column must have only one data type, but you can put
columns of different data types together to form the data frame. Because
data frames are close to what analysts typically think of as datasets, we
sometimes use the terms columns and variables interchangeably when
discussing data frames.

mydata <- data.frame(col1, col2, col3,…) Data Frame (2/4) > patientID <- c(1, 2, 3, 4) > age <- c(25, 34, 28, 52) > diabetes <- c("Type1", "Type2", "Type1", "Type1") > status <- c("Poor", "Improved", "Excellent", "Poor") > patientdata <- data.frame(patientID, age, diabetes, status) > patientdata

patientID age diabetes status

1 1 25 Type1 Poor

2 2 34 Type2 Improved

3 3 28 Type1 Excellent

4 4 52 Type1 Poor

Data Frame (3/4)

Accessing data frame elements can be
straight forward. Element can be
accessed by column names.

> patientdata$patientID
[1] 1 2 3 4

> patientdata$diabetes
[1] Type1 Type2 Type1 Type1
Levels: Type1 Type2

> patientdata$status
[1] Poor Improved Excellent Poor
Levels: Excellent Improved Poor

> patientdata[,’age’]
[1] 25 34 28 52

Data Frame (4/4)

If you want to cross tabulate diabetes type by status.

> table(patientdata$diabetes, patientdata$status)

Excellent Improved Poor
Type1 1 0 2
Type2 0 1 0

Some Useful Functions for Data Frame (1/8)

The summary() function can
quickly summarise the variables

in a data frame

> summary(patientdata)

patientID age diabetes status

Min. :1.00 Min. :25.00 Type1:3 Excellent:1

1st Qu.:1.75 1st Qu.:27.25 Type2:1 Improved :1

Median :2.50 Median :31.00 Poor :2

Mean :2.50 Mean :34.75

3rd Qu.:3.25 3rd Qu.:38.50

Max. :4.00 Max. :52.00

Some Useful Functions for Data Frame (2/8)

The attach() function adds the data
frame to the R search path. When a

variable name is encountered, data

frames in the search path are checked

in order to locate the variable.

> summary(mtcars$mpg)

Min. 1st Qu. Median Mean 3rd Qu. Max.

10.40 15.42 19.20 20.09 22.80 33.90

> plot(mtcars$mpg, mtcars$disp)

> plot(mtcars$mpg, mtcars$wt)

> attach(mtcars)

> summary(mpg)

Min. 1st Qu. Median Mean 3rd Qu. Max.

10.40 15.42 19.20 20.09 22.80 33.90

> plot(mpg, disp)

> plot(mpg, wt)

> detach(mtcars)

Some Useful Functions for Data Frame (3/8)

The detach() function removes the
data frame from the search path.

Note that detach() does nothing to
the data frame itself. The statement is

optional but is good programming

practice and should be included

routinely.

> attach(mtcars)

> summary(mpg)

Min. 1st Qu. Median Mean 3rd Qu. Max.

10.40 15.42 19.20 20.09 22.80 33.90

> plot(mpg, disp)

> plot(mpg, wt)

> detach(mtcars)

Some Useful Functions for Data Frame (4/8)

The limitations with this approach are
evident when more than one object
can have the same name.

Here we already have an object
named mpg in our environment when
the mtcars data frame is attached. In
such cases, the original object takes
precedence, which isn’t what you
want. The plot statement fails
because mpg has 3 elements and
disp has 32 elements.

> mpg <- c(25, 36, 47) > attach(mtcars)
The following object is masked _by_
.GlobalEnv:

mpg

> plot(mpg, wt)
Error in xy.coords(x, y, xlabel, ylabel,
log) :
‘x’ and ‘y’ lengths differ

Some Useful Functions for Data Frame (5/8)

In this case, the statements within

the {} brackets are evaluated with
reference to the mtcars data
frame. You don’t have to worry

about name conflicts here. If

there’s only one statement (for

example, summary(mpg)), the {}
brackets are optional.

> with(mtcars, {

+ summary(mpg, disp, wt)

+ plot(mpg, disp)

+ plot(mpg, wt)

+ })

Some Useful Functions for Data Frame (6/8)

The limitation of the with()
function is that assignments will

only exist within the function

brackets.

> with(mtcars, {

stats <- summary(mpg) stats }) Min. 1st Qu. Median Mean 3rd Qu. Max. 10.40 15.43 19.20 20.09 22.80 33.90 > stats

Error: object ‘stats’ not found

Some Useful Functions for Data Frame (7/8)

If you need to create objects that

will exist outside of the with()
construct, use the special

assignment operator <<- instead of the standard one <-. It will save the object to the global environment outside of the with() call. > with(mtcars, {

nokeepstats <- summary(mpg) keepstats <<- summary(mpg) }) > nokeepstats

Error: object ‘nokeepstats’ not found

> keepstats

Min. 1st Qu. Median Mean 3rd Qu. Max.

10.40 15.43 19.20 20.09 22.80 33.90

Some Useful Functions for Data Frame (8/8)

> head(iris)

Sepal.Length Sepal.Width Petal.Length Petal.Width Species

1 5.1 3.5 1.4 0.2 setosa

2 4.9 3.0 1.4 0.2 setosa

3 4.7 3.2 1.3 0.2 setosa

4 4.6 3.1 1.5 0.2 setosa

5 5.0 3.6 1.4 0.2 setosa

6 5.4 3.9 1.7 0.4 setosa

Factors (1/3)

Categorical (nominal) and ordered
categorical (ordinal) variables in R
are called factors.

The function factor() stores the
categorical values as a vector of
integers in the range [1… k] (where
k is the number of unique values in
the nominal variable), and an
internal vector of character strings
(the original values) mapped to
these integers.

> diabetes <- c("Type1", "Type2", "Type1", "Type1") > diabetes
[1] “Type1” “Type2” “Type1” “Type1”

Factors (2/3)

> patientID <- c(1, 2, 3, 4) age <- c(25, 34, 28, 52) > diabetes <- c("Type1", "Type2", "Type1", "Type1") > status <- c("Poor", "Improved", "Excellent", "Poor") > diabetes <- factor(diabetes) > status <- factor(status, order=TRUE) > patientdata <- data.frame(patientID, age, diabetes, status) > str(patientdata)

‘data.frame’: 4 obs. of 4 variables:

$ patientID: num 1 2 3 4 w

$ age : num 25 34 28 52

$ diabetes : Factor w/ 2 levels “Type1″,”Type2”: 1 2 1 1

$ status : Ord.factor w/ 3 levels “Excellent”<"Improved"<..: 3 2 1 3 Factors (3/3) > summary(patientdata)

patientID age diabetes status

Min. :1.00 Min. :25.00 Type1:3 Excellent:1

1st Qu.:1.75 1st Qu.:27.25 Type2:1 Improved :1

Median :2.50 Median :31.00 Poor :2

Mean :2.50 Mean :34.75

3rd Qu.:3.25 3rd Qu.:38.50

Max. :4.00 Max. :52.00

Lists (1/2)

Lists are the most complex of the R

data types. Basically, a list is an

ordered collection of objects

(components). A list allows you to

gather a variety of (possibly

unrelated) objects under one name.

mylist <- list(object1, object2, …) mylist <- list(name1=object1, name2=object2, …) Lists (2/2) > g <- "My First List" > h <- c(25, 26, 18, 39) > j <- matrix(1:10, nrow=5) > k <- c("one", "two", "three") > mylist <- list(title=g, ages=h, j, k) > mylist
$title
[1] “My First List”
$ages
[1] 25 26 18 39
[[3]]
[,1] [,2]
[1,] 1 6
[2,] 2 7
[3,] 3 8
[4,] 4 9
[5,] 5 10
[[4]]
[1] “one” “two” “three”

> mylist[[2]]
[1] 25 26 18 39
> mylist[[“ages”]]
[[1] 25 26 18 39

References

• W. Venables, D. Smith, and the R Core Team (2015) An Introduction to R.

• P. Teetor (2011) R Cookbook. O’Reilly.

• J. Adler (2012) R in a Nutshell, 2nd Edition, O’Reilly

Exercise 1/10

# Declare different variables
typesmy_numeric <- 42 my_character <- "universe“ my_logical <- FALSE # Check class of my_numeric class(my_numeric) # Check class of my_character class(my_character) # Check class of my_logical class(my_logical) Exercise 2/10 # Vector operations a) Create a verctor like 1,2,3, . . ., 10 b) Get the length of the above vector c) Get the last three numbers from the vector d) Sort the numbers with decreasing order e) Remove the number 9 from the above vector Exercise 3/10 # Vector operations a) Create a vector from 1 to 3.1415 with the length of 100 b) Create a vector from -2 to 0.1 with the length of 100 c) Get the sum and inner product of a and b Exercise 4/10 # Vector operations a) Create a vector x contains 2, 3, 4, 1 b) Create a vector y contains 1, 1, 3, 7 c) Combine column vectors x, y Exercise 5/10 # Vector operations Use rep() function to create the following vectors: a) “0” “x” “0” “x” “0” “x” b) 1 3 2 1 3 2 1 3 2 1 3 2 c) 1 1 1 2 2 2 3 3 3 Exercise 6/10 # Matrix operations a) Create a matrix which contains values from 1 to 100 with 5 rows and 20 columns b) Print out the dimensions of the matrix c) Find out the 4th column’s sum d) Find out the sum of row 3 and row 17 e) Assign the following names to the rows: “A”, “B”, “C”, “D”, “E” Exercise 7/10 # Matrix operations a) Use matrix() function to create the following matrix: TypeA TypeB TypeC Navarra 190 8 22 Zaragoza 191 4 1.7 Madrid 223 80 2.0 b) Add the following column into the matrix: TypeD 2.00 3.50 2.75 c) Use apply() function to calculate the means of each column of the matrix Exercise 8/10 # Array operations Create the following array , , 1 [,1] [,2] [,3] [1,] 1 4 7 [2,] 2 5 8 [3,] 3 6 9 , , 2 [,1] [,2] [,3] [1,] 10 13 16 [2,] 11 14 17 [3,] 12 15 18 , , 3 [,1] [,2] [,3] [1,] 19 22 25 [2,] 20 23 26 [3,] 21 24 27 Exercise 9/10 # Data frame operations Type df <- iris, then a) Print out the dimensions of df b) Find out the sum of “Sepal.Width” column c) Rename column “Species” as “label” d) Find out how many records with “Petal.Length” larger than 1.41 Exercise 10/10 # List operations Create the following list and save it to the variable x: [[1]] [1] 2 3 5 [[2]] [1] "aa" "bb" "cc" "dd" "ee" [[3]] [1] TRUE FALSE TRUE FALSE FALSE [[4]] [1] 3 Additional Exercises Well done if you’ve completed the exercises. Once you complete these additional exercises, you can leave the workshop sessions  Additional Exercise (1/3) Create the following data frame; surname nationality deceased 1 Tukey US yes 2 Venables Australia no 3 Tierney US no 4 Ripley UK no 5 McNeil Australia no Additional Exercise (2/3) # Poker and roulette winnings from Monday to Friday: poker_vector <- c(140, -50, 20, -120, 240) roulette_vector <- c(-24, -50, 100, -350, 10) days_vector <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday") names(poker_vector) <- days_vector names(roulette_vector) <- days_vector # Total winnings with poker total_poker <- sum(poker_vector) a) Calculate total winnings with roulette b) Calculate winnings overall Additional Exercise (3/3) a) Create a data frame as follows: df <- data.frame(Product=gl(3,10,labels=c("A","B", "C")), Year=factor(rep(2002:2011,3)), Sales=1:30) b) Find the sum of all products’ sales by year Thank You! bchen@lincoln.ac.uk mailto:bchen@lincoln.ac.uk