CS计算机代考程序代写 algorithm database finance c++ data science Excel Bayesian chain Hive matlab AI Chapter 1

Chapter 1
Introduction
1.1 Statistical Computing
Computational statistics and statistical computing are two areas within statistics that may be broadly described as computational, graphical, and nu- merical approaches to solving statistical problems. Statistical computing tra- ditionally has more emphasis on numerical methods and algorithms, such as optimization and random number generation, while computational statistics may encompass such topics as exploratory data analysis, Monte Carlo meth- ods, and data partitioning, etc. However, most researchers who apply com- putationally intensive methods in statistics use both computational statistics and statistical computing methods; there is much overlap and the terms are used differently in different contexts and disciplines. Gentle [118] and Givens and Hoeting [129] use “computational statistics” to encompass all the rele- vant topics that should be covered in a modern introductory text, so that “statistical computing” is somewhat absorbed under this more broad defini- tion of computational statistics. On the other hand, journals and professional organizations seem to use both terms to cover similar areas.
This book encompasses parts of both of these subjects, because a first course in computational methods for statistics necessarily includes both. Some examples of topics covered are described below.
Monte Carlo methods refer to a diverse collection of methods in statistical inference and numerical analysis where simulation is used. Many statistical problems can be approached through some form of Monte Carlo integration. In parametric bootstrap, samples are generated from a given probability dis- tribution to compute probabilities, gain information about sampling distribu- tions of statistics such as bias and standard error, to assess the performance of procedures in statistical inference, and to compare the performance of compet- ing methods for the same problem. Resampling methods such as the ordinary bootstrap and jackknife are nonparametric methods that can be applied when the distribution of the random variable or a method to simulate it directly is unavailable. The need for Monte Carlo analysis also arises because in many problems, an asymptotic approximation is unsatisfactory or intractable. The convergence to the limit distribution may be too slow, or we require results for finite samples; or the asymptotic distribution has unknown parameters.
Rizzo, Maria L.. Statistical Computing with R, Second Edition, CRC Press LLC, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/ualberta/detail.action?docID=5731927.
Created from ualberta on 2021-03-06 10:35:02.
1
Copyright © 2019. CRC Press LLC. All rights reserved.

2 Statistical Computing with R
Monte Carlo methods are covered in Chapters 6–11. The first tool needed in a simulation is a method for generating psuedo-random samples; these methods are covered in Chapters 3 and 4.
Markov Chain Monte Carlo (MCMC) methods are based on an algorithm to sample from a specified target probability distribution that is the stationary distribution of a Markov chain. These methods are widely applied for problems arising in Bayesian analysis, and in such diverse fields as computational physics and computational finance. Markov Chain Monte Carlo methods are covered in Chapter 11.
Several special topics also deserve an introduction in a survey of com- putationally intensive methods. Density estimation (Chapter 12) provides a nonparametric estimate of a density, which has many applications in addition to estimation, ranging from exploratory data analysis to cluster analysis. Com- putational methods are essential for the visualization of multivariate data and reduction of dimensionality. The increasing interest in massive and streaming data sets, and high dimensional data arising in applications of biology and en- gineering, for example, demand improved and new computational approaches for multivariate analysis and visualization. Chapter 5 is an introduction to methods for visualization of multivariate data. A review of selected topics in numerical methods such as root finding and numerical integration is presented in Chapter 13. An introduction to optimization using R is covered in Chapter 14.
A final chapter of optional material specific to R programming should be accessible to readers after covering Chapter 3. Programming topics such as benchmarking, efficiency and code profiling are covered in Chapter 15. Several years ago with the release of Rcpp [82, 83], writing R extensions in compiled libraries became much simpler so that most experienced R users with a modest amount of background in C++ can easily integrate compiled C++ functions with R code. Some simple examples are illustrated in the final chapter of the book for those users who are interested.
Many references can be recommended for further reading on these topics. Efron and Hastie [89] provide an up-to-date review of how modern statistics has evolved in the computer age. Gentle [118, 119] and the volume edited by Gentle, et al. [120] have thorough coverage of topics in computational statis- tics. A survey of methods in statistical computing is covered in Kundu and Basu [170]. Givens and Hoeting [129] is a recent graduate text on computa- tional statistics and statistical computing. Hardle et al. [139] is an introductory text with examples in R. Martinez and Martinez [197] is an accessible intro- duction to computational statistics, with numerous examples in MATLAB􏰫. Books that primarily cover Monte Carlo methods or resampling methods in- clude Davison and Hinkley [68], Efron and Tibshirani [91], Hjorth [149], Liu [186], Chernick [50] and Robert and Casella [240]. Statistical learning is a closely related topic that applies computational methods to solve a wide range of problems in modern statistics; see Hastie et al. [143] and James et al. [157]. On density estimation see Scott [264] and Silverman [268]. A good resource
Rizzo, Maria L.. Statistical Computing with R, Second Edition, CRC Press LLC, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/ualberta/detail.action?docID=5731927.
Created from ualberta on 2021-03-06 10:35:02.
Copyright © 2019. CRC Press LLC. All rights reserved.

Introduction 3
for applied linear models in R and other extensions such as nonparametric re- gression and smoothing is Faraway [95]. Albert [5] and McElreath [199] cover Bayesian computational methods with examples in R. For statistical applica- tions of numerical analysis see Lange [176] or Monahan [210].
Although this book aims to be complete for novice R users to get started, it is not intended as a full-length text about using R for statistics or data science. Some R users may also be interested in supplementary resources de- signed for learning to use R. There is a long list of introductory books and materials of this type. R by Example [7] may appeal to users who enjoy learn- ing from detailed, fully implemented examples. Verzani [295], Dalgaard [67] or Wickham and Grolemund [318] are on a similar level. For graphics in R, see Chang’s R Graphics Cookbook [47] and refer to both Chang [47] and Wickham [313] for ggplot2.
For technical reference on programming in R, several excellent references are available in addition to the R manuals [227, 229, 294]. For advanced pro- gramming topics see Eddelbuettel [82], Gillespie and Lovelace [127], and Wick- ham [312], and their respective websites.
There are now many excellent online resources available, in addition to the online R and RStudio documentation, such as galleries of code and graphics, online books, tutorials and blogs. See the references in the individual chapters for some of these. The R-bloggers website is worth visiting; it currently com- bines blog posts from some 750 bloggers at https://www.r-bloggers.com/.
1.2 The R Environment
The R environment is a suite of software and programming language based on S, for data analysis and visualization. “What is R” is one of the frequently asked questions included in the online documentation for R. Here is an excerpt from the R FAQ [226]:
R is a system for statistical computation and graphics. It consists of a language plus a run-time environment with graphics, a de- bugger, access to certain system functions, and the ability to run programs stored in script files.
R is based on the S language. Some details about differences between R and S are given in the R FAQ [151]. Venables and Ripley [293] is a good resource for applied statistics with S, Splus, and R. Other references on the S language include [27, 42, 45, 292].
The home page of the R project is http://www.r-project.org/, and the current R distribution and documentation are available on the Comprehensive R Archive Network (CRAN) at http://cran.R-project.org/. The R dis-
Rizzo, Maria L.. Statistical Computing with R, Second Edition, CRC Press LLC, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/ualberta/detail.action?docID=5731927.
Created from ualberta on 2021-03-06 10:35:02.
Copyright © 2019. CRC Press LLC. All rights reserved.

4 Statistical Computing with R
tribution includes the base and recommended packages with documentation. A help system and several reference manuals are installed with the program. Readers who have not already done so should proceed to download and
install the most recent version of R corresponding to their operating sys- tem. Installation is easy; after downloading the setup file from http://cran. R-project.org/ and running it, most users will simply accept the default options as prompted by the installation wizard.
Most R users currently use an integrated development environment or IDE to interact with the R system, edit source files, and view output. Although a type of IDE (the R GUI) is included with the R distribution, it is no longer widely used. Perhaps the most popular IDE for R currently is RStudio. In this second edition of the book, RStudio is treated as the default IDE as it is widely used, free to download the noncommercial version, and packed with convenient features. Users can install a free version of RStudio from https: //www.rstudio.com/. RStudio has recently released RStudio Cloud, currently in alpha. For more information about the cloud option, consult the websites https://rstudio.cloud/ and the community page at https://community. rstudio.com/c/rstudio-cloud. Other IDEs are available, of course, and any of them can easily be used with this book.
Programming is discussed as needed in the chapters that follow. In this text, new functions or programming methods are explained in remarks called “R notes” as they arise. Some “R notes” also address certain aspects of the R system or devices. Readers are always encouraged to consult the R help system and manuals [151, 226, 294]. For platform specific details about installation and interacting with the graphical user interface the best resources are the R manual [228] and current information at www.r-project.org.
Although RStudio provides many user-friendly features and powerful tools, R is a stand-alone program that could be run in batch mode if required for certain projects. R scripts can execute on a supercomputer and there are extensions to enable high performance computing. Other extensions like rstan provide an interface to a powerful scripting language and sampling engine Stan for Bayesian analysis. Refer to the CRAN Task Views “High Performance Computing” and “Bayesian” for more details. CRAN Task Views are a good resource to find what is available on CRAN for a wide range of statistical and machine learning applications. See https://cloud.r-project.org/web/ views/.
In the remainder of this chapter, we cover some basic information aimed to help a new user get started with R. Topics include the recommended RStu- dio integrated development environment, basic syntax, using the online help, data, files, scripts, and packages. Vectors, matrices, lists and data frames are introduced with examples, and there is an overview of basic graphics functions.
Rizzo, Maria L.. Statistical Computing with R, Second Edition, CRC Press LLC, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/ualberta/detail.action?docID=5731927.
Created from ualberta on 2021-03-06 10:35:02.
Copyright © 2019. CRC Press LLC. All rights reserved.

Introduction 5
1.3 Getting Started with R and RStudio
RStudio provides a convenient user interface that makes R much easier to use for beginners and advanced users. It is open source and available to download and install from the RStudio website at https://www.rstudio. com/. It combines a code editor with syntax highlighting, plot, environment, and console windows. An integrated help system and other important utilities are provided. RStudio has extensive support for generating reports using R Markdown with the knitr package [324], and package development without leaving the RStudio environment. A screen shot of an RStudio session is shown in Figure 1.1.
FIGURE 1.1: RStudio screen shot.
In the screen shot of RStudio, Figure 1.1, an R script is open in the code editor window (upper left), and it has been run interactively so that commands and results appear in the Console window (lower left) and Plot window (lower right). In the upper right the Environment window is visible, showing the names and values of user defined objects in the environment. To try a similar example, open the R code file for this chapter, use the mouse to select the first several lines, and click “Run” from the toolbar in the code editor window. Alternately click “Source” to run all code in the script.
Commands can also be typed at the prompt in the R Console window. For
Rizzo, Maria L.. Statistical Computing with R, Second Edition, CRC Press LLC, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/ualberta/detail.action?docID=5731927.
Created from ualberta on 2021-03-06 10:35:02.
Copyright © 2019. CRC Press LLC. All rights reserved.

6 Statistical Computing with R
example, we can evaluate the standard normal density φ(x) = √1 e−x2/2 at

x = 2 by typing the formula or (more conveniently) the dnorm function:
> 1/sqrt(2*pi) * exp(-2)
[1] 0.05399097
> dnorm(2)
[1] 0.05399097
In the example above, the command prompt is >. The [1] indicates that the result displayed is the first element of a vector.
A command can be continued on the next line. The prompt symbol changes whenever the command on the previous line is not complete. In the example below, the plot command is continued on the second line, as indicated by the prompt symbol changing to +.
> plot(cars, xlab=”Speed”, ylab=”Distance to Stop”,
+ main=”Stopping Distance for Cars in 1920″)
Whenever a statement or expression is not complete at the end of a line, the parser automatically continues it on the next line. No special symbol is needed to end a line. (A semicolon can be used to separate statements on a single line, although this tends to make code harder to read.) A group of statements can be gathered into a single (compound) expression by enclosing them in curly braces { }.
To cancel a command, a partial command, or a running script, use Ctrl-C, or in Windows press the escape key (Esc). If the RStudio console window has a red square button, an error has occurred and one can debug or click the red square to stop.
To exit the RStudio IDE, simply close the main window. The program usually prompts with the question “Save workspace data to /.Rdata?” Click yes to save the workspace, which includes user defined objects and remembers any open files, or click “No” to exit without saving.
R Note 1.1 Why are some results not seen in the console?
If R code is submitted interactively in RStudio using “Run”, the R code and result are echoed to the console, but when a file is sourced using the “Source” button or source function, statements and results are not echoed to the console. For example, evaluating the expression dnorm(2) interactively by typing in the console window or using “Run” echoes the command and prints the result to the console. However, if that expression is part of an R script, when the script is sourced using “Source,” the result is not printed unless we explicitly print it, e.g., print(dnorm(2)). In RStudio, the “Source” button has a drop-down menu with an optional “Source with echo” in case this is an issue.
Rizzo, Maria L.. Statistical Computing with R, Second Edition, CRC Press LLC, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/ualberta/detail.action?docID=5731927.
Created from ualberta on 2021-03-06 10:35:02.
Copyright © 2019. CRC Press LLC. All rights reserved.

Introduction 7
R Note 1.2
The RStudio Help menu includes RStudio Docs, which links to an online support page. A good article to look over on this page is ‘Using the RStudio IDE’ which covers many basic and less obvious features of RStudio. A very useful feature for editing within RStudio is that one or more Source panes can be detached and moved to the desktop. Simply click on the tab, drag and drop. Also see Keyboard Shortcuts Help for shortcuts to comment/uncomment lines, reformat code, and many other actions; a very handy shortcut Ctrl+Enter will run the selected line(s) of code or the line containing the cursor. To exit the shortcut help, press Esc.
1.4 Basic Syntax
The usual assignment operator is <-. For example, x <- sqrt(2 * pi) assigns the value of 􏰢2π to the symbol x. R Note 1.3 The assignment operator In many situations the two assignment operators <- and = can be used interchangeably. It is a good practice to use <- for assignment because there is a technical difference between the two operators. The R documentation on assignment operators states that “The operators <- and = assign into the environment in which they are evaluated. The operator <- can be used anywhere, whereas the operator = is only allowed at the top level (e.g., in the complete expression typed at the command prompt) or as one of the subexpressions in a braced list of expressions.” Throughout this book we will use <- for assignment and reserve = for arguments to functions. Commands entered at the command prompt in the R console are auto- matically echoed to the console, but assignment operations are silent. Some objects have print methods so that the output displayed is not necessarily the entire object, but a summarized report. Compare the effect of these com- mands. The first command displays a sequence (0.0 0.5 1.0 1.5 2.0 2.5 3.0), but does not store it. The second command stores the sequence in x, but does not display it. seq(0, 3, 0.5) x <- seq(0, 3, 0.5) Rizzo, Maria L.. Statistical Computing with R, Second Edition, CRC Press LLC, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/ualberta/detail.action?docID=5731927. Created from ualberta on 2021-03-06 10:35:02. Copyright © 2019. CRC Press LLC. All rights reserved. 8 Statistical Computing with R Syntax Below are some help topics on R operators and syntax. The ? invokes the help system for the indicated keyword. ?Syntax ?Arithmetic ?Comparison #relational operators ?Extract ?Control ?Logic #operators on vectors and arrays #control flow #logical operators Symbols or labels for functions and variables are case-sensitive and can include letters, digits, and periods. Symbols cannot contain the underscore character and cannot start with a digit. Many symbols are already defined by the R base or recommended packages. To check if a symbol is already defined, type the symbol at the prompt. The symbols q, t, I, T, and F, for example, are used by R. Note that whenever a package is loaded, other symbols may now be defined by the package. >T
[1] TRUE
>t
function (x) UseMethod(“t”) >g
Error: Object “g” not found
Here we see that both T and t are already defined, but g is not yet defined by R or by the user. Nothing prevents a user from assigning a new value to predefined symbols such as t or T, but it is a bad programming practice in general and can lead to unexpected results and programming errors.
Most new R users have some experience with other programming envi- ronments and languages such as C, MATLAB, or SAS. Some operations and features are common to all these languages. A brief list summarizing R syntax for some of these common elements is shown in Table 1.1. For more details see the help topic Syntax. Some of the functions common to most development environments are listed in Table 1.2.
Most arithmetic operations are vectorized. For example, x^2 will square each of the elements of the vector x, or each entry of the matrix x if x is a matrix. Similarly, x*y will multiply each of the elements of the vector x times the corresponding element of y (generating a warning if the vectors are not the same length). Operators for matrices are described in Table 1.3.
Rizzo, Maria L.. Statistical Computing with R, Second Edition, CRC Press LLC, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/ualberta/detail.action?docID=5731927.
Created from ualberta on 2021-03-06 10:35:02.
Copyright © 2019. CRC Press LLC. All rights reserved.

TABLE 1.1:
Description
Comment
Assignment
Concatenation operator Elementwise multiplication Exponentiation
x mod y
Integer division
Sequence from a to b by h Sequence operator
R symbol
#
<- c * ^ x %% y %/% seq : Example #this is a comment x <- log2(2) c(3,2,2) a*b 2^1.5 25 %% 3 25 %/% 3 seq(a,b,h) 0:20 Introduction 9 R Syntax and Commonly Used Operators TABLE 1.2: Description Square root ⌊x⌋, ⌈x⌉ Natural logarithm Exponential function ex Factorial Random Uniform numbers Random Normal numbers Normal distribution Rank, sort Variance, covariance Std. dev., correlation Frequency tables Missing values Commonly Used Functions R symbol sqrt floor, ceiling log exp factorial runif rnorm pnorm, dnorm, qnorm rank, sort var, cov sd, cor table NA, is.na 1.5 Using the R Online Help System RStudio includes a Help tab with a search box for searching by keyword. Help topics can also be searched from the command prompt. For documenta- tion on a topic, type ?topic or help(topic) where “topic” is the name of the topic for which you need help. For example, ?seq will bring up documentation for the sequence function. In some cases, it may be necessary to surround the topic with quotation marks. > ?seq #display help for sequence function
> ?%%
Error: syntax error, unexpected SPECIAL in ” ?%%”
The second version (below) produces the help topic.
> ?”%%”
Rizzo, Maria L.. Statistical Computing with R, Second Edition, CRC Press LLC, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/ualberta/detail.action?docID=5731927.
Created from ualberta on 2021-03-06 10:35:02.
Copyright © 2019. CRC Press LLC. All rights reserved.

10
Statistical Computing with R
TABLE 1.3:
Description
Zero vector Zero matrix
R Syntax and Functions for Vectors and Matrices
ith element of vector a
j th column of a matrix
ij th entry of matrix A Matrix multiplication Elementwise multiplication Matrix transpose
Matrix inverse Diagonal
R symbol
numeric(n)
integer(n)
rep(0,n)
matrix(0,n,m)
a[i]
A[,j]
A[i,j]
%*%
*
t solve diag
Example
x <- numeric(n) x <- integer(n) x <- rep(0,n) x <- matrix(0,n,m) a[i] <- 0 sum(A[,j]) x <- A[i,j] a%*%b a*b t(A) solve(A) diag(A) A In RStudio, “R Help” in the Help menu displays Help in an integrated browser window, with hyperlinks. Alternately the function help.start() en- tered at the command prompt will display a summary of topics in html format with links. Another way to search for help on a topic is help.search(). This and the search engine in Html help may help locate several relevant topics. For example, if we are searching for a method to compute a permutation, help.search("permutation") produces two results: order and sample. We can then consult the help topics for order and sample. The help topic for sample shows that x is sampled without replacement (a permutation of the elements of vector x) by: sample(x) #permutation of all elements of x sample(x, size=k) #permutation of k elements of x (If the goal was to count permutations, and evaluate n! , we want (n−k)! ?Special, a list of special functions including factorial and gamma.) Many help files end with executable examples. The examples can be copied and pasted at the command line. To run all the examples associated with topic, use example(topic). See for example the interesting set of examples for density. To run all the examples for density, type example(density). To see one example, open the help page, copy the lines and paste them at the command prompt. help(density) # copy and paste the lines below from the help page # The Old Faithful geyser data d <- density(faithful$eruptions, bw = "sj") d plot(d) Rizzo, Maria L.. Statistical Computing with R, Second Edition, CRC Press LLC, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/ualberta/detail.action?docID=5731927. Created from ualberta on 2021-03-06 10:35:02. Copyright © 2019. CRC Press LLC. All rights reserved. Introduction 11 A list of available data sets in the base and loaded packages is displayed by data(), and documentation on a loaded data set is displayed by the associated help topic For example, help("faithful") displays the Old Faithful geyser data help topic. If a package is installed but not yet loaded, specify the name of the package. For example, help("geyser", package = MASS) displays help for the dataset geyser without loading the MASS package [293]. Alter- nately, MASS::geyser will access the geyser data from the MASS package. For example, to get the summary of this data: > summary(MASS::geyser)
waiting
Min. : 43.00
1st Qu.: 59.00
Median : 76.00
Mean : 72.31
3rd Qu.: 83.00
Max. :108.00
duration
Min. :0.8333
1st Qu.:2.0000
Median :4.0000
Mean :3.4608
3rd Qu.:4.3833
Max. :5.4500
1.6
Distributions and Statistical Tests
There are dozens of probability distributions and statistical tests imple- mented in the R stats package, which is automatically available when using R. To use the integrated help system to search for a list of what is available, search for the keyword “Distributions.” This search should find a manual page that lists all of the available probability distributions included in stats when R is installed. Other distribution functions may be available in external pack- ages.
To search for statistical tests implemented in R stats, it is easiest to use the wildcard type of search help.search(“keyword”, package=”stats”). This restricts the search to the stats package so that we only see the results in that package. It is helpful to know that test functions in R are named in this pattern: “name.test”. For example, Pearson’s chisquared test function is chisq.test. A wildcard search ending in “.test” should find all of the test functions.
For example, try the following searches.
help.search(“distribution”, package=”stats”)
help.search(“.test”, package=”stats”)
The above search for “distribution” displays a list of links to help pages for statistical distributions in stats, along with a few other distribution-related
Rizzo, Maria L.. Statistical Computing with R, Second Edition, CRC Press LLC, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/ualberta/detail.action?docID=5731927.
Created from ualberta on 2021-03-06 10:35:02.
Copyright © 2019. CRC Press LLC. All rights reserved.

12 Statistical Computing with R
functions such as the empirical distribution function (ecdf). There is also a link to the “Distributions” manual page.
The search for “.test” displays a list of links to help pages for over 30 statistical tests implemented in R stats. The list includes t.test (Student’s T test), cor.test (correlation test), prop.test (tests for proportions), and many other commonly used tests. See Example 1.7 for an application of the Wilcoxon rank sum test.
1.7 Functions
The syntax for a function definition is
function( arglist ) expr
return(value)
Many examples of functions are documented in the chapter “Writing your own functions” of the manual [294].
Example 1.1. Here is a simple example of a user-defined R function that “rolls” n fair dice and returns the sum.
sumdice <- function(n) { k <- sample(1:6, size=n, replace=TRUE) return(sum(k)) } The function definition can be entered by several methods. 1. Typing the lines at the prompt, if the definition is short. 2. Copy from an editor and paste at the command prompt. 3. Save the function in a script file and source the file. Note that the IDE provides an editor and toolbar for submitting code. Once the user-defined function is entered in the workspace, it can be used like other R functions. #to print the result at the console > sumdice(2)
[1] 9
#to store the result rather than print it
a <- sumdice(100) #we expect the mean for 100 dice to be close to 3.5 > a / 100
[1] 3.59
Rizzo, Maria L.. Statistical Computing with R, Second Edition, CRC Press LLC, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/ualberta/detail.action?docID=5731927.
Created from ualberta on 2021-03-06 10:35:02.
Copyright © 2019. CRC Press LLC. All rights reserved.

Introduction 13
The value returned by an R function is the argument of the return state- ment or the value of the last evaluated expression. The sumdice function could be written as
sumdice <- function(n) sum(sample(1:6, size=n, replace=TRUE)) Functions can have default argument values. For example, sumdice can be generalized to roll s-sided dice, but keep the default as 6-sided. The usage is shown below. sumdice <- function(n, sides = 6) { if (sides < 1) return (0) k <- sample(1:sides, size=n, replace=TRUE) return(sum(k)) } > sumdice(5) #default 6 sides
[1] 12
> sumdice(n=5, sides=4) #4 sides
[1] 14
The body of a function can be as short as one line, like the first sumdice function above, or have many lines. The function body must be enclosed in braces when it has more than one line.
An easy way to display the list of arguments to a function is args. Try for example args(sample). If you have coded the function sumdice above, try args(sumdice). ⋄
1.8 Arrays, Data Frames, and Lists
Arrays, data frames, and lists are some of the objects used to store data in R. A matrix is a two-dimensional array. A data frame is not a matrix, although it can be represented in a rectangular layout like a matrix. Unlike a matrix, the columns of a data frame may be different types of variables. Arrays contain a single type.
Data Frames
A data frame is a list of variables, each of the same length but not neces- sarily of the same type. In this section we will discuss how to extract values of variables from a data frame.
Example 1.2 (Iris data). The Fisher iris data set gives four measurements
Rizzo, Maria L.. Statistical Computing with R, Second Edition, CRC Press LLC, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/ualberta/detail.action?docID=5731927.
Created from ualberta on 2021-03-06 10:35:02.
Copyright © 2019. CRC Press LLC. All rights reserved.

14 Statistical Computing with R
on observations from three species of iris. The first few cases in the iris data
are shown below.
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2
2 4.9 3.0 1.4 0.2
3 4.7 3.2 1.3 0.2
4 4.6 3.1 1.5 0.2
setosa
setosa
setosa
setosa
The iris data is an example of a data frame object. It has 150 cases in rows and 5 variables in columns. After loading the data, variables can be referenced by $name (the column name), by subscripts like a matrix, or by position using the [[ ]] operator. The list of variable names is returned by names. Some examples with output are shown below.
> names(iris)
[1] “Sepal.Length” “Sepal.Width” “Petal.Length” “Petal.Width”
[5] “Species”
> table(iris$Species)
setosa versicolor virginica
50 50 50
> w <- iris[[2]] #Sepal.Width > mean(w)
[1] 3.057333
Alternately, the data frame can be attached and variables referenced di- rectly by name. If a data frame is attached, it is a good practice to detach it when it is no longer needed, to avoid clashes with names of other variables.
> attach(iris)
> summary(Petal.Length[51:100]) #versicolor petal length
Min. 1st Qu. Median Mean 3rd Qu. Max.
3.00 4.00 4.35 4.26 4.60 5.10
with and by
If we only need the iris data temporarily, we can use with. The syntax
in this example would be
with(iris, summary(Petal.Length[51:100]))
However, with does not make changes outside of its local scope. It is best used for displaying or printing results. We can, however, assign the value of the evaluated expression to an object to save it.
out <- with(iris, summary(Petal.Length[51:100])) Suppose we wish to compute the means of all variables, by species. The first four columns of the data frame can be extracted with iris[,1:4]. Here the missing row index indicates that all rows should be included. The by function easily computes the means by species. Rizzo, Maria L.. Statistical Computing with R, Second Edition, CRC Press LLC, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/ualberta/detail.action?docID=5731927. Created from ualberta on 2021-03-06 10:35:02. Copyright © 2019. CRC Press LLC. All rights reserved. Introduction 15 > by(iris[,1:4], Species, colMeans)
Species: setosa
Sepal.Length Sepal.Width Petal.Length Petal.Width
5.006 3.428 1.462 0.246
————————————————–
Species: versicolor
Sepal.Length Sepal.Width Petal.Length Petal.Width
5.936 2.770 4.260 1.326
————————————————–
Species: virginica
Sepal.Length Sepal.Width Petal.Length Petal.Width
6.588 2.974 5.552 2.026
> detach(iris)
R Note 1.4

Although iris$Sepal.Width, iris[[2]], and iris[ ,2] all produce the same result, the $ and [[ ]] operators can only select one element, while the [ ] operator can select several. See the help topic Extract.
Arrays and Matrices
An array is a multiply subscripted collection of a single type of data. An array has a dimension attribute, which is a vector containing the dimensions of the array.
Example 1.3 (Arrays). Different arrays are shown. The sequence of numbers from 1 to 24 is first a vector without a dimension attribute, then a one- dimensional array, then used to fill a 4 by 6 matrix, and finally a 3 by 4 by 2 array.
x <- 1:24 dim(x) <- length(x) matrix(1:24, nrow=4, ncol=6) x <- array(1:24, c(3, 4, 2)) # vector # 1 dimensional array # 4 by 6 matrix # 3 by 4 by 2 array The 3 × 4 × 2 array defined by the last statement is displayed below. ,,1 [,1] [,2] [,3] [,4] [1,] 1 4 7 10 [2,] 2 5 8 11 [3,] 3 6 9 12 Rizzo, Maria L.. Statistical Computing with R, Second Edition, CRC Press LLC, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/ualberta/detail.action?docID=5731927. Created from ualberta on 2021-03-06 10:35:02. Copyright © 2019. CRC Press LLC. All rights reserved. 16 Statistical Computing with R ,,2 [,1] [,2] [,3] [,4] [1,] 13 16 19 22 [2,] 14 17 20 23 [3,] 15 18 21 24 The array x is displayed showing x[, , 1] (the first 3 × 4 elements) followed by x[, , 2] (the second 3 × 4 elements). ⋄ A matrix is a doubly subscripted array of a single type of data. If A is a matrix, then A[i, j] is the ij-th element of A, A[, j] is the j-th column of A, and A[i, ] is the i-th row of A. A range of rows or columns can be extracted using the : sequence operator. For example, A[2:3, 1:4] extracts the 2 × 4 matrix containing rows 2 and 3 and columns 1 through 4 of A. Example 1.4 (Matrices). The statements A <- matrix(0, nrow=2, ncol=2) A <- matrix(c(0, 0, 0, 0), nrow=2, ncol=2) A <- matrix(0, 2, 2) all assign to A the 2 × 2 zero matrix. Matrices are filled in column major order by default; that is, the row index changes faster than the column index. Thus, A <- matrix(1:8, nrow=2, ncol=4) stores in A the matrix 􏰊1 3 5 7􏰋 2468. If necessary, use the option byrow=TRUE in matrix to change the default. ⋄ Example 1.5 (Iris data: Example 1.2, cont.). We can convert the first four columns of the iris data to a matrix using as.matrix. > x <- as.matrix(iris[,1:4]) #all rows of columns 1 to 4 > mean(x[,2]) #mean of sepal width, all species
[1] 3.057333
> mean(x[51:100,3]) #mean of petal length, versicolor
[1] 4.26
It is possible to convert the matrix to a three-dimensional array, but arrays (and matrices) are stored in “column major order” by default. For arrays, “column major” means that the indices to the left are changing faster than indices to the right. In this case it is easy to convert the matrix to a 50 × 3 × 4 array, with the species as the second dimension. This works because in the data matrix, by column major order, the iris species changes faster than the variable name (column).
Rizzo, Maria L.. Statistical Computing with R, Second Edition, CRC Press LLC, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/ualberta/detail.action?docID=5731927.
Created from ualberta on 2021-03-06 10:35:02.
Copyright © 2019. CRC Press LLC. All rights reserved.

Introduction 17
> y <- array(x, dim=c(50, 3, 4)) > mean(y[,,2]) #mean of sepal width, all species
[1] 3.057333
> mean(y[,2,3]) #mean of petal length, versicolor
[1] 4.26
It is somewhat more difficult to produce a 50 × 4 × 3 array of iris data, with species as the third dimension. Here is one approach. First the matrix is sliced into three blocks of 50 observations each, corresponding to the three species. Then the three blocks are concatenated into a vector length 600, so that species is changing the slowest, and observation (row) is changing fastest. This vector then fills a 50×4×3 array.
> y <- array(c(x[1:50,], x[51:100,], x[101:150,]), dim=c(50,4,3)) > mean(y[,2,]) #mean of sepal width, all species
[1] 3.057333
> mean(y[,3,2]) #mean of petal length, versicolor
[1] 4.26
This array is provided in R as the data set iris3. ⋄ Lists
A list is an ordered collection of objects. The members of a list (the com- ponents) can be different types. Lists are more general than data frames; in fact, a data frame is a list with class “data.frame”. A list can be created by the list() function.
Some functions return list objects. Two examples are shown below; the run length encoding function rle in Example 1.6 and the Wilcoxon test in Example 1.7.
Example 1.6 (Run length encoding). Consider a coin flipping experiment. A “run” is a sequence of heads or tails. It is known that the maximum run length in a sequence of n Bernoulli trials (p = 0.5) should be about log2(n). The R function rle computes run lengths for a sequence of Bernoulli trials. We can simulate 1000 independent flips of a fair coin using the R Binomial random generator function rbinom.
n <- 1000 x <- rbinom(n, size = 1, prob = .5) table(x) >x 01
520 480
Here we can assign outcome 1 to heads and 0 to tails. We are interested in the pattern of runs of heads and tails; in particular, we are interested in the distribution of run lengths. The first part of the sequence can be shown with the head function.
Rizzo, Maria L.. Statistical Computing with R, Second Edition, CRC Press LLC, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/ualberta/detail.action?docID=5731927.
Created from ualberta on 2021-03-06 10:35:02.
Copyright © 2019. CRC Press LLC. All rights reserved.

18 Statistical Computing with R > head(x, 30)
[1] 0 0 1 1 0 1 1 1 1 0 0 0 1 0 1 0 1 1 0 1 1 0 1 0 0 0 0 0 1 0
The value returned by rle has two components: lengths and values.
r <- rle(x) > str(r)
List of 2
$ lengths: int [1:483] 2 2 1 4 3 1 1 1 1 2 …
$ values : int [1:483] 0 1 0 1 0 1 0 1 0 1 …
– attr(*, “class”)= chr “rle”
The R structure function str is very helpful when we want information about a list object. To extract one of the components, we can use the dollar sign and name of the component or double bracket and position.
> head(r$lengths)
[1] 2 2 1 4 3 1
> head(r[[1]])
[1] 2 2 1 4 3 1
Is the maximum run length in this example approximately equal to log2(n)?
> max(r$lengths)
[1] 10
> log2(length(x))
[1] 9.965784

Lists are frequently used to return several results of a function in a single object. Several classical hypothesis tests that return class htest are a good example. See for example the help topic for t.test or chisq.test. Refer to the “Value” section of the documentation. The value returned is a list containing the test statistic, p-value, etc. The components of a list can be referenced by name using $ or by position using [[ ]].
Example 1.7 (Named list). The Wilcoxon rank sum test is implemented in the function wilcox.test. Here the test is applied to two normal samples with different means.
w <- wilcox.test(rnorm(10), rnorm(10, 2)) > w #print the summary
Wilcoxon rank sum test
data: rnorm(10) and rnorm(10, 2)
W = 2, p-value = 4.33e-05
alternative hypothesis:
true location shift is not equal to 0
Rizzo, Maria L.. Statistical Computing with R, Second Edition, CRC Press LLC, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/ualberta/detail.action?docID=5731927.
Created from ualberta on 2021-03-06 10:35:02.
Copyright © 2019. CRC Press LLC. All rights reserved.

xy [1,] 0.88009604 0.6583918 [2,] 0.32964955 0.1385332 [3,] 0.61625490 0.1378254 [4,] 0.08102034 0.1746324
# if we want row names
> dimnames(a) <- list(letters[1:4], c("x", "y")) >a
xy a 0.88009604 0.6583918 b 0.32964955 0.1385332 c 0.61625490 0.1378254 d 0.08102034 0.1746324
# another way to assign row names
> row.names(a) <- list("NE", "NW", "SW", "SE") >a
xy NE 0.88009604 0.6583918 NW 0.32964955 0.1385332 SW 0.61625490 0.1378254 SE 0.08102034 0.1746324
Rizzo, Maria L.. Statistical Computing with R, Second Edition, CRC Press LLC, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/ualberta/detail.action?docID=5731927.
Created from ualberta on 2021-03-06 10:35:02.
Introduction 19
> w$statistic #stored in object w W2
> w$p.value
[1] 4.330035e-05
Try unlist(w) and unclass(w) to see more details. ⋄ Some examples of functions in this book that return a named list can be
found in Examples 8.13, 12.12, and 14.7.
Example 1.8 (A list of names). Below we create a list to assign row and column names in a matrix. The first component for row names will be NULL in this case because we do not want to assign row names.
a <- matrix(runif(8), 4, 2) #a 4x2 matrix dimnames(a) <- list(NULL, c("x", "y")) Here is the 4 × 2 matrix with column names (type a to display it). ⋄ Copyright © 2019. CRC Press LLC. All rights reserved. 20 Statistical Computing with R 1.9 Formula Specification Some functions in R take a formula object as an argument. Examples include the function to fit linear models (lm) and certain graphics functions like boxplot. For example, a formula for simple linear regression of response variable y on a single predictor x is specified by y ~ x, which represents the model y = β0+β1x+ε. To specify the regression model y = β1x+ε, without an intercept term, the formula is y ~ 0 + x. For example, compare the following regression models for the rock data: lm(rock$peri ~ rock$area) lm(rock$peri ~ 0 + rock$area) lm(rock$peri ~ 1 + rock$area) The formula syntax can represent more complicated models with several terms, interactions, etc. The syntax is based on Wilkinson’s notation [319]. See Hastie [44, Section 2.2] for its implementation in S and R languages, or [198] for an online version of documentation for Wilkinson notation. In Section 1.10, parallel boxplots are generated using the formula argument to boxplot. Several examples of formulas for linear models are in Section 8.5 and Chapter 9. 1.10 Graphics The R graphics package contains most of the commonly used graphics functions. In this section, for reference, some of the graphics functions and options or parameters are listed. Examples of graphics and the R code used to produce them appear throughout the text. See Chang [47] and Murrell [213] for many more examples. Maindonald and Braun [189]), and Venables and Ripley [293] also have many examples of graphics in R. Table 1.4 lists some basic 2D graphics functions in R (graphics) and other packages. Several examples using the graphics functions in Table 1.4 are given throughout the text. See Table 5.1 and the examples of Chapter 5 for more 2D graphics functions and some 3D visualization methods. Rizzo, Maria L.. Statistical Computing with R, Second Edition, CRC Press LLC, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/ualberta/detail.action?docID=5731927. Created from ualberta on 2021-03-06 10:35:02. Copyright © 2019. CRC Press LLC. All rights reserved. Introduction 21 TABLE 1.4: Some Basic Graphics Functions in R (graphics) and Other Packages Method Scatter plot Add regression line to plot Add reference line to plot Reference curve Histogram Bar plot Plot empirical CDF QQ Plot Normal QQ plot QQ normal ref. line Box plot Stem plot R Note 1.5 ggplot in (graphics) plot abline abline curve hist barplot plot.ecdf qqplot qqnorm qqline boxplot stem in (package) truehist (MASS) qqmath (lattice) What are the ggplot versions of the basic graphics listed in Table 1.4? Graphics in ggplot2 do not correspond to single purpose functions like hist or boxplot, so there is no single ggplot2 function that can be listed in Table 1.4 for individual plots. All of the ggplot2 graphics use the ggplot function to start a new plot. The type of plot and its appearance are determined by the plot aesthetics and elements called geoms, such as geom_point, geom_line, geom_boxplot, etc. See Sec- tion 1.11 for some simple examples using ggplot. Example 1.9 (Parallel boxplots). The boxplot function can display a sin- gle boxplot or a group of parallel boxplots. Parallel boxplots are helpful for comparing the distribution of a continuous or quantitative variable by groups. The group variable should be a factor or a character vector. Figure 1.2 displays parallel boxplots of the iris data sepal length measure- ments by the factor Species. The code to generate the plot uses the model formula argument corresponding to a one-way analysis of variance. The second line, which includes some optional arguments, corresponds to Figure 1.2. boxplot(iris$Sepal.Length ~ iris$Species) boxplot(iris$Sepal.Length ~ iris$Species, ylab = "Sepal Length", boxwex = .4) See Example 1.13 for a similar plot using ggplot. ⋄ Rizzo, Maria L.. Statistical Computing with R, Second Edition, CRC Press LLC, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/ualberta/detail.action?docID=5731927. Created from ualberta on 2021-03-06 10:35:02. Copyright © 2019. CRC Press LLC. All rights reserved. 22 Statistical Computing with R setosa versicolor virginica FIGURE 1.2: Parallel boxplots of iris sepal length. Colors, plotting symbols, and line types In most plotting functions, colors, symbols, and line types can be specified using col, pch, and lty. The size of a symbol is specified by cex. Available plotting characters are shown in the manual [294, Ch. 12], which includes this example for displaying plotting characters in a legend. plot.new() #if a plot is not open legend(locator(1), as.character(0:25), pch=0:25) #then click to locate the legend The example above can be used to display line types, by substituting lty for pch. The following produces a display of colors. legend(locator(1), as.character(0:8), lwd=20, col=0:8) Other colors and color palettes are available. For example, plot.new() palette(rainbow(15)) legend(locator(1), as.character(1:15), lwd=15, col=1:15) puts a 15-color rainbow palette into effect and displays the colors. Use colors() to see the vector of named colors. Most of the figures in this text have been drawn in black and white or grayscale. Where color palettes would normally be used, we have substituted a grayscale palette. In these cases, on screen it is better to substitute one of the pre-defined color palettes or a custom palette. To define a color palette, refer to ?palette, and to use a defined color palette, see the topic ?rainbow (the topics rainbow, heat.colors, topo.colors, and terrain.colors are documented on the same page.) Example 1.10 (Plotting characters and colors). It is easy to display a table of plotting characters for reference. plot(0:25, rep(1, 26), pch = 0:25) text(0:25, 0.9, 0:25) Rizzo, Maria L.. Statistical Computing with R, Second Edition, CRC Press LLC, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/ualberta/detail.action?docID=5731927. Created from ualberta on 2021-03-06 10:35:02. Copyright © 2019. CRC Press LLC. All rights reserved. Sepal Length 4.5 5.5 6.5 7.5 Introduction 23 To display the symbols in color, insert col = 0:25. ⋄ A utility to display available colors in R is show.colors() in the DAAG package [189]. Setting the graphical parameter par(ask = TRUE) has the effect that the graphics device will wait for user input before displaying the next plot; e.g., the message “Waiting to confirm page change ... ” appears, and in the IDE the user should click on the graphics window to display the next screen. To turn off this behavior, type par(ask = FALSE). 1.11 Introduction to ggplot ggplot2 [313] is an R graphics package that is quite different from the standard graphics package that comes with the R distribution. The name refers to the “grammar of graphics” introduced by Leland Wilkinson. Spoken and written language has a grammar and syntax, and it is possible to view statistical graphics as having similar structure or grammar. What are the graphical counterparts of the building blocks of language (nouns, verbs, adjectives, etc.)? What is a graphic in this context? To learn ggplot2 requires a basic understanding of this grammar of graphics. In general, a graphic is a mapping of data to some visual elements, which provides a visual summary of the data. One big difference between R graphics and ggplot2 is the way that the mapping is specified. R graphics define a dedicated function to create a particular mapping. For example, there is one function for a barplot, another for a boxplot and another for plotting curves. The package ggplot2 takes a different approach by first identifying the data to plot, and a geometric object called a “geom” (what to draw). The aesthetics (aes) provide the mapping that connect the data to the visual objects. These ideas may be easier to understand with reference to an example. Install the package ggplot2 if it is not already installed. The run length encoding data in Example 1.6 should be easy to summarize in a barplot. It is a good example to illustrate some fundamental differences in R graphics vs. ggplot. Example 1.11 (Barplot for run lengths). For an R barplot, we only need to specify a count variable for the heights of the bars. With ggplot, we must have the variable in a data frame and for the graphical object geom_bar it must be a factor or a character, and it must be mapped using the aes aesthetics function. barplot(table(r$lengths)) #R graphics version ## ggplot version Rizzo, Maria L.. Statistical Computing with R, Second Edition, CRC Press LLC, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/ualberta/detail.action?docID=5731927. Created from ualberta on 2021-03-06 10:35:02. Copyright © 2019. CRC Press LLC. All rights reserved. 24 Statistical Computing with R library(ggplot2) df <- data.frame(lengths = factor(r$lengths)) ggplot(df, aes(lengths)) + geom_bar() See Figures 1.3(a) (R graphics version) and 1.3(b) (ggplot version) for a com- parison of the two basic barplots. ~ 0 1 2 3 4 5 6 7 B 9 10 (a) ⋄ FIGURE 1.3: Barplots of count data using R graphics (a) and ggplot (b). The following example will display several versions of a scatterplot of the iris sepal width and sepal length data. Example 1.12. For a minimal example, enter the following line. ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width)) + geom_point() Here we have specified the data, a mapping for variables Sepal.Length and Sepal.Width to a visual object or geom. For a scatterplot we want geom_point. However, all three species of iris appear with the same color and symbol in this version, so we cannot see the effect of species. Figure 1.4 (see color insert) was generated by the following: ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species, shape = Species)) + geom_point(size = 2) In the second version, we added aesthetics to map the data to a color by the factor Species and different plotting symbols by Species. These aesthetics are arguments to aes. The size of the symbols was doubled with another aesthetic, size = 2 in geom_point. Notice that a legend was automatically added to the graph in this version. ⋄ Properties of data are for example quantitative or qualitative (numeric, integer, factor, etc.) Properties of the visual objects that may appear on a graph are their type (points, lines, curves, polygons), appearance (color, size, symbol), and so on. There are far too many properties of visual objects to list Rizzo, Maria L.. Statistical Computing with R, Second Edition, CRC Press LLC, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/ualberta/detail.action?docID=5731927. Created from ualberta on 2021-03-06 10:35:02. ; 2 3 4 5~ths5 .; 8 g 10 (b) Copyright © 2019. CRC Press LLC. All rights reserved. 40- £ 3.5- Species • setosa • versicolor • virginica ~ n; :t 3.0 - (f) 2.5- 2.0- Introduction 25 4.5- 6 Sepal.length FIGURE 1.4: Scatterplot of iris data using ggplot. them all, and it is always possible to invent new ones. These elements are part of the grammar of graphics. The main function to create a new ggplot is ggplot (qplot is a shortcut that works for some simple plots but not in general). A ggplot is built up in layers, starting with a base layer. Once the base layer is defined, we can add layers and elements with a + operation. The main idea is easiest to understand with a few familiar examples. Example 1.13 (ggplot: parallel boxplots and violin plots). Example 1.9 uses the boxplot function to display parallel boxplots for the iris sepal length mea- surements by species. Parallel boxplots can easily be displayed using ggplot with the geom geom_boxplot. Parallel violin plots (geom_violin) are similar to parallel boxplots. A violin plot displays a density estimate reflected on both sides of an axis, giving it a sort of violin shape. For vertical boxplots or violin plots: ggplot(iris, aes(Species, Sepal.Length)) + geom_boxplot() ggplot(iris, aes(Species, Sepal.Length)) + geom_violin() For horizontal plots, as shown in Figures 1.5(a) and 1.5(b), add coord_flip(). ggplot(iris, aes(Species, Sepal.Length)) + geom_boxplot() + coord_flip() ggplot(iris, aes(Species, Sepal.Length)) + geom_violin() + coord_flip() ggplot Facets ⋄ Something that ggplot does very well is to construct arrays of plots. When our data set contains a qualitative variable of type factor, it is helpful to view relationships across levels of the factor. This is illustrated in Example 1.14. Rizzo, Maria L.. Statistical Computing with R, Second Edition, CRC Press LLC, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/ualberta/detail.action?docID=5731927. Created from ualberta on 2021-03-06 10:35:02. Copyright © 2019. CRC Press LLC. All rights reserved. 26 Statistical Computing with R (a) (b) FIGURE 1.5: Parallel boxplots and violin plots using ggplot. Example 1.14 (MPG by engine displacement). This example uses the mpg data in ggplot2, which records fuel economy data from 1999 and 2008 for 38 models of cars. Use str(mpg) or read the help page to learn about the variables in mpg. Suppose that we want to plot highway mpg (hwy) as a function of engine size (displacement displ) for each class of vehicle, and display all of these plots in an array. In order to compare plots, the x and y axes of all plots should be identical. With facet_wrap, ggplot handles this detail automatically and also takes care of labeling the plots. ggplot(mpg, aes(displ, hwy)) + geom_point() + facet_wrap(~ class) The symbol before class is a tilde (as used in other R formulas). The variable in facet_wrap should be a factor or character type. Here class is a character vector. The plot is shown in Figure 1.6 ⋄ More examples of plots using ggplot2 will be shown throughout the chap- ters of this book. 1.12 Workspace and Files The workspace in R contains data and other objects. User-defined objects created in a session will persist until R is closed. If the workspace is saved before quitting R, the objects created during the session will be saved. It is not necessary to save the workspace for the examples and code here. The ls command will display the names of objects in the current workspace. One or more objects can be removed from the workspace by the rm or remove command. For more information, consult the R documentation. RStudio displays information about objects in the global environment in the ‘Environment’ tab of one of its four panes. This provides a more Rizzo, Maria L.. Statistical Computing with R, Second Edition, CRC Press LLC, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/ualberta/detail.action?docID=5731927. Created from ualberta on 2021-03-06 10:35:02. w- w•- im'"•• "'" • 56 Sep ai .Length Sepai.Length Copyright © 2019. CRC Press LLC. All rights reserved. 40- 30- 20- 40- . ~ 30- 0 .<:: 20- 40- 10- 20- • I I ·~• ~ I I • . 235 dispi 2seater m1n1van •• suv I• . compact pickup midsize ·~···•·. subcompact •• • •••• •• ·~··' ~~,~- • • I • 3456 Introduction 27 FIGURE 1.6: Array of plots of highway mpg as a function of engine dis- placement, by class of vehicle. user-friendly interface to inspecting and possibly removing objects from the workspace. If the object has a magnifying glass icon or a spreadsheet-like icon at right, click on the icon to open a window with detailed information about the object. In Grid view check boxes appear that make it easy to re- move checked objects using the broom button. To remove all objects, click the broom while in List view or with no objects selected. Note that saving objects in the workspace can lead to unexpected results and serious hidden programming errors. For example, in the following, sup- pose that the programmer intended to randomly generate the value of b, but accidentally omitted the code. y <- runif(100, 0, b) Now, if an object named b happens to be found in the workspace, and the value of b produces a valid expression in runif, no error will be reported. An error will occur, but the programmer will not realize that it has occurred. It is recommended that the user occasionally check what is stored in the workspace, and remove unneeded objects. The entire list of objects returned by ls() can be removed (without warning!) by rm(list = ls()). However, whenever one is using RStudio to run R interactively, it is much easier to use the Environment tab described above. In general, it is probably a bad practice to save functions in the workspace, because the user may forget that certain objects exist and these objects are either not documented at all or only through comments. It is a better idea to save functions in scripts and data in files. Collections of functions and data sets can also be organized and documented in packages. (See Sections 1.13 and 1.14 below.) Rizzo, Maria L.. Statistical Computing with R, Second Edition, CRC Press LLC, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/ualberta/detail.action?docID=5731927. Created from ualberta on 2021-03-06 10:35:02. Copyright © 2019. CRC Press LLC. All rights reserved. 28 Statistical Computing with R R Note 1.6 RStudio and the knitr package provide a great user interface for devel- oping code that is seamlessly integrated within a R Markdown report. Whenever knitr “knits” a document, it runs code within a clean envi- ronment so that even if objects exist in the global environment shown in RStudio, knitr is not affected by them. This is a good thing - but it sometimes causes confusion. Perhaps one has written R code which seems to work perfectly, but when it is part of a report it stops with an error. When code “works” at the command line (interactively) but refuses to “knit,” this is usually caused by objects in the global envi- ronment; it is a (hidden) programming error that knitr caught. 1.12.1 The Working Directory Many scripts and data sets are provided, and many will be created by users. It is convenient to create a folder or directory with a short path name to store these files. In the examples, we assume that the files are located in /Rfiles, which will be created by the user. Any other name or path can be used. Although it is not necessary to specify the working directory, sometimes it may be convenient to do so. A user can get or set the current working directory by the commands getwd and setwd. To set the working directory to “/Rfiles”, for example, the command is setwd("/Rfiles"). An easy way to change the working directory in RStudio is through the “Session” submenu “Set Working Directory.” 1.12.2 Reading Data from External Files Often data to be analyzed is stored in external files. Typically, data is stored in plain text files, delimited by white space such as tabs or spaces, or by special characters such as commas. Univariate data from an external file can be read into a vector by the scan command. If the file contains a data frame or a matrix, or is csv (comma sepa- rated values) format, use the read.table function. The read.table function has many options to support different file formats. The read.csv function has defaults convenient for reading csv format files. Example 1.15 (Import data from a local text file). This is a simple exam- ple that applies to the data files in Hand et al. [134]. The data files can be downloaded from the publisher web page www.crcpress.com; search by title “Handbook of Small Data Sets” to locate “DOWNLOAD.zip”. Download and extract the files to your preferred location. The following line then reads the file “FOREARM.DAT” after it has been saved to your working directory. Rizzo, Maria L.. Statistical Computing with R, Second Edition, CRC Press LLC, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/ualberta/detail.action?docID=5731927. Created from ualberta on 2021-03-06 10:35:02. Copyright © 2019. CRC Press LLC. All rights reserved. Introduction 29 forearm <- scan(file = "FOREARM.DAT") #a vector If the file is not in your current working directory, or it is in a subdirectory, specify the path name. Suppose that your file is in the “DATASETS” subdi- rectory. > forearm <- scan(file = "./DATASETS/FOREARM.DAT") #a vector Read 140 items > print(forearm)
[1] 17.3 18.4 20.9 16.8 18.7 20.5 17.9 20.4 18.3 …
Windows users note the unix style forward slashes in the path name below. See the R for Windows FAQ [236].
For a short data file, we could print it to check the import. Most data files will be too long to read at the console, so the head function is an easy way to view the first few observations.
> head(forearm)
[1] 17.3 18.4 20.9 16.8 18.7 20.5

For text files of data with more than one variable, it is easier to use read.table to import the data. Read the help page for read.table to review possible format specifications, and set them to match the format of the data file (headings, separators, etc.). The file argument in read.table (or scan) could optionally be a connection, such as the URL of a web page.
Example 1.16 (Importing data from a web page). To read data directly from a web page, the URL can be specified as the file argument. View the data online, then match the read.table arguments to the format.
Here we will import the auto mpg data from the UCI Machine Learn- ing Repository [216] at https://archive.ics.uci.edu/ml/index.php. The data is described at https://archive.ics.uci.edu/ml/datasets/auto+ mpg including a link to the data folder. The file name is “auto-mpg.data”.
The file does not have a header (column names) or row names and it appears to be delimited by spaces or tabs. Missing values are coded ?, so we need to change the missing value symbol na.strings = “NA” to na.strings = “?”. Also, we do not want the car name to be a factor, so we set as.is = TRUE to import it as a string.
fileloc <- "https://archive.ics.uci.edu/ml/machine-learning-databases/ auto-mpg/auto-mpg.data" df <- read.table(file = fileloc, na.strings = "?", as.is = TRUE) After importing data, it is a good practice to check that the result is as ex- pected. Two helpful functions for this are str and head. The structure function str summarizes the data object, and head returns the first few observations: Rizzo, Maria L.. Statistical Computing with R, Second Edition, CRC Press LLC, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/ualberta/detail.action?docID=5731927. Created from ualberta on 2021-03-06 10:35:02. Copyright © 2019. CRC Press LLC. All rights reserved. 30 Statistical Computing with R > str(df)
’data.frame’: 398 obs. of 9 variables:
$V1:num 18151816171514141415…
$V2:int 8888888888…
$ V3: num 307 350 318 304 302 429 454 440 455 390 …
$ V4: num 130 165 150 150 140 198 220 215 225 190 …
$ V5: num 3504 3693 3436 3433 3449 …
$V6:num 1211.5111210.51098.5108.5…
$V7:int 70707070707070707070…
$V8:int 1111111111…
$ V9: chr “chevrolet chevelle malibu” “buick skylark 320” …
The structure function str tells us that this object is a data frame with 398 observations of 9 variables, the type of each variable, and shows the first few values for each variable.
We assign variable names and print the summary:
names(df) <- c("mpg", "cyl", "displ", "hp", "wt", "accel", "year", "origin", "name") summary(df) mpg Min. : 9.00 1st Qu.:17.50 Median :23.00 Mean :23.51 3rd Qu.:29.00 Max. :46.60 ... cyl Min. :3.000 1st Qu.:4.000 Median :4.000 Mean :5.455 3rd Qu.:8.000 Max. :8.000 displ Min. : 68.0 1st Qu.:104.2 Median :148.5 Mean :193.4 3rd Qu.:262.0 Max. :455.0 hp Min. : 46.0 1st Qu.: 75.0 Median : 93.5 Mean :104.5 3rd Qu.:126.0 Max. :230.0 NA’s :6 The help topic for read.table also contains documentation for read.csv and read.delim, for reading comma-separated-values (.csv) files and text files with other delimiters. R Note 1.7 By default, read.table will convert character variables to factors. To prevent conversion of character data to factors, set as.is = TRUE (also see the colClasses argument of read.table). One of the recommended R packages included with the distribution is the foreign package, which provides several utility functions for reading files in Minitab, S, SAS, SPSS, Stata, and other formats. For details type help(package = foreign). Rizzo, Maria L.. Statistical Computing with R, Second Edition, CRC Press LLC, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/ualberta/detail.action?docID=5731927. Created from ualberta on 2021-03-06 10:35:02. ⋄ Copyright © 2019. CRC Press LLC. All rights reserved. Introduction 31 1.12.3 Importing/Exporting .csv Files Data is often supplied in comma-separated-values (.csv) format, which is a text file that separates data with special text characters called delimiters. Files in .csv format can be opened in most spreadsheet applications. Spreadsheet data should be saved in .csv format before importing into R. In a .csv file, the dates are likely to be given as strings, delimited by double quotation marks. Example 1.17 (Importing/exporting .csv files). This example illustrates how to export the contents of a data frame to a .csv file, and how to import the data from a .csv file into an R data frame. #create a data frame dates <- c("3/27/1995", "4/3/1995", "4/10/1995", "4/18/1995") prices <- c(11.1, 7.9, 1.9, 7.3) d <- data.frame(dates=dates, prices=prices) #create the .csv file filename <- "temp.csv" write.table(d, file = filename, sep = ",", row.names = FALSE) The new file “temp.csv” can be opened in most spreadsheets. When displayed in a text editor (not a spreadsheet), the file “temp.csv” contains the following lines (without the leading spaces). "dates","prices" "3/27/1995",11.1 "4/3/1995",7.9 "4/10/1995",1.9 "4/18/1995",7.3 Most .csv format files can be read using read.table. In addition there are functions read.csv and read.csv2 designed for .csv files. #read the .csv file read.table(file = filename, sep = ",", header = TRUE) read.csv(file = filename) #same thing dates prices 1 3/27/1995 11.1 2 4/3/1995 7.9 3 4/10/1995 1.9 4 4/18/1995 7.3 See as.Date for converting the character representation of the dates to date objects. ⋄ Rizzo, Maria L.. Statistical Computing with R, Second Edition, CRC Press LLC, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/ualberta/detail.action?docID=5731927. Created from ualberta on 2021-03-06 10:35:02. Copyright © 2019. CRC Press LLC. All rights reserved. 32 Statistical Computing with R 1.13 Using Scripts R scripts are plain text files containing R code. Once code is saved in a script, all of it can be submitted via the source command, or part of it can be executed by copy and paste (to the console). To save R commands in a file, prepare the file with a plain text editor and save with extension .R. RStudio provides an integrated text editor. The File menu “New File” submenu opens a new script (or other types of files). If an R script is open in the editor, a “Run” menu and a “Source” button appear on the toolbar of the editor window. The RStudio “Source” button, or the R source command, loads and ex- ecutes the commands in the script. It is not necessary to close the file, and in fact, it may be convenient to keep it open for editing. Save changes before sourcing the file. For example, if “/Rfiles/example.R” is a file containing R code, the command source("/Rfiles/example.R") will enter all lines of the file at the command prompt and execute the code. Windows users should use the unix-style forward slashes above or double backslashes like the command below. source("\\Rfiles\\example.R") The source command is useful when my script requires functions that are defined in another R script. Simply source that file before the functions are required. Note that by default, evaluations of expressions are not printed at the console when a script is running. Use the print command within a script to display the value of an expression. Thus, in interactive mode, an expression and its value are both printed > sqrt(pi)
[1] 1.772454
but from a script it is necessary to use print(sqrt(pi)).
Alternately, set options in the source statement to control how much is
printed. By setting echo=TRUE the statements and evaluation of expressions are echoed to the console. To see evaluation of expressions but not statements, leave echo=FALSE and set print.eval=TRUE. The examples are below.
source(“/Rfiles/example.R”, echo=TRUE)
source(“/Rfiles/example.R”, print.eval=TRUE)
Rizzo, Maria L.. Statistical Computing with R, Second Edition, CRC Press LLC, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/ualberta/detail.action?docID=5731927.
Created from ualberta on 2021-03-06 10:35:02.
Copyright © 2019. CRC Press LLC. All rights reserved.

Introduction 33
1.14 Using Packages
The R installation consists of the base and several recommended packages. Type library() to see a list of installed packages or click on the Packages tab in RStudio. A package must be installed and loaded to be available. Base packages are automatically loaded. Other packages can be installed and loaded as needed.
Several of the recommended packages are used in this text. Some con- tributed packages are also used. The R system provides an interface to in- stall contributed packages from CRAN as needed (see install.packages). In RStudio, the Packages tab “Install” button provides a dialog to select the packages and install them. A frequent error is the ‘Object not found’ error, which can occur when a symbol is used from a package that is not available. If this error occurs, check spelling, then check that the package containing the object is loaded.
To load an installed package, use the library or require command. For example, to load the recommended package boot, type library(boot) at the command prompt. If the package is loaded, the help system for the package is also loaded. Typing the command help(package=boot) or clicking on the name of the package in RStudio’s Package tab will bring up a window showing the contents of the package. Once the package is loaded, typing ?boot will bring up the help topic for the boot function in the boot package (if not loaded, use help(boot, package=boot)).
Another way to use an object from a package without loading it is by the double colon operator. For example, to use the truehist function in the MASS package an option is MASS::truehist.
A complete list of all available packages is provided on the CRAN website. This list is so large that it may be easier to search for packages using the CRAN Task Views, which organize packages according to broad topics or tasks.
1.15 Using R Markdown and knitr
R Markdown is a document format that can be used to dynamically gen- erate reproducible reports that combine code, output and graphics with your report document in one step. The knitr package makes it very easy to work with R Markdown documents in the RStudio development environment. See the R Markdown website at https://rmarkdown.rstudio.com/. The Help menu in RStudio also contains a convenient cheatsheet and a reference guide
Rizzo, Maria L.. Statistical Computing with R, Second Edition, CRC Press LLC, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/ualberta/detail.action?docID=5731927.
Created from ualberta on 2021-03-06 10:35:02.
Copyright © 2019. CRC Press LLC. All rights reserved.

34 Statistical Computing with R
for R Markdown. See also the code demos, tutorials and vignettes provided with the knitr package [324].
Try Exercise 1.8 and consider using R Markdown to create reports for other exercises in this book. A worked example is provided at the end of the book in 15.21.
Exercises
1.1 Generate a random sample x1 , . . . , x100 of data from the t4 (df=4) dis- tribution using the rt function. Use the MASS::truehist function to display a probability histogram of the sample.
1.2 Add the t4 density curve (dt) to your histogram in Exercise 1.1 using the curve function with add=TRUE.
1.3 Add an estimated density curve to your histogram in Exercise 1.2 using density. For example,
lines(density(x), col=2)
will add the density estimate using the color red. Notice that the density estimate (density) is an approximation to the density of the sampled distribution (in this case the t4 density). (Density estimation and the density function are covered in detail in Chapter 12.)
1.4 a. Write an R function f in R to implement the function f(x)= x−a
b
that will transform an input vector x and return the result. The function should take three input arguments: x, a, b.
b. To transform x to the interval [0, 1] we subtract the minimum value and divide by the range:
y <- f(x, a = min(x), b = max(x) - min(x)) Generate a random sample of Normal(μ = 2,σ = 2) data using rnorm and use your function f to transform this sample to the interval [0, 1]. Print a summary of both the sample x and the trans- formed sample y to check the result. 1.5 Refer to Exercise 1.4. Suppose that we want to transform the x sample so that it has mean zero and standard deviation one (studentize the sample). That is, we want zi=xi−x ̄, i=1,...,n, s Rizzo, Maria L.. Statistical Computing with R, Second Edition, CRC Press LLC, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/ualberta/detail.action?docID=5731927. Created from ualberta on 2021-03-06 10:35:02. Copyright © 2019. CRC Press LLC. All rights reserved. Introduction 35 where s is the standard deviation of the sample. Using your function f this is z <- f(x, a = mean(x), b = sd(x)) Display a summary and histogram of the studentized sample z. It should be centered exactly at zero. Use sd(z) to check that the studentized sample has standard deviation exactly 1.0. 1.6 Using your function f of Exercise 1.4, center and scale your Normal(μ = 2, σ = 2) sample by subtracting the sample median and dividing by the sample interquartile range (IQR). Compare your results to Exercise 1.5. 1.7 (ggplot) Refer to Example 1.14 where we displayed an array of scatter- plots using ggplot with facet_wrap. One of the variables in the mpg data is drv, a character vector indicating whether the vehicle is front- wheel drive, rear-wheel drive, or four-wheel drive. Add color = drv in aes: aes(displ, hwy, color = drv) and display the revised plot. Your scatterplots should now have the three levels of drv coded by color and the plot should have automatically generated a legend for drv color. 1.8 (RStudio and knitr) This exercise is intended to serve as an introduction to report writing with R Markdown. Install the knitr package if it is not installed. Create an html report using R Markdown and knitr in RStudio. The report should include the code and output of Examples 1.12 and 1.14 with appropriate headings and a brief explanation of each example. Rizzo, Maria L.. Statistical Computing with R, Second Edition, CRC Press LLC, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/ualberta/detail.action?docID=5731927. Created from ualberta on 2021-03-06 10:35:02. Copyright © 2019. CRC Press LLC. All rights reserved. Rizzo, Maria L.. Statistical Computing with R, Second Edition, CRC Press LLC, 2019. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/ualberta/detail.action?docID=5731927. Created from ualberta on 2021-03-06 10:35:02. Copyright © 2019. CRC Press LLC. All rights reserved.