An overview of the psych package
William Revelle
Department of Psychology
Northwestern University
January 7, 2017
Contents
0.1 Jump starting the psych package–a guide for the impatient . . . . . . . . . 4
1 Overview of this and related documents 6
2 Getting started 7
3 Basic data analysis 8
3.1 Data input from the clipboard . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2 Basic descriptive statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2.1 Outlier detection using outlier . . . . . . . . . . . . . . . . . . . . 10
3.2.2 Basic data cleaning using scrub . . . . . . . . . . . . . . . . . . . . 12
3.2.3 Recoding categorical variables into dummy coded variables . . . . . 12
3.3 Simple descriptive graphics . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3.1 Scatter Plot Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3.2 Density or violin plots . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3.3 Means and error bars . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3.4 Error bars for tabular data . . . . . . . . . . . . . . . . . . . . . . . 17
3.3.5 Two dimensional displays of means and errors . . . . . . . . . . . . . 21
3.3.6 Back to back histograms . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.3.7 Correlational structure . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3.8 Heatmap displays of correlational structure . . . . . . . . . . . . . . 25
3.4 Testing correlations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.5 Polychoric, tetrachoric, polyserial, and biserial correlations . . . . . . . . . . 31
3.6 Multiple regression from data or correlation matrices . . . . . . . . . . . . . 31
3.7 Mediation and Moderation analysis . . . . . . . . . . . . . . . . . . . . . . . 35
1
4 Item and scale analysis 36
4.1 Dimension reduction through factor analysis and cluster analysis . . . . . . 39
4.1.1 Minimum Residual Factor Analysis . . . . . . . . . . . . . . . . . . . 41
4.1.2 Principal Axis Factor Analysis . . . . . . . . . . . . . . . . . . . . . 42
4.1.3 Weighted Least Squares Factor Analysis . . . . . . . . . . . . . . . . 42
4.1.4 Principal Components analysis (PCA) . . . . . . . . . . . . . . . . . 48
4.1.5 Hierarchical and bi-factor solutions . . . . . . . . . . . . . . . . . . . 48
4.1.6 Item Cluster Analysis: iclust . . . . . . . . . . . . . . . . . . . . . . 52
4.2 Confidence intervals using bootstrapping techniques . . . . . . . . . . . . . 55
4.3 Comparing factor/component/cluster solutions . . . . . . . . . . . . . . . . 55
4.4 Determining the number of dimensions to extract. . . . . . . . . . . . . . . 61
4.4.1 Very Simple Structure . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.4.2 Parallel Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.5 Factor extension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.6 Exploratory Structural Equation Modeling (ESEM) . . . . . . . . . . . . . 66
5 Classical Test Theory and Reliability 70
5.1 Reliability of a single scale . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.2 Using omega to find the reliability of a single scale . . . . . . . . . . . . . . 75
5.3 Estimating ωh using Confirmatory Factor Analysis . . . . . . . . . . . . . . 79
5.3.1 Other estimates of reliability . . . . . . . . . . . . . . . . . . . . . . 80
5.4 Reliability and correlations of multiple scales within an inventory . . . . . . 81
5.4.1 Scoring from raw data . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.4.2 Forming scales from a correlation matrix . . . . . . . . . . . . . . . . 83
5.5 Scoring Multiple Choice Items . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.6 Item analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.6.1 Exploring the item structure of scales . . . . . . . . . . . . . . . . . 87
5.6.2 Empirical scale construction . . . . . . . . . . . . . . . . . . . . . . . 89
6 Item Response Theory analysis 90
6.1 Factor analysis and Item Response Theory . . . . . . . . . . . . . . . . . . . 91
6.2 Speeding up analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.3 IRT based scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.3.1 1 versus 2 parameter IRT scoring . . . . . . . . . . . . . . . . . . . . 100
7 Multilevel modeling 102
7.1 Decomposing data into within and between level correlations using statsBy 103
7.2 Generating and displaying multilevel data . . . . . . . . . . . . . . . . . . . 103
7.3 Factor analysis by groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
8 Set Correlation and Multiple Regression from the correlation matrix 104
2
9 Simulation functions 107
10 Graphical Displays 109
11 Converting output to APA style tables using LATEX 109
12 Miscellaneous functions 111
13 Data sets 112
14 Development version and a users guide 114
15 Psychometric Theory 114
16 SessionInfo 114
3
0.1 Jump starting the psych package–a guide for the impatient
You have installed psych (section 2) and you want to use it without reading much more.
What should you do?
1. Activate the psych package:
library(psych)
2. Input your data (section 3.1). Go to your friendly text editor or data manipulation
program (e.g., Excel) and copy the data to the clipboard. Include a first line that has
the variable labels. Paste it into psych using the read.clipboard.tab command:
myData <- read.clipboard.tab() 3. Make sure that what you just read is right. Describe it (section 3.2) and perhaps look at the first and last few lines: describe(myData) headTail(myData) 4. Look at the patterns in the data. If you have fewer than about 10 variables, look at the SPLOM (Scatter Plot Matrix) of the data using pairs.panels (section 3.3.1) Even better, use the outlier to detect outliers. pairs.panels(myData) outlier(myData) 5. Note that you have some weird subjects, probably due to data entry errors. Either edit the data by hand (use the edit command) or just scrub the data (section 3.2.2). cleaned <- scrub(myData, max=9) #e.g., change anything great than 9 to NA 6. Graph the data with error bars for each variable (section 3.3.3). error.bars(myData) 7. Find the correlations of all of your data. • Descriptively (just the values) (section 3.3.7) r <- lowerCor(myData) • Graphically (section 3.3.8) corPlot(r) • Inferentially (the values, the ns, and the p values) (section 3.4) corr.test(myData) 8. Test for the number of factors in your data using parallel analysis (fa.parallel, section 4.4.2) or Very Simple Structure (vss, 4.4.1) . fa.parallel(myData) vss(myData) 4 9. Factor analyze (see section 4.1) the data with a specified number of factors (the default is 1), the default method is minimum residual, the default rotation for more than one factor is oblimin. There are many more possibilities (see sections 4.1.1-4.1.3). Compare the solution to a hierarchical cluster analysis using the ICLUST algorithm (Revelle, 1979) (see section 4.1.6). Also consider a hierarchical factor solution to find coefficient ω (see 4.1.5). fa(myData) iclust(myData) omega(myData) If you prefer to do a principal components analysis you may use the principal function. The default is one component. principal(myData) 10. Some people like to find coefficient α as an estimate of reliability. This may be done for a single scale using the alpha function (see 5.1). Perhaps more useful is the ability to create several scales as unweighted averages of specified items using the scoreItems function (see 5.4) and to find various estimates of internal consistency for these scales, find their intercorrelations, and find scores for all the subjects. alpha(myData) #score all of the items as part of one scale. myKeys <- make.keys(nvar=20,list(first = c(1,-3,5,-7,8:10),second=c(2,4,-6,11:15,-16))) my.scores <- scoreItems(myKeys,myData) #form several scales my.scores #show the highlights of the results At this point you have had a chance to see the highlights of the psych package and to do some basic (and advanced) data analysis. You might find reading this entire vignette helpful to get a broader understanding of what can be done in R using the psych. Remember that the help command (?) is available for every function. Try running the examples for each help page. 5 1 Overview of this and related documents The psych package (Revelle, 2015) has been developed at Northwestern University since 2005 to include functions most useful for personality, psychometric, and psychological re- search. The package is also meant to supplement a text on psychometric theory (Revelle, prep), a draft of which is available at http://personality-project.org/r/book/. Some of the functions (e.g., read.clipboard, describe, pairs.panels, scatter.hist, error.bars, multi.hist, bi.bars) are useful for basic data entry and descriptive analy- ses. Psychometric applications emphasize techniques for dimension reduction including factor analysis, cluster analysis, and principal components analysis. The fa function includes five methods of factor analysis (minimum residual , principal axis, weighted least squares, generalized least squares and maximum likelihood factor analysis). Principal Components Analysis (PCA) is also available through the use of the principal function. Determining the number of factors or components to extract may be done by using the Very Simple Structure (Revelle and Rocklin, 1979) (vss), Minimum Average Partial correlation (Velicer, 1976) (MAP) or parallel analysis (fa.parallel) criteria. These and several other criteria are included in the nfactors function. Two parameter Item Response Theory (IRT) models for dichotomous or polytomous items may be found by factoring tetrachoric or polychoric correlation matrices and expressing the resulting parameters in terms of location and discrimination using irt.fa. Bifactor and hierarchical factor structures may be estimated by using Schmid Leiman transformations (Schmid and Leiman, 1957) (schmid) to transform a hierarchical factor structure into a bifactor solution (Holzinger and Swineford, 1937). Scale construction can be done using the Item Cluster Analysis (Revelle, 1979) (iclust) function to determine the structure and to calculate reliability coefficients α (Cronbach, 1951)(alpha, scoreItems, score.multiple.choice), β (Revelle, 1979; Revelle and Zin- barg, 2009) (iclust) and McDonald’s ωh and ωt (McDonald, 1999) (omega). Guttman’s six estimates of internal consistency reliability (Guttman (1945), as well as additional estimates (Revelle and Zinbarg, 2009) are in the guttman function. The six measures of Intraclass correlation coefficients (ICC) discussed by Shrout and Fleiss (1979) are also available. Graphical displays include Scatter Plot Matrix (SPLOM) plots using pairs.panels, cor- relation “heat maps” (corPlot) factor, cluster, and structural diagrams using fa.diagram, iclust.diagram, structure.diagram and het.diagram, as well as item response charac- teristics and item and test information characteristic curves plot.irt and plot.poly. This vignette is meant to give an overview of the psych package. That is, it is meant to give a summary of the main functions in the psych package with examples of how 6 http://personality-project.org/r/book/ they are used for data description, dimension reduction, and scale construction. The ex- tended user manual at psych_manual.pdf includes examples of graphic output and more extensive demonstrations than are found in the help menus. (Also available at http: //personality-project.org/r/psych_manual.pdf). The vignette, psych for sem, at psych_for_sem.pdf, discusses how to use psych as a front end to the sem package of John Fox (Fox et al., 2012). (The vignette is also available at http://personality-project. org/r/book/psych_for_sem.pdf). For a step by step tutorial in the use of the psych package and the base functions in R for basic personality research, see the guide for using R for personality research at http://personalitytheory.org/r/r.short.html. For an introduction to psychometric theory with applications in R, see the draft chapters at http://personality-project. org/r/book). 2 Getting started Some of the functions described in this overview require other packages. Particularly useful for rotating the results of factor analyses (from e.g., fa, factor.minres, factor.pa, factor.wls, or principal) or hierarchical factor models using omega or schmid, is the GPArotation package. These and other useful packages may be installed by first installing and then using the task views (ctv) package to install the “Psychometrics” task view, but doing it this way is not necessary. install.packages("ctv") library(ctv) task.views("Psychometrics") The “Psychometrics” task view will install a large number of useful packages. To install the bare minimum for the examples in this vignette, it is necessary to install just 3 pack- ages: install.packages(list(c("GPArotation","mnormt") Because of the difficulty of installing the package Rgraphviz , alternative graphics have been developed and are available as diagram functions. If Rgraphviz is available, some functions will take advantage of it. An alternative is to use“dot”output of commands for any external graphics package that uses the dot language. 7 psych_manual.pdf http://personality-project.org/r/psych_manual.pdf http://personality-project.org/r/psych_manual.pdf psych_for_sem.pdf "http://personality-project.org/r/book/psych_for_sem.pdf" http://personality-project.org/r/book/psych_for_sem.pdf "http://personality-project.org/r/book/psych_for_sem.pdf" http://personality-project.org/r/book/psych_for_sem.pdf http://personalitytheory.org/r/r.short.html http://personality-project.org/r/book http://personality-project.org/r/book 3 Basic data analysis A number of psych functions facilitate the entry of data and finding basic descriptive statistics. Remember, to run any of the psych functions, it is necessary to make the package active by using the library command: > library(psych)
The other packages, once installed, will be called automatically by psych.
It is possible to automatically load psych and other functions by creating and then saving
a “.First” function: e.g.,
.First <- function(x) {library(psych)} 3.1 Data input from the clipboard There are of course many ways to enter data into R. Reading from a local file using read.table is perhaps the most preferred. However, many users will enter their data in a text editor or spreadsheet program and then want to copy and paste into R. This may be done by using read.table and specifying the input file as “clipboard” (PCs) or “pipe(pbpaste)” (Macs). Alternatively, the read.clipboard set of functions are perhaps more user friendly: read.clipboard is the base function for reading data from the clipboard. read.clipboard.csv for reading text that is comma delimited. read.clipboard.tab for reading text that is tab delimited (e.g., copied directly from an Excel file). read.clipboard.lower for reading input of a lower triangular matrix with or without a diagonal. The resulting object is a square matrix. read.clipboard.upper for reading input of an upper triangular matrix. read.clipboard.fwf for reading in fixed width fields (some very old data sets) For example, given a data set copied to the clipboard from a spreadsheet, just enter the command > my.data <- read.clipboard() This will work if every data field has a value and even missing data are given some values (e.g., NA or -999). If the data were entered in a spreadsheet and the missing values 8 were just empty cells, then the data should be read in as a tab delimited or by using the read.clipboard.tab function. > my.data <- read.clipboard(sep="\t") #define the tab option, or > my.tab.data <- read.clipboard.tab() #just use the alternative function For the case of data in fixed width fields (some old data sets tend to have this format), copy to the clipboard and then specify the width of each field (in the example below, the first variable is 5 columns, the second is 2 columns, the next 5 are 1 column the last 4 are 3 columns). > my.data <- read.clipboard.fwf(widths=c(5,2,rep(1,5),rep(3,4)) 3.2 Basic descriptive statistics Once the data are read in, then describe or describeBy will provide basic descriptive statistics arranged in a data frame format. Consider the data set sat.act which in- cludes data from 700 web based participants on 3 demographic variables and 3 ability measures. describe reports means, standard deviations, medians, min, max, range, skew, kurtosis and standard errors for integer or real data. Non-numeric data, although the statistics are meaningless, will be treated as if numeric (based upon the categorical coding of the data), and will be flagged with an *. describeBy reports descriptive statistics broken down by some categorizing variable (e.g., gender, age, etc.) > library(psych)
> data(sat.act)
> describe(sat.act) #basic descriptive statistics
vars n mean sd median trimmed mad min max range skew
gender 1 700 1.65 0.48 2 1.68 0.00 1 2 1 -0.61
education 2 700 3.16 1.43 3 3.31 1.48 0 5 5 -0.68
age 3 700 25.59 9.50 22 23.86 5.93 13 65 52 1.64
ACT 4 700 28.55 4.82 29 28.84 4.45 3 36 33 -0.66
SATV 5 700 612.23 112.90 620 619.45 118.61 200 800 600 -0.64
SATQ 6 687 610.22 115.64 620 617.25 118.61 200 800 600 -0.59
kurtosis se
gender -1.62 0.02
education -0.07 0.05
age 2.42 0.36
ACT 0.53 0.18
SATV 0.33 4.27
SATQ -0.02 4.41
These data may then be analyzed by groups defined in a logical statement or by some other
variable. E.g., break down the descriptive data for males or females. These descriptive
data can also be seen graphically using the error.bars.by function (Figure 5). By setting
skew=FALSE and ranges=FALSE, the output is limited to the most basic statistics.
9
> #basic descriptive statistics by a grouping variable.
> describeBy(sat.act,sat.act$gender,skew=FALSE,ranges=FALSE)
$`1`
vars n mean sd se
gender 1 247 1.00 0.00 0.00
education 2 247 3.00 1.54 0.10
age 3 247 25.86 9.74 0.62
ACT 4 247 28.79 5.06 0.32
SATV 5 247 615.11 114.16 7.26
SATQ 6 245 635.87 116.02 7.41
$`2`
vars n mean sd se
gender 1 453 2.00 0.00 0.00
education 2 453 3.26 1.35 0.06
age 3 453 25.45 9.37 0.44
ACT 4 453 28.42 4.69 0.22
SATV 5 453 610.66 112.31 5.28
SATQ 6 442 596.00 113.07 5.38
attr(,”call”)
by.data.frame(data = x, INDICES = group, FUN = describe, type = type,
skew = FALSE, ranges = FALSE)
The output from the describeBy function can be forced into a matrix form for easy analysis
by other programs. In addition, describeBy can group by several grouping variables at the
same time.
> sa.mat <- describeBy(sat.act,list(sat.act$gender,sat.act$education), + skew=FALSE,ranges=FALSE,mat=TRUE) > headTail(sa.mat)
item group1 group2 vars n mean sd se
gender1 1 1 0 1 27 1 0 0
gender2 2 2 0 1 30 2 0 0
gender3 3 1 1 1 20 1 0 0
gender4 4 2 1 1 25 2 0 0
…
SATQ9 69 1 4 6 51 635.9 104.12 14.58
SATQ10 70 2 4 6 86 597.59 106.24 11.46
SATQ11 71 1 5 6 46 657.83 89.61 13.21
SATQ12 72 2 5 6 93 606.72 105.55 10.95
3.2.1 Outlier detection using outlier
One way to detect unusual data is to consider how far each data point is from the mul-
tivariate centroid of the data. That is, find the squared Mahalanobis distance for each
data point and then compare these to the expected values of χ2. This produces a Q-Q
(quantle-quantile) plot with the n most extreme data points labeled (Figure 1). The outlier
values are in the vector d2.
10
> png( ‘outlier.png’ )
> d2 <- outlier(sat.act,cex=.8)
> dev.off()
null device
1
Figure 1: Using the outlier function to graphically show outliers. The y axis is the
Mahalanobis D2, the X axis is the distribution of χ2 for the same number of degrees of
freedom. The outliers detected here may be shown graphically using pairs.panels (see
2, and may be found by sorting d2.
11
3.2.2 Basic data cleaning using scrub
If, after describing the data it is apparent that there were data entry errors that need to
be globally replaced with NA, or only certain ranges of data will be analyzed, the data can
be “cleaned” using the scrub function.
Consider a data set of 10 rows of 12 columns with values from 1 – 120. All values of columns
3 – 5 that are less than 30, 40, or 50 respectively, or greater than 70 in any of the three
columns will be replaced with NA. In addition, any value exactly equal to 45 will be set
to NA. (max and isvalue are set to one value here, but they could be a different value for
every column).
> x <- matrix(1:120,ncol=10,byrow=TRUE) > colnames(x) <- paste('V',1:10,sep='') > new.x <- scrub(x,3:5,min=c(30,40,50),max=70,isvalue=45,newvalue=NA) > new.x
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
[1,] 1 2 NA NA NA 6 7 8 9 10
[2,] 11 12 NA NA NA 16 17 18 19 20
[3,] 21 22 NA NA NA 26 27 28 29 30
[4,] 31 32 33 NA NA 36 37 38 39 40
[5,] 41 42 43 44 NA 46 47 48 49 50
[6,] 51 52 53 54 55 56 57 58 59 60
[7,] 61 62 63 64 65 66 67 68 69 70
[8,] 71 72 NA NA NA 76 77 78 79 80
[9,] 81 82 NA NA NA 86 87 88 89 90
[10,] 91 92 NA NA NA 96 97 98 99 100
[11,] 101 102 NA NA NA 106 107 108 109 110
[12,] 111 112 NA NA NA 116 117 118 119 120
Note that the number of subjects for those columns has decreased, and the minimums have
gone up but the maximums down. Data cleaning and examination for outliers should be a
routine part of any data analysis.
3.2.3 Recoding categorical variables into dummy coded variables
Sometimes categorical variables (e.g., college major, occupation, ethnicity) are to be ana-
lyzed using correlation or regression. To do this, one can form “dummy codes” which are
merely binary variables for each category. This may be done using dummy.code. Subse-
quent analyses using these dummy coded variables may be using biserial or point biserial
(regular Pearson r) to show effect sizes and may be plotted in e.g., spider plots.
3.3 Simple descriptive graphics
Graphic descriptions of data are very helpful both for understanding the data as well as
communicating important results. Scatter Plot Matrices (SPLOMS) using the pairs.panels
12
function are useful ways to look for strange effects involving outliers and non-linearities.
error.bars.by will show group means with 95% confidence boundaries. By default, er-
ror.bars.by and error.bars will show “cats eyes” to graphically show the confidence
limits (Figure 5) This may be turned off by specifying eyes=FALSE. densityBy or vio-
linBy may be used to show the distribution of the data in “violin” plots (Figure 4). (These
are sometimes called “lava-lamp” plots.)
3.3.1 Scatter Plot Matrices
Scatter Plot Matrices (SPLOMS) are very useful for describing the data. The pairs.panels
function, adapted from the help menu for the pairs function produces xy scatter plots of
each pair of variables below the diagonal, shows the histogram of each variable on the
diagonal, and shows the lowess locally fit regression line as well. An ellipse around the
mean with the axis length reflecting one standard deviation of the x and y variables is also
drawn. The x axis in each scatter plot represents the column variable, the y axis the row
variable (Figure 2). When plotting many subjects, it is both faster and cleaner to set the
plot character (pch) to be ’.’. (See Figure 2 for an example.)
pairs.panels will show the pairwise scatter plots of all the variables as well as his-
tograms, locally smoothed regressions, and the Pearson correlation. When plotting
many data points (as in the case of the sat.act data, it is possible to specify that the
plot character is a period to get a somewhat cleaner graphic. However, in this figure,
to show the outliers, we use colors and a larger plot character.
Another example of pairs.panels is to show differences between experimental groups.
Consider the data in the affect data set. The scores reflect post test scores on positive
and negative affect and energetic and tense arousal. The colors show the results for four
movie conditions: depressing, frightening movie, neutral, and a comedy.
3.3.2 Density or violin plots
Graphical presentation of data may be shown using box plots to show the median and 25th
and 75th percentiles. A powerful alternative is to show the density distribution using the
violinBy function (Figure 4).
13
> png( ‘pairspanels.png’ )
> sat.d2 <- data.frame(sat.act,d2) #combine the d2 statistics from before with the sat.act data.frame
> pairs.panels(sat.d2,bg=c(“yellow”,”blue”)[(d2 > 25)+1],pch=21)
> dev.off()
null device
1
Figure 2: Using the pairs.panels function to graphically show relationships. The x axis
in each scatter plot represents the column variable, the y axis the row variable. Note the
extreme outlier for the ACT. If the plot character were set to a period (pch=’.’) it would
make a cleaner graphic, but in to show the outliers in color we use the plot characters 21
and 22.
14
> png(‘affect.png’)
> pairs.panels(affect[14:17],bg=c(“red”,”black”,”white”,”blue”)[affect$Film],pch=21,
+ main=”Affect varies by movies “)
> dev.off()
null device
1
Figure 3: Using the pairs.panels function to graphically show relationships. The x axis in
each scatter plot represents the column variable, the y axis the row variable. The coloring
represent four different movie conditions.
15
> data(sat.act)
> violinBy(sat.act[5:6],sat.act$gender,grp.name=c(“M”, “F”),main=”Density Plot by gender for SAT V and Q”)
● ●
●
●
Density Plot by gender for SAT V and Q
O
b
se
rv
e
d
SATV M SATV F SATQ M SATQ F
2
0
0
3
0
0
4
0
0
5
0
0
6
0
0
7
0
0
8
0
0
Figure 4: Using the violinBy function to show the distribution of SAT V and Q for males
and females. The plot shows the medians, and 25th and 75th percentiles, as well as the
entire range and the density distribution.
16
3.3.3 Means and error bars
Additional descriptive graphics include the ability to draw error bars on sets of data, as
well as to draw error bars in both the x and y directions for paired data. These are the
functions error.bars, error.bars.by, error.bars.tab, and error.crosses.
error.bars show the 95 % confidence intervals for each variable in a data frame or ma-
trix. These errors are based upon normal theory and the standard errors of the mean.
Alternative options include +/- one standard deviation or 1 standard error. If the
data are repeated measures, the error bars will be reflect the between variable cor-
relations. By default, the confidence intervals are displayed using a “cats eyes” plot
which emphasizes the distribution of confidence within the confidence interval.
error.bars.by does the same, but grouping the data by some condition.
error.bars.tab draws bar graphs from tabular data with error bars based upon the
standard error of proportion (σp =
√
pq/N)
error.crosses draw the confidence intervals for an x set and a y set of the same size.
The use of the error.bars.by function allows for graphic comparisons of different groups
(see Figure 5). Five personality measures are shown as a function of high versus low scores
on a “lie” scale. People with higher lie scores tend to report being more agreeable, consci-
entious and less neurotic than people with lower lie scores. The error bars are based upon
normal theory and thus are symmetric rather than reflect any skewing in the data.
Although not recommended, it is possible to use the error.bars function to draw bar
graphs with associated error bars. (This kind of dynamite plot (Figure 7) can be very
misleading in that the scale is arbitrary. Go to a discussion of the problems in presenting
data this way at http://emdbolker.wikidot.com/blog:dynamite. In the example shown,
note that the graph starts at 0, although is out of the range. This is a function of using
bars, which always are assumed to start at zero. Consider other ways of showing your
data.
3.3.4 Error bars for tabular data
However, it is sometimes useful to show error bars for tabular data, either found by the
table function or just directly input. These may be found using the error.bars.tab
function.
17
http://emdbolker.wikidot.com/blog:dynamite
> data(epi.bfi)
> error.bars.by(epi.bfi[,6:10],epi.bfi$epilie<4) 0.95% confidence limits Independent Variable D e p e n d e n t V a ri a b le bfagree bfcon bfext bfneur bfopen 0 5 0 1 0 0 1 5 0 ● ● ● ● ● Figure 5: Using the error.bars.by function shows that self reported personality scales on the Big Five Inventory vary as a function of the Lie scale on the EPI. The “cats eyes” show the distribution of the confidence. 18 > error.bars.by(sat.act[5:6],sat.act$gender,bars=TRUE,
+ labels=c(“Male”,”Female”),ylab=”SAT score”,xlab=””)
Male Female
0.95% confidence limits
S
A
T
s
co
re
2
0
0
3
0
0
4
0
0
5
0
0
6
0
0
7
0
0
8
0
0
2
0
0
3
0
0
4
0
0
5
0
0
6
0
0
7
0
0
8
0
0
Figure 6: A “Dynamite plot” of SAT scores as a function of gender is one way of misleading
the reader. By using a bar graph, the range of scores is ignored. Bar graphs start from 0.
19
> T <- with(sat.act,table(gender,education)) > rownames(T) <- c("M","F") > error.bars.tab(T,way=”both”,ylab=”Proportion of Education Level”,xlab=”Level of Education”,
+ main=”Proportion of sample by education level”)
Proportion of sample by education level
Level of Education
P
ro
p
o
rt
io
n
o
f
E
d
u
ca
tio
n
L
e
ve
l
0
.0
0
0
.0
5
0
.1
0
0
.1
5
0
.2
0
0
.2
5
0
.3
0
M 0 M 1 M 2 M 3 M 4 M 5
0
.0
0
0
.0
5
0
.1
0
0
.1
5
0
.2
0
0
.2
5
0
.3
0
Figure 7: The proportion of each education level that is Male or Female. By using the
way=”both” option, the percentages and errors are based upon the grand total. Alterna-
tively, way=”columns” finds column wise percentages, way=”rows” finds rowwise percent-
ages. The data can be converted