Introduction to Genomic Selection in R using the rrBLUP Package
Amy Jacobson
jaco0795@umn.edu
University of Minnesota
Introduction to Genomic Selection in
R using the rrBLUP Package
Learning Objectives
Download the package and load the sample files
Impute missing markers using A.mat()
Define the training and validation populations
Run mixed.solve() and determine accuracy of
predictions
Overview of rrBLUP package
Download from CRAN-version 4
Must use R version 2.14.1 or greater
Uses ridge regression BLUP for genomic
predictions
Predicts marker effects through mixed.solve()
A.mat() command can be used to impute missing
markers
Mixed.sove does not allow NA marker values
Define the training and validation populations
One Step vs. Two Step
One step
Uses a mixed model analysis for the plot data
Two step
Adjusted means are calculated across locations
Means are then used in ridge regression blup
This webinar uses a two step approach
Computationally more efficient and faster
Install the rrBLUP Package
Launch R->Packages->Install Package
Select CRAN Mirror nearest you
Install the rrBLUP Package
Select the rrBLUP package
Install the rrBLUP Package
Install the package by
a zip file
http://cran.r-
project.org/web/packa
ges/rrBLUP/index.html
Packages->install
package from local zip
files
http://cran.r-project.org/web/packages/rrBLUP/index.html
http://cran.r-project.org/web/packages/rrBLUP/index.html
http://cran.r-project.org/web/packages/rrBLUP/index.html
http://cran.r-project.org/web/packages/rrBLUP/index.html
http://cran.r-project.org/web/packages/rrBLUP/index.html
Install the rrBLUP Package
Select the package from
saved location
Install the rrBLUP Package
Now that the package is installed, the library must
be loaded every time R is opened
Sample Files
Files downloaded from the Hordeum Toolbox
http://hordeumtoolbox.org/
University of Minnesota barley breeding program
preliminary yield trail-St. Paul location in 2009
Phenotypic traits-yield, plant height and heading
date
1178 markers, 164 NA markers
1 = homozygous for parent 1, 0 = heterozygous,
and -1 homozygous for parent 2
Markers must be in the {-1,0,1} format for rrBLUP
http://hordeumtoolbox.org/
http://hordeumtoolbox.org/
Load the Sample Files
Setwd()-Set the working directory to the
location of the sample files
Read.table command used for .txt files
Read.csv command used for .csv files
Header=F since sample marker file does not
have a header with marker names
Load the Sample Files
head() command used to see the first 5 lines of a file
Useful to see if data was loaded correctly
Load the Sample Files
Load the phenotype file and use the head command to
see the first five lines
Header=T since phenotype files have column names
Markers and phenotypes must be in matrix format
Load the Sample Files
Determine the size of the matrices
dim() command gives the number of rows and
columns
96 observations and 1178 markers, 3 traits
Learning Objectives
Download the package and load the sample files
Impute missing markers using A.mat()
Define the training and validation populations
Run mixed.solve() and determine accuracy of
predictions
Impute Missing Markers
rrBLUP mixed.solve() does not allow for missing
markers
Imputed value is the population mean for that
marker
Useful for SNP data since level of missing data is
low
In the sample files 164 markers are missing out
of 1178 (0.14%)
A.mat also calculates the additive relationship
matrix
Impute Missing Markers
max.missing-maximum proportion of missing data
If 50% of markers are missing data then markers
are not imputed
impute method- imputes the mean of the markers
return.imputed-prints out the imputed results if set
to TRUE
Impute Missing Markers
>impute=A.mat(Markers,max.missing=0.5,imput
e.method=”mean”,return.imputed=T)
> Markers_impute=impute$imputed
Rename imputed marker matrix as
Markers_impute
impute$imputed-returns the imputed marker
matrix
impute$A-returns the additive relationship
matrix
Impute Missing Markers
>impute$imputed
Imputed
marker value
Marker
value
left NA if
more
than
50%
missing
data
Impute Missing Markers
Remove markers that had more than 50% missing
data
NA values are not allowed in mixed.solve
Two markers in the SNP file must be removed
Column 169 and 562
New dimensions show 2 less columns
Use Markers_impute2 as marker matrix for estimating
marker effects
Learning Objectives
Download the package and load the sample files
Impute missing markers using A.mat()
Define the training and validation populations
Run mixed.solve() and determine accuracy of
predictions
Training and Validation Populations
Training population-genotyped and phenotyped
Validation population-phenotype values estimated
based on marker effects calculated from training
population
Code is set that 60% of the total population is the
training population
40% validation population
Training and Validation Populations
58 (60% of total population of 96) random
numbers sampled to determine which individuals
are in the training population
Individuals are the row numbers for the
phenotypes and marker matrices
Sampled numbers will be different every time the
code is run and will affect the correlation accuracy
Training and Validation Populations
Validation population is 40% of the total
population
setdiff() command determines the numbers that
are not in the training population and will be part
of the validation population
Training and Validation Populations
Pheno_train and m_train are the phenotype and
marker matrices for the values in the training
population
Pheno_valid and m_valid will be the validation
populations
Learning Objectives
Download the package and load the sample files
Impute missing markers using A.mat()
Define the training and validation populations
Run mixed.solve() and determine accuracy of
predictions
Run mixed.solve
Y=μ+Xg+e
Nx1 vector of
phenotypic
means
Pheno_train
Overall mean
of the training
set
$Beta
NxNm
(marker
matrix)
m_train
NmX1
(marker
effects
matrix)
Calculated in
mixed.solve
as $u
Nx1 vector of
residual
effects
Run mixed.solve
Yield is the first
column of the
pheno_train
matrix
Vector of
observations
Design matrix of
random effects
(Markers)
Standard errors
are not calculated
K matrix is the
identity matrix
Run mixed.solve
Yield_answer$u is the output of the marker effects
head(e) shows the marker effects for the first five
markers
Run mixed.solve
m_valid*e = marker validation matrix times the
marker effects
Pred_yield=predicted yield based on the marker
effects of the training population with the grand
mean added in
Determine Correlation Accuracy
Correlation between the predicted yield values
and the observed yield values
Accuracy will change slightly each time due to
different individuals sampled for the training and
validation populations
Determine Correlation Accuracy
Plant Height
Determine Correlation Accuracy
Heading Date
Determine Correlation Accuracy
Correlation accuracy with 500 iterations
Determine Correlation Accuracy
Correlation accuracy is different for each trait
Values will be different every time it is run since
different lines will be included in the training or
validation sets
Accuracy is affected by training size, validation
size, number of markers and heritability
Determine Correlation Accuracy
Effects of training population size on accuracy
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0 10 20 30 40 50 60 70 80 90 100
C
o
rr
e
la
ti
o
n
o
f
A
c
c
u
ra
c
y
Precent of individuals in training population
YLD
YLD
Common Errors
Headers incorrectly input
Common Errors
NA Markers
Common Errors
Incorrect matrix dimensions
Removed one individual from phenotype matrix
Common Errors
Read in values as characters instead of numeric
Quotes around values
Resources
rrBLUP reference manual
http://cran.r-
project.org/web/packages/rrBLUP/rrBLUP.pdf
rrBLUP vingettes
http://cran.r-
project.org/web/packages/rrBLUP/vignettes/vignette
Endelman, J.B. 2011. Ridge regression and other
kernels for genomic selection with R package
rrBLUP. Plant Genome 4:250-255. doi:
10.3835/plantgenome2011.08.0024
http://cran.r-project.org/web/packages/rrBLUP/rrBLUP.pdf
http://cran.r-project.org/web/packages/rrBLUP/rrBLUP.pdf
http://cran.r-project.org/web/packages/rrBLUP/rrBLUP.pdf
http://cran.r-project.org/web/packages/rrBLUP/rrBLUP.pdf
http://cran.r-project.org/web/packages/rrBLUP/vignettes/vignette.pdf
http://cran.r-project.org/web/packages/rrBLUP/vignettes/vignette.pdf
http://cran.r-project.org/web/packages/rrBLUP/vignettes/vignette.pdf
http://cran.r-project.org/web/packages/rrBLUP/vignettes/vignette.pdf
http://cran.r-project.org/web/packages/rrBLUP/vignettes/vignette.pdf
Acknowledgements
Rex Bernardo SolCAP
Emily Combs David Francis
Lian Lian Shawn Yarnes
Chris Schaefer John McQueen
Lisa-Marie Krchov
Dataset
rrBLUP TCAP Hordeum’s
Jeff Endelman Toolbox
Funding
Monsanto Company
USDA SolCAP
Questions?