程序代写代做代考 Introduction to Genomic Selection in R using the rrBLUP Package

Introduction to Genomic Selection in R using the rrBLUP Package

Amy Jacobson

jaco0795@umn.edu

University of Minnesota

Introduction to Genomic Selection in

R using the rrBLUP Package

Learning Objectives

 Download the package and load the sample files

 Impute missing markers using A.mat()

 Define the training and validation populations

 Run mixed.solve() and determine accuracy of

predictions

Overview of rrBLUP package
 Download from CRAN-version 4

 Must use R version 2.14.1 or greater

 Uses ridge regression BLUP for genomic

predictions

 Predicts marker effects through mixed.solve()

 A.mat() command can be used to impute missing

markers

 Mixed.sove does not allow NA marker values

 Define the training and validation populations

One Step vs. Two Step

 One step

 Uses a mixed model analysis for the plot data

 Two step

 Adjusted means are calculated across locations

 Means are then used in ridge regression blup

 This webinar uses a two step approach

 Computationally more efficient and faster

Install the rrBLUP Package
 Launch R->Packages->Install Package

 Select CRAN Mirror nearest you

Install the rrBLUP Package
 Select the rrBLUP package

Install the rrBLUP Package

 Install the package by

a zip file

 http://cran.r-

project.org/web/packa

ges/rrBLUP/index.html

 Packages->install

package from local zip

files

http://cran.r-project.org/web/packages/rrBLUP/index.html
http://cran.r-project.org/web/packages/rrBLUP/index.html
http://cran.r-project.org/web/packages/rrBLUP/index.html
http://cran.r-project.org/web/packages/rrBLUP/index.html
http://cran.r-project.org/web/packages/rrBLUP/index.html

Install the rrBLUP Package

 Select the package from

saved location

Install the rrBLUP Package

 Now that the package is installed, the library must

be loaded every time R is opened

Sample Files

 Files downloaded from the Hordeum Toolbox
http://hordeumtoolbox.org/

 University of Minnesota barley breeding program
preliminary yield trail-St. Paul location in 2009

 Phenotypic traits-yield, plant height and heading
date

 1178 markers, 164 NA markers

 1 = homozygous for parent 1, 0 = heterozygous,
and -1 homozygous for parent 2
 Markers must be in the {-1,0,1} format for rrBLUP

http://hordeumtoolbox.org/
http://hordeumtoolbox.org/

Load the Sample Files

 Setwd()-Set the working directory to the

location of the sample files

 Read.table command used for .txt files

 Read.csv command used for .csv files

 Header=F since sample marker file does not

have a header with marker names

Load the Sample Files
 head() command used to see the first 5 lines of a file

 Useful to see if data was loaded correctly

Load the Sample Files
 Load the phenotype file and use the head command to

see the first five lines

 Header=T since phenotype files have column names

 Markers and phenotypes must be in matrix format

Load the Sample Files
 Determine the size of the matrices

 dim() command gives the number of rows and

columns

 96 observations and 1178 markers, 3 traits

Learning Objectives

 Download the package and load the sample files

 Impute missing markers using A.mat()

 Define the training and validation populations

 Run mixed.solve() and determine accuracy of

predictions

Impute Missing Markers

 rrBLUP mixed.solve() does not allow for missing

markers

 Imputed value is the population mean for that

marker

 Useful for SNP data since level of missing data is

low

 In the sample files 164 markers are missing out

of 1178 (0.14%)

 A.mat also calculates the additive relationship

matrix

Impute Missing Markers
 max.missing-maximum proportion of missing data

 If 50% of markers are missing data then markers

are not imputed

 impute method- imputes the mean of the markers

 return.imputed-prints out the imputed results if set

to TRUE

Impute Missing Markers

 >impute=A.mat(Markers,max.missing=0.5,imput

e.method=”mean”,return.imputed=T)

 > Markers_impute=impute$imputed

 Rename imputed marker matrix as

Markers_impute

 impute$imputed-returns the imputed marker

matrix

 impute$A-returns the additive relationship

matrix

Impute Missing Markers
 >impute$imputed

Imputed

marker value

Marker

value

left NA if

more

than

50%

missing

data

Impute Missing Markers
 Remove markers that had more than 50% missing

data

 NA values are not allowed in mixed.solve

 Two markers in the SNP file must be removed

 Column 169 and 562

 New dimensions show 2 less columns

 Use Markers_impute2 as marker matrix for estimating

marker effects

Learning Objectives

 Download the package and load the sample files

 Impute missing markers using A.mat()

 Define the training and validation populations

 Run mixed.solve() and determine accuracy of

predictions

Training and Validation Populations

 Training population-genotyped and phenotyped

 Validation population-phenotype values estimated

based on marker effects calculated from training

population

 Code is set that 60% of the total population is the

training population

 40% validation population

Training and Validation Populations

 58 (60% of total population of 96) random

numbers sampled to determine which individuals

are in the training population

 Individuals are the row numbers for the

phenotypes and marker matrices

 Sampled numbers will be different every time the

code is run and will affect the correlation accuracy

Training and Validation Populations

 Validation population is 40% of the total

population

 setdiff() command determines the numbers that

are not in the training population and will be part

of the validation population

Training and Validation Populations

 Pheno_train and m_train are the phenotype and

marker matrices for the values in the training

population

 Pheno_valid and m_valid will be the validation

populations

Learning Objectives

 Download the package and load the sample files

 Impute missing markers using A.mat()

 Define the training and validation populations

 Run mixed.solve() and determine accuracy of

predictions

Run mixed.solve

Y=μ+Xg+e

Nx1 vector of

phenotypic

means

Pheno_train

Overall mean

of the training

set

$Beta

NxNm

(marker

matrix)

m_train

NmX1

(marker

effects

matrix)

Calculated in

mixed.solve

as $u

Nx1 vector of

residual

effects

Run mixed.solve
Yield is the first

column of the

pheno_train

matrix

Vector of

observations

Design matrix of

random effects

(Markers)

Standard errors

are not calculated

K matrix is the

identity matrix

Run mixed.solve
 Yield_answer$u is the output of the marker effects

 head(e) shows the marker effects for the first five

markers

Run mixed.solve

 m_valid*e = marker validation matrix times the

marker effects

 Pred_yield=predicted yield based on the marker

effects of the training population with the grand

mean added in

Determine Correlation Accuracy

 Correlation between the predicted yield values

and the observed yield values

 Accuracy will change slightly each time due to

different individuals sampled for the training and

validation populations

Determine Correlation Accuracy

 Plant Height

Determine Correlation Accuracy

 Heading Date

Determine Correlation Accuracy

 Correlation accuracy with 500 iterations

Determine Correlation Accuracy

 Correlation accuracy is different for each trait

 Values will be different every time it is run since

different lines will be included in the training or

validation sets

 Accuracy is affected by training size, validation

size, number of markers and heritability

Determine Correlation Accuracy

 Effects of training population size on accuracy

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0 10 20 30 40 50 60 70 80 90 100

C
o

rr
e
la

ti
o

n
o

f
A

c
c
u

ra
c

y

Precent of individuals in training population

YLD

YLD

Common Errors

 Headers incorrectly input

Common Errors

 NA Markers

Common Errors

 Incorrect matrix dimensions

 Removed one individual from phenotype matrix

Common Errors

 Read in values as characters instead of numeric

 Quotes around values

Resources

 rrBLUP reference manual

 http://cran.r-

project.org/web/packages/rrBLUP/rrBLUP.pdf

 rrBLUP vingettes

 http://cran.r-

project.org/web/packages/rrBLUP/vignettes/vignette

.pdf

 Endelman, J.B. 2011. Ridge regression and other

kernels for genomic selection with R package

rrBLUP. Plant Genome 4:250-255. doi:

10.3835/plantgenome2011.08.0024

http://cran.r-project.org/web/packages/rrBLUP/rrBLUP.pdf
http://cran.r-project.org/web/packages/rrBLUP/rrBLUP.pdf
http://cran.r-project.org/web/packages/rrBLUP/rrBLUP.pdf
http://cran.r-project.org/web/packages/rrBLUP/rrBLUP.pdf
http://cran.r-project.org/web/packages/rrBLUP/vignettes/vignette.pdf
http://cran.r-project.org/web/packages/rrBLUP/vignettes/vignette.pdf
http://cran.r-project.org/web/packages/rrBLUP/vignettes/vignette.pdf
http://cran.r-project.org/web/packages/rrBLUP/vignettes/vignette.pdf
http://cran.r-project.org/web/packages/rrBLUP/vignettes/vignette.pdf

Acknowledgements
Rex Bernardo SolCAP

Emily Combs David Francis

Lian Lian Shawn Yarnes

Chris Schaefer John McQueen

Lisa-Marie Krchov

Dataset

rrBLUP TCAP Hordeum’s

Jeff Endelman Toolbox

Funding

Monsanto Company

USDA SolCAP

Questions?