TP.prep: Selecting the best training population for genomic selection

Description

This function was designed to help users build the best training population for real genomic selection applications. The RICE dataset from Zhao et al. (2011) has been incorporated to the package to highlight this new feature of sommer implemented in v.1.5. The TP.prep function is just a function that uses genetic markers or any kind of numeric matrix to perform principal component analysis, identify the names of the validation population the user provide, and find the most suitable training population for performing genomic selection based on similarities or similarities-dissimilarities.

My recomendation would be to randomly choose the TP and predict the performance of the VP. Do this a 100 times and get the average GEBV for the VP and take decisions using those.

Usage

TP.prep(markers=NULL, vp.names=NULL, tp.size=NULL, method="random", 
        npop=300, nelite=1, mutprob=.5, niterations=100, lambda=NULL)

Arguments

markers

a marker matrix in numeric format but any numeric matrix can be used in this argument. No extra columns should be in this matrix. ROWNAMES SHOOULD BE THE NAMES OF THE INDIVIDUALSIN THE POPULATION.

vp.names

a character vector with the names of the validaton population. Make sure that the marker matrix has rownames with the individuals so the program can find them and identify the most suitable training population.

tp.size

a numeric vector indicating the population sizes desired for the training population. By default is NULL which will make the program select in steps of 20 until reach the population size (number of rows in the marker matrix).

method

a method among "similar", "sim-dissim", "random", "PEV", or "CDmean" indicating if the program should look for similar plants (based on the PCA of the marker matrix) or a mixture of similar and constrasting individuals, select plants at random, minimize the PEV, or calculate the CD mean. The default is "sim-dissim", meaning will make a mixture.

npop

genetic algorithm parameter, number of solutions at each iteration. Only for the CDmean algorithm.

nelite

genetic algorithm parameter, number of solutions selected as elite parents which will generate the next set of solutions. Only for the CDmean algorithm.

mutprob

genetic algorithm parameter, probability of mutation for each generated solution. Only for the CDmean algorithm.

niterations

genetic algorithm parameter, number of iterations.Only for the CDmean algorithm.

lambda

a value for the CDmean algorithm where lambda=Var(e)/Var(g).

Value

If all parameters are correctly indicated the program will return:

$tp.list: a list with the names of the TP populations. Each element of the list corresponds to each population size specified in the 'tp.size' argument.

References

Keyan Zhao, Chih-Wei Tung, Georgia C. Eizenga, Mark H. Wright, M. Liakat Ali, Adam H. Price, Gareth J. Norton, M. Rafiqul Islam, Andy Reynolds, Jason Mezey, Anna M. McClung, Carlos D. Bustamante & Susan R. McCouch (2011). Genome-wide association mapping reveals a rich genetic architecture of complex traits in Oryza sativa. Nat Comm 2:467 DOI: 10.1038/ncomms1467, Published Online 13 Sep 2011.

Covarrubias-Pazaran G (2016) Genome assisted prediction of quantitative traits using the R package sommer. PLoS ONE 11(6): doi:10.1371/journal.pone.0156744

Examples

Run this code

# NOT RUN {
####=========================================####
#### For CRAN time limitations most lines in the 
#### examples are silenced with one '#' mark, 
#### remove them and run the examples using
#### command + shift + C |OR| control + shift + C
####=========================================####
data(RICE)
W <- RICE$RiceGenoN; W[1:5,1:5]; dim(W)
#### lets use only 5000 random markers across the genome
#### for demonstration purposes
ww <- sample(1:dim(W)[2],5000)
W <- W[,ww]; dim(W)

Y <- RICE$RicePheno; Y[1:5,1:5]; dim(Y)
(vp.str <- RICE$vp.str[1:25]) # validation population with structure
bases <- seq(50,200,25) # TP sizes required

####=========================================####
####=========================================####
#### a) Random sample based
####=========================================####
####=========================================####
#### rest of the plants not in the validation population (VP)
rest <- (1:dim(W)[1])[-which(rownames(W) %in% vp.str)]
#### select the best TP for the VP, provide:
#### markers, names of VP, size of the TP desired, method
#tp.plant.strR <- TP.prep(markers=W, vp.names = vp.str, 
#                         tp.size = bases, method="random")

####=========================================####
#### b) Genetic simmilarity-dissimilarity based
####=========================================####
#tp.plant.strG1 <- TP.prep(markers=W, vp.names = vp.str, 
#                         tp.size = bases, method="sim-dissim")
                         
#tp.plant.strG2 <- TP.prep(markers=W, vp.names = vp.str, 
#                         tp.size = bases, method="sim")
                       
####=========================================####
####=========================================####
#### c) QTL based
#### we assume real situation that VP 
#### wouldn't be phenotyped for GWAS
####=========================================####
####=========================================####
#K1 <- A.mat(W)
#Z1 <- diag(dim(K1)[1])
#image(K1)
#ETA1 <- list(list(Z=Z1,K=K1))
#YY <- Y; YY[vp.str,] <- NA
####=========================================####
#### be patient, the GWAS is for >36,000 SNPs
####=========================================####
#ans.GWAS <- mmer(Y=YY$Protein.content, Z=ETA1, W=W)
#(h2 <- sqrt(ans.GWAS$var.comp[1,]/sum(ans.GWAS$var.comp)))
####=========================================####
#### get marker matrix for the most significant QTLs
#### and do the TP selection using only those markers
####=========================================####
#X <- hits(ans.GWAS, nmar = 25); head(X); dim(X)
#tp.plant.strG3 <- TP.prep(markers=X, vp.names = vp.str, 
#                      tp.size = bases, method="sim-dissim")

####=========================================####
####=========================================####
#### COMPARE METHODS with 50 plants
####=========================================####
####=========================================####
#metho <- list(a=tp.plant.strR[[1]], b=tp.plant.strG1[[1]],
#      c=tp.plant.strG2[[1]], d=tp.plant.strG3[[1]])

#resos <- numeric()
#for(i in 1:4){ # for each method
  
#  tpgi <- metho[[i]] # training pop genetic-based
#  Wpgi <- W[c(vp.str,tpgi),] # provisional markers based on genes
#  Ypgi <- Y[c(vp.str,tpgi),] # provisional phenos based on genes
#  Xpgi <- X[c(vp.str,tpgi),] # provisional bag-mat based on genes
  
#  Kpgi <- A.mat(Wpgi)
#  Zpgi <- diag(dim(Kpgi)[1])
#  Ypgi[vp.str,] <- NA
#  yt <- Ypgi$Protein.content
#  ETAg <- list(list(Z=Zpgi, K=Kpgi))
#  mixg <- mmer(Y=yt, X=Xpgi, Z=ETAg, method="EMMA") #X=Xpgi,
#  resos[i] <- cor(Y[vp.str,"Protein.content"], 
#                   fitted(mixg)[(1:length(vp.str)),], 
#                   use="complete")
#}
#resos
### the random method needs more iterations to have a real idea what
### would be the predictive ability of using plants at random,
### this is just to give an idea to users.
# }