stepwiseqtl: Stepwise selection for multiple QTL

Description

Performs forward/backward selection to identify a multiple QTL model, with model choice made via a penalized LOD score, with separate penalties on main effects and interactions.

Usage

stepwiseqtl(cross, chr, pheno.col=1, qtl, formula, max.qtl=10, covar=NULL,
            method=c("imp", "hk"), model=c("normal", "binary"),
            incl.markers=TRUE, refine.locations=TRUE,
            additive.only=FALSE, scan.pairs=FALSE, penalties,
            keeplodprofile=TRUE, keeptrace=FALSE, verbose=TRUE,
            tol=1e-4, maxit=1000, require.fullrank=FALSE)

Value

The output is a representation of the best model, as measured by the penalized LOD score (see Details), among all models visited. This is QTL object (of class "qtl", as produced by

makeqtl), with attributes "formula", indicating the model formula, and "pLOD" indicating the penalized LOD score.

If keeplodprofile=TRUE, LOD profiles from the last pass through the refinement algorithm are retained as an attribute,

"lodprofile", to the object. These may be plotted with

plotLodProfile.

If keeptrace=TRUE, the output will contain an attribute

"trace" containing information on the best model at each step of forward and backward elimination. This is a list of objects of class "compactqtl", which is similar to a QTL object (as produced by makeqtl) but containing just a vector of chromosome IDs and positions for the QTL. Each will also have attributes "formula" (containing the model formula) and

"pLOD" (containing the penalized LOD score.

Arguments

cross: An object of class cross. See read.cross for details.
chr: Optional vector indicating the chromosomes to consider in search for QTL. This should be a vector of character strings referring to chromosomes by name; numeric values are converted to strings. Refer to chromosomes with a preceding - to have all chromosomes but those considered. A logical (TRUE/FALSE) vector may also be used.
pheno.col: Column number in the phenotype matrix which should be used as the phenotype. One may also give character strings matching the phenotype names. Finally, one may give a numeric vector of phenotypes, in which case it must have the length equal to the number of individuals in the cross, and there must be either non-integers or values < 1 or > no. phenotypes; this last case may be useful for studying transformations.
qtl: Optional QTL object (of class "qtl", as created by makeqtl) to use as a starting point.
formula: Optional formula to define the QTL model to be used as a starting point.
max.qtl: Maximum number of QTL to which forward selection should proceed.
covar: Data frame of additive covariates.
method: Indicates whether to use multiple imputation or Haley-Knott regression.
model: The phenotype model: the usual model or a model for binary traits
incl.markers: If FALSE, do calculations only at points on an evenly spaced grid.
refine.locations: If TRUE, use refineqtl to refine the QTL locations after each step of forward and backward selection.
additive.only: If TRUE, allow only additive QTL models; if FALSE, consider also pairwise interactions among QTL.
scan.pairs: If TRUE, perform a two-dimensional, two-QTL scan at each step of forward selection.
penalties: Vector of three (or six) values indicating the penalty on the number of QTL terms. If three values, these are the penalties on main effects and heavy and light penalties on interactions. If six values, these include X-chr-specific penalties, and the values are: main effect for autosomes, main effect for X chr, heavy penalty on A:A interactions, light penalty on A:A interactions, penalty on A:X interactions, and penalty on X:X interactions. See the Details below. If missing, default values are used that are based on simulations of backcrosses and intercrosses with genomes modeled after that of the mouse.
keeplodprofile: If TRUE, keep the LOD profiles from the last iteration as attributes to the output.
keeptrace: If TRUE, keep information on the sequence of models visited through the course of forward and backward selection as an attribute to the output.
verbose: If TRUE, give feedback about progress. If verbose is an integer > 1, even more information is printed.
tol: Tolerance for convergence for the binary trait model.
maxit: Maximum number of iterations for fitting the binary trait model.
require.fullrank: If TRUE, give LOD=0 when covariate matrix in the linear regression is not of full rank.

Methods

imp: multiple imputation is used, as described by Sen and Churchill (2001).

hk: Haley-Knott regression is used (regression of the phenotypes on the multipoint QTL genotype probabilities), as described by Haley and Knott (1992).

Author

Karl W Broman, broman@wisc.edu

Details

We seek to identify the model with maximal penalized LOD score. The penalized LOD score, defined in Manichaikul et al. (2009), is the LOD score for the model (the \(\log_{10}\) likelihood ratio comparing the model to the null model with no QTL) with penalties on the number of QTL and QTL:QTL interactions.

We consider QTL models allowing pairwise interactions among QTL but with an enforced hierarchy in which inclusion of a pairwise interaction requires the inclusion of both of the corresponding main effects. Additive covariates may be included, but currently we do not explore QTL:covariate interactions. Also, the penalized LOD score criterion is currently defined only for autosomal loci, and results with the X chromosome should be considered with caution.

The penalized LOD score is of the form \(pLOD(\gamma) = LOD(\gamma) - T_m p_m - T_h p_h - T_l p_l\) where \(\gamma\) denotes a model, \(p_m\) is the number of QTL in the model ("main effects"), \(p_h\) is the number of pairwise interactions that will be given a heavy interaction penalty, \(p_l\) is the number of pairwise interactions that will be given a light interaction penalty, \(T_m\) is the penalty on main effects, \(T_h\) is the heavy interaction penalty, and \(T_l\) is the light interaction penalty. The penalties argument is the vector \((T_m, T_h, T_l)\). If \(T_l\) is missing (penalties has a vector of length 2), we assume \(T_l = T_h\), and so all pairwise interactions are assigned the same penalty.

The "heavy" and "light" interaction penalties can be a bit confusing. Consider the clusters of QTL that are connected via one or more pairwise interactions. To each such cluster, we assign at most one "light" interaction penalty, and give all other pairwise interactions the heavy interaction penalty. In other words, if \(p_i\) is the total number of pairwise interactions for a QTL model, we let \(p_l\) be the number of clusters of connected QTL with at least one pairwise interaction, and then let \(p_h - p_i - p_l\).

Let us give an explicit example. Consider a model with 6 QTL, and with interactions between QTL 2 and 3, QTL 4 and 5 and QTL 4 and 6 (so we have the model formula y ~ Q1 + Q2 + Q3 + Q4 + Q5 + Q6 + Q2:Q3 + Q4:Q5 + Q4:Q6). There are three clusters of connected QTL: (1), (2,3) and (4,5,6). We would assign 6 main effect penalties (\(T_m\)), 2 light interaction penalties (\(T_l\)), and 1 heavy interaction penalty (\(T_h\)).

Manichaikul et al. (2009) described a system for deriving the three penalties on the basis of permutation results from a two-dimensional, two-QTL genome scan (as calculated with scantwo). These may be calculated with the function calc.penalties.

A forward/backward search method is used, with the aim to optimize the penalized LOD score criterion. That is, we seek to identify the model with maximal the penalized LOD score. The search algorithm was based closely on an algorithm described by Zeng et al. (1999).

We use forward selection to a model of moderate size (say 10 QTL), followed by backward elimination all the way to the null model. The chosen model is that which optimizes the penalized LOD score criterion, among all models visited. The detailed algorithm is as follows. Note that if additive.only=TRUE, no pairwise interactions are considered.

Start at the null model, and perform a single-QTL genome scan, and choose the position giving the largest LOD score. If scan.pairs=TRUE, start with a two-dimensional, two-QTL genome scan instead. If an initial QTL model were defined through the arguments qtl and formula, start with this model and jump immediately to step 2.
With a fixed QTL model in hand:
1. Scan for an additional additive QTL.
2. For each QTL in the current model, scan for an additional interacting QTL.
3. If there are \(\ge\) 2 QTL in the current model, consider adding one of the possible pairwise interactions.
4. If scan.pairs=TRUE perform a two-dimensional, two-QTL scan, seeking to add a pair of novel QTL, either additive or interacting.
5. Step to the model that gives the largest value for the model comparison criterion, among those considered at the current step.
Refine the locations of the QTL in the current model (if refine.locations=TRUE).
Repeat steps 2 and 3 up to a model with some pre-determined number of loci.
Perform backward elimination, all the way back to the null model. At each step, consider dropping one of the current main effects or interactions; move to the model that maximizes the model comparison criterion, among those considered at this step. Follow this with a refinement of the locations of the QTL.
Finally, choose the model having the largest model comparison criterion, among all models visited.

In this forward/backward algorithm, it is likely best to build up to an overly large model and then prune it back. Note that there is no "stopping rule"; the chosen model is that which optimizes the model comparison criterion, among all models visited. The search can be time consuming, particularly if a two-dimensional scan is performed at each forward step. Such two-dimensional scans may be useful for identifying QTL linked in repulsion (having effects of opposite sign) or interacting QTL with limited marginal effects, but our limited experience suggests that they are not necessary; important linked or interacting QTL pairs can be picked up in the forward selection to a large model, and will be retained in the backward elimination phase.

References

Manichaikul, A., Moon, J. Y., Sen, Ś, Yandell, B. S. and Broman, K. W. (2009) A model selection approach for the identification of quantitative trait loci in experimental crosses, allowing epistasis. Genetics, 181, 1077--1086.

Broman, K. W. and Speed, T. P. (2002) A model selection approach for the identification of quantitative trait loci in experimental crosses (with discussion). J Roy Stat Soc B 64, 641--656, 731--775.

Haley, C. S. and Knott, S. A. (1992) A simple regression method for mapping quantitative trait loci in line crosses using flanking markers. Heredity 69, 315--324.

Sen, Ś. and Churchill, G. A. (2001) A statistical framework for quantitative trait mapping. Genetics 159, 371--387.

Zeng, Z.-B., Kao, C.-H. and Basten, C. J. (1999) Estimating the genetic architecture of quantitative traits. Genetical Research, 74, 279--289.

Examples

Run this code

data(fake.bc)
fake.bc <- subset(fake.bc, chr=c(1,2,3,5))
if (FALSE) fake.bc <- calc.genoprob(fake.bc, step=2.5)
fake.bc <- calc.genoprob(fake.bc, step=0)
outsw <- stepwiseqtl(fake.bc, max.qtl=3, method="hk", keeptrace=TRUE)

# best model
outsw
plotModel(outsw)

# path through model space
thetrace <- attr(outsw, "trace")

# plot of these
par(mfrow=c(3,3))
for(i in seq(along=thetrace))
  plotModel(thetrace[[i]], main=paste("pLOD =",round(attr(thetrace[[i]],"pLOD"), 2)))

Run the code above in your browser using DataLab