haplo.glm: GLM Regression of Trait on Ambiguous Haplotypes

Description

Perform glm regression of a trait on haplotype effects, allowing for ambiguous haplotypes. This method performs an iterative two-step EM, with the posterior probabilities of pairs of haplotypes per subject used as weights to update the regression coefficients, and the regression coefficients used to update the posterior probabilities.

Usage

haplo.glm(formula=formula(data), family=gaussian, data=sys.parent(),
           weights, na.action="na.geno.keep", start=NULL, 
           locus.label=NA, control=haplo.glm.control(), 
           method="glm.fit", model=TRUE, x=FALSE, y=TRUE, 
           contrasts=NULL, ...)

Arguments

formula

a formula expression as for other regression models, of the form response ~ predictors. For details, see the documentation for lm and formula.

family

a family object. This is a list of expressions for defining the link, variance function, initialization values, and iterative weights for the generalized linear model. Supported families are: gaussian, binomial, poisson. Curr

data

a data frame in which to interpret the variables occurring in the formula. A CRITICAL element of the data frame is the matrix of genotypes, denoted here as "geno", although an informative name should be used in practice. This geno matrix is actuall

weights

the weights for observations (rows of the data frame). By default, all observations are weighted equally.

na.action

a function to filter missing data. This is applied to the model.frame. The default value of na.action=na.geno.keep will keep observations with some (but not all) missing alleles, but exclude observations missing any other data (e.g., response variab

start

a vector of initial values on the scale of the linear predictor.

locus.label

vector of labels for loci.

control

list of control parameters. The default is constructed by the function haplo.glm.control. The items in this list control the regression modeling of the haplotypes (e.g., additive, dominant, recessive effects of haplotypes; which haplotype is chosen

method

currently, glm.fit is the only method allowed.

model

logical, if model=TRUE, the model.frame is returned.

logical, if x=TRUE, the model.matrix is returned.

logical, if y=TRUE, the response variable is returned.

contrasts

currently ignored

...

other arguments that may be passed - currently ignored.

Value

An object of class "haplo.glm" is returned. The output object from haplo.glm has all the components of a glm object, with a few more. It is important to note that some of the returned components correpond to the "expanded" version of the data. This means that each observation is expanded into the number of terms in the observation's posterior distribution of haplotype pairs, given the marker data. For example, when fitting the response y on haplotype effects, the value of y[i], for the ith observation, is replicated m[i] times, where m[i] is the number of pairs of haplotypes consistent with the observed marker data. The returned components that are expanded are indicated below by [expanded] in the definition of the component. These expanded components may need to be collapsed, depending on the user's objectives. For example, when considering the influence of an observation, it may make sense to examine the expanded residuals for a single observation, perhaps plotted against the haplotypes for that observation. In contrast, it would not be sensible to plot all residuals against non-genetic covaraites, without first collapsing the expanded residuals for each observation. To collapse, one can use the average residual per observation, weighted according to the posterior probabilities. The appropriate weight can be computed as wt = fit$weight.expanded * fit$haplo.post.info$post. Then, the weighted average can be calculated as tapply(fit$residuals * wt, fit$haplo.post.info$indx, sum).
coefficientsthe coefficients of the linear.predictors, which multiply the columns of the model matrix. The names of the coefficients are the names of the columns of the model matrix. For haplotype coefficients, the names are the concatentation of name of the geno matrix with a haplotype number. The haplotype number corresponds to the index of the haplotype. The default print will show the coefficients with haplotype number, along with the alleles that define the haplotype, and the estimated haplotype frequency. If the model is over-determined there will be missing values in the coefficients corresponding to inestimable coefficients.
residuals[expanded] residuals from the final weighted least squares fit; also known as working residuals, these are typically not interpretable without rescaling by the weights (see glm.object and residuals.haplo.glm).
fitted.values[expanded] fitted mean values, obtained by transforming linear.predictors using the inverse link function (see glm.object).
effects[expaded] orthogonal, single-degree-of-freedom effects (see lm.object).
Rthe triangular factor of the decomposition (see lm.object).
rankthe computed rank (number of linearly independent columns in the model matrix), which is the model degrees of freedom - see lm.object.
assignthe list of assignments of coefficients (and effects) to the terms in the model (see lm.object).
df.residual[expanded] number of degrees of freedom for residuals, corresponding to the expanded data.
prior.weights[expanded] input weights after expanding according to the number of pairs of haplotypes consistent with an observation's marker genotype data.
familya 3 element character vector giving the name of the family, the link and the variance function; mainly for printing purposes.
linear.predictors[expanded] linear fit, given by the product of the model matrix and the coefficients. In a glm, eta.
devianceup to a constant, minus twice the maximized log-likelihood. Similar to the residual sum of squares.
null.deviancethe deviance corresponding to the model with no predictors.
callan image of the call that produced the object, but with the arguments all named and with the actual formula included as the formula argument.
iterthe number of IRLS iterations used to compute the estimates, for the last step of the EM fit of coefficients.
yexpanded response.
contrastsa list containing sufficient information to construct the contrasts used to fit any factors occurring in the model (see lm.object).
lnlikelog-likelihood of the fitted model.
lnlike.nulllog-likelihood of the null model (intercept-only).
lrtlikelihood ratio test statistic to test whether all coefficients (excepet intercept) are zero: 2*(lnlike - lnlike.null)
termsan object of mode expression and class term summarizing the formula, but not complete for the final model. Because this does not represent expansion of the design matrix for the haplotypes, it is typically not of direct relevance to users.
controllist of all control parameters
haplo.uniquethe data.frame of unique haplotypes
haplo.basethe index of the haplotype used as the base-line for the regression model. To see the actual haplotype definition, use the following: fit$haplo.unique[fit$haplo.base,], where fit is the saved haplo.glm object (e.g., fit <- haplo.glm(y ~ geno, ...) ).
haplo.freqthe final estimates of haplotype frequencies, after completing EM steps of updating haplotype frequencies and regression coefficients. The length of haplo.freq is the number of rows of haplo.unique, and the order of haplo.freq is the same as that for the rows of haplo.unique. So, the frequencies of the unique haplotypes can be viewed as cbind(fit$haplo.unique, fit$haplo.freq).
haplo.freq.initthe initial estimates of haplotype frequencies, based on the EM algorithm for estimating haplotype frequencies, ingnoring the trait. These can be compared with haplo.freq, to see the impact of using the regression model to update the haplotype frequencies.
converge.emT/F whether the EM-glm steps converged
haplo.commonthe indices of the haplotypes determined to be "common" enough to estimate their corresponding regression coefficients.
haplo.rarethe indices of all the haplotypes determined to be too rare to estimate their specific regression coefficients.
haplo.rare.termT/F whether the "rare" term is included in the haplotype regression model.
haplo.namesthe names of the coefficients that represent haplotype effects.
haplo.post.infoa data.frame of information regarding the posterior probabilites. The columns of this data.frame are: indx (the index of the input obsevation; if the ith observation is repeated m times, then indx will show m replicates of i; hence, indx will correspond to the "expanded" observations); hap1 and hap2 (the indices of the haplotypes; if hap1=j and hap2=k, then the two haplotypes in terms of alleles are fit$haplo.unique[j,] and fit$haplo.unique[k,]); post.init (the initial posterior probability, based on haplo.freq.init); post (the final posterior probability, based on haplo.freq).
xthe model matrix, with [expanded] rows, if x=T.
infothe observed information matrix, based on Louis' formula. The upper left submatrix is for the regression coefficient, the lower right submatrix for the haplotype frequencies, and the remaining is the information between regression coefficients and haplotype frequencies.
var.matthe variance-covariance matrix of regression coefficients and haplotype frequencies, based on the inverse of info. Upper left submatrix is for regression coefficients, lower right submatrix for haplotype frequencies.
haplo.elimthe indices of the haplotypes eliminated from the info and var.mat matrices because their frequencies are less than haplo.min.info (the minimum haplotype frequency required for computation of the information matrix - see haplo.glm.control)
missinga matrix of logical values, indicating whether rows of data were removed for missing values in either genotype matrix (genomiss) or any other variables (yxmiss), such as y, other covariates, or weights.
rank.inforank of information (info) matrix.

References

Lake S, Lyon H, Silverman E, Weiss S, Laird N, Schaid D (2002) Estimation and tests of haplotype-environment interaction when linkage phase is ambiguous. Human Heredity 55:56-65.

Details

To properly prepare the data frame, the genotype matrix must be processed by setupGeno, and then included in the data frame with the response and other variables.

For binomial family, the initialization of values gives warnings if non-integer number of successes, which is a concern in these models because of the weights of posterior probability of each haplotype pair per subject. We supress the warnings by defining a haplo.binomial family, which we use if family=binomial is used.

Examples

Run this code

cat("FOR REGULAR USAGE, DO NOT DISCARD GENOTYPES WITH MISSING VALUES
")
cat("WE ONLY SUBSET BY keep HERE SO THE EXAMPLES RUN FASTER
")

 data(hla.demo)
 geno <- as.matrix(hla.demo[,c(17,18,21:24)])
 keep <- !apply(is.na(geno) | geno==0, 1, any) # SKIP THESE THREE LINES
 hla.demo <- hla.demo[keep,]                   # IN AN ANALYSIS
 geno <- geno[keep,]                           # 
 attach(hla.demo)
 label <-c("DQB","DRB","B")
 y <- hla.demo$resp
 y.bin <- 1*(hla.demo$resp.cat=="low")

# set up a genotype array as a model.matrix for inserting into data frame
# Note that hla.demo is a data.frame, and we need to subset to columns
# of interest. Also also need to convert to a matrix object, so that
# setupGeno can code alleles and convert geno to 'model.matrix' class.

 geno <- setupGeno(geno, miss.val=c(0,NA))

  # geno now has an attribute 'unique.alleles' which must be passed to
  # haplo.glm as allele.lev=attributes(geno)$unique.alleles, see below

 my.data <- data.frame(geno=geno, age=hla.demo$age, male=hla.demo$male,
                      y=y, y.bin=y.bin)

 fit.gaus <- haplo.glm(y ~ male + geno, family = gaussian,  na.action=
               "na.geno.keep",allele.lev=attributes(geno)$unique.alleles, 
               data=my.data, locus.label=label,
               control = haplo.glm.control(haplo.freq.min=0.02))
 fit.gaus

Run the code above in your browser using DataLab