gtReg: Fitting group testing regression models

Description

Fits the group testing regression model specified through a symbolic description of the linear predictor and descriptions of the group testing setting. This function allows for fitting regression models with simple pooling, halving, or array testing data.

Usage

gtReg(
  type = "sp",
  formula,
  data,
  groupn = NULL,
  subg = NULL,
  coln = NULL,
  rown = NULL,
  arrayn = NULL,
  retest = NULL,
  sens = 1,
  spec = 1,
  linkf = c("logit", "probit", "cloglog"),
  method = c("Vansteelandt", "Xie"),
  sens.ind = NULL,
  spec.ind = NULL,
  start = NULL,
  control = gtRegControl(...),
  ...
)

Arguments

type

"sp" for simple pooling, "halving" for halving protocol, or "array" for array testing. See 'Details' for descriptions of the group testing algorithms.

formula

an object of class "formula" (or one that can be coerced to that class); a symbolic description of the model to be fitted. The details of model specification are under 'Details'.

data

an optional data frame, list, or environment (or object coercible by as.data.frame to a data frame) containing the variables in the model. If not found in data, the variables are taken from environment(formula), typically the environment from which gtReg is called.

groupn

a vector, list, or data frame of the group numbers that designates individuals to groups (for use with simple pooling, type = "sp", or the halving protocol, type = "halving").

subg

a vector, list, or data frame of the group numbers that designates individuals to subgroups (for use with the halving protocol, type = "halving").

coln

a vector, list, or data frame that specifies the column group number for each sample (for use with array testing, type = "array").

rown

a vector, list, or data frame that specifies the row group number for each sample (for use with array testing, type = "array").

arrayn

a vector, list, or data frame that specifies the array number for each sample (for use with array testing, type = "array").

retest

a vector, list, or data frame of individual retest results. Default value is NULL for no retests. See 'Details' for details on how to specify retest.

sens

sensitivity of the test. Default value is set to 1.

spec

specificity of the test. Default value is set to 1.

linkf

a character string specifying one of the three link functions for a binomial model: "logit" (default), "probit", or "cloglog".

method

the method to fit the regression model. Options include "Vansteelandt" (default) or "Xie". The "Vansteelandt" option finds estimates by directly maximizing the likelihood function based on the group responses, while the "Xie" option uses the EM algorithm to maximize the likelihood function in terms of the unobserved individual responses.

sens.ind

sensitivity of the individual retests. If NULL, set to be equal to sens.

spec.ind

specificity of the individual retests. If NULL, set to be equal to spec.

start

starting values for the parameters in the linear predictor.

control

a list of parameters for controlling the fitting process in method "Xie". These parameters will be passed to the gtRegControl function for use.

...

arguments to be passed to gtRegControl by default. See argument control.

Value

An object of class "gtReg", a list which may include:

coefficients

a named vector of coefficients.

hessian

estimated Hessian matrix of the negative log-likelihood function. This serves as an estimate of the information matrix.

residuals

the response residuals. This is the difference of the observed group responses and the fitted group responses. Not included for array testing.

fitted.values

the fitted mean values of group responses. Not included for array testing.

deviance

the deviance between the fitted model and the saturated model. Not included for array testing.

aic

Akaike's Information Criterion. This is minus twice the maximized log-likelihood plus twice the number of coefficients. Not included for array testing.

null.deviance

the deviance for the null model, comparable with deviance. The null model will include only the intercept, if there is one in the model. Provided for simple pooling, type = "sp", only.

counts

the number of iterations in optim (Vansteelandt's method) or the number of iterations in the EM algorithm (Xie's method, halving, and array testing).

Gibbs.sample.size

the number of Gibbs samples generated in each E step. Provided for array testing, type = "array", only.

df.residual

the residual degrees of freedom. Provided for simple pooling, type = "sp", only.

df.null

the residual degrees of freedom for the null model. Provided for simple pooling, type = "sp", only.

the vector of group responses. Not included for array testing.

call

the matched call.

formula

the formula supplied.

terms

the terms object used.

method

the method ("Vansteelandt" or "Xie") used to fit the model. For the halving protocol, the "Xie" method is used. Not included for array testing.

link

the link function used in the model.

Details

With simple pooling and halving, a typical predictor has the form groupresp ~ covariates where groupresp is the (numeric) group response vector. With array testing, individual samples are placed in a matrix-like grid where samples are pooled within each row and within each column. This leads to two kinds of group responses: row and column group responses. Thus, a typical predictor has the form cbind(col.resp, row.resp) ~ covariates, where col.resp is the (numeric) column group response vector and row.resp is the (numeric) row group response vector. For all methods, covariates is a series of terms which specifies a linear predictor for individual responses. Note that it is actually the unobserved individual responses, not the observed group responses, which are modeled by the covariates. When denoting group responses (groupresp, col.resp, and row.resp), a 0 denotes a negative response and a 1 denotes a positive response, where the probability of an individual positive response is being modeled directly.

A terms specification of the form first + second indicates all the terms in first together with all the terms in second with duplicates removed. A specification of the form first:second indicates the set of terms obtained by taking the interactions of all terms in first with all terms in second. The specification first*second indicates the cross of first and second. This is the same as first + second + first:second. The terms in the formula will be re-ordered so that main effects come first, followed by the interactions, all second-order, all third-order, and so on; to avoid this, pass a terms object as the formula.

For simple pooling (type = "sp"), the functions gtreg.fit, EM, and EM.ret, where the first corresponds to Vansteelandt's method described in Vansteelandt et al. (2000) and the last two correspond to Xie's method described in Xie (2001), are called to carry out the model fitting. The gtreg.fit function uses the optim function with default method "Nelder-Mead" to maximize the likelihood function of the observed group responses. If this optimization method produces a Hessian matrix of all zero elements, the "SANN" method in optim is employed to find the coefficients and Hessian matrix. For the "SANN" method, the number of iterations in optim is set to be 10000. For the background on the use of optim, see help(optim).

The EM and EM.ret functions apply Xie's EM algorithm to the likelihood function written in terms of the unobserved individual responses; the functions use glm.fit to update the parameter estimates within each M step. The EM function is used when there are no retests and EM.ret is used when individual retests are available. Thus, within the retest argument, individual observations in observed positive groups are 0 (negative) or 1 (positive); the remaining individual observations are NAs, meaning that no retest is performed for them. Retests cannot be used with Vansteelandt's method; a warning message will be given in this case, and the individual retests will be ignored in the model fitting. There could be slight differences in the estimates between Vansteelandt's and Xie's methods (when retests are not available) due to different convergence criteria.

With simple pooling (i.e., Dorfman testing, two-stage hierarchical testing), each individual appears in exactly one pool. When only the group responses are observed, the null degrees of freedom are the number of groups minus 1 and the residual degrees of freedom are the number of groups minus the number of parameters. When individual retests are observed too, it is an open research question for what the degrees of freedom and the deviance for the null model should be; therefore, the degrees of freedom and null.deviance will not be displayed.

Under the halving protocol, the EM.halving function applies Xie's EM algorithm to the likelihood function written in terms of the unobserved individual responses; the functions use glm.fit to update the parameter estimates within each M step. In the halving protocol, if the initial group tests positive, it is split into two subgroups. The two subgroups are subsequently tested and if either subgroup tests positive, the third and final step is to test all individuals within the subgroup. Thus, within subg, subgroup responses in observed positive groups are 0 (negative) or 1 (positive); the remaining subgroup responses are NAs, meaning that no tests are performed for them. The individual retests are similarly coded.

With array testing (also known as matrix pooling), the EM.mp function applies Xie's EM algorithm to the likelihood function written in terms of the unobserved individual responses. In each E step, the Gibbs sampling technique is used to estimate the conditional probabilities. Because of the large number of Gibbs samples needed to achieve convergence, the model fitting process could be quite slow, especially when multiple positive rows and columns are observed. In this case, we can either increase the Gibbs sample size to help achieve convergence or loosen the convergence criteria by increasing tol at the expense of perhaps poorer estimates. If follow-up retests are performed, the retest results going into the model will help achieve convergence faster with the same Gibbs sample size and convergence criteria. In each M step, we use glm.fit to update the parameter estimates.

For simple pooling, retest provides individual retest results for Dorfman's retesting procedure. Under the halving protocol, retest provides individual retest results within a subgroup that tests positive. The retest argument provides individual retest results, where a 0 denotes negative and 1 denotes positive status. A NA denotes that no retest is performed for that individual. The default value is NULL for no retests.

For simple pooling, control provides parameters for controlling the fitting process in the "Xie" method only.

gtReg returns an object of class "gtReg". The function summary (i.e., summary.gtReg is used to obtain or print a summary of the results. The group testing function predict (i.e., predict.gtReg) is used to make predictions on "gtReg" objects.

References

Vansteelandt2000binGroup2

Xie2001binGroup2

Examples

Run this code

# NOT RUN {
# Estimated running time for all examples was calculated 
#   using a computer with 16 GB of RAM and one core of 
#   an Intel i7-6500U processor. Please take this into 
#   account when interpreting the run times given.

data(hivsurv)
fit1 <- gtReg(type = "sp", formula  =  groupres ~ AGE + EDUC., 
              data  =  hivsurv, groupn  =  gnum, sens  =  0.9, 
              spec  =  0.9, method  =  "Xie")
fit1

set.seed(46)
gt.data <- gtSim(type = "sp", par = c(-12, 0.2), 
                 size1 = 700, size2 = 5)
fit2 <- gtReg(type = "sp", formula = gres ~ x, data = gt.data, 
              groupn = groupn)
fit2

set.seed(21)
gt.data <- gtSim(type = "sp", par = c(-12, 0.2), 
                 size1 = 700, size2 = 6, sens = 0.95, spec = 0.95, 
                 sens.ind = 0.98, spec.ind = 0.98)
fit3 <- gtReg(type = "sp", formula = gres ~ x, data = gt.data, 
              groupn = groupn, retest = retest, method = "Xie", 
              sens = 0.95, spec = 0.95, sens.ind = 0.98, 
              spec.ind = 0.98, trace = TRUE)
summary(fit3)

set.seed(46)
gt.data <- gtSim(type = "halving", par = c(-6, 0.1), gshape = 17, 
                 gscale = 1.4, size1 = 5000, size2 = 5, 
                 sens = 0.95, spec = 0.95)
fit4 <- gtReg(type = "halving", formula = gres ~ x, 
              data = gt.data, groupn = groupn, subg = subgroup, 
              retest = retest, sens = 0.95, spec = 0.95, 
              start = c(-6, 0.1), trace = TRUE)
summary(fit4)

# This example takes approximately 15 seconds to run.
# 5x6 and 4x5 array
set.seed(9128)
sa1a <- gtSim(type = "array", par = c(-7, 0.1), size1 = c(5, 4), 
              size2 = c(6, 5), sens = 0.95, spec = 0.95)
sa1 <- sa1a$dframe
# }
# NOT RUN {
fit5 <- gtReg(type = "array", 
              formula = cbind(col.resp, row.resp) ~ x, 
              data = sa1, coln = coln, rown = rown, 
              arrayn = arrayn, sens = 0.95, spec = 0.95, 
              tol = 0.005, n.gibbs = 2000, trace = TRUE)
fit5
summary(fit5)
# }
# NOT RUN {
# The example below shows how long this fitting process 
#   may take. It takes approximately 1.5 minutes to achieve 
#   convergence.
set.seed(9012)
sa2a <- gtSim(type = "array", par = c(-7, 0.1), 
              size1 = rep(10, 4), size2 = rep(10, 4), 
              sens = 0.95, spec = 0.95)
sa2 <- sa2a$dframe
# }
# NOT RUN {
fit6 <- gtReg(type = "array", 
              formula = cbind(col.resp, row.resp) ~ x, 
              data = sa2, coln = coln, rown = rown, 
              arrayn = arrayn, retest = retest, 
              sens = 0.95, spec = 0.95, 
              start = c(-7, 0.1), tol = 0.005)
fit6
summary(fit6)
# }

Run the code above in your browser using DataLab