isee: Interaction stagewise estimating equations

Description

Perform model selection with clustered data while considering interaction terms using one of two stagewise methods. The first (ACTS) uses an active set approach in which interaction terms are only considered for a given update if the corresponding main effects have already been added to the model. The second approach (HiLa) approximates the regularized path for hierarchical lasso with Generalized Estimating Equations. In this second approach, the model hierarchy is guaranteed in each individual step, thus ensuring the desired hierarchy throughout the path.

Usage

isee(y, ...)
# S3 method for formula
isee(formula, data = list(), clusterID, waves = NULL,
  interactionID = NULL, contrasts = NULL, subset, method = "ACTS", ...)
# S3 method for default
isee(y, x, waves = NULL, interactionID, method = "ACTS",
  ...)
acts.fit(y, x, interactionID, family, clusterID, waves = NULL,
  corstr = "independence", alpha = NULL, intercept = TRUE, offset = 0,
  control = sgee.control(maxIt = 200, epsilon = 0.05, stoppingThreshold =
  min(length(y), ncol(x)) - intercept, undoThreshold = 0), standardize = TRUE,
  verbose = FALSE, ...)
hila.fit(y, x, interactionID, family, clusterID, waves = NULL,
  corstr = "independence", alpha = NULL, intercept = TRUE, offset = 0,
  control = sgee.control(maxIt = 200, epsilon = 0.05, stoppingThreshold =
  min(length(y), ncol(x)) - intercept, undoThreshold = 0.005),
  standardize = TRUE, verbose = FALSE, ...)

Arguments

Vector of response measures that corresponds with modeling family given in 'family' parameter. 'y' is assumed to be the same length as 'clusterID' and is assumed to be organized into clusters as dictated by 'clusterID'.

...

Not currently used

formula

Object of class 'formula'; a symbolic description of the model to be fitted

data

Optional data frame containing the variables in the model.

clusterID

Vector of integers that identifies the clusters of response measures in 'y'. Data and 'clusterID' are assumed to 1) be of equal lengths, 2) sorted so that observations of a cluster are in contiguous rows, and 3) organized so that 'clusterID' is a vector of consecutive integers.

waves

An integer vector which identifies components in clusters. The length of waves should be the same as the number of observations. waves is automatically generated if none is supplied, but when using subset parameter, the waves parameter must be provided by the user for proper calculation.

interactionID

A (p^2+p)/2 x 2 matrix of interaction IDs. Main effects have the same (unique) number in both columns for their corresponding row. Interaction effects have each of their corresponding main effects in the two columns. it is assumed that main effects are listed first. It is assumed that the main effect IDs used start at 1 and go up tp the number of main effects, p.

contrasts

An optional list provided when using a formula. similar to contrasts from glm. See the contrasts.arg of model.matrix.default.

subset

An optional vector specifying a subset of observations to be used in the fitting process.

method

A character string indicating desired method to be used to perform interaction selection. Value can either be "ACTS", where an active set approach is taken and interaction terms are considered for selection only after main effects are brought in, or "HiLa", where the hierarchical lasso penalty is used to ensure hierarchy is maintained in each step. Default Value is "ACTS".

Design matrix of dimension length(y) x nvars where each row is represents an obersvation of predictor variables. Assumed to be scaled.

family

Modeling family that describes the marginal distribution of the response. Assumed to be an object such as 'gaussian()' or 'poisson()'

corstr

A character string indicating the desired working correlation structure. The following are implemented : "independence" (default value), "exchangeable", and "ar1".

alpha

An intial guess for the correlation parameter value between -1 and 1 . If left NULL (the default), the initial estimate is 0.

intercept

Binary value indicating where an intercept term is to be included in the model for estimation. Default is to include an intercept.

offset

Vector of offset value(s) for the linear predictor. 'offset' is assumed to be either of length one, or of the same length as 'y'. Default is to have no offset.

control

A list of parameters used to contorl the path generation process; see sgee.control.

standardize

A logical parameter that indicates whether or not the covariates need to be standardized before fitting (but after generating interaction terms from main covariates). If standardized before fitting, the unstandardized path is returned as the default, with a standardizedPath and standardizedX included separately. Default value is TRUE.

verbose

Logical parameter indicating whether output should be produced while isee is running. Default value is FALSE.

Value

Object of class 'sgee' containing the path of coefficient estimates, the path of scale estimates, the path of correlation parameter estimates, and the iteration at which iSEE terminated, and initial regression values including x, y, codefamily, clusterID, interactionID, offset, epsilon, and numIt.

References

Vaughan, G., Aseltine, R., Chen, K., Yan, J., (2017). Efficient interaction selection for clustered data via stagewise generalized estimating equations. Department of Statistics, University of Connecticut. Technical Report.

Zhu, R., Zhao, H., and Ma, S. (2014). Identifying gene-environment and gene-gene interactions using a progressive penalization approach. Genetic Epidemiology 38, 353--368.

Bien, J., Taylor, J., and Tibshirani, R. (2013). A lasso for hierarchical interactions. The Annals of Statistics 41, 1111--1141.

Examples

Run this code

# NOT RUN {
#####################
## Generate test data
#####################

## Initialize covariate values
p <- 5 
beta <- c(1, 0, 1.5, 0, .5, ## Main effects
          rep(0.5,4), ## Interaction terms
          0.5, 0, 0.5,
          0,1,
          0)


generatedData <- genData(numClusters = 50,
                         clusterSize = 4,
                         clusterRho = 0.6,
                         clusterCorstr = "exchangeable",
                         yVariance = 1,
                         xVariance = 1,
                         beta = beta,
                         numMainEffects = p,
                         family = gaussian(),
                         intercept = 1)

 
## Perform Fitting by providing formula and data
genDF <- data.frame(Y = generatedData$y, X = generatedData$xMainEff)

## Using "ACTS" method
coefMat1 <- isee(formula(paste0("Y~(",
                               paste0("X.", 1:p, collapse = "+"),
                                 ")^2")),
                  data = genDF,
                  family = gaussian(),
                  clusterID = generatedData$clusterID,
                  corstr = "exchangeable",
                  method = "ACTS",
                  control = sgee.control(maxIt = 50, epsilon = 0.5))

## Using "HiLa" method
coefMat2 <- isee(formula(paste0("Y~(",
                               paste0("X.", 1:p, collapse = "+"),
                                 ")^2")),
                  data = genDF,
                  family = gaussian(),
                  clusterID = generatedData$clusterID,
                  corstr = "exchangeable",
                  method = "HiLa",
                  control = sgee.control(maxIt = 50, epsilon = 0.5))

# }

Run the code above in your browser using DataLab