Learn R Programming

epiGWAS (version 1.0.2)

stabilityGLM: Computes the area under the stability path for all covariates

Description

To perform model selection, this function scores all covariates in X according to the area under their stability selection paths. Our model selection procedure starts by dynamically defining a grid for the elastic net penalization parameter \(\lambda\). To define the grid, we solve the full-dataset elastic net. This yields n_lambda log-scaled values between \(\lambda_{max}\) and \(\lambda_{min}\). \(\lambda_{max}\) is the maximum value for which the elastic net support is not empty. On the other hand, \(\lambda_{min}\) can be derived through lambda_min_ratio, which is the ratio of \(\lambda_{min}\) to \(\lambda_{max}\). The next step is identical to the original stability selection procedure. For each value of \(\lambda\), we solve n_subsample times the same elastic net, though for a different subsample. The subsample is a random selection of half of the samples of the original dataset. The empirical frequency of each covariate entering the support is then the number of times the covariate is selected in the support as a fraction of n_subsample. We obtain the stability path by associating each value of \(\lambda\) with the corresponding empirical frequency. The final scores are the areas under the stability path curves. This is a key difference with the original stability selection procedure where the final score is the maximum empirical frequency. On simulations, our scoring technique outperformed maximum empirical frequencies.

Usage

stabilityGLM(X, Y, weights = rep(1, nrow(X)), family = "gaussian",
  n_subsample = 20, n_lambda = 100, short = TRUE,
  lambda_min_ratio = 0.01, eps = 1e-05)

Arguments

X

input design matrix

Y

response vector

weights

nonnegative sample weights

family

response type. Either 'gaussian' or 'binomial'

n_subsample

number of subsamples for stability selection

n_lambda

total number of lambda values

short

whether to compute the aucs only on the first half of the stability path. We observed better performance for thresholded paths

lambda_min_ratio

ratio of \(\lambda_{min}\) to \(\lambda_{max}\) (see description for a thorough explanation)

eps

elastic net mixing parameter.

Value

a vector containing the areas under the stability path curves

Details

For a fixed \(\lambda\), the L2 penalization is \(\lambda \times eps\), while the L1 penalization is \(\lambda \times (1-eps)\). The goal of the L2 penalization is to ensure the uniqueness of the solution. For that reason, we recommend setting eps << 1.

References

Slim, L., Chatelain, C., Azencott, C.-A., & Vert, J.-P. (2018). Novel Methods for Epistasis Detection in Genome-Wide Association Studies. BioRxiv.

Meinshausen, N., & B<U+00FC>hlmann, P. (2010). Stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(4), 417<U+2013>473.

Haury, A. C., Mordelet, F., Vera-Licona, P., & Vert, J. P. (2012). TIGRESS: Trustful Inference of Gene REgulation using Stability Selection. BMC Systems Biology, 6.

See Also

glmnet-package

Other support estimation functions: stabilityBIG

Examples

Run this code
# NOT RUN {
# ---- Continuous data ----
n <- 50
p <- 20
X <- matrix(rnorm(n * p), ncol = p)
Y <- crossprod(t(X), rnorm(p))
aucs_cont <- stabilityGLM(X, Y,
  family = "gaussian", n_subsample = 1,
  short = FALSE
)

# ---- Binary data ----
X <- matrix(rnorm(n * p), ncol = p)
Y <- runif(n, min = 0, max = 1) < 1 / (1 + exp(-X[, c(1, 7, 15)] %*% rnorm(3)))
weights <- runif(n, min = 0.4, max = 0.8)
aucs_binary <- stabilityGLM(X, Y,
  weights = weights,
  n_lambda = 50, lambda_min_ratio = 0.05, n_subsample = 1
)
# }

Run the code above in your browser using DataLab