cv_vim: Nonparametric Variable Importance Estimates and Inference using Cross-fitting

Description

Compute estimates and confidence intervals for the nonparametric variable importance parameter of interest, using cross-fitting. This essentially involves splitting the data into V train/test splits; train the learners on the training data, evaluate importance on the test data; and average over these splits.

Usage

cv_vim(
  Y,
  X,
  f1,
  f2,
  indx = 1,
  V = length(unique(folds)),
  folds = NULL,
  stratified = FALSE,
  weights = rep(1, length(Y)),
  type = "r_squared",
  run_regression = TRUE,
  SL.library = c("SL.glmnet", "SL.xgboost", "SL.mean"),
  alpha = 0.05,
  delta = 0,
  scale = "identity",
  na.rm = FALSE,
  ...
)

Arguments

the outcome.

the covariates.

the predicted values on validation data from a flexible estimation technique regressing Y on X in the training data; a list of length V, where each object is a set of predictions on the validation data.

the predicted values on validation data from a flexible estimation technique regressing the fitted values in f1 on X withholding the columns in indx; a list of length V, where each object is a set of predictions on the validation data.

indx

the indices of the covariate(s) to calculate variable importance for; defaults to 1.

the number of folds for cross-validation, defaults to 10.

folds

the folds to use, if f1 and f2 are supplied. A list of length two; the first element provides the outer folds (for hypothesis testing), while the second element is a list providing the inner folds (for cross-validation).

stratified

if run_regression = TRUE, then should the generated folds be stratified based on the outcome (helps to ensure class balance across cross-validation folds)

weights

weights for the computed influence curve (e.g., inverse probability weights for coarsened-at-random settings)

type

the type of parameter (e.g., ANOVA-based is "anova").

run_regression

if outcome Y and covariates X are passed to cv_vim, and run_regression is TRUE, then Super Learner will be used; otherwise, variable importance will be computed using the inputted fitted values.

SL.library

a character vector of learners to pass to SuperLearner, if f1 and f2 are Y and X, respectively. Defaults to SL.glmnet, SL.xgboost, and SL.mean.

alpha

the level to compute the confidence interval at. Defaults to 0.05, corresponding to a 95% confidence interval.

delta

the value of the $\delta$-null (i.e., testing if importance < $\delta$); defaults to 0.

scale

should CIs be computed on original ("identity") or logit ("logit") scale?

na.rm

should we remove NA's in the outcome and fitted values in computation? (defaults to FALSE)

...

other arguments to the estimation tool, see "See also".

Value

An object of class vim. See Details for more information.

Details

We define the population variable importance measure (VIM) for the group of features (or single feature) $s$ with respect to the predictiveness measure $V$ by $$\psi_{0,s} := V(f_0, P_0) - V(f_{0,s}, P_0),$$ where $f_0$ is the population predictiveness maximizing function, $f_{0,s}$ is the population predictiveness maximizing function that is only allowed to access the features with index not in $s$, and $P_0$ is the true data-generating distribution. Cross-fitted VIM estimates are obtained by first splitting the data into $K$ folds; then using each fold in turn as a hold-out set, constructing estimators $f_{n,k}$ and $f_{n,k,s}$ of $f_0$ and $f_{0,s}$, respectively on the training data and estimator $P_{n,k}$ of $P_0$ using the test data; and finally, computing $$\psi_{n,s} := K^{(-1)}\sum_{k=1}^K \{V(f_{n,k},P_{n,k}) - V(f_{n,k,s}, P_{n,k})\}$$ See the paper by Williamson, Gilbert, Simon, and Carone for more details on the mathematics behind the cv_vim function, and the validity of the confidence intervals.

In the interest of transparency, we return most of the calculations within the vim object. This results in a list containing:

call - the call to cv_vim
s - the column(s) to calculate variable importance for
SL.library - the library of learners passed to SuperLearner
full_fit - the fitted values of the chosen method fit to the full data (a list, for train and test data)
red_fit - the fitted values of the chosen method fit to the reduced data (a list, for train and test data)
est - the estimated variable importance
naive - the naive estimator of variable importance
naives - the naive estimator on each fold
updates - the influence curve-based update for each fold
se - the standard error for the estimated variable importance
ci - the $(1-\alpha) \times 100$% confidence interval for the variable importance estimate
full_mod - the object returned by the estimation procedure for the full data regression (if applicable)
red_mod - the object returned by the estimation procedure for the reduced data regression (if applicable)
alpha - the level, for confidence interval calculation
folds - the folds used for hypothesis testing and cross-validation
y - the outcome
weights - the weights
mat- a tibble with the estimate, SE, CI, hypothesis testing decision, and p-value

Examples

Run this code

# NOT RUN {
library(SuperLearner)
library(ranger)
n <- 100
p <- 2
## generate the data
x <- data.frame(replicate(p, stats::runif(n, -5, 5)))

## apply the function to the x's
smooth <- (x[,1]/5)^2*(x[,1]+7)/5 + (x[,2]/3)^2

## generate Y ~ Normal (smooth, 1)
y <- as.matrix(smooth + stats::rnorm(n, 0, 1))

## set up a library for SuperLearner
learners <- c("SL.mean", "SL.ranger")

## -----------------------------------------
## using Super Learner (with a small number of folds, for illustration only)
## -----------------------------------------
set.seed(4747)
est <- cv_vim(Y = y, X = x, indx = 2, V = 2,
type = "r_squared", run_regression = TRUE,
SL.library = learners, cvControl = list(V = 2), alpha = 0.05)

## ------------------------------------------
## doing things by hand, and plugging them in (with a small number of folds, for illustration only)
## ------------------------------------------
## set up the folds
indx <- 2
V <- 2
set.seed(4747)
outer_folds <- sample(rep(seq_len(2), length = n))
inner_folds_1 <- sample(rep(seq_len(V), length = sum(outer_folds == 1)))
inner_folds_2 <- sample(rep(seq_len(V), length = sum(outer_folds == 2)))
y_1 <- y[outer_folds == 1, , drop = FALSE]
x_1 <- x[outer_folds == 1, , drop = FALSE]
y_2 <- y[outer_folds == 2, , drop = FALSE]
x_2 <- x[outer_folds == 2, , drop = FALSE]
## get the fitted values by fitting the super learner on each pair
fhat_ful <- list()
fhat_red <- list()
for (v in 1:V) {
    ## fit super learner
    fit <- SuperLearner::SuperLearner(Y = y_1[inner_folds_1 != v, , drop = FALSE],
     X = x_1[inner_folds_1 != v, , drop = FALSE], 
     SL.library = learners, cvControl = list(V = V))
    fitted_v <- SuperLearner::predict.SuperLearner(fit)$pred
    ## get predictions on the validation fold
    fhat_ful[[v]] <- SuperLearner::predict.SuperLearner(fit,
     newdata = x_1[inner_folds_1 == v, , drop = FALSE])$pred
    ## fit the super learner on the reduced covariates
    red <- SuperLearner::SuperLearner(Y = y_2[inner_folds_2 != v, , drop = FALSE],
     X = x_2[inner_folds_2 != v, -indx, drop = FALSE], 
     SL.library = learners, cvControl = list(V = V))
    ## get predictions on the validation fold
    fhat_red[[v]] <- SuperLearner::predict.SuperLearner(red,
     newdata = x_2[inner_folds_2 == v, -indx, drop = FALSE])$pred
}
est <- cv_vim(Y = y, f1 = fhat_ful, f2 = fhat_red, indx = 2,
V = V, folds = list(outer_folds = outer_folds, 
inner_folds = list(inner_folds_1, inner_folds_2)), 
type = "r_squared", run_regression = FALSE, alpha = 0.05)

# }