vim: Nonparametric Variable Importance Estimates and Inference

Description

Compute estimates of and confidence intervals for nonparametric risk-based variable importance.

Usage

vim(
  Y,
  X,
  f1 = NULL,
  f2 = NULL,
  indx = 1,
  weights = rep(1, length(Y)),
  type = "r_squared",
  run_regression = TRUE,
  SL.library = c("SL.glmnet", "SL.xgboost", "SL.mean"),
  alpha = 0.05,
  delta = 0,
  scale = "identity",
  na.rm = FALSE,
  folds = NULL,
  stratified = FALSE,
  ...
)

Arguments

the outcome.

the covariates.

the fitted values from a flexible estimation technique regressing Y on X.

the fitted values from a flexible estimation technique regressing Y on X withholding the columns in indx.

indx

the indices of the covariate(s) to calculate variable importance for; defaults to 1.

weights

weights for the computed influence curve (e.g., inverse probability weights for coarsened-at-random settings)

type

the type of importance to compute; defaults to r_squared, but other supported options are auc, accuracy, and anova.

run_regression

if outcome Y and covariates X are passed to vimp_accuracy, and run_regression is TRUE, then Super Learner will be used; otherwise, variable importance will be computed using the inputted fitted values.

SL.library

a character vector of learners to pass to SuperLearner, if f1 and f2 are Y and X, respectively. Defaults to SL.glmnet, SL.xgboost, and SL.mean.

alpha

the level to compute the confidence interval at. Defaults to 0.05, corresponding to a 95% confidence interval.

delta

the value of the $\delta$-null (i.e., testing if importance < $\delta$); defaults to 0.

scale

should CIs be computed on original ("identity") or logit ("logit") scale?

na.rm

should we remove NA's in the outcome and fitted values in computation? (defaults to FALSE)

folds

the folds used for f1 and f2; assumed to be 1 for the observations used in f1 and 2 for the observations used in f2. If there is only a single fold passed in, then hypothesis testing is not done.

stratified

if run_regression = TRUE, then should the generated folds be stratified based on the outcome (helps to ensure class balance across cross-validation folds)

...

other arguments to the estimation tool, see "See also".

Value

An object of classes vim and the type of risk-based measure. See Details for more information.

Details

We define the population variable importance measure (VIM) for the group of features (or single feature) $s$ with respect to the predictiveness measure $V$ by $$\psi_{0,s} := V(f_0, P_0) - V(f_{0,s}, P_0),$$ where $f_0$ is the population predictiveness maximizing function, $f_{0,s}$ is the population predictiveness maximizing function that is only allowed to access the features with index not in $s$, and $P_0$ is the true data-generating distribution. VIM estimates are obtained by obtaining estimators $f_n$ and $f_{n,s}$ of $f_0$ and $f_{0,s}$, respectively; obtaining an estimator $P_n$ of $P_0$; and finally, setting $\psi_{n,s} := V(f_n, P_n) - V(f_{n,s}, P_n)$.

In the interest of transparency, we return most of the calculations within the vim object. This results in a list containing:

call - the call to vim
s - the column(s) to calculate variable importance for
SL.library - the library of learners passed to SuperLearner
type - the type of risk-based variable importance measured
full_fit - the fitted values of the chosen method fit to the full data
red_fit - the fitted values of the chosen method fit to the reduced data
est - the estimated variable importance
naive - the naive estimator of variable importance
update - the influence curve-based update
se - the standard error for the estimated variable importance
ci - the $(1-\alpha) \times 100$% confidence interval for the variable importance estimate
test - a decision to either reject (TRUE) or not reject (FALSE) the null hypothesis, based on a conservative test
pval - a conservative p-value based on the same conservative test as test
full_mod - the object returned by the estimation procedure for the full data regression (if applicable)
red_mod - the object returned by the estimation procedure for the reduced data regression (if applicable)
alpha - the level, for confidence interval calculation
folds - the folds used for hypothesis testing
y - the outcome
weights - the weights
mat- a tibble with the estimate, SE, CI, hypothesis testing decision, and p-value

Examples

Run this code

# NOT RUN {
library(SuperLearner)
library(ranger)
## generate the data
## generate X
p <- 2
n <- 100
x <- data.frame(replicate(p, stats::runif(n, -1, 1)))

## apply the function to the x's
f <- function(x) 0.5 + 0.3*x[1] + 0.2*x[2]
smooth <- apply(x, 1, function(z) f(z))

## generate Y ~ Normal (smooth, 1)
y <- matrix(rbinom(n, size = 1, prob = smooth))

## set up a library for SuperLearner
learners <- "SL.ranger"

## using Y and X; use class-balanced folds
folds_1 <- sample(rep(seq_len(2), length = sum(y == 1)))
folds_0 <- sample(rep(seq_len(2), length = sum(y == 0)))
folds <- vector("numeric", length(y))
folds[y == 1] <- folds_1
folds[y == 0] <- folds_0
est <- vim(y, x, indx = 2, type = "r_squared",
           alpha = 0.05, run_regression = TRUE,
           SL.library = learners, cvControl = list(V = 2),
           folds = folds)

## using pre-computed fitted values
full <- SuperLearner(Y = y[folds == 1], X = x[folds == 1, ],
SL.library = learners, cvControl = list(V = 2))
full.fit <- predict(full)$pred
reduced <- SuperLearner(Y = y[folds == 2], X = x[folds == 2, -2, drop = FALSE],
SL.library = learners, cvControl = list(V = 2))
red.fit <- predict(reduced)$pred

est <- vim(Y = y, f1 = full.fit, f2 = red.fit,
            indx = 2, run_regression = FALSE, alpha = 0.05, folds = folds,
            type = "accuracy")

# }