vim: Nonparametric Intrinsic Variable Importance Estimates and Inference

Description

Compute estimates of and confidence intervals for nonparametric intrinsic variable importance based on the population-level contrast between the oracle predictiveness using the feature(s) of interest versus not.

Usage

vim(
  Y = NULL,
  X = NULL,
  f1 = NULL,
  f2 = NULL,
  indx = 1,
  type = "r_squared",
  run_regression = TRUE,
  SL.library = c("SL.glmnet", "SL.xgboost", "SL.mean"),
  alpha = 0.05,
  delta = 0,
  scale = "identity",
  na.rm = FALSE,
  sample_splitting = TRUE,
  sample_splitting_folds = NULL,
  final_point_estimate = "split",
  stratified = FALSE,
  C = rep(1, length(Y)),
  Z = NULL,
  ipc_scale = "identity",
  ipc_weights = rep(1, length(Y)),
  ipc_est_type = "aipw",
  scale_est = TRUE,
  nuisance_estimators_full = NULL,
  nuisance_estimators_reduced = NULL,
  exposure_name = NULL,
  bootstrap = FALSE,
  b = 1000,
  boot_interval_type = "perc",
  clustered = FALSE,
  cluster_id = rep(NA, length(Y)),
  ...
)

Value

An object of classes vim and the type of risk-based measure. See Details for more information.

Arguments

Y: the outcome.
X: the covariates. If type = "average_value", then the exposure variable should be part of X, with its name provided in exposure_name.
f1: the fitted values from a flexible estimation technique regressing Y on X. A vector of the same length as Y; if sample-splitting is desired, then the value of f1 at each position should be the result of predicting from a model trained without that observation.
f2: the fitted values from a flexible estimation technique regressing either (a) f1 or (b) Y on X withholding the columns in indx. A vector of the same length as Y; if sample-splitting is desired, then the value of f2 at each position should be the result of predicting from a model trained without that observation.
indx: the indices of the covariate(s) to calculate variable importance for; defaults to 1.
type: the type of importance to compute; defaults to r_squared, but other supported options are auc, accuracy, deviance, and anova.
run_regression: if outcome Y and covariates X are passed to vimp_accuracy, and run_regression is TRUE, then Super Learner will be used; otherwise, variable importance will be computed using the inputted fitted values.
SL.library: a character vector of learners to pass to SuperLearner, if f1 and f2 are Y and X, respectively. Defaults to SL.glmnet, SL.xgboost, and SL.mean.
alpha: the level to compute the confidence interval at. Defaults to 0.05, corresponding to a 95% confidence interval.
delta: the value of the $\delta$-null (i.e., testing if importance < $\delta$); defaults to 0.
scale: should CIs be computed on original ("identity") or another scale? (options are "log" and "logit")
na.rm: should we remove NAs in the outcome and fitted values in computation? (defaults to FALSE)
sample_splitting: should we use sample-splitting to estimate the full and reduced predictiveness? Defaults to TRUE, since inferences made using sample_splitting = FALSE will be invalid for variables with truly zero importance.
sample_splitting_folds: the folds used for sample-splitting; these identify the observations that should be used to evaluate predictiveness based on the full and reduced sets of covariates, respectively. Only used if run_regression = FALSE.
final_point_estimate: if sample splitting is used, should the final point estimates be based on only the sample-split folds used for inference ("split", the default), or should they instead be based on the full dataset ("full") or the average across the point estimates from each sample split ("average")? All three options result in valid point estimates -- sample-splitting is only required for valid inference.
stratified: if run_regression = TRUE, then should the generated folds be stratified based on the outcome (helps to ensure class balance across cross-validation folds)
C: the indicator of coarsening (1 denotes observed, 0 denotes unobserved).
Z: either (i) NULL (the default, in which case the argument C above must be all ones), or (ii) a character vector specifying the variable(s) among Y and X that are thought to play a role in the coarsening mechanism. To specify the outcome, use "Y"; to specify covariates, use a character number corresponding to the desired position in X (e.g., "1").
ipc_scale: what scale should the inverse probability weight correction be applied on (if any)? Defaults to "identity". (other options are "log" and "logit")
ipc_weights: weights for the computed influence curve (i.e., inverse probability weights for coarsened-at-random settings). Assumed to be already inverted (i.e., ipc_weights = 1 / [estimated probability weights]).
ipc_est_type: the type of procedure used for coarsened-at-random settings; options are "ipw" (for inverse probability weighting) or "aipw" (for augmented inverse probability weighting). Only used if C is not all equal to 1.
scale_est: should the point estimate be scaled to be greater than or equal to 0? Defaults to TRUE.
nuisance_estimators_full: (only used if type = "average_value") a list of nuisance function estimators on the observed data (may be within a specified fold, for cross-fitted estimates). Specifically: an estimator of the optimal treatment rule; an estimator of the propensity score under the estimated optimal treatment rule; and an estimator of the outcome regression when treatment is assigned according to the estimated optimal rule.
nuisance_estimators_reduced: (only used if type = "average_value") a list of nuisance function estimators on the observed data (may be within a specified fold, for cross-fitted estimates). Specifically: an estimator of the optimal treatment rule; an estimator of the propensity score under the estimated optimal treatment rule; and an estimator of the outcome regression when treatment is assigned according to the estimated optimal rule.
exposure_name: (only used if type = "average_value") the name of the exposure of interest; binary, with 1 indicating presence of the exposure and 0 indicating absence of the exposure.
bootstrap: should bootstrap-based standard error estimates be computed? Defaults to FALSE (and currently may only be used if sample_splitting = FALSE).
b: the number of bootstrap replicates (only used if bootstrap = TRUE and sample_splitting = FALSE); defaults to 1000.
boot_interval_type: the type of bootstrap interval (one of "norm", "basic", "stud", "perc", or "bca", as in boot{boot.ci}) if requested. Defaults to "perc".
clustered: should the bootstrap resamples be performed on clusters rather than individual observations? Defaults to FALSE.
cluster_id: vector of the same length as Y giving the cluster IDs used for the clustered bootstrap, if clustered is TRUE.
...: other arguments to the estimation tool, see "See also".

Details

We define the population variable importance measure (VIM) for the group of features (or single feature) $s$ with respect to the predictiveness measure $V$ by $$\psi_{0,s} := V(f_0, P_0) - V(f_{0,s}, P_0),$$ where $f_0$ is the population predictiveness maximizing function, $f_{0,s}$ is the population predictiveness maximizing function that is only allowed to access the features with index not in $s$, and $P_0$ is the true data-generating distribution. VIM estimates are obtained by obtaining estimators $f_n$ and $f_{n,s}$ of $f_0$ and $f_{0,s}$, respectively; obtaining an estimator $P_n$ of $P_0$; and finally, setting $\psi_{n,s} := V(f_n, P_n) - V(f_{n,s}, P_n)$.

In the interest of transparency, we return most of the calculations within the vim object. This results in a list including:

s: the column(s) to calculate variable importance for
SL.library: the library of learners passed to SuperLearner
type: the type of risk-based variable importance measured
full_fit: the fitted values of the chosen method fit to the full data
red_fit: the fitted values of the chosen method fit to the reduced data
est: the estimated variable importance
naive: the naive estimator of variable importance (only used if type = "anova")
eif: the estimated efficient influence function
eif_full: the estimated efficient influence function for the full regression
eif_reduced: the estimated efficient influence function for the reduced regression
se: the standard error for the estimated variable importance
ci: the $(1-\alpha) \times 100$% confidence interval for the variable importance estimate
test: a decision to either reject (TRUE) or not reject (FALSE) the null hypothesis, based on a conservative test
p_value: a p-value based on the same test as test
full_mod: the object returned by the estimation procedure for the full data regression (if applicable)
red_mod: the object returned by the estimation procedure for the reduced data regression (if applicable)
alpha: the level, for confidence interval calculation
sample_splitting_folds: the folds used for sample-splitting (used for hypothesis testing)
y: the outcome
ipc_weights: the weights
cluster_id: the cluster IDs
mat: a tibble with the estimate, SE, CI, hypothesis testing decision, and p-value

Examples

Run this code

# generate the data
# generate X
p <- 2
n <- 100
x <- data.frame(replicate(p, stats::runif(n, -1, 1)))

# apply the function to the x's
f <- function(x) 0.5 + 0.3*x[1] + 0.2*x[2]
smooth <- apply(x, 1, function(z) f(z))

# generate Y ~ Bernoulli (smooth)
y <- matrix(rbinom(n, size = 1, prob = smooth))

# set up a library for SuperLearner; note simple library for speed
library("SuperLearner")
learners <- c("SL.glm")

# using Y and X; use class-balanced folds
est_1 <- vim(y, x, indx = 2, type = "accuracy",
           alpha = 0.05, run_regression = TRUE,
           SL.library = learners, cvControl = list(V = 2),
           stratified = TRUE)

# using pre-computed fitted values
set.seed(4747)
V <- 2
full_fit <- SuperLearner::CV.SuperLearner(Y = y, X = x,
                                          SL.library = learners,
                                          cvControl = list(V = 2),
                                          innerCvControl = list(list(V = V)))
full_fitted <- SuperLearner::predict.SuperLearner(full_fit)$pred
# fit the data with only X1
reduced_fit <- SuperLearner::CV.SuperLearner(Y = full_fitted,
                                             X = x[, -2, drop = FALSE],
                                             SL.library = learners,
                                             cvControl = list(V = 2, validRows = full_fit$folds),
                                             innerCvControl = list(list(V = V)))
reduced_fitted <- SuperLearner::predict.SuperLearner(reduced_fit)$pred

est_2 <- vim(Y = y, f1 = full_fitted, f2 = reduced_fitted,
            indx = 2, run_regression = FALSE, alpha = 0.05,
            stratified = TRUE, type = "accuracy",
            sample_splitting_folds = get_cv_sl_folds(full_fit$folds))