sp_vim: Shapley Population Variable Importance Measure (SPVIM) Estimates and Inference

Description

Compute estimates and confidence intervals for the SPVIMs, using cross-fitting. This essentially involves splitting the data into V train/test splits; train the learners on the training data, evaluate importance on the test data; and average over these splits.

Usage

sp_vim(
  Y,
  X,
  V = 5,
  weights = rep(1, length(Y)),
  type = "r_squared",
  SL.library = c("SL.glmnet", "SL.xgboost", "SL.mean"),
  univariate_SL.library = NULL,
  gamma = 1,
  alpha = 0.05,
  delta = 0,
  na.rm = FALSE,
  stratified = FALSE,
  ...
)

Arguments

the outcome.

the covariates.

the number of folds for cross-validation, defaults to 10.

weights

weights for the computed influence curve (e.g., inverse probability weights for coarsened-at-random settings)

type

the type of parameter (e.g., R-squared-based is "r_squared").

SL.library

a character vector of learners to pass to SuperLearner, if f1 and f2 are Y and X, respectively. Defaults to SL.glmnet, SL.xgboost, and SL.mean.

univariate_SL.library

(optional) a character vector of learners to pass to SuperLearner for estimating univariate regression functions. Defaults to SL.polymars

gamma

the fraction of the sample size to use when sampling subsets (e.g., gamma = 1 samples the same number of subsets as the sample size)

alpha

the level to compute the confidence interval at. Defaults to 0.05, corresponding to a 95% confidence interval.

delta

the value of the \(\delta\)-null (i.e., testing if importance < \(\delta\)); defaults to 0.

na.rm

should we remove NA's in the outcome and fitted values in computation? (defaults to FALSE)

stratified

should the generated folds be stratified based on the outcome (helps to ensure class balance across cross-validation folds)?

...

other arguments to the estimation tool, see "See also".

Value

An object of class vim. See Details for more information.

Details

We define the SPVIM as the weighted average of the population difference in predictiveness over all subsets of features not containing feature \(j\).

This is equivalent to finding the solution to a population weighted least squares problem. This key fact allows us to estimate the SPVIM using weighted least squares, where we first sample subsets from the power set of all possible features using the Shapley sampling distribution; then use cross-fitting to obtain estimators of the predictiveness of each sampled subset; and finally, solve the least squares problem given in Williamson and Feng (2020).

See the paper by Williamson and Feng (2020) for more details on the mathematics behind this function, and the validity of the confidence intervals. The function works by estimating In the interest of transparency, we return most of the calculations within the vim object. This results in a list containing:

call - the call to cv_vim
SL.library - the library of learners passed to SuperLearner
v- the estimated predictiveness measure for each sampled subset
preds_lst - the predicted values from the chosen method for each sampled subset
est - the estimated SPVIM value for each feature
ic_lst - the influence functions for each sampled subset
ic- a list of the SPVIM influence function contributions
se - the standard errors for the estimated variable importance
ci - the \((1-\alpha) \times 100\)% confidence intervals based on the variable importance estimates
gamma- the fraction of the sample size used when sampling subsets
alpha - the level, for confidence interval calculation
delta- the delta value used for hypothesis testing
y - the outcome
weights - the weights
mat- a tibble with the estimates, SEs, CIs, hypothesis testing decisions, and p-values

Examples

Run this code

# NOT RUN {
library(SuperLearner)
library(ranger)
n <- 100
p <- 2
## generate the data
x <- data.frame(replicate(p, stats::runif(n, -5, 5)))

## apply the function to the x's
smooth <- (x[,1]/5)^2*(x[,1]+7)/5 + (x[,2]/3)^2

## generate Y ~ Normal (smooth, 1)
y <- as.matrix(smooth + stats::rnorm(n, 0, 1))

## set up a library for SuperLearner
learners <- c("SL.mean", "SL.ranger")

## -----------------------------------------
## using Super Learner (with a small number of CV folds,
## for illustration only)
## -----------------------------------------
set.seed(4747)
est <- sp_vim(Y = y, X = x, V = 2, type = "r_squared", 
SL.library = learners, alpha = 0.05)

# }