sbf: Selection By Filtering (SBF)

Description

Model fitting after applying univariate filters

Usage

sbf(x, ...)
# S3 method for default
sbf(x, y, sbfControl = sbfControl(), ...)
# S3 method for formula
sbf(form, data, ..., subset, na.action, contrasts = NULL)
# S3 method for recipe
sbf(x, data, sbfControl = sbfControl(), ...)
# S3 method for sbf
predict(object, newdata = NULL, ...)

Arguments

a data frame containing training data where samples are in rows and features are in columns. For the recipes method, x is a recipe object.

…

for sbf: arguments passed to the classification or regression routine (such as randomForest). For predict.sbf: augments cannot be passed to the prediction function using predict.sbf as it uses the function originally specified for prediction.

a numeric or factor vector containing the outcome for each sample.

sbfControl

a list of values that define how this function acts. See sbfControl. (NOTE: If given, this argument must be named.)

form

A formula of the form y ~ x1 + x2 + ...

data

Data frame from which variables specified in formula are preferentially to be taken.

subset

An index vector specifying the cases to be used in the training sample. (NOTE: If given, this argument must be named.)

na.action

A function to specify the action to be taken if NAs are found. The default action is for the procedure to fail. An alternative is na.omit, which leads to rejection of cases with missing values on any required variable. (NOTE: If given, this argument must be named.)

contrasts

a list of contrasts to be used for some or all the factors appearing as variables in the model formula.

object

an object of class sbf

newdata

a matrix or data frame of predictors. The object must have non-null column names

Value

for sbf, an object of class sbf with elements:

pred

if sbfControl$saveDetails is TRUE, this is a list of predictions for the hold-out samples at each resampling iteration. Otherwise it is NULL

variables

a list of variable names that survived the filter at each resampling iteration

results

a data frame of results aggregated over the resamples

fit

the final model fit with only the filtered variables

optVariables

the names of the variables that survived the filter using the training set

call

the function call

control

the control object

resample

if sbfControl$returnResamp is "all", a data frame of the resampled performance measures. Otherwise, NULL

metrics

a character vector of names of the performance measures

dots

a list of optional arguments that were passed in

For predict.sbf, a vector of predictions.

Details

More details on this function can be found at http://topepo.github.io/caret/feature-selection-using-univariate-filters.html.

This function can be used to get resampling estimates for models when simple, filter-based feature selection is applied to the training data.

For each iteration of resampling, the predictor variables are univariately filtered prior to modeling. Performance of this approach is estimated using resampling. The same filter and model are then applied to the entire training set and the final model (and final features) are saved.

sbf can be used with "explicit parallelism", where different resamples (e.g. cross-validation group) can be split up and run on multiple machines or processors. By default, sbf will use a single processor on the host machine. As of version 4.99 of this package, the framework used for parallel processing uses the foreach package. To run the resamples in parallel, the code for sbf does not change; prior to the call to sbf, a parallel backend is registered with foreach (see the examples below).

The modeling and filtering techniques are specified in sbfControl. Example functions are given in lmSBF.

Examples

Run this code

# NOT RUN {
# }
# NOT RUN {
data(BloodBrain)

## Use a GAM is the filter, then fit a random forest model
RFwithGAM <- sbf(bbbDescr, logBBB,
                 sbfControl = sbfControl(functions = rfSBF,
                                         verbose = FALSE,
                                         method = "cv"))
RFwithGAM

predict(RFwithGAM, bbbDescr[1:10,])

## classification example with parallel processing

## library(doMC)

## Note: if the underlying model also uses foreach, the
## number of cores specified above will double (along with
## the memory requirements)
## registerDoMC(cores = 2)

data(mdrr)
mdrrDescr <- mdrrDescr[,-nearZeroVar(mdrrDescr)]
mdrrDescr <- mdrrDescr[, -findCorrelation(cor(mdrrDescr), .8)]

set.seed(1)
filteredNB <- sbf(mdrrDescr, mdrrClass,
                 sbfControl = sbfControl(functions = nbSBF,
                                         verbose = FALSE,
                                         method = "repeatedcv",
                                         repeats = 5))
confusionMatrix(filteredNB)
# }
# NOT RUN {

# }