ReclassificationFRESA.Model: IDI/NRI-based feature selection procedure for linear, logistic, and Cox proportional hazards regresion models

Description

This function performs a bootstrap sampling to rank the variables that statistically improve prediction. After the frequency rank, the function uses a forward selection procedure to create a final model, whose terms all have a significant contribution to the integrated discrimination improvement (IDI) or the net reclassification improvement (NRI). For each bootstrap, the IDI/NRI is computed and the variable with the largest statically significant IDI/NRI is added to the model. The procedure is repeated at each bootstrap until no more variables can be inserted. The variables that enter the model are then counted, and the same procedure is repeated for the rest of the bootstrap loops. The frequency of variable-inclusion in the model is returned as well as a model that uses the frequency of inclusion.

Usage

ReclassificationFRESA.Model(size = 100,
	                            fraction = 1,
	                            pvalue = 0.05, 
	                            loops = 100,
	                            covariates = "1",
	                            Outcome,
	                            variableList,
	                            data, 
	                            maxTrainModelSize = 10,
	                            type = c("LM", "LOGIT", "COX"),
	                            timeOutcome = "Time",
	                            selectionType=c("zIDI", "zNRI"),
	                            loop.threshold = 20,
	                            interaction = 1,
	                            cores = 4)

Arguments

size

The number of candidate variables to be tested (the first size variables from variableList)

fraction

The fraction of data (sampled with replacement) to be used as train

pvalue

The maximum p-value, associated to either IDI or NRI, allowed for a term in the model

loops

The number of bootstrap loops

covariates

A string of the type "1 + var1 + var2" that defines which variables will always be included in the models (as covariates)

Outcome

The name of the column in data that stores the variable to be predicted by the model

variableList

A data frame with two columns. The first one must have the names of the candidate variables and the other one the description of such variables

data

A data frame where all variables are stored in different columns

maxTrainModelSize

Maximum number of terms that can be included in the model

type

Fit type: Logistic ("LOGIT"), linear ("LM"), or Cox proportional hazards ("COX")

timeOutcome

The name of the column in data that stores the time to event (needed only for a Cox proportional hazards regression model fitting)

selectionType

The type of index to be evaluated by the improveProb function (Hmisc package): z-score of IDI or of NRI

loop.threshold

After loop.threshold cycles, only variables that have already been selected in previous cycles will be candidates to be selected in posterior cycles

interaction

Set to either 1 for first order models, or to 2 for second order models

cores

Cores to be used for parallel processing

Value

final.modelAn object of class lm, glm, or coxph containing the final model
var.namesA vector with the names of the features that were included in the final model
formulaAn object of class formula with the formula used to fit the final model
ranked.varAn array with the ranked frequencies of the features
z.selectionA vector in which each term represents the z-score of the index defined in selectionType obtained with the full model and the model without one term
formula.listA list containing objects of class formula with the formulas used to fit the models found at each cycle

References

Pencina, M. J., D'Agostino, R. B., & Vasan, R. S. (2008). Evaluating the added predictive ability of a new marker: from area under the ROC curve to reclassification and beyond. Statistics in medicine 27(2), 157-172.

Examples

Run this code

# Start the graphics device driver to save all plots in a pdf format
	pdf(file = "Example.pdf")
	# Get the stage C prostate cancer data from the rpart package
	library(rpart)
	data(stagec)
	# Split the stages into several columns
	dataCancer <- cbind(stagec[,c(1:3,5:6)],
	                    gleason4 = 1*(stagec[,7] == 4),
	                    gleason5 = 1*(stagec[,7] == 5),
	                    gleason6 = 1*(stagec[,7] == 6),
	                    gleason7 = 1*(stagec[,7] == 7),
	                    gleason8 = 1*(stagec[,7] == 8),
	                    gleason910 = 1*(stagec[,7] >= 9),
	                    eet = 1*(stagec[,4] == 2),
	                    diploid = 1*(stagec[,8] == "diploid"),
	                    tetraploid = 1*(stagec[,8] == "tetraploid"),
	                    notAneuploid = 1-1*(stagec[,8] == "aneuploid"))
	# Remove the incomplete cases
	dataCancer <- dataCancer[complete.cases(dataCancer),]
	# Load a pre-stablished data frame with the names and descriptions of all variables
	data(cancerVarNames)
	# Get a Cox proportional hazards model using:
	# - 10 bootstrap loops
	# - zIDI as the feature inclusion criterion
	cancerModel <- ReclassificationFRESA.Model(loops = 10,
	                                           Outcome = "pgstat",
	                                           variableList = cancerVarNames,
	                                           data = dataCancer,
	                                           type = "COX",
	                                           timeOutcome = "pgtime",
	                                           selectionType = "zIDI")
	# Shut down the graphics device driver
	dev.off()