Learn R Programming

FRESA.CAD (version 2.0.2)

crossValidationFeatureSelection: IDI/NRI-based selection of a linear, logistic, or Cox proportional hazards regression model from a set of candidate variables

Description

This function performs a cross-validation analysis of a feature selection algorithm based on the integrated discrimination improvement (IDI) or the net reclassification improvement (NRI) to return a predictive model. It is composed of an IDI/NRI-based feature selection followed by an update procedure, ending with a bootstrapping backwards feature elimination. The user can control how many train and blind test sets will be evaluated.

Usage

crossValidationFeatureSelection(size = 10,
	                                fraction = 1.0,
	                                pvalue = 0.05,
	                                loops = 100,
	                                covariates = "1",
	                                Outcome,
	                                timeOutcome = "Time",
	                                variableList,
	                                data,
	                                maxTrainModelSize = 10,
	                                type = c("LM", "LOGIT", "COX"),
	                                selectionType = c("zIDI", "zNRI"),
	                                loop.threshold = 10,
	                                startOffset = 0,
	                                elimination.bootstrap.steps = 25,
	                                trainFraction = 0.67,
	                                trainRepetition = 9,
	                                elimination.pValue = 0.05,
	                                CVfolds = 10,
	                                bootstrap.steps = 25,
	                                interaction = c(1, 1),
	                                nk = 0,
	                                unirank = NULL,
	                                print=TRUE,
	                                plots=TRUE)

Arguments

size
The number of candidate variables to be tested (the first size variables from variableList)
fraction
The fraction of data (sampled with replacement) to be used as train
pvalue
The maximum p-value, associated to either IDI or NRI, allowed for a term in the model
loops
The number of bootstrap loops
covariates
A string of the type "1 + var1 + var2" that defines which variables will always be included in the models (as covariates)
Outcome
The name of the column in data that stores the variable to be predicted by the model
timeOutcome
The name of the column in data that stores the time to event (needed only for a Cox proportional hazards regression model fitting)
variableList
A data frame with two columns. The first one must have the names of the candidate variables and the other one the description of such variables
data
A data frame where all variables are stored in different columns
maxTrainModelSize
Maximum number of terms that can be included in the model
type
Fit type: Logistic ("LOGIT"), linear ("LM"), or Cox proportional hazards ("COX")
selectionType
The type of index to be evaluated by the improveProb function (Hmisc package): z-score of IDI or of NRI
loop.threshold
After loop.threshold cycles, only variables that have already been selected in previous cycles will be candidates to be selected in posterior cycles
startOffset
Only terms whose position in the model is larger than the startOffset are candidates to be removed
elimination.bootstrap.steps
The number of bootstrap loops for the backwards elimination procedure
trainFraction
The fraction of data (sampled with replacement) to be used as train for the cross-validation procedure
trainRepetition
The number of cross-validation folds (it should be at least equal to $1/$trainFraction for a complete cross-validation)
elimination.pValue
The maximum p-value, associated to either IDI or NRI, allowed for a term in the model by the backward elimination procedure
CVfolds
The number of folds for the final cross-validation.
bootstrap.steps
The number of bootstrap loops for the confidence intervals estimation
interaction
A vector of size two. The terms are used by the search and update procedures, respectively. Set to either 1 for first order models, or to 2 for second order models
nk
The number of neighbors used to generate a k-nearest neighbors (KNN) classification. If zero, k is set to the square root of the number of cases. If less than zero, it will not perform the KNN classification
unirank
A list with the results yielded by the uniRankVar function, required only if the rank needs to be updated during the cross-validation procedure
print
Logical. If TRUE, information will be displayed
plots
Logical. If TRUE, plots are displayed

Value

  • formula.listA list containing objects of class formula with the formulas used to fit the models found at each cycle
  • Models.testPredictionA data frame with the blind test set predictions made at each fold of the cross validation, where the models used to generate such predictions (formula.list) were generated via a feature selection process which included only the train set. It also includes a column with the Outcome of each prediction, and a column with the number of the fold at which the prediction was made.
  • FullModel.testPredictionA data frame similar to Models.testPrediction, but where the model used to generate the predictions was the full model, generated via a feature selection process which included all data.
  • TestRetrained.blindPredictionsA data frame similar to Models.testPrediction, but where the models were retrained on an independent set of data (only if enough samples are given at each fold)
  • LastTrainedModel.bootstrappedAn object of class bootstrapValidation containing the results of the bootstrap validation in the last trained model
  • Test.accuracyThe global blind test accuracy of the cross-validation procedure
  • Test.sensitivityThe global blind test sensitivity of the cross-validation procedure
  • Test.specificityThe global blind test specificity of the cross-validation procedure
  • Train.correlationsToFullThe Spearman $\rho$ rank correlation coefficient between the predictions made with each model from formula.list and the full model in the train set
  • Blind.correlationsToFullThe Spearman $\rho$ rank correlation coefficient between the predictions made with each model from formula.list and the full model in the test set
  • FullModelAtFoldAccuraciesThe blind test accuracy for the full model at each cross-validation fold
  • FullModelAtFoldSpecifictiesThe blind test specificity for the full model at each cross-validation fold
  • FullModelAtFoldSensitivitiesThe blind test sensitivity for the full model at each cross-validation fold
  • AtCVFoldModelBlindAccuraciesThe blind test accuracy for the full model at each final cross-validation fold
  • AtCVFoldModelBlindSpecificitiesThe blind test specificity for the full model at each final cross-validation fold
  • AtCVFoldModelBlindSensitivitiesThe blind test sensitivity for the full model at each final cross-validation fold
  • Models.CVblindMeanSensitivitesThe mean ROC sensitivities at certain specificities for all test final cross-validation folds (i.e. 1.00, 0.95, 0.90, 0.80, 0.70, 0.60, 0.50, 0.40, 0.30, 0.20, 0.10, 0.05, and 0.00)
  • varIDISelectionA list containing the values returned by ReclassificationFRESA.Model using all data
  • updateIDISelectionA list containing the values returned by updateModel using all data and the model from varIDISelection
  • backIDIEliminationA list containing the values returned by bootstrapVarElimination using all data and the model from updateIDISelection
  • FullModel.bootstrappedAn object of class bootstrapValidation containing the results of the bootstrap validation in the full model
  • Models.testSensitivitiesA matrix with the mean ROC sensitivities at certain specificities for each train and all test cross-validation folds using the cross-validation models (i.e. 0.95, 0.90, 0.80, 0.70, 0.60, 0.50, 0.40, 0.30, 0.20, 0.10, and 0.05)
  • FullKNN.testPredictionA data frame similar to Models.testPrediction, but where a KNN classifier with the same features as the full model was used to generate the predictions
  • KNN.testPredictionA data frame similar to Models.testPrediction, but where KNN classifiers with the same features as the cross-validation models were used to generate the predictions at each cross-validation fold
  • fullenetAn object of class cv.glmnet containing the results of an elastic net cross-validation fit
  • enet.testPredictionsA data frame similar to Models.testPrediction, but where the predictions were made by the elastic net model
  • enetVariablesA list with the elastic net full model and the models found at each cross-validation fold

Details

This function produces a set of data and plots that can be used to inspect the degree of over-fitting or shrinkage of a model. It uses bootstrapped data, cross-validation data, and, if possible, retrain data. During each cycle, a train and a test ROC will be generated using bootstrapped data. At the end of the cross-validation feature selection procedure, a set of three plots may be produced depending on the specifications of the analysis. The first plot shows the ROC for each cross-validation blind test. The second plot, if enough samples are given, shows the ROC of each model trained and tested in the blind test partition. The final plot shows ROC curves generated with the train, the bootstrapped blind test, and the cross-validation test data. Additionally, this plot will also contain the ROC of the cross-validation mean test data, and of the cross-validation coherence. These set of plots may be used to get an overall perspective of the expected model shrinkage. Along with the plots, the function provides the overall performance of the system (accuracy, sensitivity, and specificity). The function also produces a report of the expected performance of a KNN algorithm trained with the selected features of the model, and an elastic net algorithm. The test predictions obtained with these algorithms can then be compared to the predictions generated by the logistic, linear, or Cox proportional hazards regression model.

References

Pencina, M. J., D'Agostino, R. B., & Vasan, R. S. (2008). Evaluating the added predictive ability of a new marker: from area under the ROC curve to reclassification and beyond. Statistics in medicine 27(2), 157-172.

See Also

crossValidationNeRIFeatureSelection, ReclassificationFRESA.Model, NeRIBasedFRESA.Model

Examples

Run this code
# Start the graphics device driver to save all plots in a pdf format
	pdf(file = "Example.pdf")
	# Get the stage C prostate cancer data from the rpart package
	library(rpart)
	data(stagec)
	# Split the stages into several columns
	dataCancer <- cbind(stagec[,c(1:3,5:6)],
	                    gleason4 = 1*(stagec[,7] == 4),
	                    gleason5 = 1*(stagec[,7] == 5),
	                    gleason6 = 1*(stagec[,7] == 6),
	                    gleason7 = 1*(stagec[,7] == 7),
	                    gleason8 = 1*(stagec[,7] == 8),
	                    gleason910 = 1*(stagec[,7] >= 9),
	                    eet = 1*(stagec[,4] == 2),
	                    diploid = 1*(stagec[,8] == "diploid"),
	                    tetraploid = 1*(stagec[,8] == "tetraploid"),
	                    notAneuploid = 1-1*(stagec[,8] == "aneuploid"))
	# Remove the incomplete cases
	dataCancer <- dataCancer[complete.cases(dataCancer),]
	# Load a pre-stablished data frame with the names and descriptions of all variables
	data(cancerVarNames)
	# Rank the variables:
	# - Analyzing the raw data
	# - According to the zIDI
	rankedDataCancer <- univariateRankVariables(variableList = cancerVarNames,
	                                           formula = "Surv(pgtime, pgstat) ~ 1",
	                                           Outcome = "pgstat",
	                                           data = dataCancer, 
	                                           categorizationType = "Raw", 
	                                           type = "COX", 
	                                           rankingTest = "zIDI",
	                                           description = "Description")
	# Get a Cox proportional hazards model using:
	# - The top 7 ranked variables
	# - 10 bootstrap loops in the feature selection procedure
	# - The zIDI as the feature inclusion criterion
	# - 5 bootstrap loops in the backward elimination procedure
	# - A 5-fold cross-validation in the feature selection, 
	#           update, and backward elimination procedures
	# - A 10-fold cross-validation in the model validation procedure
	# - First order interactions in the update procedure
	cancerModel <- crossValidationFeatureSelection(size = 7,
	                                               loops = 10,
	                                               Outcome = "pgstat",
	                                               timeOutcome = "pgtime",
	                                               variableList = rankedDataCancer,
	                                               data = dataCancer,
	                                               type = "COX",
	                                               selectionType = "zIDI",
	                                               elimination.bootstrap.steps = 5,
	                                               trainRepetition = 5,
	                                               CVfolds = 10,
	                                               interaction = c(1,2))
	# Shut down the graphics device driver
	dev.off()

Run the code above in your browser using DataLab