medianPredict: The median prediction from a list of models

Description

Given a list of model formulas, this function will train such models and return the median prediction on a test data set. It also provides a k-nearest neighbours (KNN) prediction using the features listed in such models.

Usage

medianPredict(formulaList,
	              trainData,
	              testData = NULL, 
	              predictType = c("prob", "linear"),
	              type = c("LOGIT", "LM", "COX"),
	              Outcome = "CLASS",
	              nk=0,
	              ...)

Arguments

formulaList

A list made of objects of class formula, each representing a model formula to be fitted and predicted with

trainData

A data frame with the data to train the model, where all variables are stored in different columns

testData

A data frame similar to trainData, but with the data set to be predicted. If NULL, trainData will be used

predictType

Prediction type: Probability ("prob") or linear predictor ("linear")

type

Fit type: Logistic ("LOGIT"), linear ("LM"), or Cox proportional hazards ("COX")

Outcome

The name of the column in data that stores the variable to be predicted by the model

The number of neighbours used to generate the KNN classification. If zero, k is set to the square root of the number of cases. If less than zero, it will not perform the KNN classification

...

Additional parameters for fitting a glm object

Value

medianPredictA vector with the median prediction for the testData data set, using the models from formulaList
medianKNNPredictA vector with the median prediction for the testData data set, using the KNN models
predictionsA matrix, where each column represents the predictions made with each model from formulaList
KNNpredictionsA matrix, where each column represents the predictions made with a different KNN model

Examples

Run this code

# Start the graphics device driver to save all plots in a pdf format
	pdf(file = "Example.pdf")
	# Get the stage C prostate cancer data from the rpart package
	library(rpart)
	data(stagec)
	# Split the stages into several columns
	dataCancer <- cbind(stagec[,c(1:3,5:6)],
	                    gleason4 = 1*(stagec[,7] == 4),
	                    gleason5 = 1*(stagec[,7] == 5),
	                    gleason6 = 1*(stagec[,7] == 6),
	                    gleason7 = 1*(stagec[,7] == 7),
	                    gleason8 = 1*(stagec[,7] == 8),
	                    gleason910 = 1*(stagec[,7] >= 9),
	                    eet = 1*(stagec[,4] == 2),
	                    diploid = 1*(stagec[,8] == "diploid"),
	                    tetraploid = 1*(stagec[,8] == "tetraploid"),
	                    notAneuploid = 1-1*(stagec[,8] == "aneuploid"))
	# Remove the incomplete cases
	dataCancer <- dataCancer[complete.cases(dataCancer),]
	# Load a pre-stablished data frame with the names and descriptions of all variables
	data(cancerVarNames)
	# Rank the variables:
	# - Analyzing the raw data
	# - According to the zIDI
	rankedDataCancer <- univariateRankVariables(variableList = cancerVarNames,
	                                            formula = "Surv(pgtime, pgstat) ~ 1",
	                                            Outcome = "pgstat",
	                                            data = dataCancer,
	                                            categorizationType = "Raw",
	                                            type = "COX",
	                                            rankingTest = "zIDI",
	                                            description = "Description")
	# Get a Cox proportional hazards model using:
	# - The top 7 ranked variables
	# - 10 bootstrap loops in the feature selection procedure
	# - The zIDI as the feature inclusion criterion
	# - 5 bootstrap loops in the backward elimination procedure
	# - A 5-fold cross-validation in the feature selection, 
	#            update, and backward elimination procedures
	# - A 10-fold cross-validation in the model validation procedure
	# - First order interactions in the update procedure
	cancerModel <- crossValidationFeatureSelection(size = 7,
	                                               loops = 10,
	                                               Outcome = "pgstat",
	                                               timeOutcome = "pgtime",
	                                               variableList = rankedDataCancer,
	                                               data = dataCancer,
	                                               type = "COX",
	                                               selectionType = "zIDI",
	                                               elimination.bootstrap.steps = 5,
	                                               trainRepetition = 5,
	                                               CVfolds = 10,
	                                               interaction = c(1,2))
	# Get the median prediction:
	# - Without an independent test set
	# - Without a KNN classification
	mp <- medianPredict(formulaList = cancerModel$formula.list,
	                    trainData = dataCancer,
	                    predictType = "prob",
	                    type = "COX",
	                    Outcome = "pgstat",
	                    nk=0)
	# Shut down the graphics device driver
	dev.off()

Run the code above in your browser using DataLab