Learn R Programming

FRESA.CAD (version 2.0.2)

univariateRankVariables: Univariate analysis of features

Description

This function reports the mean and standard deviation for each feature in a model, and ranks them according to a user-specified score. Additionally, it does a Kolmogorov-Smirnov (KS) test on the raw and z-standardized data. It also reports the raw and z-standardized t-test score, the p-value of the Wilcoxon rank-sum test, the integrated discrimination improvement (IDI), the net reclassification improvement (NRI), the net residual improvement (NeRI), and the area under the ROC curve (AUC). Furthermore, it reports the z-value of the variable significance on the fitted model.

Usage

univariateRankVariables(variableList,
	                        formula,
	                        Outcome,
	                        data, 
	                        categorizationType = c("Raw",
	                                               "Categorical",
	                                               "ZCategorical",
	                                               "RawZCategorical",
	                                               "RawTail",
	                                               "RawZTail"), 
	                        type = c("LOGIT", "LM", "COX"), 
	                        rankingTest = c("zIDI",
	                                        "zNRI",
	                                        "IDI",
	                                        "NRI",
	                                        "NeRI",
	                                        "Ztest",
	                                        "AUC",
	                                        "CStat",
	                                        "Kendall"), 
	                        cateGroups = c(0.1, 0.9),
	                        raw.dataFrame = NULL,
	                        description = ".",
	                        uniType = c("Binary","Regression"),
	                        fullAnalysis=TRUE)

Arguments

variableList
A data frame with the candidate variables to be ranked
formula
An object of class formula with the formula to be fitted
Outcome
The name of the column in data that stores the variable to be predicted by the model
data
A data frame where all variables are stored in different columns
categorizationType
How variables will be analyzed: As given in data ("Raw"); broken into the p-value categories given by cateGroups ("Categorical"); broken into the p-value categories given by cateGroups, and weighted
type
Fit type: Logistic ("LOGIT"), linear ("LM"), or Cox proportional hazards ("COX")
rankingTest
Variables will be ranked based on: The z-score of the IDI ("zIDI"), the z-score of the NRI ("zNRI"), the IDI ("IDI"), the NRI ("NRI"), the NeRI ("NeRI"), the z-score of the model fit ("Ztest"), the AUC ("AUC"), the Somers' rank
cateGroups
A vector of percentiles to be used for the categorization procedure
raw.dataFrame
A data frame similar to data, but with unadjusted data, used to get the means and variances of the unadjusted data
description
The name of the column in variableList that stores the variable description
uniType
Type of univariate analysis: Binary classification ("Binary") or regression ("Regression")
fullAnalysis
If FALSE it will only order the features according to its z-statistics of the linear model

Value

  • A sorted data frame. In the case of a binary classification analysis, the data frame will have the following columns:
  • NameName of the raw variable or of the dummy variable if the data has been categorized
  • parentName of the raw variable from which the dummy variable was created
  • descripDescription of the parent variable, as defined in description
  • cohortMeanMean value of the variable
  • cohortStdStandard deviation of the variable
  • cohortKSDD statistic of the KS test when comparing a normal distribution and the distribution of the variable
  • cohortKSPAssociated p-value to the cohortKSD
  • caseMeanMean value of cases (subjects with Outcome equal to 1)
  • caseStdStandard deviation of cases
  • caseKSDD statistic of the KS test when comparing a normal distribution and the distribution of the variable only for cases
  • caseKSPAssociated p-value to the caseKSD
  • caseZKSDD statistic of the KS test when comparing a normal distribution and the distribution of the z-standardized variable only for cases
  • caseZKSPAssociated p-value to the caseZKSD
  • controlMeanMean value of controls (subjects with Outcome equal to 0)
  • controlStdStandard deviation of controls
  • controlKSDD statistic of the KS test when comparing a normal distribution and the distribution of the variable only for controls
  • controlKSPAssociated p-value to the controlsKSD
  • controlZKSDD statistic of the KS test when comparing a normal distribution and the distribution of the z-standardized variable only for controls
  • controlZKSPAssociated p-value to the controlsZKSD
  • t.RawvalueNormal inverse p-value (z-value) of the t-test performed on raw.dataFrame
  • t.Zvaluez-value of the t-test performed on data
  • wilcox.Zvaluez-value of the Wilcoxon rank-sum test performed on data
  • ZGLMz-value returned by the lm, glm, or coxph functions for the z-standardized variable
  • zNRIz-value returned by the improveProb function (Hmisc package) when evaluating the NRI
  • zIDIz-value returned by the improveProb function (Hmisc package) when evaluating the IDI
  • zNeRIz-value returned by the improvedResiduals function when evaluating the NeRI
  • ROCAUCArea under the ROC curve returned by the roc function (pROC package)
  • cStatCorrc index of Somers' rank correlation returned by the rcorr.cens function (Hmisc package)
  • NRINRI returned by the improveProb function (Hmisc package)
  • IDIIDI returned by the improveProb function (Hmisc package)
  • NeRINeRI returned by the improvedResiduals function
  • kendall.rKendall $\tau$ rank correlation coefficient between the variable and the binary outcome
  • kendall.pAssociated p-value to the kendall.r
  • TstudentRes.pp-value of the improvement in residuals, as evaluated by the paired t-test
  • WilcoxRes.pp-value of the improvement in residuals, as evaluated by the paired Wilcoxon rank-sum test
  • FRes.pp-value of the improvement in residual variance, as evaluated by the F-test
  • caseN_Z_Low_TailNumber of cases in the low tail
  • caseN_Z_Hi_TailNumber of cases in the top tail
  • controlN_Z_Low_TailNumber of controls in the low tail
  • controlN_Z_Hi_TailNumber of controls in the top tail
  • In the case of regression analysis, the data frame will have the following columns:
  • NameName of the raw variable or of the dummy variable if the data has been categorized
  • parentName of the raw variable from which the dummy variable was created
  • descripDescription of the parent variable, as defined in description
  • cohortMeanMean value of the variable
  • cohortStdStandard deviation of the variable
  • cohortKSDD statistic of the KS test when comparing a normal distribution and the distribution of the variable
  • cohortKSPAssociated p-value to the cohortKSP
  • cohortZKSDD statistic of the KS test when comparing a normal distribution and the distribution of the z-standardized variable
  • cohortZKSPAssociated p-value to the cohortZKSD
  • ZGLMz-value returned by the glm or Cox procedure for the z-standardized variable
  • zNRIz-value returned by the improveProb function (Hmisc package) when evaluating the NRI
  • NeRINeRI returned by the improvedResiduals function
  • cStatCorrc index of Somers' rank correlation returned by the rcorr.cens function (Hmisc package)
  • spearman.rSpearman $\rho$ rank correlation coefficient between the variable and the outcome
  • pearson.rPearson r product-moment correlation coefficient between the variable and the outcome
  • kendall.rKendall $\tau$ rank correlation coefficient between the variable and the outcome
  • kendall.pAssociated p-value to the kendall.r
  • TstudentRes.pp-value of the improvement in residuals, as evaluated by the paired t-test
  • WilcoxRes.pp-value of the improvement in residuals, as evaluated by the paired Wilcoxon rank-sum test
  • FRes.pp-value of the improvement in residual variance, as evaluated by the F-test

Details

This function will create valid dummy categorical variables if, and only if, data has been z-standardized. The p-values provided in cateGroups will be converted to its corresponding z-score, which will then be used to create the categories. If non z-standardized data were to be used, the categorization analysis would return wrong results.

References

Pencina, M. J., D'Agostino, R. B., & Vasan, R. S. (2008). Evaluating the added predictive ability of a new marker: from area under the ROC curve to reclassification and beyond. Statistics in medicine 27(2), 157-172.

Examples

Run this code
# Start the graphics device driver to save all plots in a pdf format
	pdf(file = "Example.pdf")
	# Get the stage C prostate cancer data from the rpart package
	library(rpart)
	data(stagec)
	# Split the stages into several columns
	dataCancer <- cbind(stagec[,c(1:3,5:6)],
	                    gleason4 = 1*(stagec[,7] == 4),
	                    gleason5 = 1*(stagec[,7] == 5),
	                    gleason6 = 1*(stagec[,7] == 6),
	                    gleason7 = 1*(stagec[,7] == 7),
	                    gleason8 = 1*(stagec[,7] == 8),
	                    gleason910 = 1*(stagec[,7] >= 9),
	                    eet = 1*(stagec[,4] == 2),
	                    diploid = 1*(stagec[,8] == "diploid"),
	                    tetraploid = 1*(stagec[,8] == "tetraploid"),
	                    notAneuploid = 1-1*(stagec[,8] == "aneuploid"))
	# Remove the incomplete cases
	dataCancer <- dataCancer[complete.cases(dataCancer),]
	# Load a pre-stablished data frame with the names and descriptions of all variables
	data(cancerVarNames)
	# Rank the variables:
	# - Analyzing the raw data
	# - According to the zIDI
	rankedDataCancer <- univariateRankVariables(variableList = cancerVarNames,
	                                            formula = "Surv(pgtime, pgstat) ~ 1",
	                                            Outcome = "pgstat",
	                                            data = dataCancer, 
	                                            categorizationType = "Raw", 
	                                            type = "COX", 
	                                            rankingTest = "zIDI",
	                                            description = "Description")
	# Shut down the graphics device driver
	dev.off()

Run the code above in your browser using DataLab