model.diagnostics: Model Predictions and Diagnostics

Description

Takes model object and makes predictions, runs model diagnostics, and creates graphs and tables of the results.

Usage

model.diagnostics(model.obj = NULL, qdata.trainfn = NULL, qdata.testfn = NULL, 
folder = NULL, MODELfn = NULL, response.name = NULL, unique.rowname = NULL,
 diagnostic.flag=NULL, seed = NULL, prediction.type=NULL, MODELpredfn = NULL, 
 na.action = NULL, v.fold = 10, device.type = NULL, DIAGNOSTICfn = NULL, 
 res=NULL, jpeg.res = 72, device.width = 7,  device.height = 7, units="in", 
 pointsize=12, cex=par()$cex, req.sens, req.spec, FPC, FNC, quantiles=NULL, 
 all=TRUE, subset = NULL, weights = NULL, mtry = NULL, controls = NULL, 
 xtrafo = NULL, ytrafo = NULL, scores = NULL)

Value

The function will return a dataframe of the row ID, and the Observed and predicted values.

For Binary response models the predicted probability of presence is returned.

For Categorical Response models the predicted category (by majority vote) is returned as well as a column for each category giving the probability of that category. If necessary, make.names is applied to the categories to create valid column names.

For Continuous response models the predicted value is returned.

If prediction.type = "CV" the dataframe also includes a column indicating which cross-validation fold each datapoint was in.

Arguments

model.obj

R model object. The model object to use for prediction. The model object must be of type "RF" (random forest), "QRF" (quantile random forest), or "CF" (conditional forest). The ModelMap package does not currently support SGB models.

qdata.trainfn

String. The name (full path or base name with path specified by folder) of the training data file used for building the model (file should include columns for both response and predictor variables). The file must be a comma-delimited file *.csv with column headings. qdata.trainfn can also be an R dataframe. If predictions will be made (predict = TRUE or map=TRUE) the predictor column headers must match the names of the raster layer files, or a rastLUT must be provided to match predictor columns to the appropriate raster and band. If qdata.trainfn = NULL (the default), a GUI interface prompts user to browse to the training data file.

qdata.testfn

String. The name (full path or base name with path specified by folder) of the independent data set for testing (validating) the model's predictions. The file must be a comma-delimited file ".csv" with column headings and the column headings must be the same as those in the training data file. qdata.testfn can also be an R dataframe. If qdata.testfn = NULL (default), a GUI interface asks user if there is a test set available, then prompts user to browse to the test data file. If no test set is desired (for example, cross-fold validation will be performed, or for RF models, Out-Of-Bag estimation, set qdata.testfn = FALSE. If no test set is given, and qdata.testfn is not set to FALSE, the GUI interface asks if a proportion of the data should be set aside as an independent test set. If this is desired, the user will be prompted to specify the proportion to set aside as test data, and two new data files will be generated in the out put folder. The new file names will be the original data file name with "_train" and "_test" appended to the end of the file names.

folder

String. The folder used for all output from predictions and/or maps. Do not add ending slash to path string. If folder = NULL (default), a GUI interface prompts user to browse to a folder. To use the working directory, specify folder = getwd().

MODELfn

String. The file name to use to save the generated model object. If MODELfn = NULL (the default), a default name is generated by pasting model.type_response.type_response.name. If the other output filenames are left unspecified, MODELfn will be used as the basic name to generate other output filenames. The filename can be the full path, or it can be the simple basename, in which case the output will be to the folder specified by folder.

response.name

String. The name of the response variable used to build the model. The response.name must be column name from the training/test data files. If the model.obj was constructed in ModelMap with the model.build() function, then the model.diagnostics() can extract the response.name from the model.obj. If the model was constructed outside of ModelMap the you may need to specify the response.name. In particular, if a SGB model was constructed with the aid of Elith's code, it is necessary to specify the response.name argument, as all models constructed with this code are given a response name of "y.data". If the response.name argument differs from the response name in the model.obj, the specified argument is giver preference, and a warning generated.

unique.rowname

String. The name of the unique identifier used to identify each row in the training data. If unique.rowname = NULL, a GUI interface prompts user to select a variable from the list of column names from the training data file. If unique.rowname = FALSE, a variable is generated of numbers from 1 to nrow(qdata) to index each row.

diagnostic.flag

String. The name of a column used to identify a subset of rows in the training data or test data to use for model diagnostics. This column must be either a logical vector (TRUE and FALSE) or a vector of zeros ond ones (where 0=FALSE and 1=TRUE. If this argument is used model diagnostics that depend on predicted and observed values will be calculated from a subset of the training or test data. These include confusion matrix and threshold criteria for binary response models and the scatterplot for continuous response models. The output file of predicted and observed values will have an aditional column, indicating which rows were used in the diagnostic calculations. Note that for cross validation, the entire training dataset will be used to create cross validation predictions, but that only the predictions on the the rows indicated by diagnostic.flag will be used for the diagnostics.

seed

Integer. The number used to initialize randomization to build RF or SGB models. If you want to produce the same model later, use the same seed. If seed = NULL (the default), a new seed is created each run.

prediction.type

String. Prediction type. "TEST", "CV", "OOB" or "TRAIN". If predict = "TEST", validation predictions will be made on the test set provided by qdata.testfn. If predict = "CV", cross validation will be used on the training data provided by qdata.trainfn. If model.obj is a Random Forest model and predict = "OOB" the Out-of-Bag predictions will be calculated on the training data. If model.obj is a Stochastic Gradient Boosting model and predict = "TRAIN" the predictions will be calculated on the training data, but these predictions should be used with caution as this will lead to over optimistic estimates of model quality. A *.csv file of the unique id, observed, and predicted values is generated and put in the specified (or default) folder.

MODELpredfn

String. Model validation. A character string used to construct the output file names for the validation diagnostics, for example the prediction *.csv file, and the graphics *.jpg, *.pdf and *.ps files. The filename can be the full path, or it can be the simple basename, in which case the output will be to the folder specified by folder. If MODELpredfn = NULL (the default), a default name is created by pasting modelfn and "_pred".

na.action

String. Model validation. Specifies the action to take if there are NA values in the predictor data or if there is a level or class of a categorical predictor variable in the validation test set, but not in the training data set. By default, model.daignostics() will use the same na.action as was given to model.build. There are 2 options: (1) na.action = "na.omit" where any data point with NA or any new levels for any of the factored predictors is removed from the data; (2) na.action = "na.roughfix" where a missing categorical predictor is replaced with the most common category, and a missing continuous predictor is replaced with the median. Note: data points with missing response values will always be omitted.

v.fold

Integer (or logical FALSE). Model validation. The number of cross validation folds to use when making validation predictions on the training data. Only used if prediction.type = "CV".

device.type

String or vector of strings. Model validation. One or more device types for graphical output from model validation diagnostics.

Current choices:

			`"default"`	default graphics device
			`"jpeg"`	*.jpg files
			`"none"`	no graphics device generated
			`"pdf"`	*.pdf files
			`"png"`	*.png files
			`"postscript"`	*.ps files
			`"tiff"`	*.tif files

DIAGNOSTICfn

String. Model validation. Name used as base to create names for output files from model validation diagnostics. The filename can be the full path, or it can be the simple basename, in which case the output will be to the folder specified by folder. Defaults to DIAGNOSTICfn = MODELfn followed by the appropriate suffixes (i.e. ".csv", ".jpg", etc...).

res

Integer. Model validation. Pixels per inch for jpeg, png, and tiff plots. The default is 72dpi, good for on screen viewing. For printing, suggested setting is 300dpi.

jpeg.res

Integer. Model validation. Deprecated. Ignored unless res not provided.

device.width

Integer. Model validation. The device width for diagnostic plots in inches.

device.height

Integer. Model validation. The device height for diagnostic plots in inches.

units

Model validation. The units in which device.height and device.width are given. Can be "px" (pixels), "in" (inches, the default), "cm" or "mm".

pointsize

Integer. Model validation. The default pointsize of plotted text, interpreted as big points (1/72 inch) at res ppi

cex

Integer. Model validation. The cex for diagnostic plots.

req.sens

Numeric. Model validation. The required sensitivity for threshold optimization for binary response model evaluation.

req.spec

Numeric. Model validation. The required specificity for threshold optimization for binary response model evaluation.

FPC

Numeric. Model validation. The False Positive Cost for threshold optimization for binary response model evaluation.

FNC

Numeric. Model validation. The False Negative Cost for threshold optimization for binary response model evaluation.

quantiles

Numeric Vector. QRF models. The quantiles to predict. A numeric vector with values between zero and one. If model was built without specifying quantiles, quantile importance can not be calculated, but quantiles can still be used to specify prediction quantiles. If model was built with quantiles specified, then the model quantiles will be used for importance graph. If quantiles are not specified for model building or diagnostics, prediction quantiles will default to quantiles=c(0.1,0.5,0.9)

all

Logical. QRF models. all=TRUE uses all observations for prediction. all=FALSE uses only a certain number of observations per node for prediction (set with argument obs). Unlike in the quantredForest package itself, the default in ModelMap is all=TRUE, to more closely parallel ordinary random forest models.

subset

CF models. NOT SUPPORTED. Only needed for prediction.type="CV" for CF models. An optional vector specifying a subset of observations to be used in the fitting process. Note: subset is not yet supported for cross validation diagnostics.

weights

CF models. NOT SUPPORTED. Only needed for prediction.type="CV" for CF models. An optional vector of weights to be used in the fitting process. Non-negative integer valued weights are allowed as well as non-negative real weights. Observations are sampled (with or without replacement) according to probabilities weights/sum(weights). The fraction of observations to be sampled (without replacement) is computed based on the sum of the weights if all weights are integer-valued and based on the number of weights greater zero else. Alternatively, weights can be a double matrix defining case weights for all ncol(weights) trees in the forest directly. This requires more storage but gives the user more control. Note: weights is not yet supported for cross validation diagnostics.

mtry

Integer. Only needed for prediction.type="CV" for CF models (for RF and QRF models mtry will be determined from the model object). Number of variables to try at each node of Random Forest trees.

controls

CF models. Only needed for prediction.type="CV" for CF models. An object of class ForestControl-class, which can be obtained using cforest_control (and its convenience interfaces cforest_unbiased and cforest_classical). If controls is specified, then stand alone arguments mtry and ntree ignored and these parameters must be specified as part of the controls argument. If controls not specified, model.build defaults to cforest_unbiased(mtry=mtry, ntree=ntree) with the values of mtry and ntree specified by the stand alone arguments.

xtrafo

CF models. Only needed for prediction.type="CV" for CF models. A function to be applied to all input variables. By default, the ptrafo function from the party package is applied.

ytrafo

CF models. Only needed for prediction.type="CV" for CF models. A function to be applied to all response variables. By default, the ptrafo function from the party package is applied.

scores

CF models. NOT SUPPORTED. Only needed for prediction.type="CV" for CF models. An optional named list of scores to be attached to ordered factors. Note: scores is not yet supported for cross validation diagnostics.

Author

Elizabeth Freeman and Tracey Frescino

Details

model.diagnostics()takes model object and makes predictions, runs model diagnostics, and creates graphs and tables of the results.

model.diagnostics() can be run in a traditional R command mode, where all arguments are specified in the function call. However it can also be used in a full push button mode, where you type in the simple command model.map(), and GUI pop up windows will ask questions about the type of model, the file locations of the data, etc...

When running model.map() on non-Windows platforms, file names and folders need to be specified in the argument list, but other pushbutton selections are handled by the select.list() function, which is platform independent.

Diagnostic predictions are made my one of four methods, and a text file is generated consisting of three columns: Observation ID, observed values and predicted values. If predition.type = "CV") an additional column indicates which cross-fold each observation fell into. If the models response type is categorical then in addition a column giving the category predicted by majority vote, there are also categories for each possible response category giving the proportion of trees that predicted that category.

A variable importance graph is made. If response.type = "categorical", category specific graphs are generated for variable importance. These show how much the model accuracy for each category is affected when the values of each predictor variable is randomly permuted.

The package corrplot is used to generate a plot of correlation between predictor variables. If there are highly correlated predictor variables, then the variable importances of "RF" and "QRF" models need to be interpreted with care, and users may want to consider looking at the conditional variable importances available for "CF" models produced by the party package.

If model.type = "RF", the OOB error is plotted as a function of number of trees in the model. If response.type = "binary" or If response.type = "categorical" category specific graphs are generated for OOB error as a function of number of trees.

If response.type = "binary", a summary graph is made using the PresenceAbsence package and a *.csv spreadsheets are created of optimized thresholds by several methods with their associated error statistics, and predicted prevalence.

If response.type = "continuous" a scatterplot of observed vs. predicted is created with a simple linear regression line. The graph is labeled with slope and intercept of this line as well as Pearson's and Spearman's correlation coefficients.

If response.type = "categorical", a confusion matrix is generated, that includes erros of ommission and comission, as well as Kappa, Percent Correctly Classified (PCC) and the Multicategorical Area Under the Curve (MAUC) as defined by Hand and Till (2001) and calculated by the package HandTill2001.

References

Breiman, L. (2001) Random Forests. Machine Learning, 45:5-32.

Elith, J., Leathwick, J. R. and Hastie, T. (2008). A working guide to boosted regression trees. Journal of Animal Ecology. 77:802-813.

Hand, D. J., & Till, R. J. (2001). A simple generalisation of the area under the ROC curve for multiple class classification problems. Machine Learning, 45(2), 171-186.

Liaw, A. and Wiener, M. (2002). Classification and Regression by randomForest. R News 2(3), 18--22.

Ridgeway, G., (1999). The state of boosting. Comp. Sci. Stat. 31:172-181