Takes model object and makes predictions, runs model diagnostics, and creates graphs and tables of the results.
model.diagnostics(model.obj = NULL, qdata.trainfn = NULL, qdata.testfn = NULL,
folder = NULL, MODELfn = NULL, response.name = NULL, unique.rowname = NULL,
diagnostic.flag=NULL, seed = NULL, prediction.type=NULL, MODELpredfn = NULL,
na.action = NULL, v.fold = 10, device.type = NULL, DIAGNOSTICfn = NULL,
res=NULL, jpeg.res = 72, device.width = 7, device.height = 7, units="in",
pointsize=12, cex=par()$cex, req.sens, req.spec, FPC, FNC, quantiles=NULL,
all=TRUE, subset = NULL, weights = NULL, mtry = NULL, controls = NULL,
xtrafo = NULL, ytrafo = NULL, scores = NULL)
The function will return a dataframe of the row ID, and the Observed and predicted values.
For Binary response models the predicted probability of presence is returned.
For Categorical Response models the predicted category (by majority vote) is returned as well as a column for each category giving the probability of that category. If necessary, make.names
is applied to the categories to create valid column names.
For Continuous response models the predicted value is returned.
If prediction.type = "CV"
the dataframe also includes a column indicating which cross-validation fold each datapoint was in.
R
model object. The model object to use for prediction. The model object must be of type "RF"
(random forest), "QRF"
(quantile random forest), or "CF"
(conditional forest). The ModelMap
package does not currently support SGB
models.
String. The name (full path or base name with path specified by folder
) of the training data file used for building the model (file should include columns for both response and predictor variables). The file must be a comma-delimited file *.csv
with column headings. qdata.trainfn
can also be an R
dataframe. If predictions will be made (predict = TRUE
or map=TRUE
) the predictor column headers must match the names of the raster layer files, or a rastLUT
must be provided to match predictor columns to the appropriate raster and band. If qdata.trainfn = NULL
(the default), a GUI interface prompts user to browse to the training data file.
String. The name (full path or base name with path specified by folder
) of the independent data set for testing (validating) the model's predictions. The file must be a comma-delimited file ".csv"
with column headings and the column headings must be the same as those in the training data file. qdata.testfn
can also be an R
dataframe. If qdata.testfn = NULL
(default), a GUI interface asks user if there is a test set available, then prompts user to browse to the test data file. If no test set is desired (for example, cross-fold validation will be performed, or for RF models, Out-Of-Bag estimation, set qdata.testfn = FALSE
. If no test set is given, and qdata.testfn
is not set to FALSE
, the GUI interface asks if a proportion of the data should be set aside as an independent test set. If this is desired, the user will be prompted to specify the proportion to set aside as test data, and two new data files will be generated in the out put folder. The new file names will be the original data file name with "_train"
and "_test"
appended to the end of the file names.
String. The folder used for all output from predictions and/or maps. Do not add ending slash to path string. If folder = NULL
(default), a GUI interface prompts user to browse to a folder. To use the working directory, specify folder = getwd()
.
String. The file name to use to save the generated model object. If MODELfn = NULL
(the default), a default name is generated by pasting model.type_response.type_response.name
. If the other output filenames are left unspecified, MODELfn
will be used as the basic name to generate other output filenames. The filename can be the full path, or it can be the simple basename, in which case the output will be to the folder specified by folder
.
String. The name of the response variable used to build the model. The response.name
must be column name from the training/test data files. If the model.obj
was constructed in ModelMap
with the model.build()
function, then the model.diagnostics()
can extract the response.name
from the model.obj
. If the model was constructed outside of ModelMap
the you may need to specify the response.name
. In particular, if a SGB model was constructed with the aid of Elith's code, it is necessary to specify the response.name
argument, as all models constructed with this code are given a response name of "y.data"
. If the response.name
argument differs from the response name in the model.obj
, the specified argument is giver preference, and a warning generated.
String. The name of the unique identifier used to identify each row in the training data. If unique.rowname = NULL
, a GUI interface prompts user to select a variable from the list of column names from the training data file. If unique.rowname = FALSE
, a variable is generated of numbers from 1
to nrow(qdata)
to index each row.
String. The name of a column used to identify a subset of rows in the training data or test data to
use for model diagnostics. This column must be either a logical vector (TRUE
and FALSE
) or a vector of zeros ond ones (where 0=FALSE
and 1=TRUE
. If this argument is used model diagnostics that depend on predicted and observed values will be calculated from a subset of the training or test data. These include confusion matrix and threshold criteria for binary response models and the scatterplot for continuous response models. The output file of predicted and observed values will have an aditional column, indicating which rows were used in the diagnostic calculations. Note that for cross validation, the entire training dataset will be used to create cross validation predictions, but that only the predictions on the the rows indicated by diagnostic.flag
will be used for the diagnostics.
Integer. The number used to initialize randomization to build RF or SGB models. If you want to produce the same model later, use the same seed. If seed = NULL
(the default), a new seed is created each run.
String. Prediction type. "TEST"
, "CV"
, "OOB"
or "TRAIN"
. If predict = "TEST"
, validation predictions will be made on the test set provided by qdata.testfn
. If predict = "CV"
, cross validation will be used on the training data provided by qdata.trainfn
. If model.obj
is a Random Forest model and predict = "OOB"
the Out-of-Bag predictions will be calculated on the training data. If model.obj
is a Stochastic Gradient Boosting model and predict = "TRAIN"
the predictions will be calculated on the training data, but these predictions should be used with caution as this will lead to over optimistic estimates of model quality. A *.csv
file of the unique id, observed, and predicted values is generated and put in the specified (or default) folder.
String. Model validation. A character string used to construct the output file names for the validation diagnostics, for example the prediction *.csv
file, and the graphics *.jpg
, *.pdf
and *.ps
files. The filename can be the full path, or it can be the simple basename, in which case the output will be to the folder specified by folder
. If MODELpredfn = NULL
(the default), a default name is created by pasting modelfn
and "_pred"
.
String. Model validation. Specifies the action to take if there are NA
values in the predictor data or if there is a level or class of a categorical predictor variable in the validation test set, but not in the training data set. By default, model.daignostics()
will use the same na.action
as was given to model.build
. There are 2 options: (1) na.action = "na.omit"
where any data point with NA
or any new levels for any of the factored predictors is removed from the data; (2) na.action = "na.roughfix"
where a missing categorical predictor is replaced with the most common category, and a missing continuous predictor is replaced with the median. Note: data points with missing response values will always be omitted.
Integer (or logical FALSE
). Model validation. The number of cross validation folds to use when making validation predictions on the training data. Only used if prediction.type = "CV"
.
String or vector of strings. Model validation. One or more device types for graphical output from model validation diagnostics.
Current choices:
"default" | default graphics device | |||
"jpeg" | *.jpg files | |||
"none" | no graphics device generated | |||
"pdf" | *.pdf files | |||
"png" | *.png files | |||
"postscript" | *.ps files | |||
"tiff" | *.tif files |
String. Model validation. Name used as base to create names for output files from model validation diagnostics. The filename can be the full path, or it can be the simple basename, in which case the output will be to the folder specified by folder
. Defaults to DIAGNOSTICfn = MODELfn
followed by the appropriate suffixes (i.e. ".csv"
, ".jpg"
, etc...).
Integer. Model validation. Pixels per inch for jpeg, png, and tiff plots. The default is 72dpi, good for on screen viewing. For printing, suggested setting is 300dpi.
Integer. Model validation. Deprecated. Ignored unless res
not provided.
Integer. Model validation. The device width for diagnostic plots in inches.
Integer. Model validation. The device height for diagnostic plots in inches.
Model validation. The units in which device.height
and device.width
are given. Can be "px"
(pixels), "in"
(inches, the default), "cm"
or "mm"
.
Integer. Model validation. The default pointsize of plotted text, interpreted as big points (1/72 inch) at res
ppi
Integer. Model validation. The cex for diagnostic plots.
Numeric. Model validation. The required sensitivity for threshold optimization for binary response model evaluation.
Numeric. Model validation. The required specificity for threshold optimization for binary response model evaluation.
Numeric. Model validation. The False Positive Cost for threshold optimization for binary response model evaluation.
Numeric. Model validation. The False Negative Cost for threshold optimization for binary response model evaluation.
Numeric Vector. QRF models. The quantiles to predict. A numeric vector with values between zero and one. If model was built without specifying quantiles, quantile importance can not be calculated, but quantiles
can still be used to specify prediction quantiles. If model was built with quantiles specified, then the model quantiles will be used for importance graph. If quantiles are not specified for model building or diagnostics, prediction quantiles will default to quantiles=c(0.1,0.5,0.9)
Logical. QRF models. all=TRUE
uses all observations for prediction. all=FALSE
uses only a certain number of observations per node for prediction (set with argument obs). Unlike in the quantredForest package itself, the default in ModelMap is all=TRUE
, to more closely parallel ordinary random forest models.
CF models. NOT SUPPORTED. Only needed for prediction.type="CV"
for CF models. An optional vector specifying a subset of observations to be used in the fitting process. Note: subset
is not yet supported for cross validation diagnostics.
CF models. NOT SUPPORTED. Only needed for prediction.type="CV"
for CF models. An optional vector of weights to be used in the fitting process. Non-negative integer valued weights are allowed as well as non-negative real weights. Observations are sampled (with or without replacement) according to probabilities weights/sum(weights)
. The fraction of observations to be sampled (without replacement) is computed based on the sum of the weights if all weights are integer-valued and based on the number of weights greater zero else. Alternatively, weights
can be a double matrix defining case weights for all ncol(weights)
trees in the forest directly. This requires more storage but gives the user more control. Note: weights
is not yet supported for cross validation diagnostics.
Integer. Only needed for prediction.type="CV"
for CF models (for RF and QRF models mtry will be determined from the model object). Number of variables to try at each node of Random Forest trees.
CF models. Only needed for prediction.type="CV"
for CF models. An object of class ForestControl-class
, which can be obtained using cforest_control (and its convenience interfaces cforest_unbiased and cforest_classical). If controls
is specified, then stand alone arguments mtry
and ntree
ignored and these parameters must be specified as part of the controls
argument. If controls
not specified, model.build
defaults to cforest_unbiased(mtry=mtry, ntree=ntree)
with the values of mtry
and ntree
specified by the stand alone arguments.
CF models. Only needed for prediction.type="CV"
for CF models. A function to be applied to all input variables. By default, the ptrafo
function from the party
package is applied.
CF models. Only needed for prediction.type="CV"
for CF models. A function to be applied to all response variables. By default, the ptrafo
function from the party
package is applied.
CF models. NOT SUPPORTED. Only needed for prediction.type="CV"
for CF models. An optional named list of scores to be attached to ordered factors. Note: scores
is not yet supported for cross validation diagnostics.
Elizabeth Freeman and Tracey Frescino
model.diagnostics()
takes model object and makes predictions, runs model diagnostics, and creates graphs and tables of the results.
model.diagnostics()
can be run in a traditional R command mode, where all arguments are specified in the function call. However it can also be used in a full push button mode, where you type in the simple command model.map()
, and GUI pop up windows will ask questions about the type of model, the file locations of the data, etc...
When running model.map()
on non-Windows platforms, file names and folders need to be specified in the argument list, but other pushbutton selections are handled by the select.list()
function, which is platform independent.
Diagnostic predictions are made my one of four methods, and a text file is generated consisting of three columns: Observation ID, observed values and predicted values. If predition.type = "CV")
an additional column indicates which cross-fold each observation fell into. If the models response type is categorical then in addition a column giving the category predicted by majority vote, there are also categories for each possible response category giving the proportion of trees that predicted that category.
A variable importance graph is made. If response.type = "categorical"
, category specific graphs are generated for variable importance. These show how much the model accuracy for each category is affected when the values of each predictor variable is randomly permuted.
The package corrplot
is used to generate a plot of correlation between predictor variables. If there are highly correlated predictor variables, then the variable importances of "RF"
and "QRF"
models need to be interpreted with care, and users may want to consider looking at the conditional variable importances available for "CF"
models produced by the party
package.
If model.type = "RF"
, the OOB error is plotted as a function of number of trees in the model. If response.type = "binary"
or If response.type = "categorical"
category specific graphs are generated for OOB error as a function of number of trees.
If response.type = "binary"
, a summary graph is made using the PresenceAbsence
package and a *.csv
spreadsheets are created of optimized thresholds by several methods with their associated error statistics, and predicted prevalence.
If response.type = "continuous"
a scatterplot of observed vs. predicted is created with a simple linear regression line. The graph is labeled with slope and intercept of this line as well as Pearson's and Spearman's correlation coefficients.
If response.type = "categorical"
, a confusion matrix is generated, that includes erros of ommission and comission, as well as Kappa, Percent Correctly Classified (PCC) and the Multicategorical Area Under the Curve (MAUC) as defined by Hand and Till (2001) and calculated by the package HandTill2001
.
Breiman, L. (2001) Random Forests. Machine Learning, 45:5-32.
Elith, J., Leathwick, J. R. and Hastie, T. (2008). A working guide to boosted regression trees. Journal of Animal Ecology. 77:802-813.
Hand, D. J., & Till, R. J. (2001). A simple generalisation of the area under the ROC curve for multiple class classification problems. Machine Learning, 45(2), 171-186.
Liaw, A. and Wiener, M. (2002). Classification and Regression by randomForest. R News 2(3), 18--22.
Ridgeway, G., (1999). The state of boosting. Comp. Sci. Stat. 31:172-181
get.test
, model.build
, model.mapmake