Build an ensemble SDM that assembles multiple algorithms for a single species. The function takes as inputs an occurrence data frame made of presence/absence or presence-only records and a raster object for data extraction and projection. The function returns an S4 Ensemble.SDM class object containing the habitat suitability map, the binary map, the between-algorithm variance map and the associated evaluation tables (model evaluation, algorithm evaluation, algorithm correlation matrix and variable importance).
ensemble_modelling(algorithms, Occurrences, Env, Xcol = "Longitude",
Ycol = "Latitude", Pcol = NULL, rep = 10, name = NULL,
save = FALSE, path = getwd(), PA = NULL, cv = "holdout",
cv.param = c(0.7, 1), thresh = 1001, metric = "SES",
axes.metric = "Pearson", uncertainty = TRUE, tmp = FALSE,
ensemble.metric = c("AUC"), ensemble.thresh = c(0.75),
weight = TRUE, verbose = TRUE, GUI = FALSE, ...)
an S4 Ensemble.SDM class object viewable with the
plot.model
function.
character. A character vector specifying the algorithm name(s) to be run (see details below).
data frame. Occurrences table (can be processed first by
load_occ
).
raster object. RasterStack object of environmental variables (can
be processed first by load_var
).
character. Name of the column in the occurrence table containing Latitude or X coordinates.
character. Name of the column in the occurrence table containing Longitude or Y coordinates.
character. Name of the column in the occurrence table specifying whether a line is a presence or an absence. A value of 1 is presence and value of 0 is absence. If NULL presence-only dataset is assumed.
integer. Number of repetitions for each algorithm.
character. Optional name given to the final Ensemble.SDM produced (by default 'Ensemble.SDM').
logical. If TRUE
, the ensemble SDM is automatically saved.
character. If save is If TRUE
, the path to the directory in
which the ensemble SDM will be saved.
list(nb, strat) defining the pseudo-absence selection strategy used in case of presence-only dataset. If PA is NULL, recommended PA selection strategy is used depending on the algorithm (see details below).
character. Method of cross-validation used to evaluate the ensemble SDM (see details below).
numeric. Parameters associated to the method of cross-validation used to evaluate the ensemble SDM (see details below).
numeric. A single integer value representing the number of equal interval threshold values between 0 and 1.
character. Metric used to compute the binary map threshold (see details below.)
Metric used to evaluate variable relative importance (see details below).
logical. If TRUE
, generates an uncertainty map and
an algorithm correlation matrix.
logical. If set to true, the habitat suitability map of each
algorithm is saved in a temporary file to release memory. But beware: if you
close R, temporary files will be deleted To avoid any loss you can save your
ensemble SDM with save.model
. Depending on number, resolution
and extent of models, temporary files can take a lot of disk space.
Temporary files are written in R environment temporary folder.
character. Metric(s) used to select the best SDMs that will be included in the ensemble SDM (see details below).
numeric. Threshold(s) associated with the metric(s) used to compute the selection.
logical. If TRUE
, SDMs are weighted using the ensemble
metric or, alternatively, the mean of the selection metrics.
logical. If TRUE
, allows the function to print text in
the console.
logical. Do not take this argument into account (parameter for the user interface).
additional parameters for the algorithm modelling function (see details below).
Uses the glm
function from the package 'stats', you can set the following parameters (see
glm
for more details):
character. Test used to evaluate the SDM, default 'AIC'.
numeric. Positive convergence tolerance eps ; the iterations converge when |dev - dev_old|/(|dev| + 0.1) < eps. By default, set to 10e-08.
numeric. Integer giving the maximal number of IWLS (Iterative Weighted Last Squares) iterations, default 500.
Uses the gam
function from the package 'mgcv', you can set the following parameters (see
gam
for more details):
character. Test used to evaluate the model, default 'AIC'.
numeric. This is used for judging conversion of the GLM IRLS (Iteratively Reweighted Least Squares) loop, default 10e-08.
numeric. Maximum number of IRLS iterations to perform, default 500.
Uses the
earth
function from the package 'earth', you can set the following
parameters (see earth
for more details):
integer. Maximum degree of interaction (Friedman's mi) ; 1 meaning build an additive model (i.e., no interaction terms). By default, set to 2.
Uses the
gbm
function from the package 'gbm,' you can set the following
parameters (see gbm
for more details):
integer. The total number of trees to fit. This is equivalent to the number of iterations and the number of basis functions in the additive expansion. By default, set to 2500.
integer. minimum number of observations in the trees terminal nodes. Note that this is the actual number of observations not the total weight. By default, set to 1.
integer. Number of cross-validations, default 3.
integer. Number of cross-validation folds to perform. If cv.folds>1 then gbm, in addition to the usual fit, will perform a cross-validation. By default, set to 1e-03.
Uses the rpart
function from the package 'rpart', you can set the following parameters (see
rpart
for more details):
integer. The minimum number of observations in any terminal node, default 1.
integer. Number of cross-validations, default 3.
Uses the randomForest
function
from the package 'randomForest', you can set the following parameters (see
randomForest
for more details):
integer. Number of trees to grow. This should not be set to a too small number, to ensure that every input row gets predicted at least a few times. By default, set to 2500.
integer. Minimum size of terminal nodes. Setting this number larger causes smaller trees to be grown (and thus take less time). By default, set to 1.
Uses the maxent
function
from the package 'dismo'. Make sure that you have correctly installed the
maxent.jar file in the folder ~\R\library\version\dismo\java available
at https://www.cs.princeton.edu/~schapire/maxent/ (see
maxent
for more details).
Uses the nnet
function from the package 'nnet', you can set the following parameters (see
nnet
for more details):
integer. Maximum number of iterations, default 500.
Uses the svm
function
from the package 'e1071', you can set the following parameters (see
svm
for more details):
float. Epsilon parameter in the insensitive loss function, default 1e-08.
integer. If an integer value k>0 is specified, a k-fold cross-validation on the training data is performed to assess the quality of the model: the accuracy rate for classification and the Mean Squared Error for regression. By default, set to 3.
Depending on the raster object resolution the process can be more or less time and memory consuming.
'all' calls all the following algorithms. Algorithms include Generalized linear model (GLM), Generalized additive model (GAM), Multivariate adaptive regression splines (MARS), Generalized boosted regressions model (GBM), Classification tree analysis (CTA), Random forest (RF), Maximum entropy (MAXENT), Artificial neural network (ANN), and Support vector machines (SVM). Each algorithm has its own parameters settable with the ... (see each algorithm section below to set their parameters).
list with two values: nb number of pseudo-absences selected, and strat strategy used to select pseudo-absences: either random selection or disk selection. We set default recommendation from Barbet-Massin et al. (2012) (see reference).
Cross-validation method used to split the occurrence dataset used for evaluation: holdout data are partitioned into a training set and an evaluation set using a fraction (cv.param[1]) and the operation can be repeated (cv.param[2]) times, k-fold data are partitioned into k (cv.param[1]) folds being k-1 times in the training set and once the evaluation set and the operation can be repeated (cv.param[2]) times, LOO (Leave One Out) each point is successively taken as evaluation data.
Choice of the metric used to compute the binary map threshold and the confusion matrix (by default SES as recommended by Liu et al. (2005), see reference below): Kappa maximizes the Kappa, CCR maximizes the proportion of correctly predicted observations, TSS (True Skill Statistic) maximizes the sum of sensitivity and specificity, SES uses the sensitivity-specificity equality, LW uses the lowest occurrence prediction probability, ROC minimizes the distance between the ROC plot (receiving operating characteristic curve) and the upper left corner (1,1).
Metric used to evaluate the variable relative importance (difference between a full model and one with each variable successively omitted): Pearson (computes a simple Pearson's correlation r between predictions of the full model and the one without a variable, and returns the score 1-r: the highest the value, the more influence the variable has on the model), AUC, Kappa, sensitivity, specificity, and prop.correct (proportion of correctly predicted occurrences).
Ensemble metric(s) used to select SDMs: AUC, Kappa, sensitivity, specificity, and prop.correct (proportion of correctly predicted occurrences).
See algorithm in detail section
M. Barbet-Massin, F. Jiguet, C. H. Albert, & W. Thuiller (2012) "Selecting pseudo-absences for species distribution models: how, where and how many?" Methods Ecology and Evolution 3:327-338 http://onlinelibrary.wiley.com/doi/10.1111/j.2041-210X.2011.00172.x/full
C. Liu, P. M. Berry, T. P. Dawson, R. & G. Pearson (2005) "Selecting thresholds of occurrence in the prediction of species distributions." Ecography 28:85-393 http://www.researchgate.net/publication/230246974_Selecting_Thresholds_of_Occurrence_in_the_Prediction_of_Species_Distributions
modelling
to build SDMs with a single algorithm,
stack_modelling
to build SSDMs.
if (FALSE) {
# Loading data
data(Env)
data(Occurrences)
Occurrences <- subset(Occurrences, Occurrences$SPECIES == 'elliptica')
# ensemble SDM building
ESDM <- ensemble_modelling(c('CTA', 'MARS'), Occurrences, Env, rep = 1,
Xcol = 'LONGITUDE', Ycol = 'LATITUDE',
ensemble.thresh = c(0.6))
# Results plotting
plot(ESDM)
}
Run the code above in your browser using DataLab