This function rearranges the user's input data to make sure they can be used within biomod2. The function allows to select pseudo-absences or background data in the case that true absences data are not available, or to add pseudo-absence data to an existing set of absence (see details).
BIOMOD_FormatingData(resp.var,
expl.var,
resp.xy = NULL,
resp.name = NULL,
eval.resp.var = NULL,
eval.expl.var = NULL,
eval.resp.xy = NULL,
PA.nb.rep = 0,
PA.nb.absences = 1000,
PA.strategy = 'random',
PA.dist.min = 0,
PA.dist.max = NULL,
PA.sre.quant = 0.025,
PA.table = NULL,
na.rm = TRUE)
a vector, SpatialPointsDataFrame
(or SpatialPoints
if you work with ‘only presences’ data) containing species data (a single species) in binary format (ones for presences, zeros for true absences and NA for indeterminate ) that will be used to build the species distribution models.
a matrix
, data.frame
, SpatialPointsDataFrame
or RasterStack
containing your explanatory variables that will be used to build your models.
optional 2 columns matrix
containing the X and Y coordinates of resp.var (only consider if resp.var is a vector) that will be used to build your models.
a vector, SpatialPointsDataFrame
your species data (a single species) in binary format (ones for presences, zeros for true absences and NA for indeterminate ) that will be used to evaluate the models with independent data (or past data for instance).
a matrix
, data.frame
, SpatialPointsDataFrame
or RasterStack
containing your explanatory variables that will be used to evaluate the models with independent data (or past data for instance).
optional 2 columns matrix
containing the X and Y coordinates of resp.var (only consider if resp.var is a vector) that will be used to evaluate the modelswith independent data (or past data for instance).
response variable name (character). The species name.
number of required Pseudo Absences selection (if needed). 0 by Default.
number of pseudo-absence selected for each repetition (when PA.nb.rep > 0) of the selection (true absences included)
strategy for selecting the Pseudo Absences (must be ‘random’, ‘sre’, ‘disk’ or ‘user.defined’)
minimal distance to presences for ‘disk’ Pseudo Absences selection (in meters if the explanatory is a not projected raster (+proj=longlat) and in map units (typically also meters) when it is projected or when explanatory variables are stored within table )
maximal distance to presences for ‘disk’ Pseudo Absences selection(in meters if the explanatory is a not projected raster (+proj=longlat) and in map units (typically also meters) when it is projected or when explanatory variables are stored within table )
quantile used for ‘sre’ Pseudo Absences selection
a matrix
(or a data.frame
) having as many rows than resp.var
values. Each column corresponds to a Pseudo-absences selection. It contains TRUE
or FALSE
indicating which values of resp.var
will be considered to build models. It must be used with ‘user.defined’ PA.strategy
.
logical, if TRUE, all points having one or several missing value for environmental data will be removed from the analysis
A 'data.formatted.Biomod.object'
for BIOMOD_Modeling
.
It is strongly advised to check whether this formatted data corresponds to what was expected. A summary is easily printed by simply tipping the name of the object. A generic plot function is also available to display the different dataset in the geographic space.
This function homogenizes the initial data for making sure the modelling exercise will be completed with all the required data. It supports different kind of inputs.
IMPORTANT: When the explanatory data are given in rasterLayer
or rasterStack
objects, biomod2 will be extract the variables onto the XY coordinates of the presence (and absence is any) vector. Be sure to give the XY coordinates (‘resp.xy’) in the same projection system than the raster objects. Same for the evaluation data in the case some sort of independent (or past) data are available (‘eval.resp.xy’).
When the explanatory variables are given in SpatialPointsDataFrame
, the same requirements are asked than for the raster objects. The XY coordinates must be given to make sure biomod2 can extract the explanatory variables onto the presence (absence) data
When the explanatory variables are stored in a data.frame, make sure there are in the same order than the response variable. biomod2 will simply merge the datasets without considering the XY coordinates.
When both presence and absence data are available, and there is enough absences: set sQuotePA.nb.rep to 0. No pseudo-absence will be extracted.
When no true absences are given or when there are not numerous enough. It's advise to make several pseudo absences selections. That way the influence of the pseudo-absence selection could then be estimated later on. If the user do not want to run several repetition, make sure to select a relatively high number pseudo-absence. Make sure the number of pseudo-absence data is not higher than the maximum number of potential pseudo-absence (e.g. do not select 10,000 pseudo-absence when the rasterStack or data.frame do not contain more than 2000 pixels or rows).
Response variable encoding
BIOMOD_FormatingData
concerns a single species at a time so resp.var
must be a uni-dimensional object.
Response variable must be a vector
or a one column data.frame
/matrix
/SpatialPointsDataFrame
( SpatialPoints
are also allowed if you work with ‘only presences’ data) object.
As most of biomod2 models need Presences AND Absences data, the response variable must contain some absences (if there are not, make sure to select pseudo-absence). In the input resp.var
argument, the data should be coded in the following way :
Presences : 1
True Absences : 0 (if any)
No Information : NA (if any, might latter be used for pseudo-absence)
If resp.var
is a non-spatial object (vector
, matrix
/data.frame
) and that some models requiring spatial data are being used (e.g. MAXENT.Phillips) and/or pseudo absences spatially dependent (i.e 'disk'), make sure to give the XY coordinates of the sites/rows (‘resp.xy’).
Explanatory variables encoding
Explanatory variables must be stored together in a multi-dimensional object. It may be a matrix
, a data.frame
, a SpatialPointsDataFrame
or a rasterStack
object. Factorial variables are allowed here even if that can lead to some models omissions.
Evaluation Data
If you have data enough, we strongly recommend to split your dataset into 2 part : one for training/calibrating and testing the models and another to evaluate it. If you do it, fill the eval.resp.var
, eval.expl.var
and optionally the eval.resp.xy
arguments with this data. The advantage of working with a specific dataset for evaluating your models is that you will be able to evaluate more properly your ‘ensemble modeled’ models. That being said, this argument is optional and you may prefer only to test (kind of evaluation) your models only with a ‘cross-validation’ procedure (see Models function). The best practice is to use one set of data for training/calibrating, one set of testing and one for evaluating. The calibration and testing of the data can be done automatically in biomod2 in the Models function. The dataset for evaluation must be entered in BIOMOD_FormatingData
.
Pseudo Absences selection
The PA.xxx
's arguments let you parameterize your pseudo absences selection if you want some. It's an optional step.
Pseudo absences will be selected within the ‘background data’ and might be constrained by a defined ‘strategy’.
background data
‘Background data’ represents data there is no information whether the species of interest occurs or not. It is defined by the ‘No Information’ data of your resp.var
if you give some. If not, (i.e Only presences data or all cells with a define presence or absence state) the background will be take into your expl.var
object if it's a RasterStack
.
strategy
The strategy allows to constrain the choice of pseudo-absence within the ‘background data’.
3 ways are currently implemented to select the pseudo-absences candidate cells (PA.strategy
argument):
‘random’: all cell of initial background are Pseudo absences candidates. The choice is made randomly given the number of pseudo-absence to select PA.nb.absences
.
‘disk’: you may define a minimal (PA.dist.min
), respectively a maximal (PA.dist.max
) distance to presences points for selecting your pseudo absences candidates. That may be useful if you don't want to select pseudo-absences too close to your presences (same niche and to avoid pseudo-replication), respectively too far from your presences (localized sampling strategy).
‘sre’: Pseudo absences candidates have to be selected in condition that differs from a defined proportion (PA.sre.quant
) of presences data. It forces pseudo absences to be selected outside of the broadly defined environmental conditions for the species. It means that a surface range envelop model (sre, similar the BIOCLIM) is first carried out (using the specified quantile) on the species of interest, and then the pseudo-absence data are extracted outside of this envelop. This particular case may lead to over optimistic models evaluations.
‘user.defined’: In this case, pseudo absences selection should have been done in a previous step. This pseudo absences have to be reference into a well formatted data.frame
(e.g. PA.table
argument)
# NOT RUN {
# species occurrences
DataSpecies <- read.csv(system.file("external/species/mammals_table.csv",
package="biomod2"), row.names = 1)
head(DataSpecies)
# the name of studied species
myRespName <- 'GuloGulo'
# the presence/absences data for our species
myResp <- as.numeric(DataSpecies[,myRespName])
# the XY coordinates of species data
myRespXY <- DataSpecies[,c("X_WGS84","Y_WGS84")]
# Environmental variables extracted from BIOCLIM (bio_3, bio_4, bio_7, bio_11 & bio_12)
myExpl = raster::stack( system.file( "external/bioclim/current/bio3.grd",
package="biomod2"),
system.file( "external/bioclim/current/bio4.grd",
package="biomod2"),
system.file( "external/bioclim/current/bio7.grd",
package="biomod2"),
system.file( "external/bioclim/current/bio11.grd",
package="biomod2"),
system.file( "external/bioclim/current/bio12.grd",
package="biomod2"))
# 1. Formatting Data
myBiomodData <- BIOMOD_FormatingData(resp.var = myResp,
expl.var = myExpl,
resp.xy = myRespXY,
resp.name = myRespName)
myBiomodData
plot(myBiomodData)
# }
Run the code above in your browser using DataLab