trainNS: Calibrate a natural splines model

Description

This function constructs a natural-spline model by evaluating all possible models given the available predictors and constraints. "Constraints" in this case include the degrees of freedom for a spline, whether or not interaction terms are included, minimum number of presence sites per model term, and maximum number of terms to include in the model. Its output is any or all of: the most parsimonious model (lowest AICc); all models evaluated; and/or a table with AICc for all evaluated models.

Usage

trainNS(
  data,
  resp = names(data)[1],
  preds = names(data)[2:ncol(data)],
  scale = NA,
  df = 1:4,
  interaction = TRUE,
  interceptOnly = TRUE,
  method = "glm.fit",
  presPerTermFinal = 10,
  maxTerms = 8,
  w = TRUE,
  family = "binomial",
  removeInvalid = TRUE,
  failIfNoValid = TRUE,
  out = "model",
  cores = 1,
  verbose = FALSE,
  ...
)

Value

The object that is returned depends on the value of the out argument. It can be a model object, a data frame, a list of models, or a list of all two or more of these. If scale is TRUE, any model object will also have an element named $scale, which contains the means and standard deviations for predictors that are not factors. The data frame reports the AICc for all of the models evaluated, sorted by best to worst. The converged column indicates whether the model converged ("TRUE" is good), and the boundary column whether the model parameters are near the boundary (usually, negative or positive infinity; "FALSE" is good).

Arguments

data

Data frame.

resp

Response variable. This is either the name of the column in data or an integer indicating the column in data that has the response variable. The default is to use the first column in data as the response.

preds

Character vector or integer vector. Names of columns or column indices of predictors. The default is to use the second and subsequent columns in data.

scale

Either NA (default), or TRUE or FALSE. If TRUE, the predictors will be centered and scaled by dividing by subtracting their means then dividing by their standard deviations. The means and standard deviations will be returned in the model object under an element named "scales". For example, if you do something like model <- trainNS(data, scale=TRUE), then you can get the means and standard deviations using model$scales$means and model$scales$sds. If FALSE, no scaling is done. If NA (default), then the function will check to see if non-factor predictors have means ~0 and standard deviations ~1. If not, then a warning will be printed, but the function will continue to do its operations.

df

A vector of integers > 0 or NULL. Sets flexibility of model fit. See documentation for ns.

interaction

If TRUE (default), include two-way interaction terms.

interceptOnly

If TRUE (default) and model selection is enabled, then include an intercept-only model.

method

Character, name of function used to solve. This can be 'glm.fit' (default), 'brglmFit' (from the brglm2 package), or another function.

presPerTermFinal

Minimum number of presence sites per term in initial starting model.

maxTerms

Maximum number of terms to be used in any model, not including the intercept (default is 8).

w

Weights. Any of:

TRUE: Causes the total weight of presences to equal the total weight of absences (if family='binomial')
FALSE: Each datum is assigned a weight of 1.
A numeric vector of weights, one per row in data.
The name of the column in data that contains site weights.

family

Name of family for data error structure (see family).

removeInvalid

Logical. If TRUE (default), remove models that either did not converge or have parameter estimates near the boundaries (usually negative or positive infinity).

failIfNoValid

Logical. If TRUE (default), and the "full" model either does not converge or has parameters near the boundary, then the function will fail. If FALSE, then return NULL in this case.

out

Character vector. One or more values:

'model': Model with the lowest AICc.
'models': All models evaluated, sorted from lowest to highest AICc (lowest is best).
'tuning': Data frame with tuning parameters, one row per model, sorted by AICc.

cores

Number of cores to use. Default is 1. If you have issues when cores > 1, please see the troubleshooting_parallel_operations guide.

verbose

Logical. If TRUE then display intermediate results on the display device. Default is FALSE.

...

Arguments to send to glm.

Details

This function is designed to find the most parsimonious model given the amount of calibration data that is available to it. `trainNS()` can work with any data, but has been designed to work specifically as a species distribution model where the response is either binary (default) or abundance. Specifically, it 1) identifies the most parsimonious model (lowest AICc) with 2) optimal flexibility (optimal degrees of freedom in splines) and 3) allows for (but does not require) interaction terms between predictors (if desired). If the defaults are used, the following procedure is applied:

Constructing a set of simple model terms, each with 1 to 4 degrees of freedom. Terms can be univariate or bilabiate (two-way interactions). Predictors can be continuous or factors. If any simple models has convergence issues or boundary issues (coefficients that approach negative or positive infinity), it is removed.
Constructing a series of models, each with one of the terms, then using the models to rank terms by AICc.
From the top set of terms, creating a "full" model. The full model will ensure the maximum number of terms is <= `maxTerms`, and that for each term, there are at least `presPerTermFinal` data points.
All possible submodels, plus the full model, are evaluated and ranked by AICc. If a model has convergence or boundary issues, it is removed from the set. The most parsimonious model (lowest AICc) is returned.

Examples

Run this code

# \donttest{

# NB: The examples below show a very basic modeling workflow. They have been 
# designed to work fast, not produce accurate, defensible models. They can
# take a few minutes to run.

library(mgcv)
library(sf)
library(terra)
set.seed(123)

### setup data
##############

# environmental rasters
rastFile <- system.file('extdata/madClim.tif', package='enmSdmX')
madClim <- rast(rastFile)

# coordinate reference system
wgs84 <- getCRS('WGS84')

# lemur occurrence data
data(lemurs)
occs <- lemurs[lemurs$species == 'Eulemur fulvus', ]
occs <- vect(occs, geom=c('longitude', 'latitude'), crs=wgs84)

occs <- elimCellDuplicates(occs, madClim)

occEnv <- extract(madClim, occs, ID = FALSE)
occEnv <- occEnv[complete.cases(occEnv), ]
	
# create 10000 background sites (or as many as raster can support)
bgEnv <- terra::spatSample(madClim, 20000)
bgEnv <- bgEnv[complete.cases(bgEnv), ]
bgEnv <- bgEnv[1:min(10000, nrow(bgEnv)), ]

# collate occurrences and background sites
presBg <- data.frame(
  presBg = c(
    rep(1, nrow(occEnv)),
    rep(0, nrow(bgEnv))
  )
)

env <- rbind(occEnv, bgEnv)
env <- cbind(presBg, env)

predictors <- c('bio1', 'bio12')

### calibrate models
####################

# Note that all of the trainXYZ functions can made to go faster using the
# "cores" argument (set to just 1, by default). The examples below will not
# go too much faster using more cores because they are simplified, but
# you can try!
cores <- 1

# MaxEnt
mx <- trainMaxEnt(
	data = env,
	resp = 'presBg',
	preds = predictors,
	regMult = 1, # too few values for reliable model, but fast
	verbose = TRUE,
	cores = cores
)

# MaxNet
mn <- trainMaxNet(
	data = env,
	resp = 'presBg',
	preds = predictors,
	regMult = 1, # too few values for reliable model, but fast
	verbose = TRUE,
	cores = cores
)

# generalized linear model (GLM)
gl <- trainGLM(
	data = env,
	resp = 'presBg',
	preds = predictors,
	scale = TRUE, # automatic scaling of predictors
	verbose = TRUE,
	cores = cores
)

# generalized additive model (GAM)
ga <- trainGAM(
	data = env,
	resp = 'presBg',
	preds = predictors,
	verbose = TRUE,
	cores = cores
)

# natural splines
ns <- trainNS(
	data = env,
	resp = 'presBg',
	preds = predictors,
	scale = TRUE, # automatic scaling of predictors
	df = 1:2, # too few values for reliable model(?)
	verbose = TRUE,
	cores = cores
)

# boosted regression trees
envSub <- env[1:1049, ] # subsetting data to run faster
brt <- trainBRT(
	data = envSub,
	resp = 'presBg',
	preds = predictors,
	learningRate = 0.001, # too few values for reliable model(?)
	treeComplexity = c(2, 3), # too few values for reliable model, but fast
	minTrees = 1200, # minimum trees for reliable model(?), but fast
	maxTrees = 1200, # too small for reliable model(?), but fast
	tryBy = 'treeComplexity',
	anyway = TRUE, # return models that did not converge
	verbose = TRUE,
	cores = cores
)

# random forests
rf <- trainRF(
	data = env,
	resp = 'presBg',
	preds = predictors,
	numTrees = c(100, 500), # using at least 500 recommended, but fast!
	verbose = TRUE,
	cores = cores
)

### make maps of models
#######################

# NB We do not have to scale rasters before predicting GLMs and NSs because we
# used the `scale = TRUE` argument in trainGLM() and trainNS().

mxMap <- predictEnmSdm(mx, madClim)
mnMap <- predictEnmSdm(mn, madClim) 
glMap <- predictEnmSdm(gl, madClim)
gaMap <- predictEnmSdm(ga, madClim)
nsMap <- predictEnmSdm(ns, madClim)
brtMap <- predictEnmSdm(brt, madClim)
rfMap <- predictEnmSdm(rf, madClim)

maps <- c(
	mxMap,
	mnMap,
	glMap,
	gaMap,
	nsMap,
	brtMap,
	rfMap
)

names(maps) <- c('MaxEnt', 'MaxNet', 'GLM', 'GAM', 'NSs', 'BRTs', 'RFs')
fun <- function() plot(occs, col='black', pch=3, add=TRUE)
plot(maps, fun = fun, nc = 4)

### compare model responses to BIO12 (mean annual precipitation)
################################################################

# make a data frame holding all other variables at mean across occurrences,
# varying only BIO12
occEnvMeans <- colMeans(occEnv, na.rm=TRUE)
occEnvMeans <- rbind(occEnvMeans)
occEnvMeans <- as.data.frame(occEnvMeans)
climFrame <- occEnvMeans[rep(1, 100), ]
rownames(climFrame) <- NULL

minBio12 <- min(env$bio12)
maxBio12 <- max(env$bio12)
climFrame$bio12 <- seq(minBio12, maxBio12, length.out=100)

predMx <- predictEnmSdm(mx, climFrame)
predMn <- predictEnmSdm(mn, climFrame)
predGl <- predictEnmSdm(gl, climFrame)
predGa <- predictEnmSdm(ga, climFrame)
predNat <- predictEnmSdm(ns, climFrame)
predBrt <- predictEnmSdm(brt, climFrame)
predRf <- predictEnmSdm(rf, climFrame)


plot(climFrame$bio12, predMx,
xlab='BIO12', ylab='Prediction', type='l', ylim=c(0, 1))

lines(climFrame$bio12, predMn, lty='solid', col='red')
lines(climFrame$bio12, predGl, lty='dotted', col='blue')
lines(climFrame$bio12, predGa, lty='dashed', col='green')
lines(climFrame$bio12, predNat, lty=4, col='purple')
lines(climFrame$bio12, predBrt, lty=5, col='orange')
lines(climFrame$bio12, predRf, lty=6, col='cyan')

legend(
   'topleft',
   inset = 0.01,
   legend = c(
	'MaxEnt',
	'MaxNet',
	'GLM',
	'GAM',
	'NS',
	'BRT',
	'RF'
   ),
   lty = c(1, 1:6),
   col = c(
	'black',
	'red',
	'blue',
	'green',
	'purple',
	'orange',
	'cyan'
   ),
   bg = 'white'
)

### compare response curves ###
###############################

modelNames <- c('MaxEnt', 'MaxNet', 'GLM', 'GAM', 'NS', 'BRT', 'RF')

responses <- responseCurves(
	models = list(mx, mn, gl, ga, ns, brt, rf),
	env = bgEnv,
	ref = occEnv,
	vars = predictors,
	modelNames = modelNames
)

print(responses)

# }