varSel: Variable Selection

Description

The function performs a data-driven variable selection. Starting from the provided model it iterates through all the variables starting from the one with the highest contribution (permutation importance or maxent percent contribution). If the variable is correlated with other variables (according to the given method and threshold) it performs a Jackknife test and among the correlated variables it removes the one that results in the best performing model when removed (according to the given metric). The process is repeated until the remaining variables are not highly correlated anymore.

Usage

varSel(model, metric, bg4cor, test = NULL, env = NULL,
  parallel = FALSE, method = "spearman", cor_th = 0.7, permut = 10,
  use_pc = FALSE)

Arguments

model

'>SDMmodel or '>SDMmodelCV object.

metric

character. The metric used to evaluate the models, possible values are: "auc", "tss" and "aicc".

bg4cor

'>SWD object. Background locations used to test the correlation between environmental variables.

test

'>SWD. Test dataset used to evaluate the model, not used with aicc and '>SDMmodelCV objects, default is NULL.

env

stack containing the environmental variables, used only with "aicc", default is NULL.

parallel

logical, if TRUE it uses parallel computation, default is FALSE. Used only with AICc.

method

character. The method used to compute the correlation matrix, default "spearman".

cor_th

numeric. The correlation threshold used to select highly correlated variables, default is 0.7.

permut

integer. Number of permutations, default is 10.

use_pc

logical, use percent contribution. If TRUE and the model is trained using the Maxent method, the algorithm uses the percent contribution computed by Maxent software to score the variable importance, default is FALSE.

Value

The '>SDMmodel or '>SDMmodelCV object trained using the selected variables.

Details

Parallel computation increases the speed only for large datasets due to the time necessary to create the cluster. To find highly correlated variables the following formula is used: $$| coeff | \le cor_th$$

Examples

Run this code

# NOT RUN {
# Acquire environmental variables
files <- list.files(path = file.path(system.file(package = "dismo"), "ex"),
                    pattern = "grd", full.names = TRUE)
predictors <- raster::stack(files)

# Prepare presence locations
p_coords <- condor[, 1:2]

# Prepare background locations
bg_coords <- dismo::randomPoints(predictors, 10000)

# Create SWD object
presence <- prepareSWD(species = "Vultur gryphus", coords = p_coords,
                       env = predictors, categorical = "biome")
bg <- prepareSWD(species = "Vultur gryphus", coords = bg_coords,
                 env = predictors, categorical = "biome")

# Get subsample of background to train the model, we will use the full
# dataset to compute the correlation among the environmental variables
bg_model <- getSubsample(bg, 5000, seed = 25)

# Split presence locations in training (80%) and testing (20%) datasets
datasets <- trainValTest(presence, test = 0.2)
train <- datasets[[1]]
test <- datasets[[2]]

# Train a Maxent model
model <- train(method = "Maxent", p = train, a = bg_model, fc = "l")

# Remove variables with correlation higher than 0.7 accounting for the AUC,
# in the following example the variable importance is computed as permutation
# importance
vs <- varSel(model, metric = "auc", bg4cor = bg, test = test, cor_th = 0.7,
             permut = 1)
vs

# Remove variables with correlation higher than 0.7 accounting for the TSS,
# in the following example the variable importance is the MaxEnt percent
# contribution
vs <- varSel(model, metric = "tss", bg4cor = bg, test = test, cor_th = 0.7,
             use_pc = TRUE)
vs

# Remove variables with correlation higher than 0.7 accounting for the aicc,
# in the following example the variable importance is the MaxEnt percent
# contribution
vs <- varSel(model, metric = "aicc", bg4cor = bg, cor_th = 0.7,
             use_pc = TRUE, env = predictors)
vs
# }

Run the code above in your browser using DataLab