nndm: Nearest Neighbour Distance Matching (NNDM) algorithm

Description

This function implements the NNDM algorithm and returns the necessary indices to perform a NNDM LOO CV for map validation.

Usage

nndm(
  tpoints,
  modeldomain = NULL,
  ppoints = NULL,
  samplesize = 1000,
  sampling = "regular",
  phi = "max",
  min_train = 0.5
)

Value

An object of class nndm consisting of a list of six elements: indx_train, indx_test, and indx_exclude (indices of the observations to use as training/test/excluded data in each NNDM LOO CV iteration), Gij (distances for G function construction between prediction and target points), Gj (distances for G function construction during LOO CV), Gjstar (distances for modified G function during NNDM LOO CV), phi (landscape autocorrelation range). indx_train and indx_test can directly be used as "index" and "indexOut" in caret's trainControl function or used to initiate a custom validation strategy in mlr3.

Arguments

tpoints: sf or sfc point object. Contains the training points samples.
modeldomain: sf polygon object defining the prediction area (see Details).
ppoints: sf or sfc point object. Contains the target prediction points. Optional. Alternative to modeldomain (see Details).
samplesize: numeric. How many points in the modeldomain should be sampled as prediction points? Only required if modeldomain is used instead of ppoints.
sampling: character. How to draw prediction points from the modeldomain? See `sf::st_sample`. Only required if modeldomain is used instead of ppoints.
phi: Numeric. Estimate of the landscape autocorrelation range in the same units as the tpoints and ppoints for projected CRS, in meters for geographic CRS. Per default (phi="max"), the size of the prediction area is used. See Details.
min_train: Numeric between 0 and 1. Minimum proportion of training data that must be used in each CV fold. Defaults to 0.5 (i.e. half of the training points).

Author

Carles Milà

Details

NNDM proposes a LOO CV scheme such that the nearest neighbour distance distribution function between the test and training data during the CV process is matched to the nearest neighbour distance distribution function between the prediction and training points. Details of the method can be found in Milà et al. (2022).

Specifying phi allows limiting distance matching to the area where this is assumed to be relevant due to spatial autocorrelation. Distances are only matched up to phi. Beyond that range, all data points are used for training, without exclusions. When phi is set to "max", nearest neighbor distance matching is performed for the entire prediction area. Euclidean distances are used for projected and non-defined CRS, great circle distances are used for geographic CRS (units in meters).

The modeldomain is a sf polygon that defines the prediction area. The function takes a regular point sample (amount defined by samplesize) from the spatial extent. As an alternative use ppoints instead of modeldomain, if you have already defined the prediction locations (e.g. raster pixel centroids). When using either modeldomain or ppoints, we advise to plot the study area polygon and the training/prediction points as a previous step to ensure they are aligned.

References

Milà, C., Mateu, J., Pebesma, E., Meyer, H. (2022): Nearest Neighbour Distance Matching Leave-One-Out Cross-Validation for map validation. Methods in Ecology and Evolution 00, 1– 13.
Meyer, H., Pebesma, E. (2022): Machine learning-based global maps of ecological variables and the challenge of assessing them. Nature Communications. 13.

Examples

Run this code

########################################################################
# Example 1: Simulated data - Randomly-distributed training points
########################################################################

library(sf)

# Simulate 100 random training points in a 100x100 square
set.seed(123)
poly <- list(matrix(c(0,0,0,100,100,100,100,0,0,0), ncol=2, byrow=TRUE))
sample_poly <- sf::st_polygon(poly)
train_points <- sf::st_sample(sample_poly, 100, type = "random")
pred_points <- sf::st_sample(sample_poly, 100, type = "regular")
plot(sample_poly)
plot(pred_points, add = TRUE, col = "blue")
plot(train_points, add = TRUE, col = "red")

# Run NNDM for the whole domain, here the prediction points are known
nndm_pred <- nndm(train_points, ppoints=pred_points)
nndm_pred
plot(nndm_pred)

# ...or run NNDM with a known autocorrelation range of 10
# to restrict the matching to distances lower than that.
nndm_pred <- nndm(train_points, ppoints=pred_points, phi = 10)
nndm_pred
plot(nndm_pred)

########################################################################
# Example 2: Simulated data - Clustered training points
########################################################################

library(sf)

# Simulate 100 clustered training points in a 100x100 square
set.seed(123)
poly <- list(matrix(c(0,0,0,100,100,100,100,0,0,0), ncol=2, byrow=TRUE))
sample_poly <- sf::st_polygon(poly)
train_points <- clustered_sample(sample_poly, 100, 10, 5)
pred_points <- sf::st_sample(sample_poly, 100, type = "regular")
plot(sample_poly)
plot(pred_points, add = TRUE, col = "blue")
plot(train_points, add = TRUE, col = "red")

# Run NNDM for the whole domain
nndm_pred <- nndm(train_points, ppoints=pred_points)
nndm_pred
plot(nndm_pred)

########################################################################
# Example 3: Real- world example; using a modeldomain instead of previously
# sampled prediction locations
########################################################################
if (FALSE) {
library(sf)
library(terra)

### prepare sample data:
dat <- readRDS(system.file("extdata","Cookfarm.RDS",package="CAST"))
dat <- aggregate(dat[,c("DEM","TWI", "NDRE.M", "Easting", "Northing","VW")],
   by=list(as.character(dat$SOURCEID)),mean)
pts <- dat[,-1]
pts <- st_as_sf(pts,coords=c("Easting","Northing"))
st_crs(pts) <- 26911
studyArea <- rast(system.file("extdata","predictors_2012-03-25.tif",package="CAST"))
studyArea[!is.na(studyArea)] <- 1
studyArea <- as.polygons(studyArea, values = FALSE, na.all = TRUE) |>
    st_as_sf() |>
    st_union()
pts <- st_transform(pts, crs = st_crs(studyArea))
plot(studyArea)
plot(st_geometry(pts), add = TRUE, col = "red")

nndm_folds <- nndm(pts, modeldomain= studyArea)
plot(nndm_folds)

#use for cross-validation:
library(caret)
ctrl <- trainControl(method="cv",
   index=nndm_folds$indx_train,
   indexOut=nndm_folds$indx_test,
   savePredictions='final')
model_nndm <- train(dat[,c("DEM","TWI", "NDRE.M")],
   dat$VW,
   method="rf",
   trControl = ctrl)
global_validation(model_nndm)
}

Run the code above in your browser using DataLab