Learn R Programming

enmSdmX (version 1.2.12)

geoFold: Assign geographically-distinct k-folds

Description

This function generates geographically-distinct cross-validation folds, or "geo-folds" ("g-folds" for short). Points are grouped by proximity to one another. Folds can be forced to have at least a minimum number of points in them. Results are deterministic (i.e., the same every time for the same data).

More specifically, g-folds are created using this process:

  • To start, all pairwise distances between points are calculated. These are used in a clustering algorithm to create a dendrogram of relationships by distance. The dendrogram is then "cut" so it has k groups (folds). If each fold has at least the minimum desired number of points (minIn), then the process stops and fold assignments are returned.

  • However, if at least one fold has fewer than the desired number of points, a series of steps is executed.

    • First, the fold with a centroid that is farthest from all others is selected. If it has sufficient points, then the next-most distant fold is selected, and so on.

    • Once a fold is identified that has fewer than the desired number of points, it is grown by adding to it the points closest to its centroid, one at a time. Each time a point is added, the fold centroid is calculated again. The fold is grown until it has the desired number of points. Call this "fold #1". From hereafter, these points are considered "assigned" and not eligible for re-assignment.

    • The remaining "unassigned" points are then clustered again, but this time into k - 1 folds. And again, the most-distant group found that has fewer than the desired number of points is found. This fold is then grown as before, using only unassigned points. This fold then becomes "fold #2."

    • The process repeats iteratively until there are k folds assigned, each with at least the desired number of points.

The potential downside of this approach is that the last fold is assigned the remainder of points, so will be the largest. One way to avoid gross imbalance is to select the value of minIn such that it divides the points into nearly equally-sized groups.

Usage

geoFold(x, k, minIn = 1, longLat = 1:2, method = "complete", ...)

Value

A vector of integers the same length as the number of points in x. Each integer indicates which fold a point in x belongs to.

Arguments

x

A "spatial points" object of class SpatVector, sf, data.frame, or matrix. If x is a data.frame or matrix, then the points will be assumed to have the WGS84 coordinate system (i.e., unprojected).

k

Numeric: Number of folds to create.

minIn

Numeric: Minimum number of points required to be in a fold.

longLat

Character or integer vector: This is ignored if x is a SpatVector or sf object. However, if x is a data.frame or matrix, then this should be a character or integer vector specifying the columns in x corresponding to longitude and latitude (in that order). For example, c('long', 'lat') or c(1, 2). The default is to assume that the first two columns in x represent coordinates.

method

Character: Method used by hclust to cluster points. By default, this is 'complete', but other methods may give more reasonable results, depending on the case.

...

Additional arguments (unused)

Details

Note that in general it is probably mathematically impossible to cluster points in 2-dimensional space into k groups, each with at least minIn points, in a manner that seems "reasonable" to the eye in all cases. In experimentation, "unreasonable" results often appear when the number of groups is high.

See Also

geoFoldContrast

Examples

Run this code
library(sf)
library(terra)

# lemur occurrence data
data(mad0)
data(lemurs)
crs <- getCRS('WGS84')
ll <- c('longitude', 'latitude')

# use occurrences of all species... easier to see on map
occs <- st_as_sf(lemurs, coords = ll, crs = getCRS('WGS84'))

# create 100 background points
mad0 <- vect(mad0)
bg <- spatSample(mad0, 100)

### assign 3 folds to occurrences and to background sites
k <- 3
minIn <- floor(nrow(occs) / k) # maximally spread between folds

presFolds <- geoFold(occs, k = k, minIn = minIn)
bgFolds <- geoFoldContrast(bg, pres = occs, presFolds = presFolds)

# number of sites per fold
table(presFolds)
table(bgFolds)

# map
plot(mad0, border = 'gray', main = paste(k, 'geo-folds'))
plot(bg, pch = 3, col = bgFolds + 1, add = TRUE)
plot(st_geometry(occs), pch = 20 + presFolds, bg = presFolds + 1, add = TRUE)

legend(
	'bottomright',
	legend = c(
		'presence fold 1',
		'presence fold 2',
		'presence fold 3',
		'background fold 1',
		'background fold 2',
		'background fold 3'
	),
	pch = c(21, 22, 23, 3, 3),
	col = c(rep('black', 3), 2, 3),
	pt.bg = c(2, 3, 4, NA, NA)
)

Run the code above in your browser using DataLab