splitSample: Select samples from along an environmental gradient

Description

Select samples from along an environmental gradient by splitting the gradient into discrete chunks and sample within each chunk. This allows a test set to be selected which covers the environmental gradient of the training set, for example.

Usage

splitSample(env, chunk = 10, take, nchunk,
            fill = c("head", "tail", "random"),
            maxit = 1000)

Value

A numeric vector of indices of selected samples. This vector has attribute lengths which indicates how many samples were actually chosen from each chunk.

Arguments

env: numeric; vector of samples representing the gradient values.
chunk: numeric; number of chunks to split the gradient into.
take: numeric; how many samples to take from the gradient. Can not be missing.
nchunk: numeric; number of samples per chunk. Must be a vector of length chunk and sum(chunk) must equal take. Can be missing (the default), in which case some simple heuristics are used to determine the number of samples chosen per chunk. See Details.
fill: character; the type of filling of chunks to perform. See Details.
maxit: numeric; maximum number of iterations in which to try to sample take observations. Basically here to stop the loop going on forever.

Author

Gavin L. Simpson

Details

The gradient is split into chunk sections and samples are selected from each chunk to result in a sample of length take. If take is divisible by chunk without remainder then there will an equal number of samples selected from each chunk. Where chunk is not a multiple of take and nchunk is not specified then extra samples have to be allocated to some of the chunks to reach the required number of samples selected.

An additional complication is that some chunks of the gradient may have fewer than nchunk samples and therefore more samples need to be selected from the remaining chunks until take samples are chosen.

If nchunk is supplied, it must be a vector stating exactly how many samples to select from each chunk. If chunk is not supplied, then the number of samples per chunk is determined as follows:

An intial allocation of floor(take / chunk) is assigned to each chunk
If any chunks have fewer samples than this initial allocation, these elements of nchunk are reset to the number of samples in those chunks
Sequentially an extra sample is allocated to each chunk with sufficient available samples until take samples are selected.

Argument fill controls the order in which the chunks are filled. fill = "head" fills from the low to the high end of the gradient, whilst fill = "tail" fills in the opposite direction. Chunks are filled in random order if fill = "random". In all cases no chunk is filled by more than one extra sample until all chunks that can supply one extra sample are filled. In the case of fill = "head" or fill = "tail" this entails moving along the gradient from one end to the other allocating an extra sample to available chunks before starting along the gradient again. For fill = "random", a random order of chunks to fill is determined, if an extra sample is allocated to each chunk in the random order and take samples are still not selected, filling begins again using the same random ordering. In other words, the random order of chunks to fill is chosen only once.

Examples

Run this code

data(swappH)

## take a test set of 20 samples along the pH gradient
test1 <- splitSample(swappH, chunk = 10, take = 20)
test1
swappH[test1]

## take a larger sample where some chunks don't have many samples
## do random filling
set.seed(3)
test2 <- splitSample(swappH, chunk = 10, take = 70, fill = "random")
test2
swappH[test2]

Run the code above in your browser using DataLab