Learn R Programming

CoordinateCleaner (version 1.0-7)

tc_outl: Flag Fossil Outlier Records in Space and Time

Description

Flags records of fossils that are spatio-temporal outliers based on interquantile ranges. Records are flagged if they are either extreme in time or space, or both.

Usage

tc_outl(x, lon = "lng", lat = "lat", 
        min.age = "min_ma", max.age = "max_ma", taxon = "accepted_name", 
        method = "quantile", size.thresh = 7, mltpl = 5, 
        replicates = 5, flag.thresh = 0.5, 
        uniq.loc = FALSE, value = "clean", verbose = TRUE)

Arguments

x

a data.frame. Containing geographical coordinates and species names.

lon

a character string. The column with the longitude coordinates. Default = “decimallongitude”.

lat

a character string. The column with the longitude coordinates. Default = “decimallatitude”.

min.age

a character string. The column with the minimum age. Default = “min_ma”.

max.age

a character string. The column with the maximum age. Default = “max_ma”.

taxon

a character string. The column with the taxon name. If “”, searches for outliers over the entire dataset, otherwise per specified taxon. Default = “accepted_name”.

method

a character string. Defining the method for outlier selection. See details. Either “quantile” or “mad”. Default = “quantile”.

size.thresh

numeric. The minimum number of records needed for a dataset to be tested. Default = 10.

mltpl

numeric. The multiplier of the interquartile range (method == 'quantile') or median absolute deviation (method == 'mad') to identify outliers. See details. Default = 3.

replicates

numeric. The number of replications for the distance matrix calculation. See details. Default = 5.

flag.thresh

numeric. The fraction of replicates necessary to flag a record. See details. Default = 0.5.

uniq.loc

logical. If TRUE only single records per location and time point (and taxon if taxon != "") are used for the outlier testing. Default = T.

value

a character string. Defining the output value. See value.

verbose

logical. If TRUE reports the name of the test and the number of records flagged.

Value

Depending on the ‘value’ argument, either a data.frame containing the records considered correct by the test (“clean”) or a logical vector, with TRUE = test passed and FALSE = test failed/potentially problematic (“flags”). Default = “clean”.

Details

The outlier detection is based on an interquantile range test. In a first step a distance matrix of geographic distances among all records is calculate. Subsequently a similar distance matrix of temporal distances among all records is calculated based on a single point selected by random between the minimum and maximum age for each record. The mean distance for each point to all neighbours is calculated for both matrices and spatial and temporal distances are scaled to the same range. The sum of these distanced is then tested against the interquantile range and flagged as an outlier if $x > IQR(x) + q_75 * mltpl$. The test is replicated ‘replicates’ times, to account for temporal uncertainty. Records are flagged as outliers if they are flagged by a fraction of more than ‘flag.thres’ replicates. Only datasets/taxa comprising more than ‘size.thresh’ records are tested. Note that geographic distances are calculated as geospheric distances for datasets (or taxa) with less than 10,000 records and approximated as Euclidean distances for datasets/taxa with 10,000 to 25,000 records. Datasets/taxa comprising more than 25,000 records are skipped.

Examples

Run this code
# NOT RUN {
minages <- c(runif(n = 11, min = 10, max = 25), 62.5)
x <- data.frame(species = c(letters[1:10], rep("z", 2)),
                lng = c(runif(n = 10, min = 4, max = 16), 75, 7),
                lat = c(runif(n = 12, min = -5, max = 5)),
                min_ma = minages, 
                max_ma = c(minages[1:11] + runif(n = 11, min = 0, max = 5), 65))

tc_outl(x, value = "flags", taxon = "")


# }

Run the code above in your browser using DataLab