Learn R Programming

CoordinateCleaner (version 1.0-7)

CleanCoordinatesFOS: Geographic and Temporal Cleaning of Records from Fossil Collections

Description

Cleaning records by multiple empirical tests to flag potentially erroneous coordinates and time-spans, addressing issues common in fossil collection databases.

Usage

CleanCoordinatesFOS(x, lon = "lng", lat = "lat", min.age = "min_ma", max.age = "max_ma", 
                    taxon = "accepted_name", countries = "cc", 
                    centroids = TRUE, countrycheck = TRUE, 
                    equal = TRUE, GBIF = TRUE, institutions = TRUE, 
                    temp.range.outliers = TRUE, spatio.temp.outliers = TRUE, 
                    temp.ages.equal = TRUE, 
                    zeros = TRUE, centroids.rad = 0.05, 
                    centroids.detail = "both", 
                    inst.rad = 0.001, outliers.method = "quantile", 
                    outliers.threshold = 5, outliers.size = 7, 
                    outliers.replicates = 5,
                    zeros.rad = 0.5, centroids.ref, country.ref, inst.ref, 
                    value = "spatialvalid", verbose = TRUE, report = FALSE)

Arguments

x

a data.frame. Containing geographical coordinates and species names.

lon

a character string. The column with the longitude coordinates. Default = “decimallongitude”.

lat

a character string. The column with the longitude coordinates. Default = “decimallatitude”.

min.age

a character string. The column with the minimum age. Default = “min_ma”.

max.age

a character string. The column with the maximum age. Default = “max_ma”.

taxon

a character string. The column with the taxon name. If “”, searches for outliers over the entire dataset, otherwise per specified taxon. Default = “accepted_name”.

countries

a character string. A vector of the same length as rows in x, with country information for each record in ISO3 format. If missing, the countries test is skipped.

centroids

logical. If TRUE, tests a radius around country centroids. The radius is centroids.rad. Default = TRUE.

countrycheck

logical. If TRUE, tests if coordinates are from the country indicated in the country column. Default = FALSE.

equal

logical. If TRUE, tests for equal absolute longitude and latitude. Default = TRUE.

GBIF

logical. If TRUE, tests a one-degree radius around the GBIF headquarters in Copenhagen, Denmark. Default = TRUE.

institutions

logical. If TRUE, tests a radius around known biodiversity institutions from instiutions. The radius is inst.rad. Default = TRUE.

temp.range.outliers

logical. If TRUE, tests for records with unexpectedly large temporal ranges, using a quantile-based outlier test. Default = TRUE.

spatio.temp.outliers

logical. IF TRUE, test for records which are outlier in time and space. See dc_round for details. Default = TRUE.

temp.ages.equal

logical. If TRUE, flags records with equal minimum and maximum age. Default = TRUE.

zeros

logical. If TRUE, tests for plain zeros, equal latitude and longitude and a radius around the point 0/0. The radius is zeros.rad. Default = TRUE.

centroids.rad

numeric. The side length of the rectangle around country centroids in degrees. Default = 0.01.

centroids.detail

a character string. If set to ‘country’ only country (adm-0) centroids are tested, if set to ‘provinces’ only province (adm-1) centroids are tested. Default = ‘both’.

inst.rad

numeric. The radius around biodiversity institutions coordinates in degrees. Default = 0.001.

outliers.method

The method used for outlier testing. See details.

outliers.threshold

numerical. The multiplier for the interquantile range for outlier detection. The higher the number, the more conservative the outlier tests. See dc_round for details. Default = 3.

outliers.size

numerical. The minimum number of records in a dataset to run the taxon-specific outlier test. Default = 7.

outliers.replicates

numeric. The number of replications for the distance matrix calculation. See details. Default = 5.

zeros.rad

numeric. The radius around 0/0 in degrees. Default = 0.5.

centroids.ref

a data.frame with alternative reference data for the centroid test. If missing, the centroids dataset is used. Alternatives must be identical in structure.

country.ref

a SpatialPolygonsDataFrame as alternative reference for the countrycheck test. If missing, the rnaturalearth:ne_countries('medium') dataset is used.

inst.ref

a data.frame with alternative reference data for the biodiversity institution test. If missing, the institutions dataset is used. Alternatives must be identical in structure.

value

a character string defining the output value. See the value section for details. one of ‘spatialvalid’, ‘summary’, ‘cleaned’. Default = ‘spatialvalid’.

verbose

logical. If TRUE reports the name of the test and the number of records flagged

report

logical or character. If TRUE a report file is written to the working directory, summarizing the cleaning results. If a character, the path to which the file should be written. Default = FALSE.

Value

Depending on the output argument:

“spatialvalid”

an object of class spatialvalid with one column for each test. TRUE = clean coordinate, FALSE = potentially problematic coordinates. The summary column is FALSE if any test flagged the respective coordinate.

“flags”

a logical vector with the same order as the input data summarizing the results of all test. TRUE = clean coordinate, FALSE = potentially problematic (= at least one test failed).

“cleaned”

a data.frame of cleaned coordinates if species = NULL or a data.frame with cleaned coordinates and species ID otherwise

Details

The outlier detection is based on an interquantile range test. In a first step a distance matrix of geographic distances among all records is calculate. Subsequently a similar distance matrix of temporal distances among all records is calculated based on a single point selected by random between the minimum and maximum age for each record. The mean distance for each point to all neighbours is calculated for both matrices and spatial and temporal distances are scaled to the same range. The sum of these distanced is then tested against the interquantile range and flagged as an outlier if $x > IQR(x) + q_75 * mltpl$. The test is replicated ‘replicates’ times, to account for temporal uncertainty. Records are flagged as outliers if they are flagged by a fraction of more than ‘flag.thres’ replicates. Only datasets/taxa comprising more than ‘size.thresh’ records are tested. Note that geographic distances are calculated as geospheric distances for datasets (or taxa) with less than 10,000 records and approximated as Euclidean distances for datasets/taxa with 10,000 to 25,000 records. Datasets/taxa comprising more than 25,000 records are skipped.

Examples

Run this code
# NOT RUN {
minages <- runif(250, 0, 65)
exmpl <- data.frame(accepted_name = sample(letters, size = 250, replace = TRUE),
                    lng = runif(250, min = 42, max = 51),
                    lat = runif(250, min = -26, max = -11),
                    min_ma = minages,
                    max_ma = minages + runif(250, 0.1, 65))

test <- CleanCoordinatesFOS(x = exmpl)

summary(test)
# }

Run the code above in your browser using DataLab