CleanCoordinates: Geographic Cleaning of Coordinates from Biologic Collections

Description

Cleaning geographic coordinates by multiple empirical tests to flag potentially erroneous coordinates, addressing issues common in biological collection databases.

Usage

CleanCoordinates(x, lon = "decimallongitude", lat = "decimallatitude", 
                species = "species", countries = NULL, 
                capitals = TRUE, centroids = TRUE, 
                countrycheck = FALSE, duplicates = FALSE, equal = TRUE, 
                GBIF = TRUE, institutions = TRUE, outliers = FALSE, seas = TRUE,
                urban = FALSE, zeros = TRUE, 
                capitals.rad = 0.05, centroids.rad = 0.01,
                centroids.detail = "both", inst.rad = 0.001, 
                outliers.method = "quantile", outliers.mtp = 3,
                outliers.td = 1000, outliers.size = 7, zeros.rad = 0.5,
                capitals.ref, centroids.ref, country.ref,
                inst.ref, seas.ref, urban.ref,
                value = "spatialvalid", verbose = TRUE,
                report = FALSE)

Arguments

a data.frame. Containing geographical coordinates and species names.

lon

a character string. The column with the longitude coordinates. Default = “decimallongitude”.

lat

a character string. The column with the longitude coordinates. Default = “decimallatitude”.

species

a character string. A vector of the same length as rows in x, with the species identity for each record. If missing, the outliers test is skipped.

countries

a character string. A vector of the same length as rows in x, with country information for each record in ISO3 format. If missing, the countries test is skipped.

capitals

logical. If TRUE, tests a radius around adm-0 capitals. The radius is capitals.rad. Default = TRUE.

centroids

logical. If TRUE, tests a radius around country centroids. The radius is centroids.rad. Default = TRUE.

countrycheck

logical. If TRUE, tests if coordinates are from the country indicated in the country column. Default = FALSE.

duplicates

logical. If TRUE, tests for duplicate records. This checks for identical coordinates or if a species vector is provided for identical coordinates within a species. All but the first records are flagged as duplicates. Default = FALSE.

equal

logical. If TRUE, tests for equal absolute longitude and latitude. Default = TRUE.

GBIF

logical. If TRUE, tests a one-degree radius around the GBIF headquarters in Copenhagen, Denmark. Default = TRUE.

institutions

logical. If TRUE, tests a radius around known biodiversity institutions from instiutions. The radius is inst.rad. Default = TRUE.

outliers

logical. If TRUE, tests each species for outlier records. Depending on the outliers.mtp and outliers.td arguments either flags records that are a minimum distance away from all other records of this species (outliers.td) or records that are outside a multiple of the interquartile range of minimum distances to the next neighbour of this species (outliers.mtp). Default = TRUE.

seas

logical. If TRUE, tests if coordinates fall into the ocean. Default = TRUE.

urban

logical. If TRUE, tests if coordinates are from urban areas. Default = FALSE.

zeros

logical. If TRUE, tests for plain zeros, equal latitude and longitude and a radius around the point 0/0. The radius is zeros.rad. Default = TRUE.

capitals.rad

numeric. The radius around capital coordinates in degrees. Default = 0.1.

centroids.rad

numeric. The side length of the rectangle around country centroids in degrees. Default = 0.01.

centroids.detail

a character string. If set to ‘country’ only country (adm-0) centroids are tested, if set to ‘provinces’ only province (adm-1) centroids are tested. Default = ‘both’.

inst.rad

numeric. The radius around biodiversity institutions coordinates in degrees. Default = 0.001.

outliers.method

The method used for outlier testing. See details.

outliers.mtp

numeric. The multiplier for the interquartile range of the outlier test. If NULL outliers.td is used. Default = 3.

outliers.td

numeric. The minimum distance of a record to all other records of a species to be identified as outlier, in km. Default = 1000.

outliers.size

numerical. THe minimum number of records in a dataset to run the taxon-specific outlier test. Default = 7.

zeros.rad

numeric. The radius around 0/0 in degrees. Default = 0.5.

capitals.ref

a data.frame with alternative reference data for the country capitals test. If missing, the capitals dataset is used. Alternatives must be identical in structure.

centroids.ref

a data.frame with alternative reference data for the centroid test. If missing, the centroids dataset is used. Alternatives must be identical in structure.

country.ref

a SpatialPolygonsDataFrame as alternative reference for the countrycheck test. If missing, the rnaturalearth:ne_countries('medium') dataset is used.

inst.ref

a data.frame with alternative reference data for the biodiversity institution test. If missing, the institutions dataset is used. Alternatives must be identical in structure.

seas.ref

a SpatialPolygonsDataFrame as alternative reference for the seas test. If missing, the landmass dataset is used.

urban.ref

a SpatialPolygonsDataFrame as alternative reference for the urban test. If missing, the test is skipped. See details for a reference gazetteers.

value

a character string defining the output value. See the value section for details. one of ‘spatialvalid’, ‘summary’, ‘cleaned’. Default = ‘spatialvalid’.

verbose

logical. If TRUE reports the name of the test and the number of records flagged

report

logical or character. If TRUE a report file is written to the working directory, summarizing the cleaning results. If a character, the path to which the file should be written. Default = FALSE.

Value

Depending on the output argument:

“spatialvalid”: an object of class spatialvalid with one column for each test. TRUE = clean coordinate, FALSE = potentially problematic coordinates. The summary column is FALSE if any test flagged the respective coordinate.
“flags”: a logical vector with the same order as the input data summarizing the results of all test. TRUE = clean coordinate, FALSE = potentially problematic (= at least one test failed).
“cleaned”: a data.frame of cleaned coordinates if species = NULL or a data.frame with cleaned coordinates and species ID otherwise

Details

The function needs all coordinates to be formally valid according to WGS84. If the data contains invalid coordinates, the function will stop and return a vector flagging the invalid records. TRUE = non-problematic coordinate, FALSE = potentially problematic coordinates. A reference gazetteer for the urban test is available at at https://github.com/azizka/CoordinateCleaner/tree/master/extra_gazetteers. Three different methods are available for the outlier test: "If “outlier” a boxplot method is used and records are flagged as outliers if their mean distance to all other records of the same species is larger than mltpl * the interquartile range of the mean distance of all records of this species. If “mad” the median absolute deviation is used. In this case a record is flagged as outlier, if the mean distance to all other records of the same species is larger than the median of the mean distance of all points plus/minus the mad of the mean distances of all records of the species * mltpl. If “distance” records are flagged as outliers, if the minimum distance to the next record of the species is > tdi

Examples

Run this code

# NOT RUN {
exmpl <- data.frame(species = sample(letters, size = 250, replace = TRUE),
                    decimallongitude = runif(250, min = 42, max = 51),
                    decimallatitude = runif(250, min = -26, max = -11))

test <- CleanCoordinates(x = exmpl)

summary(test)
# }

Run the code above in your browser using DataLab