dc_ddmm: Flag Datasets with a Degree Conversion Error

Description

This test identifies datasets where a significant fraction of records has been subject to a common degree minute to decimal degree conversion error, where the degree sign is recognized as decimal delimiter.

Usage

dc_ddmm(x, lon = "decimallongitude", lat = "decimallatitude", ds = "dataset", 
        pvalue = 0.025, diff = 1, mat.size = 1000, min.span = 2,
        value = "clean", verbose = TRUE, diagnostic = FALSE)

Arguments

a data.frame. Containing geographical coordinates and species names.

lon

a character string. The column with the longitude coordinates. Default = “decimallongitude”.

lat

a character string. The column with the longitude coordinates. Default = “decimallatitude”.

a character string. The column with the dataset of each record. In case x should be treated as a single dataset, identical for all records. Default = “dataset”.

pvalue

numeric. The p-value for the one-sided t-test to flag the test as passed or not. Both ddmm.pvalue and diff must be met. Default = 0.025.

diff

numeric. The threshold difference for the ddmm test. Indicates by which fraction the records with decimals below 0.6 must outnumber the records with decimals above 0.6. Default = 1

min.span

numeric. The minimum geographic extent of datasets to be tested. Default = 2.

mat.size

numeric. The size of the matrix for the binomial test. Must be changed in decimals (e.g. 100, 1000, 10000). Adapt to dataset size, generally 100 is better for datasets < 10000 records, 1000 is better for datasets with 10000 - 1M records. Higher values also work reasonably well for smaller datasets, therefore, default = 1000. For large datasets try 10000.

value

a character string. Defining the output value. See value.

verbose

logical. If TRUE reports the name of the test and the number of records flagged.

diagnostic

logical. If TRUE plots the analyses matrix for each dataset.

Value

Depending on the ‘value’ argument, either a data.frame with summary statistics and flags for each dataset (“dataset”) or a data.frame containing the records considered correct by the test (“clean”) or a logical vector, with TRUE = test passed and FALSE = test failed/potentially problematic (“flags”). Default = “clean”.

Details

If the degree sign is recognized as decimal delimiter during coordinate conversion, no coordinate decimals above 0.59 (59') are possible. The test here uses a binomial test to test if a significant proportion of records in a dataset have been subject to this problem. The test is best adjusted via the diff argument. The lower diff, the stricter the test. Also scales with dataset size. Empirically, for datasets with < 5,000 unique coordinate records diff = 0.1 has proven reasonable flagging most datasets with >25% problematic records and all dataset with >50% problematic records. For datasets between 5,000 and 100,000 geographic unique records diff = 0.01 is recommended, for datasets between 100,000 and 1 M records diff = 0.001, and so on. See https://github.com/azizka/CoordinateCleaner/wiki/3.-Identifying-problematic-data-sets:-CleanCoordinatesDS for explanation and simulation results.

Examples

Run this code

# NOT RUN {
clean <- data.frame(species = letters[1:10], 
                decimallongitude = runif(100, -180, 180), 
                decimallatitude = runif(100, -90,90),
                dataset = "FR")
                
dc_ddmm(x = clean, value = "flags")

#problematic dataset
lon <- sample(-180:180, size = 100, replace = TRUE) + runif(100, 0,0.59)
lat <- sample(-90:90, size = 100, replace = TRUE) + runif(100, 0,0.59)

prob <-  data.frame(species = letters[1:10], 
                decimallongitude = lon, 
                decimallatitude = lat,
                dataset = "FR")
                
dc_ddmm(x = prob, value = "flags")
# }

Run the code above in your browser using DataLab