Learn R Programming

CoordinateCleaner (version 3.0.1)

cd_ddmm: Identify Datasets with a Degree Conversion Error

Description

This test flags datasets where a significant fraction of records has been subject to a common degree minute to decimal degree conversion error, where the degree sign is recognized as decimal delimiter.

Usage

cd_ddmm(
  x,
  lon = "decimalLongitude",
  lat = "decimalLatitude",
  ds = "dataset",
  pvalue = 0.025,
  diff = 1,
  mat_size = 1000,
  min_span = 2,
  value = "clean",
  verbose = TRUE,
  diagnostic = FALSE
)

Value

Depending on the ‘value’ argument, either a data.frame

with summary statistics and flags for each dataset (“dataset”) or a data.frame containing the records considered correct by the test (“clean”) or a logical vector (“flags”), with TRUE = test passed and FALSE = test failed/potentially problematic. Default = “clean”.

Arguments

x

data.frame. Containing geographical coordinates and species names.

lon

character string. The column with the longitude coordinates. Default = “decimalLongitude”.

lat

character string. The column with the latitude coordinates. Default = “decimalLatitude”.

ds

a character string. The column with the dataset of each record. In case x should be treated as a single dataset, identical for all records. Default = “dataset”.

pvalue

numeric. The p-value for the one-sided t-test to flag the test as passed or not. Both ddmm.pvalue and diff must be met. Default = 0.025.

diff

numeric. The threshold difference for the ddmm test. Indicates by which fraction the records with decimals below 0.6 must outnumber the records with decimals above 0.6. Default = 1

mat_size

numeric. The size of the matrix for the binomial test. Must be changed in decimals (e.g. 100, 1000, 10000). Adapt to dataset size, generally 100 is better for datasets < 10000 records, 1000 is better for datasets with 10000 - 1M records. Higher values also work reasonably well for smaller datasets, therefore, default = 1000. For large datasets try 10000.

min_span

numeric. The minimum geographic extent of datasets to be tested. Default = 2.

value

character string. Defining the output value. See value.

verbose

logical. If TRUE reports the name of the test and the number of records flagged.

diagnostic

logical. If TRUE plots the analyses matrix for each dataset.

Details

If the degree sign is recognized as decimal delimiter during coordinate conversion, no coordinate decimals above 0.59 (59') are possible. The test here uses a binomial test to test if a significant proportion of records in a dataset have been subject to this problem. The test is best adjusted via the diff argument. The lower diff, the stricter the test. Also scales with dataset size. Empirically, for datasets with < 5,000 unique coordinate records diff = 0.1 has proven reasonable flagging most datasets with >25% problematic records and all dataset with >50% problematic records. For datasets between 5,000 and 100,000 geographic unique records diff = 0.01 is recommended, for datasets between 100,000 and 1 M records diff = 0.001, and so on.

See Also

Other Datasets: cd_round()

Examples

Run this code

clean <- data.frame(species = letters[1:10], 
                decimalLongitude = runif(100, -180, 180), 
                decimalLatitude = runif(100, -90,90),
                dataset = "FR")
                
cd_ddmm(x = clean, value = "flagged")

#problematic dataset
lon <- sample(0:180, size = 100, replace = TRUE) + runif(100, 0,0.59)
lat <- sample(0:90, size = 100, replace = TRUE) + runif(100, 0,0.59)

prob <-  data.frame(species = letters[1:10], 
                decimalLongitude = lon, 
                decimalLatitude = lat,
                dataset = "FR")
                
cd_ddmm(x = prob, value = "flagged")

Run the code above in your browser using DataLab