Learn R Programming

revtools (version 0.4.1)

find_duplicates: Locate duplicated information within a data.frame

Description

Identify potential duplicates within a data.frame.

Usage

find_duplicates(data, match_variable, group_variables,
  match_function, method, threshold,
  to_lower = FALSE, remove_punctuation = FALSE)

Arguments

data

a data.frame containing data to be matched

match_variable

a length-1 integer or string listing the column in which duplicates should be sought. Defaults to doi where available, followed by title. If neither are found the function will fail.

group_variables

an optional vector listing the columns to use as grouping variables; that is, categories withing which duplicates should be sought (see 'note'). Optionally NULL to compare all entries against one another.

match_function

a function to calculate dissimilarity between strings. Defaults to "exact" if doi's are available or "stringdist" otherwise.

method

the required 'method' option that corresponds with match_function. Defaults to NULL if match_function is "exact", "osa" for match_function == "stringdist", or "fuzz_m_ratio" for match_function == "fuzzdist".

threshold

an upper limit above which similar articles are not recognized as duplicates. Defaults to 5 for stringdist and 0.1 for fuzzdist. Ignored if match_function == "exact".

to_lower

logical: should text be made lower case prior to searching? Defaults to FALSE.

remove_punctuation

logical: should punctuation be removed prior to searching? Defaults to FALSE.

Value

an integer vector, in which entries with the same integer have been selected as duplicates by the selected algorithm.

See Also

screen_duplicates and extract_unique_references for manual and automated screening (respectively) of results from this function.

Examples

Run this code
# NOT RUN {
# import data
file_location <- system.file(
  "extdata",
  "avian_ecology_bibliography.ris",
  package = "revtools")
x <- read_bibliography(file_location)

# generate then locate some 'fake' duplicates
x_duplicated <- rbind(x, x[1:5,])
x_check <- find_duplicates(x_duplicated)
# returns a vector of potential matches
# }

Run the code above in your browser using DataLab