Learn R Programming

Version : 0.2.3.9; Copyright (C) 2014-2023: ICAR-NBPGR; License: GPL-2|GPL-3
Aravind, J.1, Radhamani, J.1, Kalyani Srinivasan1, Ananda Subhash, B.2, and Tyagi, R. K.1
  1. ICAR-National Bureau of Plant Genetic Resources, New Delhi, India
  2. Centre for Development of Advanced Computing, Thiruvananthapuram, Kerala, India


Introduction

The R package PGRdup was developed as a tool to aid genebank managers in the identification of probable duplicate accessions from plant genetic resources (PGR) passport databases.

This package primarily implements a workflow designed to fetch groups or sets of germplasm accessions with similar passport data particularly in fields associated with accession names within or across PGR passport databases.

The functions in this package are primarily built using the following R packages:

Installation

The package can be installed from CRAN as follows:

# Install from CRAN
install.packages('PGRdup', dependencies=TRUE)

The development version can be installed from github as follows:
# Install development version from Github
devtools::install_github("aravind-j/PGRdup")

Workflow

The series of steps involve in the workflow along with the associated functions are are illustrated below:

Step 1

Function(s) :

  • DataClean
  • MergeKW
  • MergePrefix
  • MergeSuffix

Use these functions for the appropriate data standardisation of the relevant fields in the passport databases to harmonize punctuation, leading zeros, prefixes, suffixes etc. associated with accession names.

Step 2

Function(s) :

  • KWIC

Use this function to extract the information in the relevant fields as keywords or text strings in the form of a searchable Keyword in Context (KWIC) index.

Step 3

Function(s) :

  • ProbDup

Execute fuzzy, phonetic and semantic matching of keywords to identify probable duplicate sets either within a single KWIC index or between two indexes using this function. For fuzzy matching the levenshtein edit distance is used, while for phonetic matching, the double metaphone algorithm is used. For semantic matching, synonym sets or ‘synsets’ of accession names can be supplied as an input and the text strings in such sets will be treated as being identical for matching. Various options to tweak the matching strategies used are also available in this function.

Step 4

Function(s) :

  • DisProbDup
  • ReviewProbDup
  • ReconstructProbDup

Inspect, revise and improve the retrieved sets using these functions. If considerable intersections exist between the initially identified sets, then DisProbDup may be used to get the disjoint sets. The identified sets may be subjected to clerical review after transforming them into an appropriate spreadsheet format which contains the raw data from the original database(s) using ReviewProbDup and subsequently converted back using ReconstructProbDup.

Adjuncts

Function(s) :

  • ValidatePrimKey
  • DoubleMetaphone
  • ParseProbDup
  • AddProbDup
  • SplitProbDup
  • MergeProbDup
  • ViewProbDup
  • KWCounts
  • read.genesys

Use these helper functions if needed. ValidatePrimKey can be used to check whether a column chosen in a data.frame as the primary primary key/ID confirms to the constraints of absence of duplicates and NULL values.

DoubleMetaphone is an implementation of the Double Metaphone phonetic algorithm in R and is used for phonetic matching.

ParseProbDup and AddProbDup work with objects of class ProbDup. The former can be used to parse the probable duplicate sets in a ProbDup object to a data.frame while the latter can be used to add these sets data fields to the passport databases. SplitProbDup can be used to split an object of class ProbDup according to set counts. MergeProbDup can be used to merge together two objects of class ProbDup. ViewProbDup can be used to plot the summary visualizations of probable duplicate sets retrieved in an object of class ProbDup.

KWCounts can be used to compute keyword counts from PGR passport database fields(columns), which can give a rough indication of the completeness of the data.

read.genesys can be used to import PGR data in a Darwin Core - germplasm zip archive downloaded from genesys database into the R environment.

Tips

  • Use fread to rapidly read large files instead of read.csv or read.table in base.
  • In case the PGR passport data is in any DBMS, use the appropriate R-database interface packages to get the required table as a data.frame in R.

Note

  • The ProbDup function can be memory hungry with large passport databases. In such cases, ensure that the system has sufficient memory for smooth functioning (See ?ProbDup).

Detailed tutorial

For a detailed tutorial (vignette) on how to used this package type:

browseVignettes(package = 'PGRdup')

The vignette for the latest version is also available online.

What’s new

To know whats new in this version type:

news(package='PGRdup')

Links

CRAN page

Github page

Documentation website

Zenodo DOI

CRAN checks

r-devel-linux-x86_64-debian-clang
r-devel-linux-x86_64-debian-gcc
r-devel-linux-x86_64-fedora-clang
r-devel-linux-x86_64-fedora-gcc
r-patched-linux-x86_64
r-release-linux-x86_64

r-devel-windows-x86_64
r-release-windows-x86_64
r-oldrel-windows-x86_64

r-release-macos-x86_64
r-oldrel-macos-x86_64

Citing PGRdup

To cite the methods in the package use:

citation("PGRdup")
To cite the R package 'PGRdup' in publications use:

  Aravind, J., Radhamani, J., Kalyani Srinivasan, Ananda Subhash, B., and Tyagi, R. K.  ().  PGRdup:
  Discover Probable Duplicates in Plant Genetic Resources Collections. R package version 0.2.3.9,
  https://github.com/aravind-j/PGRdup,https://cran.r-project.org/package=PGRdup.

A BibTeX entry for LaTeX users is

  @Manual{,
    title = {PGRdup: Discover Probable Duplicates in Plant Genetic Resources Collections},
    author = {J. Aravind and J. Radhamani and {Kalyani Srinivasan} and B. {Ananda Subhash} and Rishi Kumar Tyagi},
    note = {R package version 0.2.3.9 https://github.com/aravind-j/PGRdup, https://cran.r-project.org/package=PGRdup},
  }

This free and open-source software implements academic research by the authors and co-workers. If you use
it, please support the project by citing the package.

Copy Link

Version

Install

install.packages('PGRdup')

Monthly Downloads

339

Version

0.2.3.9

License

GPL-2 | GPL-3

Issues

Pull Requests

Stars

Forks

Maintainer

Last Published

August 31st, 2023

Functions in PGRdup (0.2.3.9)

ParseProbDup

Parse an object of class ProbDup to a data frame.
SplitProbDup

Split an object of class ProbDup
print.ProbDup

Prints summary of ProbDup object.
read.genesys

Convert 'Darwin Core - Germplasm' zip archive to a flat file
print.KWIC

Prints summary of KWIC object.
ProbDup

Identify probable duplicates of accessions
ViewProbDup

Visualize the probable duplicate sets retrieved in a ProbDup object
ValidatePrimKey

Validate if a data frame column confirms to primary key/ID constraints
ReconstructProbDup

Reconstruct an object of class ProbDup
ReviewProbDup

Retrieve probable duplicate set information from PGR passport database for review
MergeKW

Merge keyword strings
KWIC

Create a KWIC index
MergeProbDup

Merge two objects of class ProbDup
PGRdup-package

The PGRdup Package
KWCounts

Generate keyword counts
DoubleMetaphone

'Double Metaphone' phonetic algorithm
DisProbDup

Get disjoint probable duplicate sets
GN1000

Sample groundnut PGR passport data
DataClean

Clean PGR passport data
AddProbDup

Add probable duplicate sets fields to the PGR passport database