Learn R Programming

coil

An R package for pre-processing and error evaluation of COI-5P barcode data

coil is an R package designed for the cleaning, contextualization and assessment of cytochrome c oxidase I DNA barcode data (COI-5P, or the five prime portion of COI). It contains functions for placing COI-5P barcode sequences into a common reading frame, translating DNA sequences to amino acids and for assessing the likelihood that a given barcode sequence includes an insertion or deletion error. These functions are provided as a single function analysis pipeline and are also available individually for efficient and targeted analysis of barcode data.

Installation

coil can be installed directly from CRAN.

install.packages("coil")
library(coil)

You can also download the development version of coil directly from GitHub. You'll need to have the R package devtools installed and loaded. Also note if the build_vignettes option is set to true, you will need to have the R package knitr installed.

#install.packages("devtools")
#install.packages("knitr") #required if build_vignettes = TRUE
#library(devtools) 
devtools::install_github("CNuge/coil", build_vignettes = TRUE)
library(coil)

The vignette can then be accessed from R using the following command:

vignette("coil-vignette")

How to use it

Below is a brief demonstration to get the user started, please consult the package's vignette for a more detailed explanation of coil's functionality.

The package is built around the custom coi5p object, which takes a COI-5P DNA barcode sequence as input. The package contains functions for:

  • setting the sequence in reading frame
  • translating the sequence to amino acids
  • checking the sequence for evidence of insertion or deletion errors

The basic coi5p analysis pipeline is as follows:

example_nt_string #an input DNA string, contained in the coil package for demonstration purposes

#step 1: build the coi5p object
dat = coi5p(example_nt_string, name="example_sequence_1")

#step 2: frame the sequence
dat = frame(dat)

#step 3: by default censored translation is performed - see vignette for details
dat = translate(dat)

##step 3a: if taxonomy is known, but the translation table is not, a helper function
#can be used to look up the proper translation table.
which_trans_table("Scyliorhinidae")

#step 3a: the proper translation table can be passed to the translation function
dat = translate(dat, trans_table = 2)

#step 4: check to see if an insertion or deletion is likely
dat = indel_check(dat)
dat

All of the steps of the pipeline can be called at once through the coi5p_pipe function.

output = coi5p_pipe(example_nt_string)

Calling the variable name prints the coi5p object's summary and shows all of the important information, including: the original raw sequence, the sequence set in reading frame, the amino acid sequence and the summary stats regarding the likelihood of the sequence containing an error.

output 
#coi5p barcode sequence
#raw sequence:
#ctctacttgatttttggtgcatgag...ggacccaattctctatcaacactta
#framed sequence:
#---ctctacttgatttttggtgcat...ggacccaattctctatcaacactta
#Amino acid sequence:
#-LYLIFGAWAG?VG?ALSLLIRAEL...LTDRNLNTTFFDPAGGGDPILYQHL
#Raw sequence was trimmed: FALSE
#Stop codon present: FALSE, Amino acid PHMM score:-206.22045
#The sequence likely does not contain an insertion or deletion.
#Base pair 1 of the raw sequence is base pair 4 of the COI-5P region.

The coi5p object has the following components that can be extracted by the user using the dollar sign notation.

output$name         #the name of the sequence 
output$raw          #the input DNA sequence
output$framed       #the DNA sequence set in reading frame
output$aaSeq        #the amino acid sequence
output$aaScore      #the log likelihood score of the amino acid sequence - see vignette for details
output$indel_likely #a boolean indicating whether the sequence should be double checked for indel errors
output$stop_codons  #a boolean indicating whether the amino acid sequence contains stop codons
output$data         #contains the generated nucleotide and amino acid hidden state paths
output$was_trimmed  #a boolean indicating if part of raw DNA sequence was trimmed due to not matching the COI-5P region
output$align_report #a report indicating the first positional match between the raw sequence and the COI-5P region

Most use cases will involve the analysis of multiple sequences. Please consult the package's vignette for a suggested workflow for batch analysis and demonstration of how the batch analysis helper function can be used to build dataframes out of multiple coi5p objects.

Citation

If you use coil in your research, please consider citing the following publication:

Nugent, C. M., Elliott, T. A., Ratnasingham, S., & Adamowicz, S. J. (2020) coil: an R package for cytochrome c oxidase I (COI) DNA barcode data cleaning, translation, and error evaluation. Genome, 2020, 63(6): 291-305, https://doi.org/10.1139/gen-2019-0206

Acknowledgements

Funding for the development of this software was provided by grants in Bioinformatics and Computational Biology from the Government of Canada through Genome Canada and Ontario Genomics and from the Ontario Research Fund. Funders played no role in the study design or preparation of this software. Thank you to Sarah J. Adamowicz and Sujeevan Ratnasingham who contributed to the conceptualization of this software. Thank you to Tyler A. Elliot for aiding in the acquisition and curation of data. Thank you to Samantha Majoros for aiding in the initial testing of this package. Thank you to Suz Bateson for designing the logo for the coil package.

Copy Link

Version

Install

install.packages('coil')

Monthly Downloads

262

Version

1.2.4

License

GPL-3

Maintainer

Last Published

January 11th, 2024

Functions in coil (1.2.4)

print.coi5p

Print a summary of a coi5p object.
example_barcode_data

Example barcode data.
censored_translation

Censored Translation of a DNA string.
set_frame

Take an input sequence and get it into the reading frame.
aa_coi_PHMM

Amino acid profile hidden Markov model for coi5p.
translate

Translate a coi5p sequence.
trans_df

Data frame containing the translation table recommendation.
subsetPHMM

Subset an existing PHMM.
individual_AAbin

build an AAbin with ape.
translate_codon

Censored Translation of a codon.
individual_DNAbin

build an DNAbin with ape.
nt_coi_PHMM

Nucleotide profile hidden Markov model for coi5p.
new_coi5p

Build a new coi5p class instance.
validate_coi5p

Validate the new coi5p class instance.
ins_front_trim

Check sequence for an early large string of deletions. If it exists then return the starting index by which to slice the path and the string.
which_trans_table

Determine the translation table to use for a given taxonomic group.
leading_ins

Check for a large number of leading inserted bases.
coi5p

Build a coi5p object from a DNA sequence string.
frame

Take a coi5p sequence and place it in reading frame.
coi5p_pipe

Run the entire coi5p pipeline for an input sequence.
indel_check

Check if a coi5p sequence likely contains an error.
flatten_coi5p

Flatten a list of coi5p output objects into a dataframe.
coil

coil: evaluation of COI-5P barcode data
example_nt_string

Example coi5p DNA sequence string