read_RDP
reads and filters RDP results for all samples in a study.
The input file should be created by RDP Classifer using the -h option to create a hierarchical listing.
Classifications for multiple samples can be combined into a single file using the merge-count command of RDP Classifier.
Only rows (lineages) with count > 0 for at least one sample are retained.
If drop.groups
is TRUE:
Sequences with classifications at only the root or domain level (Root, Archaea, or Bacteria) are omitted because they provide poor taxonomic resolution.
Sequences classified to the class Chloroplast or genera Chlorophyta or Bacillariophyta are also omitted because they have little correspondence with the NCBI taxonomy.
These actions were hard-coded in earlier versions of chem16S (based on using the RDP Classifier taxonomy) but were made optional with adoption of the GTDB.
Then, only columns (samples) with classification count >= mincount
are retained.
All remaining sequences (those classified to genus or higher levels) can be used for mapping to the NCBI taxonomy.
The lineage text of the RDP Classifier looks like “Root;rootrank;Archaea;domain;Diapherotrites; phylum;Diapherotrites Incertae Sedis AR10;genus;”, so you can use lineage = "Archaea"
to select the archaeal classifications or lineage = "genus"
to select genus-level classifications.
Use the lowest.level
argument to truncate the classifications to a level higher than genus.
This argument does not reduce the number of classifications, but only trims the RDP lineages to the specified level.
This may create duplicate lineages, for which the classification counts are summed, and only unique lineages are present in the returned data frame.
Change quiet
to TRUE to suppress printing of messages about percentage classification to genus level, omitted sequences, and final range of total counts among all samples.