read_RDP: Read and filter RDP Classifier output

Description

Reads RDP Classifier output and optionally extracts lineages, removes classifications below a certain count, or truncates classifications to lowest taxonomic level.

Usage

read_RDP(file, lineage = NULL, mincount = 200,
    lowest.level = NULL, drop.groups = FALSE, quiet = FALSE)

Value

Data frame with columns inherited from file (from RDP output files: taxid, rank, lineage, name, and one column of classification counts for each sample). If any sample has less than mincount total counts, that column is omitted from the result. The rows in the result are subset from those in file by filtering operations.

NB: taxids in RDP files are not NCBI taxids.

Arguments

file: character, file name
lineage: character, regular expression for filtering lineages
mincount: integer, samples with less than this number of RDP classifications are excluded
lowest.level: character, truncate classifications at this taxonomic level
drop.groups: logical, remove some taxonomic groups deemed to be uninformative?
quiet: logical, suppress printed messages?

Details

read_RDP reads and filters RDP results for all samples in a study. The input file should be created by RDP Classifer using the -h option to create a hierarchical listing. Classifications for multiple samples can be combined into a single file using the merge-count command of RDP Classifier.

Only rows (lineages) with count > 0 for at least one sample are retained.

If drop.groups is TRUE: Sequences with classifications at only the root or domain level (Root, Archaea, or Bacteria) are omitted because they provide poor taxonomic resolution. Sequences classified to the class Chloroplast or genera Chlorophyta or Bacillariophyta are also omitted because they have little correspondence with the NCBI taxonomy. These actions were hard-coded in earlier versions of chem16S (based on using the RDP Classifier taxonomy) but were made optional with adoption of the GTDB.

Then, only columns (samples) with classification count >= mincount are retained. All remaining sequences (those classified to genus or higher levels) can be used for mapping to the NCBI taxonomy.

The lineage text of the RDP Classifier looks like “Root;rootrank;Archaea;domain;Diapherotrites; phylum;Diapherotrites Incertae Sedis AR10;genus;”, so you can use lineage = "Archaea" to select the archaeal classifications or lineage = "genus" to select genus-level classifications.

Use the lowest.level argument to truncate the classifications to a level higher than genus. This argument does not reduce the number of classifications, but only trims the RDP lineages to the specified level. This may create duplicate lineages, for which the classification counts are summed, and only unique lineages are present in the returned data frame.

Change quiet to TRUE to suppress printing of messages about percentage classification to genus level, omitted sequences, and final range of total counts among all samples.

Examples

Run this code

# An example file
file <- system.file("extdata/RDP/SMS+12.tab.xz", package = "chem16S")

# Default settings
r0 <- read_RDP(file)

# Suppress messages
r1 <- read_RDP(file, quiet = TRUE)

# Increase minimum count threshold
r2 <- read_RDP(file, mincount = 500)
# This should be 3 (i.e., 3 samples have less than 500 counts)
ncol(r1) - ncol(r2)

# Keep only Archaea
r3 <- read_RDP(file, lineage = "Archaea")
# This should be TRUE
all(grepl("Archaea", r3$lineage))

# Truncate taxonomy at phylum
r4 <- read_RDP(file, lowest.level = "phylum")
# This should be TRUE
all(sapply(strsplit(r4$lineage, ";"), tail, 1) == "phylum")

Run the code above in your browser using DataLab