Learn R Programming

chem16S (version 1.0.0)

map_taxa: Map taxonomic names to RefSeq or GTDB taxonomy

Description

Maps taxonomic names to RefSeq (NCBI) or GTDB taxonomy by automatic matching of taxonomic names, with manual mappings for some groups.

Usage

map_taxa(taxacounts = NULL, refdb = "GTDB", quiet = FALSE)

Value

Integer vector with length equal to number of rows of taxacounts. Values are rownumbers in the data frame generated by reading taxon_AA.csv.xz, or NA for no matching taxon. Attributes unmapped_groups and unmapped_percent have the input names of unmapped groups and their percentage of the total classification count.

Arguments

taxacounts

data frame with taxonomic name and abundances

refdb

character, name of reference database (GTDB or RefSeq)

quiet

logical, suppress printed messages?

Details

This function maps taxonomic names to the RefSeq (NCBI) or GTDB taxonomy. taxacounts should be a data frame generated by either read_RDP or ps_taxacounts. Input names are made by combining the taxonomic rank and name with an underscore separator (e.g. genus_ Escherichia/Shigella). Input names are then matched to the taxa listed in taxon_AA.csv.xz found under extdata/RefSeq or extdata/GTDB. The protein and organism columns in these files hold the rank and taxon name extracted from the RefSeq or GTDB database. Only exactly matching names are automatically mapped.

For mapping to the RefSeq (NCBI) taxonomy, some group names are manually mapped as follows (see Dick and Tan, 2023):

Input (i.e., RDP)RefSeq
genus_Escherichia/Shigellagenus_Escherichia
phylum_Cyanobacteria/Chloroplastphylum_Cyanobacteria
genus_Marinimicrobia_genera_incertae_sedisspecies_Candidatus Marinimicrobia bacterium
class_Cyanobacteriaphylum_Cyanobacteria
genus_Spartobacteria_genera_incertae_sedisspecies_Spartobacteria bacterium LR76
class_Planctomycetaciaclass_Planctomycetia
class_Actinobacteriaphylum_Actinobacteria
order_Rhizobialesorder_Hyphomicrobiales
genus_Gp1genus_Acidobacterium
genus_Gp6genus_Luteitalea
genus_GpIgenus_Nostoc
genus_GpIIagenus_Synechococcus
genus_GpVIgenus_Pseudanabaena
family_Family IIfamily_Synechococcaceae
genus_Subdivision3_genera_incertae_sedisfamily_Verrucomicrobia subdivision 3
order_Clostridialesorder_Eubacteriales
family_Ruminococcaceaefamily_Oscillospiraceae

To avoid manual mapping, GTDB can be used for both taxonomic assignemnts and reference proteomes. Taxonomic assignments can be made using the RDP Classifier with this GTDB SSU training set: tools:::Rd_expr_doi("10.5281/zenodo.7633100") or dada2 with this GTDB training set: tools:::Rd_expr_doi("10.5281/zenodo.6655692"). Example files created using the RDP Classifier are provided under extdata/RDP-GTDB. An example dataset created with DADA2 is data(mouse.GTDB); this is a phyloseq-class object that can be processed with functions described at physeq.

Change quiet to TRUE to suppress printing of messages about manual mappings, most abundant unmapped groups, and overall percentage of mapped names.

References

Dick JM, Tan J. 2023. Chemical links between redox conditions and estimated community proteomes from 16S rRNA and reference protein sequences. Microbial Ecology 85: 1338--1355. tools:::Rd_expr_doi("10.1007/s00248-022-01988-9")

Examples

Run this code
# Mapping taxonomic classifications from RDP training set to NCBI taxonomy
file <- system.file("extdata/RDP/SMS+12.tab.xz", package = "chem16S")
RDP <- read_RDP(file)
map <- map_taxa(RDP, refdb = "RefSeq")
# About 24% of classifications are unmaped
sum(attributes(map)$unmapped_percent)

# Mapping from GTDB training set to GTDB taxonomy
file <- system.file("extdata/RDP-GTDB/SMS+12.tab.xz", package = "chem16S")
RDP.GTDB <- read_RDP(file)
map.GTDB <- map_taxa(RDP.GTDB)
# There is 100% mapping (zero unmapped classifications)
sum(attributes(map.GTDB)$unmapped_percent)

Run the code above in your browser using DataLab