contax.trim
is a data.frame
object containing 38 781 full-length 16S rRNA
sequences. It is the trimmed version of the full data set (see below). Large taxa (many sequences) have
been trimmed as described in Vinje et al. (2016) to obtain a data set with a more even representation of
the prokaryotic taxonomy.
The contax.full
is the full consensus taxonomy data set as described in Vinje et al. (2016). The data
set is too large for CRAN and thus available as a separate package microcontax.data
. See example
below for how to obtain contax.full
.
The Header of every sequence starts with a unique tag, in this case the text "ConTax" and some integer.
This is followed by a token describing the origin of the sequence. It is typically
"Intersection=SRG"
meaning it is found in both the Silva, RDP and Greengenes data repository. Intersections can also be
SR, SG and RG if the sequence was found in two repositories only. The taxonomy information for each
sequence is found in the third token. It follows a commonly used format:
"k__<...>;p__<...>;c__<...>;o__<...>;f__<...>;g__<...>;"
where <...> is some proper text. The letters, followed by a double underscore, refer to the taxonomic levels
Domain (Kingdom), Phylum, Class, Order, Family and Genus.
Here is an example of a proper string:
"k__Bacteria;p__Firmicutes;c__Bacilli;o__Bacillales;f__Staphylococcaceae;g__Staphylococcus;"
As long as this format is used the taxonomy information can be extracted by the supplied
extractor-functions getDomain
, getPhylum
,...,getGenus
.