Training the RDP presence/absence K-mer method on sequence data.
rdpTrain(sequence, taxon, K = 8, cnames = FALSE)
Character vector of 16S sequences.
Character vector of taxon labels for each sequence.
Word length (integer).
Logical indicating if column names should be added to the trained model matrix.
A list with two elements. The first element is Method
, which is the text
"RDPclassifier"
in this case. The second element is Fitted
, which is a
matrix with one row for each unique taxon
and one column for
each possible word of length K
. The value in row i and column j is the probability that
word j is present in taxon i.
The training step of the RDP method means looking for K-mers on all sequences, and computing the probability of each K-mer being present for each unique taxon. This is an attempt to re-implement the method described by Wang et tal (2007), but without the bootstrapping. See that publications for all details.
The word-length K
is by default 8, since this is the value used by Wang et al. Larger values
may lead to memory-problems since the trained model is a matrix with 4^K columns. Adding the K-mers
as column names will slow down all computations.
The relative taxon sizes are also computed, and returned as an attribute to the model matrix. They may be used as empirical priors in the classification step.
Wang, Q, Garrity, GM, Tiedje, JM, Cole, JR (2007). Naive Bayesian Classifier for Rapid Assignment of rRNA Sequences into the New Bacterial Taxonomy. Applied and Enviromental Microbiology, 73: 5261-5267.
# NOT RUN {
# See examples for rdpClassify.
# }
Run the code above in your browser using DataLab