rdpTrain: Training the RDP classifier

Description

Training the RDP presence/absence K-mer method on sequence data.

Usage

rdpTrain(sequence, taxon, K = 8, cnames = FALSE)

Arguments

sequence

Character vector of 16S sequences.

taxon

Character vector of taxon labels for each sequence.

Word length (integer).

cnames

Logical indicating if column names should be added to the trained model matrix.

Value

A list with two elements. The first element is Method, which is the text "RDPclassifier" in this case. The second element is Fitted, which is a matrix with one row for each unique taxon and one column for each possible word of length K. The value in row i and column j is the probability that word j is present in taxon i.

Details

The training step of the RDP method means looking for K-mers on all sequences, and computing the probability of each K-mer being present for each unique taxon. This is an attempt to re-implement the method described by Wang et tal (2007), but without the bootstrapping. See that publications for all details.

The word-length K is by default 8, since this is the value used by Wang et al. Larger values may lead to memory-problems since the trained model is a matrix with 4^K columns. Adding the K-mers as column names will slow down all computations.

The relative taxon sizes are also computed, and returned as an attribute to the model matrix. They may be used as empirical priors in the classification step.

References

Wang, Q, Garrity, GM, Tiedje, JM, Cole, JR (2007). Naive Bayesian Classifier for Rapid Assignment of rRNA Sequences into the New Bacterial Taxonomy. Applied and Enviromental Microbiology, 73: 5261-5267.

Examples

Run this code

# NOT RUN {
# See examples for rdpClassify.

# }

Run the code above in your browser using DataLab