multinomTrain: Training multinomial model

Description

Training the multinomial K-mer method on sequence data.

Usage

multinomTrain(sequence, taxon, K = 8, col.names = FALSE, n.pseudo = 100)

Arguments

sequence

Character vector of 16S sequences.

taxon

Character vector of taxon labels for each sequence.

Word length (integer).

col.names

Logical indicating if column names should be added to the trained model matrix.

n.pseudo

Number of pseudo-counts to use (positive numerics, need not be integer). Special case -1 will only return word counts, not log-probabilities.

Value

A list with two elements. The first element is Method, which is the text "multinom" in this case. The second element is Fitted, which is a matrix of probabilities with one row for each unique taxon and one column for each possible word of lengthK. The sum of each row is 1.0. No probabilities are 0 if n.pseudo>0.0.

The matrix Fitted has an attribute attr("prior",), that contains the relative taxon sizes.

Details

The training step of the multinomial method (Vinje et al, 2015) means counting K-mers on all sequences and compute the multinomial probabilities for each K-mer for each unique taxon. n.pseudo pseudo-counts are added, divided equally over all K-mers, before probabilities are estimated. The optimal choice of n.pseudo will depend on K and the training data set. The default value n.pseudo=100 has proven good for K=8 and the contax.trim data set (see the microcontax R-package).

Adding the actual K-mers as column names (col.names=TRUE) will slow down the computations.

The relative taxon sizes are also computed, and may be used as an empirical prior in the classification step (see "prior" below).

References

Vinje, H, Liland, KH, Alm<U+00F8>y, T, Snipen, L. (2015). Comparing K-mer based methods for improved classification of 16S sequences. BMC Bioinformatics, 16:205.

Examples

Run this code

# NOT RUN {
# See examples for multinomClassify

# }

Run the code above in your browser using DataLab