Training the multinomial K-mer method on sequence data.
multinomTrain(sequence, taxon, K = 8, col.names = FALSE, n.pseudo = 100)
Character vector of 16S sequences.
Character vector of taxon labels for each sequence.
Word length (integer).
Logical indicating if column names should be added to the trained model matrix.
Number of pseudo-counts to use (positive numerics, need not be integer). Special case -1 will only return word counts, not log-probabilities.
A list with two elements. The first element is Method
, which is the text
"multinom"
in this case. The second element is Fitted
, which is a matrix
of probabilities with one row for each unique taxon
and one column for each possible word of
lengthK
. The sum of each row is 1.0. No probabilities are 0 if n.pseudo
>0.0.
The matrix Fitted
has an attribute attr("prior",)
, that contains the relative
taxon sizes.
The training step of the multinomial method (Vinje et al, 2015) means counting K-mers
on all sequences and compute the multinomial probabilities for each K-mer for each unique taxon.
n.pseudo
pseudo-counts are added, divided equally over all K-mers, before probabilities
are estimated. The optimal choice of n.pseudo
will depend on K
and the
training data set. The default value n.pseudo=100
has proven good for K=8
and the
contax.trim
data set (see the microcontax
R-package).
Adding the actual K-mers as column names (col.names=TRUE
) will slow down the
computations.
The relative taxon sizes are also computed, and may be used as an empirical prior in the classification step (see "prior" below).
Vinje, H, Liland, KH, Alm<U+00F8>y, T, Snipen, L. (2015). Comparing K-mer based methods for improved classification of 16S sequences. BMC Bioinformatics, 16:205.
# NOT RUN {
# See examples for multinomClassify
# }
Run the code above in your browser using DataLab