Read document-term matrices stored in special file formats.
read_dtm_Blei_et_al(file, vocab = NULL)
read_dtm_MC(file, scalingtype = NULL)
A document-term matrix.
a character string with the name of the file to read.
a character string with the name of a vocabulary file
(giving the terms, one per line), or NULL
.
a character string specifying the type of scaling
to be used, or NULL
(default), in which case the scaling will
be inferred from the names of the files with non-zero entries found
(see Details).
read_dtm_Blei_et_al
reads the (List of Lists type sparse
matrix) format employed by the Latent Dirichlet Allocation and
Correlated Topic Model C codes by Blei et al
(http://www.cs.columbia.edu/~blei/).
MC is a toolkit for creating vector models from text documents (see
https://www.cs.utexas.edu/~dml/software/mc/). It employs a
variant of Compressed Column Storage (CCS) sparse matrix format,
writing data into several files with suitable names: e.g., a file with
_dim
appended to the base file name stores the matrix
dimensions. The non-zero entries are stored in a file the name of
which indicates the scaling type used: e.g., _tfx_nz
indicates
scaling by term frequency (t), inverse document frequency
(f) and no normalization (x). See README
in the
MC sources for more information.
read_dtm_MC
reads such sparse matrix information with argument
file
giving the path with the base file name.
read_stm_MC
in package slam.