Constructs or coerces to a term-document matrix or a document-term matrix.
TermDocumentMatrix(x, control = list())
DocumentTermMatrix(x, control = list())
as.TermDocumentMatrix(x, ...)
as.DocumentTermMatrix(x, ...)
An object of class TermDocumentMatrix
or class
DocumentTermMatrix
(both inheriting from a
simple triplet matrix in package slam)
containing a sparse term-document matrix or document-term matrix. The
attribute weighting
contains the weighting applied to the
matrix.
for the constructors, a corpus or an R object from which a
corpus can be generated via Corpus(VectorSource(x))
; for the
coercing functions, either a term-document matrix or a document-term
matrix or a simple triplet matrix (package
slam) or a term frequency vector.
a named list of control options. There are local
options which are evaluated for each document and global options
which are evaluated once for the constructed matrix. Available local
options are documented in termFreq
and are internally
delegated to a termFreq
call.
This is different for a SimpleCorpus
. In this case all
options are processed in a fixed order in one pass to improve performance.
It always uses the Boost (https://www.boost.org) Tokenizer (via
Rcpp) and takes no custom functions as option arguments.
Available global options are:
bounds
A list with a tag global
whose value
must be an integer vector of length 2. Terms that appear in less
documents than the lower bound bounds$global[1]
or in
more documents than the upper bound bounds$global[2]
are
discarded. Defaults to list(global = c(1, Inf))
(i.e., every
term will be used).
weighting
A weighting function capable of handling a
TermDocumentMatrix
. It defaults to weightTf
for term
frequency weighting. Available weighting functions shipped with
the tm package are weightTf
,
weightTfIdf
, weightBin
, and
weightSMART
.
the additional argument weighting
(typically a
WeightFunction
) is allowed when coercing a
simple triplet matrix to a term-document or document-term matrix.
termFreq
for available local control options.
data("crude")
tdm <- TermDocumentMatrix(crude,
control = list(removePunctuation = TRUE,
stopwords = TRUE))
dtm <- DocumentTermMatrix(crude,
control = list(weighting =
function(x)
weightTfIdf(x, normalize =
FALSE),
stopwords = TRUE))
inspect(tdm[202:205, 1:5])
inspect(tdm[c("price", "prices", "texas"), c("127", "144", "191", "194")])
inspect(dtm[1:5, 273:276])
if(requireNamespace("SnowballC")) {
s <- SimpleCorpus(VectorSource(unlist(lapply(crude, as.character))))
m <- TermDocumentMatrix(s,
control = list(removeNumbers = TRUE,
stopwords = TRUE,
stemming = TRUE))
inspect(m[c("price", "texa"), c("127", "144", "191", "194")])
}
Run the code above in your browser using DataLab