Compute a simple correspondence analysis on the document-term matrix of a tm corpus.
runCorpusCa(corpus, dtm = NULL, variables = NULL, sparsity = 0.9, ...)
A tm corpus.
an optional document-term matrix to use; if missing, DocumentTermMatrix
will be called on corpus
to create it.
a character vector giving the names of meta-data variables to aggregate the document-term matrix (see “Details” below).
Optional sparsity threshold (between 0 and 1) below which terms should be
skipped. See removeSparseTerms
from tm.
Additional parameters passed to ca
.
A ca
object as returned by the ca
function.
The function runCorpusCa
runs a correspondence analysis (CA) on the
document-term matrix that can be extracted from a tm corpus by calling
the DocumentTermMatrix
function, or directly from the dtm
object if present.
If no variable is passed via the variables
argument, a CA is run on the
full document-term matrix (possibly skipping sparse terms, see below). If one or more
variables are chosen, the CA will be based on a stacked table whose rows correspond to
the levels of the variables: each cell contains the sum of occurrences of a given term in
all the documents of the level. Documents that contain a NA
are skipped for this
variable, but taken into account for the others, if any.
In all cases, variables that have not been selected are added as supplementary rows. If at least one variable is passed, documents are also supplementary rows, while they are active otherwise.
The sparsity
argument is passed to removeSparseTerms
to remove less significant terms from the document-term matrix. This is
especially useful for big corpora, which matrices can grow very large, prompting
ca
to take up too much memory.