a data.frame
with as named columns and in this order: a document "id"
column, a
"date"
column, a "text"
column (i.e. the columns where all texts to analyze reside), and a series of feature
columns of type numeric
, with values pointing to the applicability of a particular feature to a particular text. The
latter columns are often binary (1
means the feature is applicable to the document in the same row) or as a
percentage to specify the degree of connectedness of a feature to a document. Features could be topics (e.g., legal,
political, or economic), but also article sources (e.g., online or printed press), amongst many more options. If you have no
knowledge about features or no particular features are of interest to your analysis, provide no feature columns. In that
case, the corpus constructor automatically adds an additional feature column named "dummy"
. Provide the date
column as "yyyy-mm-dd"
. The id
column should be in character
mode. All spaces in the names of the
features are replaced by underscores.