sento_corpus: Create a sentocorpus object

Description

Formalizes a collection of texts into a well-defined corpus object, by calling, amongst others, the corpus function from the quanteda package. This package is a (very) fast text mining package; for more info, see quanteda. Their formal corpus structure is required for better memory management, corpus manipulation, and sentiment calculation. This function mainly performs a set of checks on the input data and prepares the corpus for further sentiment analysis.

Usage

sento_corpus(corpusdf, do.clean = FALSE)

Arguments

corpusdf

a data.frame with as named columns and in this order: a document "id" column, a "date" column, a "text" column (i.e. the columns where all texts to analyze reside), and a series of feature columns of type numeric, with values pointing to the applicability of a particular feature to a particular text. The latter columns are often binary (1 means the feature is applicable to the document in the same row) or as a percentage to specify the degree of connectedness of a feature to a document. Features could be topics (e.g., legal, political, or economic), but also article sources (e.g., online or printed press), amongst many more options. If you have no knowledge about features or no particular features are of interest to your analysis, provide no feature columns. In that case, the corpus constructor automatically adds an additional feature column named "dummy". Provide the date column as "yyyy-mm-dd". The id column should be in character mode. All spaces in the names of the features are replaced by underscores.

do.clean

a logical, if TRUE all texts undergo a cleaning routine to eliminate common textual garbage. This includes a brute force replacement of HTML tags and non-alphanumeric characters by an empty string.

Value

A sentocorpus object, derived from a quanteda corpus classed list keeping the elements "documents", "metadata", and "settings". The first element incorporates the corpus represented as a data.frame.

Details

A sentocorpus object can be regarded as a specialized instance of a quanteda corpus. In theory, all quanteda functions applicable to its corpus object can also be applied to a sentocorpus object. However, changing a given sentocorpus object too drastically using some of quanteda's functions might alter the very structure the corpus is meant to have (as defined in the corpusdf argument) to be able to be used as an input in other functions of the sentometrics package. There are functions, including corpus_sample or corpus_subset, that do not change the actual corpus structure and may come in handy. To add additional features, we recommend to use add_features.

Examples

Run this code

# NOT RUN {
data("usnews")

# corpus construction
corpus <- sento_corpus(corpusdf = usnews)

# take a random subset making use of quanteda
corpusSmall <- quanteda::corpus_sample(corpus, size = 500)

# deleting a feature
quanteda::docvars(corpus, field = "wapo") <- NULL

# corpus creation when no features are present
corpusDummy <- sento_corpus(corpusdf = usnews[, 1:3])

# }