sento_corpus: Create a sentocorpus object

Description

Formalizes a collection of texts into a well-defined corpus object derived from the corpus object. The quanteda package provides a robust text mining infrastructure, see quanteda. Their corpus structure brings for example a handy corpus manipulation toolset. This function performs a set of checks on the input data and prepares the corpus for further analysis.

Usage

sento_corpus(corpusdf, do.clean = FALSE)

Arguments

corpusdf

a data.frame (or a data.table, or a tbl) with as named columns: a document "id" column (in character mode), a "date" column (as "yyyy-mm-dd"), a "texts" column (in character mode), and a series of feature columns of type numeric, with values between 0 and 1 to specify the degree of connectedness of a feature to a document. Features could be topics (e.g., legal or economic), article sources (e.g., online or print), amongst many more options. When no feature column is provided, a feature named "dummyFeature" is added. All spaces in the names of the features are replaced by '_'. Feature columns with values not between 0 and 1 are rescaled column-wise.

do.clean

a logical, if TRUE all texts undergo a cleaning routine to eliminate common textual garbage. This includes a brute force replacement of HTML tags and non-alphanumeric characters by an empty string. To use with care if the text is meant to have non-alphanumeric characters! Preferably, cleaning is done outside of this function call.

Value

A sentocorpus object, derived from a quanteda corpus classed list with elements "documents", "metadata", and "settings" kept. The first element incorporates the corpus represented as a data.frame.

Details

A sentocorpus object is a specialized instance of a quanteda corpus. Any quanteda function applicable to its corpus object can also be applied to a sentocorpus object. However, changing a given sentocorpus object too drastically using some of quanteda's functions might alter the very structure the corpus is meant to have (as defined in the corpusdf argument) to be able to be used as an input in other functions of the sentometrics package. There are functions, including corpus_sample or corpus_subset, that do not change the actual corpus structure and may come in handy. To add additional features, use add_features. Binary features are useful as a mechanism to select the texts which have to be integrated in the respective feature-based sentiment measure(s), but applies only when do.ignoreZeros = TRUE. Because of this (implicit) selection that can be performed, having complementary features (e.g., "economy" and "noneconomy") makes sense.

Examples

Run this code

# NOT RUN {
data("usnews", package = "sentometrics")

# corpus construction
corp <- sento_corpus(corpusdf = usnews)

# take a random subset making use of quanteda
corpusSmall <- quanteda::corpus_sample(corp, size = 500)

# deleting a feature
quanteda::docvars(corp, field = "wapo") <- NULL

# deleting all features results in the addition of a dummy feature
quanteda::docvars(corp, field = c("economy", "noneconomy", "wsj")) <- NULL

# }
# NOT RUN {
# to add or replace features, use the add_features() function...
quanteda::docvars(corp, field = c("wsj", "new")) <- 1
# }
# NOT RUN {
# corpus creation when no features are present
corpusDummy <- sento_corpus(corpusdf = usnews[, 1:3])

# }