sento_corpus: Create a sento_corpus object

Description

Formalizes a collection of texts into a sento_corpus object derived from the quanteda corpus object. The quanteda package provides a robust text mining infrastructure (see their website), including a handy corpus manipulation toolset. This function performs a set of checks on the input data and prepares the corpus for further analysis by structurally integrating a date dimension and numeric metadata features.

Usage

sento_corpus(corpusdf, do.clean = FALSE)

Value

A sento_corpus object, derived from a quanteda

corpus

object. The corpus is ordered by date.

Arguments

corpusdf: a data.frame (or a data.table, or a tbl) with as named columns: a document "id" column (coercible to character mode), a "date" column (as "yyyy-mm-dd"), a "texts" column (in character mode), an optional "language" column (in character mode), and a series of feature columns of type numeric, with values between 0 and 1 to specify the degree of connectedness of a feature to a document. Features could be for instance topics (e.g., legal or economic) or article sources (e.g., online or print). When no feature column is provided, a feature named "dummyFeature" is added. All spaces in the names of the features are replaced by '_'. Feature columns with values not between 0 and 1 are rescaled column-wise.
do.clean: a logical, if TRUE all texts undergo a cleaning routine to eliminate common textual garbage. This includes a brute force replacement of HTML tags and non-alphanumeric characters by an empty string. To use with care if the text is meant to have non-alphanumeric characters! Preferably, cleaning is done outside of this function call.

Author

Samuel Borms

Details

A sento_corpus object is a specialized instance of a quanteda corpus. Any quanteda function applicable to its corpus object can also be applied to a sento_corpus object. However, changing a given sento_corpus object too drastically using some of quanteda's functions might alter the very structure the corpus is meant to have (as defined in the corpusdf argument) to be able to be used as an input in other functions of the sentometrics package. There are functions, including corpus_sample or corpus_subset, that do not change the actual corpus structure and may come in handy.

To add additional features, use add_features. Binary features are useful as a mechanism to select the texts which have to be integrated in the respective feature-based sentiment measure(s), but applies only when do.ignoreZeros = TRUE. Because of this (implicit) selection that can be performed, having complementary features (e.g., "economy" and "noneconomy") makes sense.

It is also possible to add one non-numerical feature, that is, "language", to designate the language of the corpus texts. When this feature is provided, a list of lexicons for different languages is expected in the compute_sentiment function.

Examples

Run this code

data("usnews", package = "sentometrics")

# corpus construction
corp <- sento_corpus(corpusdf = usnews)

# take a random subset making use of quanteda
corpusSmall <- quanteda::corpus_sample(corp, size = 500)

# deleting a feature
quanteda::docvars(corp, field = "wapo") <- NULL

# deleting all features results in the addition of a dummy feature
quanteda::docvars(corp, field = c("economy", "noneconomy", "wsj")) <- NULL

if (FALSE) {
# to add or replace features, use the add_features() function...
quanteda::docvars(corp, field = c("wsj", "new")) <- 1}

# corpus creation when no features are present
corpusDummy <- sento_corpus(corpusdf = usnews[, 1:3])

# corpus creation with a qualitative language feature
usnews[["language"]] <- "en"
usnews[["language"]][c(200:400)] <- "nl"
corpusLang <- sento_corpus(corpusdf = usnews)

Run the code above in your browser using DataLab