compute_sentiment: Compute document-level sentiment across features and lexicons

Description

Given a corpus of texts, computes (net) sentiment per document using the bag-of-words approach based on the lexicons provided and a choice of aggregation across words per document.

Usage

compute_sentiment(x, lexicons, how = "proportional", tokens = NULL,
  nCore = 1)

Arguments

either a sentocorpus object created with sento_corpus, a quanteda corpus object, or a character vector. The latter two do not incorporate a date dimension. In case of a corpus object, the numeric columns from the docvars are considered as features over which sentiment will be computed. In case of a character vector, sentiment is only computed across lexicons.

lexicons

a sentolexicons object created using sento_lexicons.

how

a single character vector defining how aggregation within documents should be performed. For currently available options on how aggregation can occur, see get_hows()$words.

tokens

a list of tokenized documents, to specify your own tokenization scheme. Can result from the quanteda's tokens function, the tokenizers package, or other. Make sure the tokens are constructed from (the texts from) the x argument, are unigrams, and preferably set to lowercase, otherwise, results may be spurious and errors could occur. By default set to NULL.

nCore

a positive numeric that will be passed on to the numThreads argument of the setThreadOptions function, to parallelize the sentiment computation across texts. A value of 1 (default) implies no parallelization. Parallelization is expected to improve speed of the sentiment computation only for sufficiently large corpora.

Value

If x is a sentocorpus object, a sentiment object, i.e., a data.table containing the sentiment scores data.table with an "id", a "date" and a "word_count" column, and all lexicon--feature sentiment scores columns. A sentiment object can be used for aggregation into time series with the aggregate.sentiment function.

If x is a quanteda corpus object, a sentiment scores data.table with an "id" and a "word_count" column, and all lexicon--feature sentiment scores columns.

If x is a character vector, a sentiment scores data.table with a "word_count" column, and all lexicon--feature sentiment scores columns.

Calculation

If the lexicons argument has no "valence" element, the sentiment computed corresponds to simple unigram matching with the lexicons [unigrams approach]. If valence shifters are included in lexicons with a corresponding "y" column, these have the effect of modifying the polarity of a word detected from the lexicon if appearing right before such word (examples: not good, very bad or can't defend) [bigrams approach]. If the valence table contains a "t" column, valence shifters are searched for in a cluster centered around a detected polarity word [clusters approach]. The latter approach is similar along the one utilized by the sentimentr package, but simplified. A cluster amounts to four words before and two words after a polarity word. A cluster never overlaps with a preceding one. Roughly speaking, the polarity of a cluster is calculated as $n(1 + 0.80d)S + \sum s$. The polarity score of the detected word is $S$, $s$ represents polarities of eventual other sentiment words, and $d$ is the difference between the number of amplifiers (t = 2) and the number of deamplifiers (t = 3). If there is an odd number of negators (t = 1), $n = -1$ and amplifiers are counted as deamplifiers, else $n = 1$. All scores, whether per unigram, per bigram or per cluster, are summed within a document, before the scaling defined by the how argument is applied. The how = "proportionalPol" option divides each document's sentiment score by the number of detected polarized words (counting words that appear multiple times by their frequency), instead of the total number of words which the how = "proportional" option gives. The how = "counts" option does no normalization. See the vignette for more details.

Details

For a separate calculation of positive (resp. negative) sentiment, one has to provide distinct positive (resp. negative) lexicons. This can be done using the do.split option in the sento_lexicons function, which splits out the lexicons into a positive and a negative polarity counterpart. All NAs are converted to 0, under the assumption that this is equivalent to no sentiment. If tokens = NULL (as per default), texts are tokenized as unigrams using the tokenize_words function. Punctuation and numbers are removed, but not stopwords. The number of words for each document is computed based on that same tokenization. All tokens are converted to lowercase, in line with what the sento_lexicons function does for the lexicons and valence shifters.

Examples

Run this code

# NOT RUN {
data("usnews", package = "sentometrics")
data("list_lexicons", package = "sentometrics")
data("list_valence_shifters", package = "sentometrics")

l1 <- sento_lexicons(list_lexicons[c("LM_en", "HENRY_en")])
l2 <- sento_lexicons(list_lexicons[c("LM_en", "HENRY_en")], list_valence_shifters[["en"]])
l3 <- sento_lexicons(list_lexicons[c("LM_en", "HENRY_en")],
                     list_valence_shifters[["en"]][, c("x", "t")])

# from a sentocorpus object, unigrams approach
corpus <- sento_corpus(corpusdf = usnews)
corpusSample <- quanteda::corpus_sample(corpus, size = 200)
sent1 <- compute_sentiment(corpusSample, l1, how = "proportionalPol")

# from a character vector, bigrams approach
sent2 <- compute_sentiment(usnews[["texts"]][1:200], l2, how = "counts")

# from a corpus object, clusters approach
corpusQ <- quanteda::corpus(usnews, text_field = "texts")
corpusQSample <- quanteda::corpus_sample(corpusQ, size = 200)
sent3 <- compute_sentiment(corpusQSample, l3, how = "counts")

# from an already tokenized corpus, using the 'tokens' argument
toks <- as.list(quanteda::tokens(corpusQSample, what = "fastestword"))
sent4 <- compute_sentiment(corpusQSample, l1[1], how = "counts", tokens = toks)

# }

Run the code above in your browser using DataLab