coherence: Coherence of a text

Description

Computes coherence of a given paragraph/document

Usage

coherence(x,split=c(".","!","?"),tvectors=tvectors, remove.punctuation=TRUE, 
stopwords = NULL, method ="Add")

Value

A list of two elements; the first element ($local) contains the local coherences as a numeric vector, the second element ($global) contains the global coherence as a numeric.

Arguments

x: a character vector of length(x) = 1 containing the document
split: a vector of expressions that determine where to split sentences
tvectors: the semantic space in which the computation is to be done (a numeric matrix where every row is a word vector)
remove.punctuation: removes punctuation from x after splitting the sentences; TRUE by default
stopwords: a character vector defining a list of words that are not used to compute the sentence vectors for x
method: the compositional model to compute the document vector from its word vectors. The default option method = "Add" computes the document vector as the vector sum. With method = "Multiply", the document vector is computed via element-wise multiplication (see compose).

Author

Fritz Guenther

Details

This function applies the method described in Landauer & Dumais (1997): The local coherence is the cosine between two adjacent sentences. The global coherence is then computed as the mean value of these local coherences.

The format of x should be of the kind x <- "sentence1. sentence2. sentence3" Every sentence can also just consist of one single word.

To import a document Document.txt to from a directory for coherence computation, set your working directory to this directory using setwd(). Then use the following command lines:

fileName1 <- "Alice_in_Wonderland.txt"

x <- readChar(fileName1, file.info(fileName1)$size)

In the traditional LSA approach, the vector D for a document (or a sentence) consisting of the words (t1, . , tn) is computed as $$D = \sum\limits_{i=1}^n t_n$$ This is the default method (method="Add") for this function. Alternatively, this function provided the possibility of computing the document vector from its word vectors using element-wise multiplication (see Mitchell & Lapata, 2010 and compose).

A note will be displayed whenever not all words of one input string are found in the semantic space. Caution: In that case, the function will still produce a result, by omitting the words not found in the semantic space. Depending on the specific requirements of a task, this may compromise the results. Please check your input when you receive this message.

A warning message will be displayed whenever no word of one input string is found in the semantic space.

References

Landauer, T.K., & Dumais, S.T. (1997). A solution to Plato's problem: The Latent Semantic Analysis theory of acquisition, induction and representation of knowledge. Psychological Review, 104, 211-240.

Mitchell, J., & Lapata, M. (2010). Composition in Distributional Models of Semantics. Cognitive Science, 34, 1388-1429.

Examples

Run this code

data(wonderland)

coherence ("there was certainly too much of it in the air. even the duchess
sneezed occasionally; and as for the baby, it was sneezing and howling
alternately without a moment's pause. the only things in the kitchen
that did not sneeze, were the cook, and a large cat which was sitting on
the hearth and grinning from ear to ear.",
tvectors=wonderland)

Run the code above in your browser using DataLab