multidocs: Comparison of sentence sets

Description

Computes cosine values between sets of sentences and/or documents

Usage

multidocs(x,y=x,chars=10,tvectors=tvectors,remove.punctuation=TRUE,
stopwords = NULL,method ="Add")

Value

A list of three elements:

cosmat: A numeric matrix giving the cosines between the input sentences/documents
xdocs: A legend for the row.names of cosmat
ydocs: A legend for the col.names of cosmat

Arguments

x: a character vector containing different sentences/documents
y: a character vector containing different sentences/documents (y = x by default)
chars: an integer specifying how many letters (starting from the first) of each sentence/document are to be printed in the row.names and col.names of the output matrix
tvectors: the semantic space in which the computation is to be done (a numeric matrix where every row is a word vector)
remove.punctuation: removes punctuation from x and y; TRUE by default
stopwords: a character vector defining a list of words that are not used to compute the document/sentence vector for x and y
method: the compositional model to compute the document vector from its word vectors. The default option method = "Add" computes the document vector as the vector sum. With method = "Multiply", the document vector is computed via element-wise multiplication (see compose).

Author

Fritz Guenther

Details

In the traditional LSA approach, the vector D for a document (or a sentence) consisting of the words (t1, . , tn) is computed as $$D = \sum\limits_{i=1}^n t_n$$ This is the default method (method="Add") for this function. Alternatively, this function provided the possibility of computing the document vector from its word vectors using element-wise multiplication (see Mitchell & Lapata, 2010 and compose).

This function computes the cosines between two sets of documents (or sentences).

The format of x (or y) should be of the kind x <- c("this is the first text","here is another text") (or y <- c("this is a third text","and here is yet another text"))

A note will be displayed whenever not all words of one input string are found in the semantic space. Caution: In that case, the function will still produce a result, by omitting the words not found in the semantic space. Depending on the specific requirements of a task, this may compromise the results. Please check your input when you receive this message.

A warning message will be displayed whenever no word of one input string is found in the semantic space.

References

Landauer, T.K., & Dumais, S.T. (1997). A solution to Plato's problem: The Latent Semantic Analysis theory of acquisition, induction and representation of knowledge. Psychological Review, 104, 211-240.

Dennis, S. (2007). How to use the LSA Web Site. In T. K. Landauer, D. S. McNamara, S. Dennis, & W. Kintsch (Eds.), Handbook of Latent Semantic Analysis (pp. 35-56). Mahwah, NJ: Erlbaum.

Mitchell, J., & Lapata, M. (2010). Composition in Distributional Models of Semantics. Cognitive Science, 34, 1388-1429.

http://wordvec.colorado.edu/

Examples

Run this code

data(wonderland)
multidocs(x = c("alice was beginning to get very tired.",
                "the red queen greeted alice."),
          y = c("the mad hatter and the mare hare are having a party.",
                "the hatter sliced the cup of tea in half."), 
      tvectors=wonderland)

Run the code above in your browser using DataLab