plot_doclist: 2D- or 3D-Plot of a list of sentences/documents

Description

2D or 3D-Plot of mutual word similarities to a given list of sentences/documents

Usage

plot_doclist(x,connect.lines="all",method="PCA",dims=3,
   axes=F,box=F,cex=1,chars=10,legend=T, size = c(800,800),
   alpha="graded",alpha.grade=1,col="rainbow",
   tvectors=tvectors,remove.punctuation=TRUE,...)

Value

see plot3d: this function is called for the side effect of drawing the plot; a vector of object IDs is returned.

plot_doclist further prints a list with two elements:

coordinates: the coordinate vectors of the sentences/documents in the plot as a data frame
xdocs: A legend for the sentence/document labels in the plot and in the coordinates

Arguments

x: a character vector of length(x) > 1 that contains multiple sentences/documents
dims: the dimensionality of the plot; set either dims = 2 or dims = 3
method: the method to be applied; either a Principal Component Analysis (method="PCA") or a Multidimensional Scaling (method="MDS")
connect.lines: (3d plot only) the number of closest associate words each word is connected with via line. Setting connect.lines="all" (default) will draw all connecting lines and will automatically apply alpha="graded"
axes: (3d plot only) whether axes shall be included in the plot
box: (3d plot only) whether a box shall be drawn around the plot
cex: (2d Plot only) A numerical value giving the amount by which plotting text should be magnified relative to the default.
chars: an integer specifying how many letters (starting from the first) of each sentence/document are to be printed in the plot
legend: (3d plot only) whether a legend shall be drawn illustrating the color scheme of the connect.lines. The legend is inserted as a background bitmap to the plot using bgplot3d. Therefore, they do not resize very gracefully (see the bgplot3d documentation for more information).
size: (3d plot only) A numeric vector with two elements, the first specifying the width and the second specifying the height of the plot device.
tvectors: the semantic space in which the computation is to be done (a numeric matrix where every row is a word vector)
remove.punctuation: removes punctuation from x and y; TRUE by default
alpha: (3d plot only) A numeric vector specifying the luminance of the connect.lines. By setting alpha="graded", the luminance of every line will be adjusted to the cosine between the two words it connects.
alpha.grade: (3d plot only) Only relevant if alpha="graded". Specify a numeric value for alpha.grade to scale the luminance of all connect.lines up (alpha.grade > 1) or down (alpha.grade < 1) by that factor.
col: (3d plot only) A vector specifying the color of the connect.lines. With setting col ="rainbow" (default), the color of every line will be adjusted to the cosine between the two words it connects, according to the rainbow palette. Other available color palettes for this purpose are heat.colors, terrain.colors, topo.colors, and cm.colors (see rainbow). Additionally, you can customize any color scale of your choice by providing an input specifying more than one color (for example col = c("black","blue","red")).
...: additional arguments which will be passed to plot3d (in a three-dimensional plot only)

Author

Fritz Guenther, Taylor Fedechko

Details

Computes all pairwise similarities within a given list of sentences/documents. On this similarity matrix, a Principal Component Analysis (PCA) or a Multidimensional Sclaing (MDS) is applied to get a two- or three-dimensional solution that best captures the similarity structure. This solution is then plotted.

In the traditional LSA approach, the vector D for a document (or a sentence) consisting of the words (t1, . , tn) is computed as $$D = \sum\limits_{i=1}^n t_n$$ This function then computes the the cosines between two sets of documents (or sentences).

The format of x should be of the kind x <- c("this is the first text","here is another text")

For creating pretty plots showing the similarity structure within this list of words best, set connect.lines="all" and col="rainbow"

References

Landauer, T.K., & Dumais, S.T. (1997). A solution to Plato's problem: The Latent Semantic Analysis theory of acquisition, induction and representation of knowledge. Psychological Review, 104, 211-240.

Mardia, K.V., Kent, J.T., & Bibby, J.M. (1979). Multivariate Analysis, London: Academic Press.

Examples

Run this code

data(wonderland)

## Standard Plot

docs <- c("alice was beginning to get very tired.",
          "the red queen greeted alice.",
          "the mad hatter and the mare hare are having a party.",
          "the hatter sliced the cup of tea in half.")
          
plot_doclist(docs,tvectors=wonderland,method="MDS",dims=2)

Run the code above in your browser using DataLab