documents.compare: Compare the documents in two corpora/dtms

Description

Compare the documents in corpus dtm.x with reference corpus dtm.y.

Usage

documents.compare(
  dtm,
  dtm.y = NULL,
  measure = c("cosine", "overlap_pct", "overlap", "crossprod", "softcosine",
    "query_lookup", "query_lookup_pct"),
  min.similarity = 0,
  n.topsim = NULL,
  max_p = 1,
  pvalue = c("none", "normal", "lognormal", "nz_normal", "nz_lognormal", "disparity"),
  simmat = NULL,
  simmat_thres = NULL
)

Arguments

dtm

A quanteda dfm. Alternatively, a DocumentTermMatrix from the tm package can be used.

dtm.y

Optional. If given, documents from dtm will only be compared to the documents in dtm.y

measure

the measure that should be used to calculate similarity/distance/adjacency. Currently supports the symmetrical measure "cosine" (cosine similarity), the assymetrical measures "overlap_pct" (percentage of term scores in the document that also occur in the other document), "overlap" (like overlap_pct, but as the sum of overlap instead of the percentage) and the symmetrical soft cosine measure (experimental). The regular crossprod (inner product) is also supported. If the dtm's are prepared with the create_queries function, the special "query_lookup" and "query_lookup_pct" can be used.

min.similarity

a threshold for similarity. lower values are deleted. Set to 0 by default.

n.topsim

An alternative or additional sort of threshold for similarity. Only keep the [n.topsim] highest similarity scores for x. Can return more than [n.topsim] similarity scores in the case of duplicate similarities.

max_p

A threshold for maximium p value.

pvalue

If max_p < 1, edges are removed based on a p value. For each document in dtm, a p value is calculated over its outward edges. Default is the p-value based on uniform distribution, akin to a "disparity" filter (see Serrano et al.) but without filtering on inward edges.

simmat

If softcosine is used, a symmetrical matrix with the similarity scores of terms. If NULL, the cosine similarity of terms in dtm will be used

simmat_thres

If softosine is used, a threshold for the similarity scores of terms

Value

A data frame with pairs of documents and their similarities.

Details

The calculation of document similarity is performed using a vector space model approach. Inner-product based similarity measures are used, such as cosine similarity. It is recommended to weight the DTM beforehand, for instance using Term frequency-inverse document frequency (tf.idf)

Examples

Run this code

# NOT RUN {
## documents.compare is deprecated. Please use compare_documents
comp = compare_documents(rnewsflow_dfm, min_similarity=0.4)
head(comp)

# }

Run the code above in your browser using DataLab