document_similarities: Calculate sequence based document similarities

Description

Calculates a number of similarity and difference statistics between two document versions based on n-gram sequence matching

Usage

document_similarities(filenames = NULL, documents = NULL,
  input_directory = NULL, ngram_size = 10, output_directory = NULL,
  doc_pairs = NULL, cores = 1, max_block_size = NULL, prehash = FALSE,
  ngram_match_only = FALSE, document_block_size = NULL,
  add_ngram_comparisons = NULL, unigram_similarity_threshold = NULL,
  doc_lengths = NULL)

Arguments

filenames

An optional character vector of filenames (with .txt extension), one per document. One of filenames and documents must be provided.

documents

An optional character vector of documents with one entry per document. One of filenames and documents must be provided.

input_directory

If filenames are provided, then a valid directory path to the directory where the document text files are located must be provided.

ngram_size

The length of n-grams on which to base comparisons. Defaults to 10, but 5 may be appropriate if stopwords have been removed.

output_directory

An optional directory where chunks of results will be saved in the form: Document_Similarity_Results_x.RData. If NULL, then a data.frame is returned from the function

doc_pairs

An optional two column matrix indicating the document indicies in filenames or documents to be compared in each comparison. This will be automatically generated to include all pairs but, can be user specified if only a subset of pairs are desired. If providing filenames, it is also possible to use the filenames to generate this matrix, but this will provide slower performance.

cores

The number of cores to be used for parallelization. Defaults to 1 but can be any number less than or equal to the number of logical cores available on your computer.

max_block_size

Defaults to NULL, but can be set to an integer value indicating the maximum number of pairs to be compared in each parallel process. Can be useful to limit the intermediate data.frame sizes. A maximum of 10-50 million is suggested.

prehash

Logical which defaults to FALSE. If TRUE, then a pre-hashing scheme is used which may greatly speed up computation but dramatically increase memory usage as well.

ngram_match_only

Defaults to FALSE. If TRUE, then only the proportion of n-grams in version a that are also present in version b and vice-versa are calculated. Can be a useful first step when searching for near-exact matches.

document_block_size

Overrides other arguments, breaks up documents into `document_block_size` chunks to be compared. This argument is suggested if a large number of comparisons are to be completed. By only comparing subsets of documents at a time, the process is memory-optimized so that very large sets of documents can be compared. Automatically sets `prehash = TRUE`, `doc_pairs = NULL`, and suggests that `output_directory` be set (due to the large number of comparisons, it is likely the resulting data.frame will be too large to hold in memory). Defaults to NULL.

add_ngram_comparisons

Defaults to NULL, but can optionally be a numeric vector containing n-gram sizes on which to compare documents. If this argument is provided, then a_in_b and b_in_a comparisons will be appended to the output for each n-gram size.

unigram_similarity_threshold

Defaults to NULL. If not NULL, can be any number greater than 0 and less than 1. This argument allows the user to first filter potential document comparisons to those where atleast one of the documents contains more than unigram_similarity_threshold proportion of the unigrams in the other version. So for example if this argument were set to 0.8, then only those documents with a unigram similarity of 0.8 would be given a full comparison. This approach is particularly useful if one is looking for very similar documents, such as hitchhiker bills.

doc_lengths

Defaults to NULL. If not NULL, then this must be a numeric vector of length equal to the number of input documents, giving the number of tokens in each.

Value

A data.frame or NULL if output_directory is not NULL.