Calculates a number of similarity and difference statistics between two document versions based on n-gram sequence matching
document_similarities(filenames = NULL, documents = NULL,
input_directory = NULL, ngram_size = 10, output_directory = NULL,
doc_pairs = NULL, cores = 1, max_block_size = NULL, prehash = FALSE,
ngram_match_only = FALSE, document_block_size = NULL,
add_ngram_comparisons = NULL, unigram_similarity_threshold = NULL,
doc_lengths = NULL)
An optional character vector of filenames (with .txt extension), one per document. One of filenames and documents must be provided.
An optional character vector of documents with one entry per document. One of filenames and documents must be provided.
If filenames are provided, then a valid directory path to the directory where the document text files are located must be provided.
The length of n-grams on which to base comparisons. Defaults to 10, but 5 may be appropriate if stopwords have been removed.
An optional directory where chunks of results will be saved in the form: Document_Similarity_Results_x.RData. If NULL, then a data.frame is returned from the function
An optional two column matrix indicating the document indicies in filenames or documents to be compared in each comparison. This will be automatically generated to include all pairs but, can be user specified if only a subset of pairs are desired. If providing filenames, it is also possible to use the filenames to generate this matrix, but this will provide slower performance.
The number of cores to be used for parallelization. Defaults to 1 but can be any number less than or equal to the number of logical cores available on your computer.
Defaults to NULL, but can be set to an integer value indicating the maximum number of pairs to be compared in each parallel process. Can be useful to limit the intermediate data.frame sizes. A maximum of 10-50 million is suggested.
Logical which defaults to FALSE. If TRUE, then a pre-hashing scheme is used which may greatly speed up computation but dramatically increase memory usage as well.
Defaults to FALSE. If TRUE, then only the proportion of n-grams in version a that are also present in version b and vice-versa are calculated. Can be a useful first step when searching for near-exact matches.
Overrides other arguments, breaks up documents into `document_block_size` chunks to be compared. This argument is suggested if a large number of comparisons are to be completed. By only comparing subsets of documents at a time, the process is memory-optimized so that very large sets of documents can be compared. Automatically sets `prehash = TRUE`, `doc_pairs = NULL`, and suggests that `output_directory` be set (due to the large number of comparisons, it is likely the resulting data.frame will be too large to hold in memory). Defaults to NULL.
Defaults to NULL, but can optionally be a numeric vector containing n-gram sizes on which to compare documents. If this argument is provided, then a_in_b and b_in_a comparisons will be appended to the output for each n-gram size.
Defaults to NULL. If not NULL, can be any number greater than 0 and less than 1. This argument allows the user to first filter potential document comparisons to those where atleast one of the documents contains more than unigram_similarity_threshold proportion of the unigrams in the other version. So for example if this argument were set to 0.8, then only those documents with a unigram similarity of 0.8 would be given a full comparison. This approach is particularly useful if one is looking for very similar documents, such as hitchhiker bills.
Defaults to NULL. If not NULL, then this must be a numeric vector of length equal to the number of input documents, giving the number of tokens in each.
A data.frame or NULL if output_directory is not NULL.