delete.duplicates: Delete duplicate (or similar) documents from a document term matrix

Description

This function is deprecated, and will at some point be removed. It is replaced by delete_duplicates.

Usage

delete.duplicates(
  dtm,
  meta = NULL,
  date.var = "date",
  hour.window = c(-24, 24),
  group.var = NULL,
  measure = c("cosine", "overlap_pct"),
  similarity = 1,
  keep = "first",
  tf.idf = FALSE,
  dup_csv = NULL,
  verbose = F
)

Value

A dtm with the duplicate documents deleted

Arguments

dtm: A quanteda dfm. Alternatively, a DocumentTermMatrix from the tm package can be used, but then the meta parameter needs to be specified manually
meta: If dtm is a quanteda dfm, docvars(meta) is used by default (meta is NULL) to obtain the meta data. Otherwise, the meta data.frame has to be given by the user, with the rows of the meta data.frame matching the rows of the dtm (i.e. each row is a document)
date.var: The name of the column in meta that specifies the document date. default is "date". The values should be of type POSIXlt or POSIXct
hour.window: A vector of length 2, in which the first and second value determine the left and right side of the window, respectively. For example, c(-10, 36) will compare each document to all documents between the previous 10 and the next 36 hours.
group.var: Optionally, The name of the column in meta that specifies a group (e.g., source, sourcetype). If given, only documents within the same group will be compared.
measure: the measure that should be used to calculate similarity/distance/adjacency. Currently supports the symmetrical measure "cosine" (cosine similarity), and the assymetrical measures "overlap_pct" (percentage of term scores in the document that also occur in the other document).
similarity: a threshold for similarity. Documents of which similarity is equal or higher are deleted
keep: A character indicating whether to keep the 'first' or 'last' published of duplicate documents.
tf.idf: if TRUE, weight the dtm with tf.idf before comparing documents. The original (non-weighted) DTM is returned.
dup_csv: Optionally, a path for writing a csv file with the duplicates edgelist. For each duplicate pair it is noted if "from" or "to" is the duplicate, or if "both" are duplicates (of other documents)
verbose: if TRUE, report progress

Details

Delete duplicate (or similar) documents from a document term matrix. Duplicates are defined by: having high content similarity, occuring within a given time distance and being published by the same source.

Note that this can also be used to delete "updates" of articles (e.g., on news sites, news agencies). This should be considered if the temporal order of publications is relevant for the analysis.

Examples

Run this code

## example with very low similarity threshold (normally not recommended!)

## delete.duplicates is deprecated. Please use delete_duplicates
dtm2 = delete_duplicates(rnewsflow_dfm, similarity = 0.5, keep='first', tf_idf = TRUE)

Run the code above in your browser using DataLab