Delete duplicate (or similar) documents from a document term matrix. Duplicates are defined by: having high content similarity, occuring within a given time distance and being published by the same source.
delete_duplicates(
dtm,
date_var = NULL,
hour_window = c(-24, 24),
group_var = NULL,
measure = c("cosine", "overlap_pct"),
similarity = 1,
keep = "first",
tf_idf = FALSE,
dup_csv = NULL,
verbose = F
)
A dtm with the duplicate documents deleted
A quanteda dfm.
The name of the column in docvars(dtm) that specifies the document date. The values should be of type POSIXlt or POSIXct
A vector of length 2, in which the first and second value determine the left and right side of the window, respectively. For example, c(-10, 36) will compare each document to all documents between the previous 10 and the next 36 hours.
Optionally, column name in docvars(dtm) that specifies a group (e.g., source, sourcetype). If given, only documents within the same group will be compared.
The measure that should be used to calculate similarity/distance/adjacency. Currently supports the symmetrical measure "cosine" (cosine similarity), and the assymetrical measures "overlap_pct" (percentage of term scores in the document that also occur in the other document).
A threshold for similarity. Documents of which similarity is equal or higher are deleted
A character indicating whether to keep the 'first' or 'last' published of duplicate documents.
If TRUE, weight the dtm with tf_idf before comparing documents. The original (non-weighted) DTM is returned.
Optionally, a path for writing a csv file with the duplicates edgelist. For each duplicate pair it is noted if "from" or "to" is the duplicate, or if "both" are duplicates (of other documents)
If TRUE, report progress
Note that this can also be used to delete "updates" of articles (e.g., on news sites, news agencies). This should be considered if the temporal order of publications is relevant for the analysis.
## example with very low similarity threshold (normally not recommended!)
dtm2 = delete_duplicates(rnewsflow_dfm, similarity = 0.5, keep='first', tf_idf = TRUE)
Run the code above in your browser using DataLab