newsflow.compare: Compare the documents in a dtm with a sliding window over time

Description

Given a document-term matrix (DTM) with dates for each document, calculates the document similarities over time using with a sliding window.

Usage

newsflow.compare(
  dtm,
  dtm.y = NULL,
  meta = NULL,
  meta.y = NULL,
  date.var = "date",
  hour.window = c(-24, 24),
  group.var = NULL,
  measure = c("cosine", "overlap_pct", "overlap", "crossprod", "softcosine",
    "query_lookup", "query_lookup_pct"),
  min.similarity = 0,
  n.topsim = NULL,
  only.from = NULL,
  only.to = NULL,
  only.complete.window = TRUE,
  pvalue = c("disparity", "normal", "lognormal", "nz_normal", "nz_lognormal"),
  max_p = 1,
  return_as = c("igraph", "edgelist", "matrix"),
  batchsize = 1000,
  simmat = NULL,
  simmat_thres = NULL,
  margin_attr = T,
  verbose = FALSE
)

Value

A network/graph in the igraph class, or an edgelist data.frame, or a sparse matrix.

Arguments

dtm: A quanteda dfm. Alternatively, a DocumentTermMatrix from the tm package can be used, but then the meta parameter needs to be specified manually
dtm.y: Optionally, another dtm. If given, the documents in dtm will be compared to the documents in dtm.y. This cannot be combined with only.from and only.to
meta: If dtm is a quanteda dfm, docvars(meta) is used by default (meta is NULL) to obtain the meta data. Otherwise, the meta data.frame has to be given by the user, with the rows of the meta data.frame matching the rows of the dtm (i.e. each row is a document)
meta.y: Like meta, but for dtm.y (only necessary if dtm.y is used)
date.var: The name of the column in meta that specifies the document date. default is "date". The values should be of type POSIXct
hour.window: A vector of length 2, in which the first and second value determine the left and right side of the window, respectively. For example, c(-10, 36) will compare each document to all documents between the previous 10 and the next 36 hours.
group.var: Optionally, The name of the column in meta that specifies a group (e.g., source, sourcetype). If given, only documents within the same group will be compared.
measure: The measure that should be used to calculate similarity/distance/adjacency. Currently supports the symmetrical measure "cosine" (cosine similarity), the assymetrical measures "overlap_pct" (percentage of term scores in the document that also occur in the other document), "overlap" (like overlap_pct, but as the sum of overlap instead of the percentage) and the symmetrical soft cosine measure (experimental). The regular crossprod (inner product) is also supported. If the dtm's are prepared with the create_queries function, the special "query_lookup" and "query_lookup_pct" can be used.
min.similarity: A threshold for similarity. lower values are deleted. Set to 0.1 by default.
n.topsim: An alternative or additional sort of threshold for similarity. Only keep the [n.topsim] highest similarity scores for x. Can return more than [n.topsim] similarity scores in the case of duplicate similarities.
only.from: A vector with names/ids of documents (dtm rownames), or a logical vector that matches the rows of the dtm. Use to compare only these documents to other documents.
only.to: A vector with names/ids of documents (dtm rownames), or a logical vector that matches the rows of the dtm. Use to compare other documents to only these documents.
only.complete.window: If True, only compare articles (x) of which a full window of reference articles (y) is available. Thus, for the first and last [window.size] days, there will be no results for x.
pvalue: If max_p < 1, edges are removed based on a p value. For each document in dtm, a p value is calculated over its outward edges. Default is the p-value based on uniform distribution, akin to a "disparity" filter (see Serrano et al.) but without filtering on inward edges.
max_p: A threshold for maximium p value.
return_as: Detemine whether output is returned as an "edgelist", "igraph" network or sparse "matrix'.
batchsize: If group and/or date are used, size of batches.
simmat: If softcosine is used, a symmetrical matrix with the similarity scores of terms. If NULL, the cosine similarity of terms in dtm will be used
simmat_thres: If softosine is used, a threshold for the similarity scores of terms
margin_attr: By default, margin attributes are added to meta (see details). This can be turned of for (slightly?) faster computation and less memory usage
verbose: If TRUE, report progress

Details

The calculation of document similarity is performed using a vector space model approach. Inner-product based similarity measures are used, such as cosine similarity. It is recommended to weight the DTM beforehand, for instance using Term frequency-inverse document frequency (tf.idf)

Meta data is included in the output. Margin attributes can also be added to meta with the margin_attr argument. see details.

For the "igraph" output the meta data is stored as vertex attributes; for the "matrix" output as the attributes "row_meta" and "col_meta"; for the "edgelist" output as the attributes "from_meta" and "to_meta". Note that attributes are removed if you perform certain operations on a matrix or data.frame, so if you want to use this information it is best to assign it immediately.

Margin attributes can be added to the meta data with the margin_attr argument. The reason for including this is that some values that are normally available in a similarity matrix are missing if certain filter options are used. If group or date is used, we don't know how many columns a rows has been compared to (normally this is all columns). If a min/max or top_n filter is used, we don't know the true row sums (and thus row means). margin_attr adds the "row_n", "row_sum", "col_n", and "col_sum" data to the meta data. In addition, there are "lag_n" and "lag_sum". this is a special case where row_n and row_sum are calculated for only matches where the column date < row date (lag). This can be used for more refined calculations of edge probabilities before and after (row_n - lag_n) a row document, which is for instance usefull for event matching.

Examples

Run this code

dtm = quanteda::dfm_tfidf(rnewsflow_dfm)

## newsflow.compare is deprecated. Please use newsflow_compare()
g = newsflow_compare(dtm, hour_window = c(0.1, 36))

vcount(g) # number of documents, or vertices
ecount(g) # number of document pairs, or edges

head(igraph::get.data.frame(g, 'vertices'))
head(igraph::get.data.frame(g, 'edges'))

Run the code above in your browser using DataLab