create_queries: Automatically infer queries from combinations of terms in a dtm

Description

This function was designed for the task of matching short event descriptions to news articles, but can more generally be used for document matching tasks. However, it should be noted that it will require exponentially more memory for dtms with more unique terms, which is why it is less suitable for matching larger documents. This only applies to the dtm, not the ref_dtm. Thus, if your goal is to match smaller documents such as event descriptions to news, this function might be usefull.

Usage

create_queries(
  dtm,
  ref_dtm = NULL,
  min_docfreq = 2,
  max_docprob = 0.01,
  weight = c("tfidf", "tfidf_sq", "binary"),
  norm_weight = c("max", "doc_max", "dtm_max", "none"),
  min_obs_exp = NA,
  union_sim_thres = NA,
  combine_all = T,
  only_dtm_combs = T,
  use_dtm_and_ref = T,
  verbose = F
)

Arguments

dtm

A quanteda dfm

ref_dtm

Optionally, another quanteda dfm. If given, the ref_dtm will be used to calculate the docfreq/docprob scores.

min_docfreq

The minimum frequency for terms or combinations of terms

max_docprob

The maximum probability (document frequency / N) for terms or combinations of terms

weight

Determine how to weight the queries (if ref_dtm is used, uses the idf of the ref_dtm). Default is "binary" (does/does not occur). "tfidf" uses common tf-idf weighting (actually just idf, since scores are binary). The ref_dfm will always be binary. "tfidf_sq" uses the squared tfidf. This weight even heavier by idf, and makes sense because the query_lookup function will only count the occurences in query_dfm (if both the query_dfm and ref_dfm would be weighted and a crossprod based similarity measure is used, terms are also multiplied)

norm_weight

Normalize the weight score so that the highest value is 1. If "max" is used, max is the highest possible value. "doc_max" uses the highest value within each document, and "dtm_max" uses the highest observed value in the dtm.

min_obs_exp

The minimum ratio of the observed and expected frequency of a term combination

union_sim_thres

If given, a number between 0 and 1, used as the cosine similarity threshold for combining clusters of terms

combine_all

If True, combine all terms. If False (default), terms that are included as unigrams (i.e. that are within the min_docfreq and max_docprob) are not combined with other terms.

only_dtm_combs

Only include term combinations that occur in dtm. This makes sense (and saves a lot of memory) if you are only interested in assymetric similarity measures based on the query

use_dtm_and_ref

if a ref_dtm is used, both the dtm and ref_dtm are used to compute the docfreq and docprob values used for filtering and weighting. If use_dtm_and_ref is set o FALSE, only the ref_dtm is used.

verbose

If true, report progress

Value

a list with a query dtm and ref_dtm. Designed for use in compare_documents using the special `query_lookup` measure

Details

The main purpose of the function is that it intersects the terms in a dtm based to increase sparsity. This can improve certain document matching tasks, but at the cost of creating a bigger dtm. If all terms are combined this would be a quadratic increase of columns. However, only term combinations that occur in dtm (not ref_dtm) will be used. This is not a problem as long as the similarity of the documents in dtm to documents in dtm_y is calculated as an assymetric similarity measure (i.e. in which the sum of terms in dtm_y is not used).

To emphasize that this feature preparation step is geared towards the task of 'looking up' documents, we use the terminolog of a 'query'. The output of the function is a list of two dtm: query_dtm and ref_dtm. Both dtms have the exact same columns that contain the query terms. The values in query_dtm are by default tfidf weighted, and the values in ref_dtm are binary.

The special `query_lookup` measure in the compare_documents function can be used to perform the lookup. Note that a more common approach is to weigh both the queries and documents and then match queries to documents with cosine similarity. However, for event matching we only want to see whether a query 'suffiently' matches a document. The query_lookup function calculates a query->document weight as the sum of query terms that occur in the document.

Several options are given to only create term combinations that are informative. Firstly, a minimum and maximum document frequency of term combinations can be defined. Secondly, a minimum observed/expected ratio can be given. The expected probability of a combination of term A and term B is the joint probability. If the observed probability is not higher, the combination is not more informative than chance. Thirdly, before intersecting terms, one can first cluster very similar terms together as single columns to reduct the number of possible combinations.

Examples

Run this code

# NOT RUN {
 q = create_queries(rnewsflow_dfm, min_docfreq = 2, union_sim_thres = 0.9, 
                    max_docprob = 0.05, verbose = FALSE)
 head(colnames(q$query_dtm),100)
# }

Run the code above in your browser using DataLab