- voc
A character vector that gives the vocabulary (e.g., colnames of a dtm)
- type
Either "bi" (bigrams) or "tri" (trigrams)
- min_overlap
The minimal overlap percentage. Works together with max_diff to determine required overlap
- max_diff
The maximum number of bi/tri-grams that is different
- pad
If True, pad the left size (ls) and right side (rs) of bi/tri-grams. So, trigrams for "pad" would be: "ls_ls_p", "ls_p_a", "p_a_d", "a_d_rs", "d_rs_rs".
- as_lower
If True, ignore case
- same_start
Should terms start with the same character(s)? Given as a number for the number of same characters. (also greatly speeds up calculation)
- drop_non_alpha
If True, ignore non alpha terms (e.g., numbers, punctuation). They will appear in the output matrix, but only with zeros.
- min_length
The minimum number of characters in a term. Terms with fewer characters are ignored. They will appear in the output matrix, but only with zeros.
- allow_asym
If True, the match only needs to be true for at least one term. In practice, this means that "America" would match perfectly with "Southern-America".
- verbose
If True, report progress