The function computes ngrams from a text
topicsGrams(
data,
ngram_window = c(1, 3),
stopwords = stopwords::stopwords("en", source = "snowball"),
occurance_rate = 0,
removal_mode = "frequency",
removal_rate_most = NULL,
removal_rate_least = NULL,
pmi_threshold = 0,
top_frequent = 200
)
A list containing tibble of the ngrams with the frequency and probability and a tibble containing the relative frequency of the ngrams for each user
(tibble) The data
(list) the minimum and maximum n-gram length, e.g. c(1,3)
(stopwords) the stopwords to remove, e.g. stopwords::stopwords("en", source = "snowball")
(numerical) The occurance rate (0-1) removes words that occur less then in (occurance_rate)*(number of documents). Example: If the training dataset has 1000 documents and the occurrence rate is set to 0.05, the code will remove terms that appear in less than 50 documents.
(character) The mode of removal, either "term", frequency" or "percentage"
(numeric) The rate of most frequent ngrams to remove
(numeric) The rate of least frequent ngrams to remove
(integer) The pmi threshold, if it shall not be used set to 0
(integer) The number of most frequently occuring ngrams to included in the output.