- data
- (list) A list containing the text data with each entry belonging to a unique id 
- ngram_window
- (list) The minimum and maximum n-gram length, e.g., c(1,3) 
- stopwords
- (stopwords) The stopwords to remove, e.g., stopwords::stopwords("en", source = "snowball") 
- removalword
- (string) The word to remove 
- pmi_threshold
- (integer; experimental) Pointwise Mutual Information (PMI) measures the association 
between terms by comparing their co-occurrence probability to their individual probabilities, 
highlighting term pairs that occur together more often than expected by chance; in this implementation,
terms with average PMI below the specified threshold (pmi_threshold) are removed from the document-term matrix. 
- occurance_rate
- (numerical) The occurance rate (0-1) removes words that occur less then in (occurance_rate)*(number of documents). Example: If the training dataset has 1000 documents and the occurrence rate is set to 0.05, the code will remove terms that appear in less than 49 documents. 
- removal_mode
- (string) Mode of removal -> one of c("none", "frequency", "term", "percentage"). frequency removes all words under a certain frequency or over a certain frequency, as indicated by removal_rate_least and removal_rate_most. term removes an absolute number of terms that are most frequent and least frequent. percentage removes the number of terms indicated by removal_rate_least and removal_rate_most relative to the number of terms in the matrix 
- removal_rate_most
- (integer) The rate of most frequent words to be removed, functionality depends on removal_mode 
- removal_rate_least
- (integer) The rate of least frequent words to be removed, functionality depends on removal_mode 
- shuffle
- (boolean) Shuffle the data before analyses 
- seed
- (integer) A seed to set for reproducibility 
- threads
- (integer) The number of threads to use; also called cpu in (CreateDtm).