feature_selection: A function that implements a number of feature selection methods for finding top words which distinguish between two classes.

Description

A function that implements a number of feature selection methods for finding top words which distinguish between two classes.

Usage

feature_selection(contingency_table, rows_to_compare = NULL, alpha = 1,
  method = c("informed Dirichlet", "TF-IDF", "TF-IDF-log(tf)",
  "TF-IDF-augmented(tf)"), maximum_top_words = 5000,
  document_term_matrix = NULL, subsume_ngrams = FALSE,
  ngram_subsumption_correlation_threshold = 0.9, rank_by_log_odds = FALSE)

Arguments

contingency_table

A contingency table generated by the `contingency_table()` function.

rows_to_compare

A numeric vector containing the indicies of the rows in the contingency table we wish to compare against eachother. Defaults to NULL, in which case all rows are compared against eachother.

alpha

The Dirichlet hyperparameter to be used if method = "informed_Dirichlet". Suggested value is the average number of terms that appear in a document. If a small value is selected, then more (globally) common terms may be selected as top words. Increasing the value will select for less globally common words. Defaults to 1 (not usually a good choice for most analyses).

method

Defaults to "informed_Dirichlet", which implements the model described in section 3.5.1 of Monroe et al. "Fightin Words...". Can also be "TF-IDF", in which case canonical TF-IDF ranking is used. The user may also select "TF-IDF-log(tf)", in which case the TF term is logged following Manning and Schutze (1999, p.544), or "TF-IDF-augmented(tf)", in which case the TF term is augmented also following Manning and Schutze (1999, p.544).

maximum_top_words

Controls the maximum number of top words returned in each category. Defaults to 5000.

document_term_matrix

The document term matrix used to construct the contingency_table. Necessary if the user selects method = "TF-IDF". Defaults to NULL.

subsume_ngrams

Optional argument allowing the user to combine highly correlated ngrams in resulting output. Only useful if terms in the document term matrix can overlap.

ngram_subsumption_correlation_threshold

Defualts to 0.9, can be set higher or lower depending on the correlation threshold at which the user would like to subsume n-grams.

rank_by_log_odds

Only applicable for the "informed_Dirichlet" method. Defaults to FALSE. If TRUE, then terms are ranked by log odds instead of z-score.

Value

A list object containing two dataframes (one for each comparison category) with ranked top words. All words included in each dataset obtain a z-score greater in magnitude than 1.96.