dfm_select: select features from a dfm or fcm

Description

This function selects or discards features from a dfm or fcm, based on a pattern match with the feature names. The most common usages are to eliminate features from a dfm already constructed, such as stopwords, or to select only terms of interest from a dictionary.

Usage

dfm_select(x, features = NULL, documents = NULL, selection = c("keep",
  "remove"), valuetype = c("glob", "regex", "fixed"),
  case_insensitive = TRUE, min_nchar = 1, max_nchar = 63,
  padding = FALSE, verbose = quanteda_options("verbose"), ...)
dfm_remove(x, features = NULL, documents = NULL, ...)
fcm_select(x, features = NULL, selection = c("keep", "remove"),
  valuetype = c("glob", "regex", "fixed"), case_insensitive = TRUE,
  verbose = TRUE, ...)
fcm_remove(x, features, ...)

Arguments

the dfm or fcm object whose features will be selected

features

one of: a character vector of features to be selected, a dfm whose features will be used for selection, or a dictionary class object whose values (not keys) will provide the features to be selected. For dfm objects, see details in the Value section below.

documents

select documents based on their document names. Works exactly the same as features.

selection

whether to keep or remove the features

valuetype

how to interpret keyword expressions: "glob" for "glob"-style wildcard expressions; "regex" for regular expressions; or "fixed" for exact matching. See valuetype for details.

case_insensitive

ignore the case of dictionary values if TRUE

min_nchar, max_nchar

numerics specifying the minimum and maximum length in characters for features to be removed or kept; defaults are 1 and https://en.wikipedia.org/wiki/Donaudampfschiffahrtselektrizit<U+00E4>tenhauptbetriebswerkbauunterbeamtengesellschaft. (Set max_nchar to NULL for no upper limit.) These are applied after (and hence, in addition to) any selection based on pattern matches. These arguments are Ignored when padding is TRUE.

padding

if TRUE features or documents not existing in x is added to dfm. This option is available only when selection is keep and valuetype is fixed.

verbose

if TRUE print message about how many features were removed

...

supplementary arguments passed to the underlying functions in stri_detect_regex

Value

A dfm or fcm object, after the feature selection has been applied.

When features is a dfm object and padding is TRUE, then the returned object will be identical in its feature set to the dfm supplied as the features argument. This means that any features in x not in features will be discarded, and that any features in found in the dfm supplied as features but not found in x will be added with all zero counts. Because selecting on a dfm is designed to produce a selected dfm with an exact feature match, when features is a dfm object, then the following settings are always used: padding = TRUE, case_insensitive = FALSE, and valuetype = "fixed".

Selecting on a dfm is useful when you have trained a model on one dfm, and need to project this onto a test set whose features must be identical. It is also used in bootstrap_dfm. See examples.

Details

dfm_remove and fcm_remove are simply a convenience wrappers to calling dfm_select and fcm_select with selection = "remove".

Examples

Run this code

myDfm <- dfm(c("My Christmas was ruined by your opposition tax plan.", 
               "Does the United_States or Sweden have more progressive taxation?"),
             tolower = FALSE, verbose = FALSE)
mydict <- dictionary(list(countries = c("United_States", "Sweden", "France"),
                          wordsEndingInY = c("by", "my"),
                          notintext = "blahblah"))
dfm_select(myDfm, mydict)
dfm_select(myDfm, mydict, case_insensitive = FALSE)
dfm_select(myDfm, c("s$", ".y"), selection = "keep", valuetype = "regex")
dfm_select(myDfm, c("s$", ".y"), selection = "remove", valuetype = "regex")
dfm_select(myDfm, stopwords("english"), selection = "keep", valuetype = "fixed")
dfm_select(myDfm, stopwords("english"), selection = "remove", valuetype = "fixed")

# select based on character length
dfm_select(myDfm, min_nchar = 5)

# selecting on a dfm
txts <- c("This is text one", "The second text", "This is text three")
(dfm1 <- dfm(txts[1:2]))
(dfm2 <- dfm(txts[2:3]))
(dfm3 <- dfm_select(dfm1, dfm2, valuetype = "fixed", padding = TRUE, verbose = TRUE))
setequal(featnames(dfm2), featnames(dfm3))

tmpdfm <- dfm(c("This is a document with lots of stopwords.",
                "No if, and, or but about it: lots of stopwords."),
              verbose = FALSE)
tmpdfm
dfm_remove(tmpdfm, stopwords("english"))
toks <- tokens(c("this contains lots of stopwords",
                 "no if, and, or but about it: lots"),
               remove_punct = TRUE)
tmpfcm <- fcm(toks)
tmpfcm
fcm_remove(tmpfcm, stopwords("english"))

Run the code above in your browser using DataLab