Build data and AI skills | 50% off

Last chance! 50% off unlimited learning

Sale ends in


quanteda (version 0.9.7-17)

selectFeatures: select features from an object

Description

This function selects or discards features from a dfm.variety of objects, such as tokenized texts, a dfm, or a list of collocations. The most common usage for removeFeatures will be to eliminate stop words from a text or text-based object, or to select only features from a list of regular expression.

Usage

selectFeatures(x, features, ...)
"selectFeatures"(x, features, selection = c("keep", "remove"), valuetype = c("glob", "regex", "fixed"), case_insensitive = TRUE, verbose = TRUE, ...)
"selectFeatures"(x, features, selection = c("keep", "remove"), valuetype = c("glob", "regex", "fixed"), case_insensitive = TRUE, padding = FALSE, indexing = FALSE, verbose = FALSE, ...)
"selectFeatures"(x, features, selection = c("keep", "remove"), valuetype = c("glob", "regex", "fixed"), case_insensitive = TRUE, verbose = TRUE, pos = 1:3, ...)

Arguments

x
object whose features will be selected
features
one of: a character vector of features to be selected, a dfm whose features will be used for selection, or a dictionary class object whose values (not keys) will provide the features to be selected. For dfm objects, see details in the Value section below.
...
supplementary arguments passed to the underlying functions in stri_detect_regex. (This is how case_insensitive is passed, but you may wish to pass others.)
selection
whether to keep or remove the features
valuetype
how to interpret feature vector: fixed for words as is; "regex" for regular expressions; or "glob" for "glob"-style wildcard
case_insensitive
ignore the case of dictionary values if TRUE
verbose
if TRUE print message about how many features were removed
padding
(only for tokenizedTexts objects) if TRUE, leave an empty string where the removed tokens previously existed. This is useful if a positional match is needed between the pre- and post-selected features, for instance if a window of adjacency needs to be computed.
indexing
use dfm-based index to efficiently process large tokenizedTexts object
pos
indexes of word position if called on collocations: remove if word pos is a stopword

Value

A dfm after the feature selection has been applied.When features is a dfm-class object, then the returned object will be identical in its feature set to the dfm supplied as the features argument. This means that any features in x not in features will be discarded, and that any features in found in the dfm supplied as features but not found in x will be added with all zero counts. This is useful when you have trained a model on one dfm, and need to project this onto a test set whose features must be identical.

See Also

removeFeatures, trim

Examples

Run this code
myDfm <- dfm(c("My Christmas was ruined by your opposition tax plan.", 
               "Does the United_States or Sweden have more progressive taxation?"),
             toLower = FALSE, verbose = FALSE)
mydict <- dictionary(list(countries = c("United_States", "Sweden", "France"),
                          wordsEndingInY = c("by", "my"),
                          notintext = "blahblah"))
selectFeatures(myDfm, mydict)
selectFeatures(myDfm, mydict, case_insensitive = FALSE)
selectFeatures(myDfm, c("s$", ".y"), "keep")
selectFeatures(myDfm, c("s$", ".y"), "keep", valuetype = "regex")
selectFeatures(myDfm, c("s$", ".y"), "remove", valuetype = "regex")
selectFeatures(myDfm, stopwords("english"), "keep", valuetype = "fixed")
selectFeatures(myDfm, stopwords("english"), "remove", valuetype = "fixed")

# selecting on a dfm
textVec1 <- c("This is text one.", "This, the second text.", "Here: the third text.")
textVec2 <- c("Here are new words.", "New words in this text.")
(dfm1 <- dfm(textVec1, verbose = FALSE))
(dfm2a <- dfm(textVec2, verbose = FALSE))
(dfm2b <- selectFeatures(dfm2a, dfm1))
setequal(features(dfm1), features(dfm2b))

# more selection on a dfm
selectFeatures(dfm1, dfm2a)
selectFeatures(dfm1, dfm2a, selection = "remove")
## Not run: ## performance comparisons
# data(SOTUCorpus, package = "quantedaData")
# toks <- tokenize(SOTUCorpus, removePunct = TRUE)
# # toks <- tokenize(tokenize(SOTUCorpus, what='sentence', simplify = TRUE), removePunct = TRUE)
# # head to head, old v. new
# system.time(selectFeaturesOLD(toks, stopwords("english"), "remove", verbose = FALSE))
# system.time(selectFeatures(toks, stopwords("english"), "remove", verbose = FALSE))
# system.time(selectFeaturesOLD(toks, c("and", "of"), "remove", verbose = FALSE, valuetype = "regex"))
# system.time(selectFeatures(toks, c("and", "of"), "remove", verbose = FALSE, valuetype = "regex"))
# microbenchmark::microbenchmark(
#     old = selectFeaturesOLD(toks, stopwords("english"), "remove", verbose = FALSE),
#     new = selectFeatures(toks, stopwords("english"), "remove", verbose = FALSE),
#     times = 5, unit = "relative")
# microbenchmark::microbenchmark(
#     new = selectFeaturesOLD(toks, c("and", "of"), "remove", verbose = FALSE, valuetype = "regex"),
#     old = selectFeatures(toks, c("and", "of"), "remove", verbose = FALSE, valuetype = "regex"),
#     times = 2, unit = "relative")
#     
# types <- unique(unlist(toks))
# numbers <- types[stringi::stri_detect_regex(types, '[0-9]')]
# microbenchmark::microbenchmark(
#     new = selectFeaturesOLD(toks, numbers, "remove", verbose = FALSE, valuetype = "fixed"),
#     old = selectFeatures(toks, numbers, "remove", verbose = FALSE, valuetype = "fixed"),
#     times = 2, unit = "relative")  
#     
# # removing tokens before dfm, versus after
# microbenchmark::microbenchmark(
#     pre = dfm(selectFeaturesOLD(toks, stopwords("english"), "remove"), verbose = FALSE),
#     post = dfm(toks, ignoredFeatures = stopwords("english"), verbose = FALSE),
#     times = 5, unit = "relative")
# ## End(Not run)

## with simple examples
toks <- tokenize(c("This is a sentence.", "This is a second sentence."), 
                 removePunct = TRUE)
selectFeatures(toks, c("is", "a", "this"), selection = "remove", 
                valuetype = "fixed", padding = TRUE, case_insensitive = TRUE)

# how case_insensitive works
selectFeatures(toks, c("is", "a", "this"), selection = "remove", 
               valuetype = "fixed", padding = TRUE, case_insensitive = FALSE)
selectFeatures(toks, c("is", "a", "this"), selection = "remove", 
               valuetype = "fixed", padding = TRUE, case_insensitive = TRUE)
selectFeatures(toks, c("is", "a", "this"), selection = "remove", 
               valuetype = "glob", padding = TRUE, case_insensitive = TRUE)
selectFeatures(toks, c("is", "a", "this"), selection = "remove", 
               valuetype = "glob", padding = TRUE, case_insensitive = FALSE)

# with longer texts
txts <- c(exampleString, inaugTexts[2])
toks <- tokenize(txts)
selectFeatures(toks, stopwords("english"), "remove")
selectFeatures(toks, stopwords("english"), "keep")
selectFeatures(toks, stopwords("english"), "remove", padding = TRUE)
selectFeatures(toks, stopwords("english"), "keep", padding = TRUE)
selectFeatures(tokenize(encodedTexts[1]), stopwords("english"), "remove", padding = TRUE)
 

## example for collocations
(myCollocs <- collocations(inaugTexts[1:3], n=20))
selectFeatures(myCollocs, stopwords("english"), "remove")

Run the code above in your browser using DataLab