Last chance! 50% off unlimited learning
Sale ends in
This function selects or removes features from a dfm or fcm,
based on feature name matches with pattern
. The most common usages
are to eliminate features from a dfm already constructed, such as stopwords,
or to select only terms of interest from a dictionary.
dfm_select(
x,
pattern = NULL,
selection = c("keep", "remove"),
valuetype = c("glob", "regex", "fixed"),
case_insensitive = TRUE,
min_nchar = NULL,
max_nchar = NULL,
verbose = quanteda_options("verbose")
)dfm_remove(x, ...)
dfm_keep(x, ...)
fcm_select(
x,
pattern = NULL,
selection = c("keep", "remove"),
valuetype = c("glob", "regex", "fixed"),
case_insensitive = TRUE,
verbose = quanteda_options("verbose"),
...
)
fcm_remove(x, pattern = NULL, ...)
fcm_keep(x, pattern = NULL, ...)
a character vector, list of character vectors, dictionary, or collocations object. See pattern for details.
whether to keep
or remove
the features
the type of pattern matching: "glob"
for "glob"-style
wildcard expressions; "regex"
for regular expressions; or "fixed"
for
exact matching. See valuetype for details.
logical; if TRUE
, ignore case when matching a
pattern
or dictionary values
optional numerics specifying the minimum and
maximum length in characters for tokens to be removed or kept; defaults are
NULL
for no limits. These are applied after (and hence, in addition
to) any selection based on pattern matches.
if TRUE
print message about how many pattern were
removed
used only for passing arguments from dfm_remove
or
dfm_keep
to dfm_select
. Cannot include
selection
.
A dfm or fcm object, after the feature selection has been applied.
For compatibility with earlier versions, when pattern
is a
dfm object and selection = "keep"
, then this will be
equivalent to calling dfm_match()
. In this case, the following
settings are always used: case_insensitive = FALSE
, and
valuetype = "fixed"
. This functionality is deprecated, however, and
you should use dfm_match()
instead.
dfm_remove
and fcm_remove
are simply a convenience
wrappers to calling dfm_select
and fcm_select
with
selection = "remove"
.
dfm_keep
and fcm_keep
are simply a convenience wrappers to
calling dfm_select
and fcm_select
with selection = "keep"
.
# NOT RUN {
dfmat <- dfm(c("My Christmas was ruined by your opposition tax plan.",
"Does the United_States or Sweden have more progressive taxation?"),
tolower = FALSE)
dict <- dictionary(list(countries = c("United_States", "Sweden", "France"),
wordsEndingInY = c("by", "my"),
notintext = "blahblah"))
dfm_select(dfmat, pattern = dict)
dfm_select(dfmat, pattern = dict, case_insensitive = FALSE)
dfm_select(dfmat, pattern = c("s$", ".y"), selection = "keep", valuetype = "regex")
dfm_select(dfmat, pattern = c("s$", ".y"), selection = "remove", valuetype = "regex")
dfm_select(dfmat, pattern = stopwords("english"), selection = "keep", valuetype = "fixed")
dfm_select(dfmat, pattern = stopwords("english"), selection = "remove", valuetype = "fixed")
# select based on character length
dfm_select(dfmat, min_nchar = 5)
dfmat <- dfm(c("This is a document with lots of stopwords.",
"No if, and, or but about it: lots of stopwords."))
dfmat
dfm_remove(dfmat, stopwords("english"))
toks <- tokens(c("this contains lots of stopwords",
"no if, and, or but about it: lots"),
remove_punct = TRUE)
fcmat <- fcm(toks)
fcmat
fcm_remove(fcmat, stopwords("english"))
# }
Run the code above in your browser using DataLab