These functions are provided for compatibility with older versions of corpus only, and may be defunct as soon as the next release.
token_filter(map_case = TRUE, map_compat = TRUE,
map_quote = TRUE, remove_ignorable = TRUE,
stemmer = NULL, stem_except = drop,
combine = abbreviations("english"),
drop_letter = FALSE, drop_mark = FALSE,
drop_number = FALSE, drop_punct = FALSE,
drop_symbol = FALSE, drop_other = FALSE,
drop = NULL, drop_except = NULL) sentence_filter(crlf_break = FALSE,
suppress = abbreviations("english"))
a logical value indicating whether to apply Unicode case mapping to the text. For most languages, this transformation changes uppercase characters to their lowercase equivalents.
a logical value indicating whether to apply Unicode compatibility mappings to the characters, those required for NFKC and NFKD normal forms.
a logical value indicating whether to replace Unicode quote characters like single quote, double quote, and apostrophe, with an ASCII single quote (').
a logical value indicating whether to remove Unicode "default ignorable" characters like zero-width spaces and soft hyphens.
a character value giving the name of the stemming
algorithm, or NA to leave words unchanged. The stemming
algorithms are provided by the
Snowball stemming library;
the following stemming algorithms are available:
"arabic", "danish", "dutch",
"english", "finnish", "french",
"german", "hungarian", "italian",
"norwegian", "porter", "portuguese",
"romanian", "russian", "spanish",
"swedish", "tamil", and "turkish".
a character vector of exception words to exempt from
stemming, or NULL. If left unspecified, stem_except
is set equal to the drop argument.
a character vector of multi-word phrases to combine, or
NULL; see ‘Combining words’.
a logical value indicating whether to replace
"letter" tokens (cased letters, kana, idoegraphic, letter-like
numeric characters and other letters) with NA.
a logical value indicating whether to replace
"mark" tokens (subscripts, superscripts, modifier letters,
modifier symbols, and other marks) with NA.
a logical value indicating whether to replace
"number" tokens (decimal digits, words appearing to be
numbers, and other numeric characters) with NA.
a logical value indicating whether to replace
"punct" tokens (punctuation) with NA.
a logical value indicating whether to replace
"symbol" tokens (emoji, math, currency, and other symbols)
with NA.
a logical value indicating whether to replace
"other" tokens (types that do not fall into any other
categories) with NA.
a character vector of types to replace with NA,
or NULL.
a character of types to exempt from the drop
rules specified by the drop_letter, drop_mark,
drop_number, drop_punct, drop_symbol,
drop_other, and drop arguments, or NULL.
a logical value indicating whether to break sentences on carriage returns or line feeds.
a character vector of sentence break suppressions.
The token_filter and sentence_filter functions are
deprecated; use text_filter instead.