These functions are provided for compatibility with older versions of corpus only, and may be defunct as soon as the next release.
token_filter(map_case = TRUE, map_compat = TRUE,
map_quote = TRUE, remove_ignorable = TRUE,
stemmer = NULL, stem_except = drop,
combine = abbreviations("english"),
drop_letter = FALSE, drop_mark = FALSE,
drop_number = FALSE, drop_punct = FALSE,
drop_symbol = FALSE, drop_other = FALSE,
drop = NULL, drop_except = NULL) sentence_filter(crlf_break = FALSE,
suppress = abbreviations("english"))
a logical value indicating whether to apply Unicode case mapping to the text. For most languages, this transformation changes uppercase characters to their lowercase equivalents.
a logical value indicating whether to apply Unicode compatibility mappings to the characters, those required for NFKC and NFKD normal forms.
a logical value indicating whether to replace Unicode quote characters like single quote, double quote, and apostrophe, with an ASCII single quote (').
a logical value indicating whether to remove Unicode "default ignorable" characters like zero-width spaces and soft hyphens.
a character value giving the name of the stemming
algorithm, or NA
to leave words unchanged. The stemming
algorithms are provided by the
Snowball stemming library;
the following stemming algorithms are available:
"arabic"
, "danish"
, "dutch"
,
"english"
, "finnish"
, "french"
,
"german"
, "hungarian"
, "italian"
,
"norwegian"
, "porter"
, "portuguese"
,
"romanian"
, "russian"
, "spanish"
,
"swedish"
, "tamil"
, and "turkish"
.
a character vector of exception words to exempt from
stemming, or NULL
. If left unspecified, stem_except
is set equal to the drop
argument.
a character vector of multi-word phrases to combine, or
NULL
; see ‘Combining words’.
a logical value indicating whether to replace
"letter"
tokens (cased letters, kana, idoegraphic, letter-like
numeric characters and other letters) with NA
.
a logical value indicating whether to replace
"mark"
tokens (subscripts, superscripts, modifier letters,
modifier symbols, and other marks) with NA
.
a logical value indicating whether to replace
"number"
tokens (decimal digits, words appearing to be
numbers, and other numeric characters) with NA
.
a logical value indicating whether to replace
"punct"
tokens (punctuation) with NA
.
a logical value indicating whether to replace
"symbol"
tokens (emoji, math, currency, and other symbols)
with NA
.
a logical value indicating whether to replace
"other"
tokens (types that do not fall into any other
categories) with NA
.
a character vector of types to replace with NA
,
or NULL
.
a character of types to exempt from the drop
rules specified by the drop_letter
, drop_mark
,
drop_number
, drop_punct
, drop_symbol
,
drop_other
, and drop
arguments, or NULL
.
a logical value indicating whether to break sentences on carriage returns or line feeds.
a character vector of sentence break suppressions.
The token_filter
and sentence_filter
functions are
deprecated; use text_filter
instead.