Learn R Programming

corpus (version 0.8.0)

corpus-deprecated: Deprecated Functions in Package corpus

Description

These functions are provided for compatibility with older versions of corpus only, and may be defunct as soon as the next release.

Usage

token_filter(map_case = TRUE, map_compat = TRUE,
                 map_quote = TRUE, remove_ignorable = TRUE,
                 stemmer = NULL, stem_except = drop,
                 combine = abbreviations("english"),
                 drop_letter = FALSE, drop_mark = FALSE,
                 drop_number = FALSE, drop_punct = FALSE,
                 drop_symbol = FALSE, drop_other = FALSE,
                 drop = NULL, drop_except = NULL)

sentence_filter(crlf_break = FALSE, suppress = abbreviations("english"))

Arguments

map_case

a logical value indicating whether to apply Unicode case mapping to the text. For most languages, this transformation changes uppercase characters to their lowercase equivalents.

map_compat

a logical value indicating whether to apply Unicode compatibility mappings to the characters, those required for NFKC and NFKD normal forms.

map_quote

a logical value indicating whether to replace Unicode quote characters like single quote, double quote, and apostrophe, with an ASCII single quote (').

remove_ignorable

a logical value indicating whether to remove Unicode "default ignorable" characters like zero-width spaces and soft hyphens.

stemmer

a character value giving the name of the stemming algorithm, or NA to leave words unchanged. The stemming algorithms are provided by the Snowball stemming library; the following stemming algorithms are available: "arabic", "danish", "dutch", "english", "finnish", "french", "german", "hungarian", "italian", "norwegian", "porter", "portuguese", "romanian", "russian", "spanish", "swedish", "tamil", and "turkish".

stem_except

a character vector of exception words to exempt from stemming, or NULL. If left unspecified, stem_except is set equal to the drop argument.

combine

a character vector of multi-word phrases to combine, or NULL; see ‘Combining words’.

drop_letter

a logical value indicating whether to replace "letter" tokens (cased letters, kana, idoegraphic, letter-like numeric characters and other letters) with NA.

drop_mark

a logical value indicating whether to replace "mark" tokens (subscripts, superscripts, modifier letters, modifier symbols, and other marks) with NA.

drop_number

a logical value indicating whether to replace "number" tokens (decimal digits, words appearing to be numbers, and other numeric characters) with NA.

drop_punct

a logical value indicating whether to replace "punct" tokens (punctuation) with NA.

drop_symbol

a logical value indicating whether to replace "symbol" tokens (emoji, math, currency, and other symbols) with NA.

drop_other

a logical value indicating whether to replace "other" tokens (types that do not fall into any other categories) with NA.

drop

a character vector of types to replace with NA, or NULL.

drop_except

a character of types to exempt from the drop rules specified by the drop_letter, drop_mark, drop_number, drop_punct, drop_symbol, drop_other, and drop arguments, or NULL.

crlf_break

a logical value indicating whether to break sentences on carriage returns or line feeds.

suppress

a character vector of sentence break suppressions.

Details

The token_filter and sentence_filter functions are deprecated; use text_filter instead.

See Also

Deprecated, text_filter