stri_opts_collator: Generate a List with Collator Settings

Description

A convenience function to tune the ICU Collator's behavior, e.g., in stri_compare, stri_order, stri_unique, stri_duplicated, as well as stri_detect_coll and other stringi-search-coll functions.

Usage

stri_opts_collator(
  locale = NULL,
  strength = 3L,
  alternate_shifted = FALSE,
  french = FALSE,
  uppercase_first = NA,
  case_level = FALSE,
  normalization = FALSE,
  normalisation = normalization,
  numeric = FALSE
)
stri_coll(
  locale = NULL,
  strength = 3L,
  alternate_shifted = FALSE,
  french = FALSE,
  uppercase_first = NA,
  case_level = FALSE,
  normalization = FALSE,
  normalisation = normalization,
  numeric = FALSE
)

Value

Returns a named list object; missing settings are left with default values.

Arguments

locale: single string, NULL or '' for default locale
strength: single integer in {1,2,3,4}, which defines collation strength; 1 for the most permissive collation rules, 4 for the strictest ones
alternate_shifted: single logical value; FALSE treats all the code points with non-ignorable primary weights in the same way, TRUE causes code points with primary weights that are equal or below the variable top value to be ignored on primary level and moved to the quaternary level
french: single logical value; used in Canadian French; TRUE results in secondary weights being considered backwards
uppercase_first: single logical value; NA orders upper and lower case letters in accordance to their tertiary weights, TRUE forces upper case letters to sort before lower case letters, FALSE does the opposite
case_level: single logical value; controls whether an extra case level (positioned before the third level) is generated or not
normalization: single logical value; if TRUE, then incremental check is performed to see whether the input data is in the FCD form. If the data is not in the FCD form, incremental NFD normalization is performed
normalisation: alias of normalization
numeric: single logical value; when turned on, this attribute generates a collation key for the numeric value of substrings of digits; this is a way to get '100' to sort AFTER '2'; note that negative or non-integer numbers will not be ordered properly

Author

Marek Gagolewski and other contributors

Details

ICU's collator performs a locale-aware, natural-language alike string comparison. This is a more reliable way of establishing relationships between strings than the one provided by base R, and definitely one that is more complex and appropriate than ordinary bytewise comparison.

References

Collation -- ICU User Guide, https://unicode-org.github.io/icu/userguide/collation/

ICU Collation Service Architecture -- ICU User Guide, https://unicode-org.github.io/icu/userguide/collation/architecture.html

icu::Collator Class Reference -- ICU4C API Documentation, https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/classicu_1_1Collator.html

Examples

Run this code

stri_cmp('number100', 'number2')
stri_cmp('number100', 'number2', opts_collator=stri_opts_collator(numeric=TRUE))
stri_cmp('number100', 'number2', numeric=TRUE) # equivalent
stri_cmp('above mentioned', 'above-mentioned')
stri_cmp('above mentioned', 'above-mentioned', alternate_shifted=TRUE)

Run the code above in your browser using DataLab