stri_split_boundaries: Split a String at Text Boundaries

Description

This function locates text boundaries (like character, word, line, or sentence boundaries) and splits strings at the indicated positions.

Usage

stri_split_boundaries(str, n = -1L, tokens_only = FALSE,
  simplify = FALSE, ..., opts_brkiter = NULL)

Arguments

str

character vector or an object coercible to

integer vector, maximal number of strings to return

tokens_only

single logical value; may affect the result if n is positive, see Details

simplify

single logical value; if TRUE or NA, then a character matrix is returned; otherwise (the default), a list of character vectors is given, see Value

...

additional settings for opts_brkiter

opts_brkiter

a named list with ICU BreakIterator's settings, see stri_opts_brkiter; NULL for the default break iterator, i.e., line_break

Value

If simplify=FALSE (the default), then the functions return a list of character vectors.

Otherwise, stri_list2matrix with byrow=TRUE and n_min=n arguments is called on the resulting object. In such a case, a character matrix with length(str) rows is returned. Note that stri_list2matrix's fill argument is set to an empty string and NA, for simplify equal to TRUE and NA, respectively.

Details

Vectorized over str and n.

If n is negative (the default), then all text pieces are extracted.

Otherwise, if tokens_only is FALSE (this is the default, for compatibility with the stringr package), then n-1 tokens are extracted (if possible) and the n-th string gives the (non-split) remainder (see Examples). On the other hand, if tokens_only is TRUE, then only full tokens (up to n pieces) are extracted.

For more information on text boundary analysis performed by ICU's BreakIterator, see stringi-search-boundaries.

Examples

Run this code

# NOT RUN {
test <- "The\u00a0above-mentioned    features are very useful. " %s+%
   "Kudos to their developers. 123 456 789"
stri_split_boundaries(test, type="line")
stri_split_boundaries(test, type="word")
stri_split_boundaries(test, type="word", skip_word_none=TRUE)
stri_split_boundaries(test, type="word", skip_word_none=TRUE, skip_word_letter=TRUE)
stri_split_boundaries(test, type="word", skip_word_none=TRUE, skip_word_number=TRUE)
stri_split_boundaries(test, type="sentence")
stri_split_boundaries(test, type="sentence", skip_sentence_sep=TRUE)
stri_split_boundaries(test, type="character")

# a filtered break iterator with the new ICU:
stri_split_boundaries("Mr. Jones and Mrs. Brown are very happy.
So am I, Prof. Smith.", type="sentence", locale="en_US@ss=standard") # ICU >= 56 only

# }

Run the code above in your browser using DataLab