Text boundary analysis is the process of locating linguistic boundaries while formatting and handling text.
Examples of the boundary analysis process process include:
Locating appropriate points to word-wrap text to fit
within specific margins while displaying or printing,
see stri_wrap
and stri_split_boundaries
.
Counting characters, words, sentences, or paragraphs,
see stri_count_boundaries
.
Making a list of the unique words in a document,
cf. stri_extract_all_words
and then stri_unique
.
Capitalizing the first letter of each word
or sentence, see also stri_trans_totitle
.
Locating a particular unit of the text (for example,
finding the third word in the document),
see stri_locate_all_boundaries
.
Generally, text boundary analysis is a locale-dependent operation. For example, in Japanese and Chinese one does not separate words with spaces - a line break can occur even in the middle of a word. These languages have punctuation and diacritical marks that cannot start or end a line, so this must also be taken into account.
stringi uses ICU's BreakIterator
to locate specific
text boundaries. Note that the BreakIterator
's behavior
may be controlled in come cases, see stri_opts_brkiter
.
The character
boundary iterator tries to match what a user
would think of as a ``character'' -- a basic unit of a writing system
for a language -- which may be more than just a single Unicode code point.
The word
boundary iterator locates the boundaries
of words, for purposes such as ``Find whole words'' operations.
The line_break
iterator locates positions that would
be appropriate points to wrap lines when displaying the text.
On the other hand, a break iterator of type sentence
locates sentence boundaries.
For technical details on different classes of text boundaries refer to the ICU User Guide, see below.
Boundary Analysis -- ICU User Guide, http://userguide.icu-project.org/boundaryanalysis
Other locale_sensitive: %s<%
,
stri_compare
,
stri_count_boundaries
,
stri_duplicated
,
stri_enc_detect2
,
stri_extract_all_boundaries
,
stri_locate_all_boundaries
,
stri_opts_collator
,
stri_order
,
stri_split_boundaries
,
stri_trans_tolower
,
stri_unique
, stri_wrap
,
stringi-locale
,
stringi-search-coll
Other text_boundaries: stri_count_boundaries
,
stri_extract_all_boundaries
,
stri_locate_all_boundaries
,
stri_opts_brkiter
,
stri_split_boundaries
,
stri_split_lines
,
stri_trans_tolower
,
stri_wrap
, stringi-search
Other stringi_general_topics: stringi-arguments
,
stringi-encoding
,
stringi-locale
,
stringi-package
,
stringi-search-charclass
,
stringi-search-coll
,
stringi-search-fixed
,
stringi-search-regex
,
stringi-search