stri_enc_isutf8: Check If a Data Stream Is Possibly in UTF-8
Description
The function checks whether given sequences of bytes forms
a proper UTF-8 string.
Usage
stri_enc_isutf8(str)
Arguments
str
character vector, a raw vector, or
a list of raw vectors
Value
Returns a logical vector.
Its i-th element indicates whether the i-th string
corresponds to a valid UTF-8 byte sequence.
Details
Negative answer means that a string is surely not valid UTF-8.
Positive result does not mean that we should be absolutely sure.
E.g. (c4,85) properly
represents ("Polish a with ogonek") in UTF-8
as well as ("A umlaut", "Ellipsis") in WINDOWS-1250.
Also note that UTF-8, as well as most 8-bit encodings,
have ASCII as their subsets
(note that stri_enc_isascii => stri_enc_isutf8). However, the longer the sequence,
the bigger the possibility that the result
is indeed in UTF-8 -- this is because not all sequences of bytes
are valid UTF-8. This function is independent of the way R marks encodings in
character strings (see Encoding and stringi-encoding).