stri_enc_toutf8(str, is_unknown_8bit = FALSE, validate = FALSE)
NA
), see Detailsis_unknown_8bit
is set to FALSE
(the default),
then R encoding marks are used, see stri_enc_mark
.
Bytes-marked strings will cause the function to fail. If a string is in UTF-8 and has a byte order mark (BOM),
then BOM will be silently removed from the output string. If default encoding is UTF-8, see stri_enc_get
,
then strings marked with native
are -- for efficiency reasons --
returned as-is, i.e. with unchanged markings.
A similar behavior is observed when calling enc2utf8
. For is_unknown_8bit=TRUE
, if a string is declared to be neither in ASCII
nor in UTF-8, then all byte codes > 127 are replaced with
the Unicode REPLACEMENT CHARACTER (\Ufffd).
Note that the REPLACEMENT CHARACTER may be interpreted as Unicode
missing value for single characters.
Here, a bytes
-marked string is assumed to be encoded
by an 8-bit encoding such that it has ASCII as its subset. What is more, in both cases setting validate
to TRUE
or NA
validates the resulting UTF-8 byte stream.
If validate=TRUE
, then
in case of any incorrect byte sequences, they will be
replaced with REPLACEMENT CHARACTER.
This option may be used in a (very rare in practice) case
in which you want to fix an invalid UTF-8 byte sequence.
For NA
, a bogus string will be replaced with a missing value.stri_enc_fromutf32
,
stri_enc_toascii
,
stri_enc_tonative
,
stri_enc_toutf32
,
stri_encode
, stringi-encoding