Encoding(x)
Encoding(x) <- value
enc2native(x)
enc2utf8(x)
"latin1"
or
"UTF-8"
or "bytes"
. These declarations can be read by
Encoding
, which will return a character vector of values
"latin1"
, "UTF-8"
"bytes"
or "unknown"
, or
set, when value
is recycled as needed and other values are
silently treated as "unknown"
. ASCII strings will never be
marked with a declared encoding, since their representation is the
same in all supported encodings. Strings marked as "bytes"
are
intended to be non-ASCII strings which should be manipulated as bytes,
and never converted to a character encoding. enc2native
and enc2utf8
convert elements of character
vectors to the native encoding or UTF-8 respectively, taking any
marked encoding into account. They are primitive functions,
designed to do minimal copying.
There are other ways for character strings to acquire a declared
encoding apart from explicitly setting it (and these have changed as
R has evolved). Functions scan
,
read.table
, readLines
, and
parse
have an encoding
argument that is used to
declare encodings, iconv
declares encodings from its
from
argument, and console input in suitable locales is also
declared. intToUtf8
declares its output as
"UTF-8"
, and output text connections (see
textConnection
) are marked if running in a
suitable locale. Under some circumstances (see its help page)
source(encoding=)
will mark encodings of character
strings it outputs.
Most character manipulation functions will set the encoding on output
strings if it was declared on the corresponding input. These include
chartr
, strsplit(useBytes = FALSE)
,
tolower
and toupper
as well as
sub(useBytes = FALSE)
and gsub(useBytes =
FALSE)
. Note that such functions do not preserve the
encoding, but if they know the input encoding and that the string has
been successfully re-encoded (to the current encoding or UTF-8), they
mark the output.
substr
does preserve the encoding, and
chartr
, tolower
and toupper
preserve UTF-8 encoding on systems with Unicode wide characters. With
their fixed
and perl
options, strsplit
,
sub
and gsub
will give a marked UTF-8 result if
any of the inputs are UTF-8.
paste
and sprintf
return elements marked
as bytes if any of the corresponding inputs is marked as bytes, and
otherwise marked as UTF-8 of any of the inputs is marked as UTF-8.
match
, pmatch
, charmatch
,
duplicated
and unique
all match in UTF-8
if any of the elements are marked as UTF-8.
## x is intended to be in latin1
x <- "fa\xE7ile"
Encoding(x)
Encoding(x) <- "latin1"
x
xx <- iconv(x, "latin1", "UTF-8")
Encoding(c(x, xx))
c(x, xx)
Encoding(xx) <- "bytes"
xx # will be encoded in hex
cat("xx = ", xx, "\n", sep = "")
Run the code above in your browser using DataLab