This page gives an overview of encoding handling in stringst.
All character strings are stored as a sequence of bytes. An encoding system relates a byte, or a short sequence of bytes to a symbol. Over the years, many encoding systems have been developed, and not all OS's and softwares use the same encoding as default. Similarly, depending on the system R is running on, R may use a different encoding for storing strings internally.
The stringdist package is designed so users in principle need not
worry about this. Strings are converted to UTF-32
(unsigned integer)
by default prior to any further computation. This means that results are
encoding-independent and that strings are interpreted as a sequence of
symbols, not as a sequence of pure bytes. In functions where this is
relevant, this may be switched by setting the useBytes
option to
TRUE
. However, keep in mind that results will then likely depend on the
system R is running on, except when your strings are pure ASCII.
Also, for multi-byte encodings, results for byte-wise computations
will usually differ from results using encoded computations.
Prior to stringdist version 0.9, setting useBytes=TRUE
could
give a significant performance enhancement. Since version 0.9, translation
to integer is done by C code internal to stringdist and the difference in
performance is now negligible.
In utf-8
, the same (accented) character may be represented as several byte sequences. For example, an u-umlaut
can be represented with a single byte code or as a byte code representing 'u'
followed by a modifier byte code
that adds the umlaut. The stringi package
of Gagolevski and Tartanus offers unicode normalisation tools.
Some algorithms (like soundex) are defined only on the printable ASCII character set. This excludes any character with accents for example. Translating accented characters to the non-accented ones is a form of transliteration. On many systems running R (but not all!) you can achieve this with
iconv(x,to="ASCII//TRANSLIT")
,
where x
is your character vector. See the documentation of iconv
for details.
The stringi
package (Gagolewski and Tartanus) should work on any system. The command
stringi::stri_trans_general(x,"Latin-ASCII")
transliterates character vector x
to ASCII.
Functions using re-encoding: stringdist
, stringdistmatrix
, amatch
, ain
, qgrams
Encoding related: printable_ascii