stringdist-encoding: String metrics in stringdist

Description

This page gives an overview of encoding handling in stringst.

Arguments

Encoding in <span class="pkg">stringdist</span>

All character strings are stored as a sequence of bytes. An encoding system relates a byte, or a short sequence of bytes to a symbol. Over the years, many encoding systems have been developed, and not all OS's and softwares use the same encoding as default. Similarly, depending on the system R is running on, R may use a different encoding for storing strings internally.

The stringdist package is designed so users in principle need not worry about this. Strings are converted to UTF-32 (unsigned integer) by default prior to any further computation. This means that results are encoding-independent and that strings are interpreted as a sequence of symbols, not as a sequence of pure bytes. In functions where this is relevant, this may be switched by setting the useBytes option to TRUE. However, keep in mind that results will then likely depend on the system R is running on, except when your strings are pure ASCII. Also, for multi-byte encodings, results for byte-wise computations will usually differ from results using encoded computations.

Prior to stringdist version 0.9, setting useBytes=TRUE could give a significant performance enhancement. Since version 0.9, translation to integer is done by C code internal to stringdist and the difference in performance is now negligible.

Unicode normalisation

In utf-8, the same (accented) character may be represented as several byte sequences. For example, an u-umlaut can be represented with a single byte code or as a byte code representing 'u' followed by a modifier byte code that adds the umlaut. The stringi package of Gagolevski and Tartanus offers unicode normalisation tools.

Some tips on character encoding and transliteration

Some algorithms (like soundex) are defined only on the printable ASCII character set. This excludes any character with accents for example. Translating accented characters to the non-accented ones is a form of transliteration. On many systems running R (but not all!) you can achieve this with

iconv(x,to="ASCII//TRANSLIT"),

where x is your character vector. See the documentation of iconv for details.

The stringi package (Gagolewski and Tartanus) should work on any system. The command stringi::stri_trans_general(x,"Latin-ASCII") transliterates character vector x to ASCII.

References