latexify: Convert Character Strings for Use with LaTeX

Description

Some characters cannot be entered directly into a LaTeX document. This function converts the input character vector to a form suitable for inclusion in a LaTeX document in text mode. It can be used together with \Sexpr in vignettes.

Usage

latexify(x, doublebackslash = TRUE, dashdash = TRUE,
         quotes = c("straight", "curved"),
         packages = c("fontenc", "textcomp"))

Value

A character vector

Arguments

x: a character vector
doublebackslash: a logical flag. If TRUE, backslashes in the output are doubled. It seems that Sweave needs TRUE and knitr FALSE.
dashdash: a logical flag. If TRUE (the default), consecutive dashes (“-”) in the input will be rendered as separate characters. If FALSE, they will not be given any special treatment, which will usually mean that two dashes are rendered as an en dash and three dashes make an em dash.
quotes: a character string specifying how single and double quotes (ASCII codes 39 and 34) and stand-alone grave accents (ASCII code 96) in the input are treated. The default is to use straight quotes and the proper symbol for the grave accent. The other option is to use curved right side (closing) quotes, and let LaTeX convert the grave accent to opening curved quotes. Straight double quotes are not available in the default OT1 font encoding of LaTeX. Straight single quotes and the grave accent symbol require the “textcomp” package. See packages.
packages: a character vector specifying the LaTeX packages allowed. The use of some symbols in LaTeX requires commands or characters made available by an add-on package. If a package required for a given character is not marked as available, a fallback solution is silently used. For example, curved quotes may be used instead of straight quotes. The supported packages are "eurosym" (not used by default), "fontenc" and "textcomp". Including "fontenc" in the vector means that some other encoding than OT1 is going to be used. Note that straight quotes are an exception in the sense that a reasonable substitute (curved quotes) is available. In many other cases, "textcomp" and "fontenc" are silently assumed.

Author

Mikko Korpela

Details

The function is intended for use with unformatted inline text. Newlines, tabs and other whitespace characters ("[:space:]" in regex) are converted to spaces. Control characters ("[:cntrl:]") that are not whitespace are removed. Other more or less special characters in the ASCII set are ‘{’, ‘}’, ‘\’, ‘#’, ‘$’, ‘%’, ‘^’, ‘&’, ‘_’, ‘~’, double quote, ‘/’, single quote, ‘<’, ‘>’, ‘|’, grave and ‘-’. They are converted to the corresponding LaTeX commands. Some of the conversions are affected by user options, e.g. dashdash.

Before applying the substitutions described above, input elements with Encoding set to "bytes" are printed and the output is stored using captureOutput. The result of this intermediate stage is ASCII text where some characters are shown as their byte codes using a hexadecimal pair prefixed with "\x". This set includes tabs, newlines and control characters. The substitutions are then applied to the intermediate result.

The quoting functions sQuote and dQuote may use non-ASCII quote characters, depending on the locale. Also these quotes are converted to LaTeX commands. This means that the quoting functions are safe to use with any LaTeX input encoding. Similarly, some other non-ASCII characters, e.g. letters, currency symbols, punctuation marks and diacritics, are converted to commands.

Adding "eurosym" to packages enables the use of the euro sign as provided by the "eurosym" package (\euro).

The result is converted to UTF-8 encoding, Normalization Form C (NFC). Note that this function will not add any non-ASCII characters that were not already present in the input. On the contrary, some non-ASCII characters, e.g. all characters in the "latin1" (ISO-8859-1) Encoding (character set), are removed when converted to LaTeX commands. Any remaining non-ASCII character has a good chance of working when the document is processed with XeTeX or LuaTeX, but the Unicode support available with pdfTeX is limited.

Assuming that pdflatex is used for compilation, suggested package loading commands in the document preamble are:

\usepackage[T1]{fontenc}    % no '"' in OT1 font encoding
\usepackage{textcomp}       % some symbols e.g. straight single quote
\usepackage[utf8]{inputenx} % UTF-8 input encoding
\input{ix-utf8enc.dfu}      % more supported characters

References

INRIA. Tralics: a LaTeX to XML translator, HTML documentation of all TeX commands. https://www-sop.inria.fr/marelle/tralics/.

Levitt, N., Persch, C., and Unicode, Inc. (2013) GNOME Character Map, application version 3.10.1.

Mittelbach, F., Goossens, M., Braams, J., Carlisle, D., and Rowley, C. (2004) The LaTeX Companion. Addison-Wesley, second edition. ISBN-13: 978-0-201-36299-2.

Pakin, S. (2009) The Comprehensive LaTeX Symbol List. https://www.ctan.org/tex-archive/info/symbols/comprehensive.

The Unicode Consortium. The Unicode Standard. https://home.unicode.org/.

Examples

Run this code

x1 <- "clich\xe9\nma\xf1ana"
Encoding(x1) <- "latin1"
x1
x2 <- x1
Encoding(x2) <- "bytes"
x2
x3 <- enc2utf8(x1)
testStrings <-
    c("different     kinds\nof\tspace",
      "control\a characters \ftoo",
      "{braces} and \\backslash",
      '#various$ %other^ &characters_ ~escaped"/coded',
      x1,
      x2,
      x3)
latexStrings <- latexify(testStrings, doublebackslash = FALSE)
## All should be "unknown"
Encoding(latexStrings)
cat(latexStrings, sep="\n")
## Input encoding does not matter
identical(latexStrings[5], latexStrings[7])

Run the code above in your browser using DataLab