utf8: UTF-8 Text Handling

Description

UTF-8 text conversion, formatting, and printing.

Usage

as_utf8(x)
    utf8_encode(x, display = FALSE)
    utf8_format(x, trim = FALSE, chars = NULL, justify = "left",
                width = NULL, na.encode = TRUE, quote = FALSE,
                na.print = NULL, print.gap = NULL, ...)
    utf8_print(x, chars = NULL, quote = TRUE, na.print = NULL,
               print.gap = NULL, right = FALSE, max = NULL,
               display = TRUE, ...)
    utf8_width(x, encode = TRUE)
    utf8_valid(x)

Arguments

character object.

display

logical scalar indicating whether to optimize the encoding for display, not byte-for-byte data transmission.

trim

logical scalar indicating whether to suppress padding spaces around elements.

chars

integer scalar indicating the maximum number of character units to display. Wide characters like emoji take two character units; combining marks and default ignorables take none. Longer strings get truncated and suffixed or prefixed with an ellipsis ("..." in C locale, "\u2026" in others). Set to NULL to limit output to the line width as determined by getOption("width").

justify

justification; one of "left", "right", "centre", or "none". Can be abbreviated.

width

the minimum field width; set to NULL or 0 for no restriction.

na.encode

logical scalar indicating whether to encode NA values as character strings.

quote

logical scalar indicating whether to put surrounding quotes around character strings.

na.print

character string (or NULL) indicating the encoding for NA values. Ignored when na.encode is FALSE.

print.gap

non-negative integer (or NULL) giving the number of spaces in gaps between columns; set to NULL or 1 for a single space.

right

logical scalar indicating whether to right-justify character strings.

max

non-negative integer (or NULL) indicating the maximum number of elements to print; set to getOption("max.print") if argument is NULL.

encode

whether to encode the object before measuring its width.

...

further arguments passed from other methods. Ignored.

Value

For as_utf8 or utf8_encode, a character object with the same attributes as x but with Encoding set to "UTF-8".

For utf8_print, the function returns x invisibly.

For utf8_valid or utf8_width, a logical or integer object, respectively, with the same names, dim, and dimnames as x.

Details

as_utf8 converts a character object from its declared encoding to a valid UTF-8 character object, or throws an error if no conversion is possible.

utf8_encode encodes a character object for printing on a UTF-8 device by escaping controls characters and other non-printable characters. When display = TRUE, the function optimizes the encoding for display by removing default ignorable characters (soft hyphens, zero-width spaces, etc.) and placing zero-width spaces after wide emoji. When LC_CTYPE = "C", the function escapes all non-ASCII characters and gives the same results on all platforms.

utf8_format formats a character object for printing, optionally truncating long character strings.

utf8_print prints a character object after formatting it with utf8_format.

utf8_validtests whether the elements of a character object can be translated to valid UTF-8 strings.

utf8_width returns the printed widths of the elements of a character object on a UTF-8 device or, when LC_CTYPE = "C", on an ASCII device. If the string is not printable on the device, for example if it contains a control code like "\n", then the result is NA. If encode = TRUE, the default, then the function returns the widths of the encoded elements (via utf8_encode); otherwise, the function returns the widths of the original elements.

Examples

Run this code

# NOT RUN {
    # the second element is encoded in latin-1, but declared as UTF-8
    x <- c("fa\u00E7ile", "fa\xE7ile", "fa\xC3\xA7ile")
    Encoding(x) <- c("UTF-8", "UTF-8", "bytes")

    # attempt to convert to UTF-8 (fails)
    
# }
# NOT RUN {
as_utf8(x)
# }
# NOT RUN {
    y <- x
    Encoding(y[2]) <- "latin1" # mark the correct encoding
    as_utf8(y) # succeeds

    # test for valid UTF-8
    utf8_valid(x)

    # encoding
    utf8_encode(x)

    # formatting
    utf8_format(x, chars = 3)
    utf8_format(x, chars = 3, justify = "centre", width = 10)
    utf8_format(x, chars = 3, justify = "right")

    # get widths
    utf8_width(x)
    utf8_width(x, encode = FALSE)

    # printing (assumes that output is capable of displaying Unicode 10.0.0)
    print(intToUtf8(0x1F600 + 0:79)) # with default R print function
    utf8_print(intToUtf8(0x1F600 + 0:79)) # with utf8_print, truncates line
    utf8_print(intToUtf8(0x1F600 + 0:79), chars = 1000) # higher character limit

    # in C locale, output ASCII (same results on all platforms)
    oldlocale <- Sys.getlocale("LC_CTYPE")
    invisible(Sys.setlocale("LC_CTYPE", "C")) # switch to C locale
    utf8_print(intToUtf8(0x1F600 + 0:79))
    invisible(Sys.setlocale("LC_CTYPE", oldlocale)) # switch back to old locale
# }