Learn R Programming

rock (version 0.8.1)

doc_to_txt: Convert a document (.docx, .pdf, .odt, .rtf, or .html) to a plain text file

Description

This used to be a thin wrapper around textreadr::read_document() that also writes the result to output, doing its best to correctly write UTF-8 (based on the approach recommended in this blog post). However, textreadr was archived from CRAN. It now directly wraps the functions that textreadr wraps: pdftools::pdf_text(), striprtf::read_rtf, and it uses xml2 to import .docx and .odt files, and rvest to import .html files, using the code from the textreadr package.

Usage

doc_to_txt(
  input,
  output = NULL,
  encoding = rock::opts$get("encoding"),
  newExt = NULL,
  preventOverwriting = rock::opts$get("preventOverwriting"),
  silent = rock::opts$get("silent")
)

Value

The converted source, as a character vector.

Arguments

input

The path to the input file.

output

The path and filename to write to. If this is a path to an existing directory (without a filename specified), the input filename will be used, and the extension will be replaced with extension.

encoding

The encoding to use when writing the text file.

newExt

The extension to append: only used if output = NULL and newExt is not NULL, in which case the output will be written to a file with the same name as input but with newExt as extension.

preventOverwriting

Whether to prevent overwriting existing files.

silent

Whether to the silent or chatty.

Examples

Run this code
### This example requires the {xml2} package
if (requireNamespace("xml2", quietly = TRUE)) {
  print(
    rock::doc_to_txt(
      input = system.file(
        "extdata/doc-to-test.docx", package="rock"
      )
    )
  );
}

Run the code above in your browser using DataLab