Learn R Programming

tm (version 0.5-9.1)

preprocessReut21578XML: Preprocess the Reuters-21578 XML archive.

Description

Preprocess the Reuters-21578 XML archive by correcting invalid UTF-8 encodings and copying each text document into a separate file.

Usage

preprocessReut21578XML(input, output, fixEnc = TRUE)

Arguments

input
A character describing the input directory.
output
A character describing the output directory.
fixEnc
A logical value indicating whether an invalid UTF-8 encoding in the Reuters-21578 XML dataset should be corrected.

Value

  • No explicit return value. As a side product the directory output contains the corrected dataset.

References

Lewis, David (1997) Reuters-21578 Text Categorization Collection Distribution 1.0. http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html

Luz, Saturnino XML-encoded version of Reuters-21578. http://modnlp.berlios.de/reuters21578.html