textfile: read a text corpus source from a file

Description

Read a text corpus from one or more source files. The texts of the corpus come from (some part of) the content of the files, and the document-level metadata (docvars) come from either the file contents or filenames.

Usage

textfile(file, ignoreMissingFiles = FALSE, textField = NULL, cache = FALSE, docvarsfrom = c("filenames"), dvsep = "_", docvarnames = NULL, encoding = NULL, ...)
"textfile"(file, ignoreMissingFiles = FALSE, textField = NULL, cache = FALSE, docvarsfrom = "metadata", dvsep = "_", docvarnames = NULL, encoding = NULL, ...)

Arguments

file

the complete filename(s) to be read. This is designed to automagically handle a number of common scenarios, so the value can be a "glob"-type' wildcard value. Currently available filetypes are:

txt

plain text files: So-called structured text files, which describe both texts and metadata: For all structured text filetypes, the column, field, or node which contains the the text must be specified with the textField parameter, and all other fields are treated as docvars.

json

data in some form of JavaScript Object Notation, consisting of the texts and optionally additional docvars. The supported formats are:

a single JSON object per file
line-delimited JSON, with one object per line
line-delimited JSON, of the format produced from a Twitter stream. This type of file has special handling which simplifies the Twitter format into docvars. The correct format for each JSON file is automatically detected.

csv,tab,tsv

comma- or tab-separated values

xml

Basic flat XML documents are supported -- those of the kind supported by the function xmlToDataFrame function of the XML package.

file can also not be a path to a single local file, such as

a wildcard value

any valid pathname with a wildcard ("glob") expression that can be expanded by the operating system. This may consist of multiple file types.

a URL to a remote

which is downloaded then loaded

zip,tar,tar.gz,tar.bz

archive file, which is unzipped. The contained files must be either at the top level or in a single directory. Archives, remote URLs and glob patterns can resolve to any of the other filetypes, so you could have, for example, a remote URL to a zip file which contained Twitter JSON files.

ignoreMissingFiles

if FALSE, then if the file argument doesn't resolve to an existing file, then an error will be thrown. Note that this can happen in a number of ways, including passing a path to a file that does not exist, to an empty archive file, or to a glob pattern that matches no files.

textField

a variable (column) name or column number indicating where to find the texts that form the documents for the corpus. This must be specified for file types .csv and .json.

cache

If TRUE, write the object to a temporary file and store the temporary filename in the corpusSource-class object definition. If FALSE, return the data in the object. Caching the file provides a way to read in very large quantities of textual data without storing two copies in memory: one as a corpusSource-class object and the second as a corpus class object. It also provides a way to try different settings of encoding conversion when creating a corpus from a corpusSource-class object, without having to load in all of the source data again

docvarsfrom

used to specify that docvars should be taken from the filenames, when the textfile inputs are filenames and the elements of the filenames are document variables, separated by a delimiter (dvsep). This allows easy assignment of docvars from filenames such as 1789-Washington.txt, 1793-Washington, etc. by dvsep or from meta-data embedded in the text file header (headers).

dvsep

separator used in filenames to delimit docvar elements if docvarsfrom="filenames" is used

docvarnames

character vector of variable names for docvars, if docvarsfrom is specified. If this argument is not used, default docvar names will be used (docvar1, docvar2, ...).

encoding

vector: either the encoding of all files, or one encoding for each files

...

additional arguments passed through to low-level file reading function, such as file, read.csv, etc. Useful for specifying an input encoding option, which is specified in the same was as it would be give to iconv. See the Encoding section of file for details. Also useful for passing arguments through to read.csv, for instance `quote = ""`, if quotes are causing problems within comma-delimited fields.

Value

an object of class corpusSource-class that can be read by corpus to construct a corpus

Details

If cache = TRUE, the constructor does not store a copy of the texts, but rather reads in the texts and associated data, and saves them to a temporary disk file whose location is specified in the corpusSource-class object. This prevents a complete copy of the object from cluttering the global environment and consuming additional space. This does mean however that the state of the file containing the source data will not be cross-platform and may not be persistent across sessions. So the recommended usage is to load the data into a corpus in the same session in which textfile is called.

Examples

Run this code

## Not run: # Twitter json
# mytf1 <- textfile("http://www.kenbenoit.net/files/tweets.json")
# summary(corpus(mytf1), 5)
# # generic json - needs a textField specifier
# mytf2 <- textfile("http://www.kenbenoit.net/files/sotu.json",
#                   textField = "text")
# summary(corpus(mytf2))
# # text file
# mytf3 <- textfile("https://wordpress.org/plugins/about/readme.txt")
# summary(corpus(mytf3))
# # XML data
# mytf6 <- textfile("http://www.kenbenoit.net/files/plant_catalog.xml", 
#                   textField = "COMMON")
# summary(corpus(mytf6))
# # csv file
# write.csv(data.frame(inaugSpeech = texts(inaugCorpus), docvars(inaugCorpus)), 
#           file = "/tmp/inaugTexts.csv", row.names = FALSE)
# mytf7 <- textfile("/tmp/inaugTexts.csv", textField = "inaugSpeech")
# summary(corpus(mytf7))
# 
# # vector of full filenames for a recursive structure
# textfile(list.files(path = "~/Desktop/texts", pattern = "\\.txt$", 
#                     full.names = TRUE, recursive = TRUE))
# ## End(Not run)

Run the code above in your browser using DataLab