Reads a text file in table format and creates a distributed data frame from it, with cases corresponding to lines and variables to fields in the file.
# S3 method for table
drRead(file, header = FALSE, sep = "", quote = "\"'", dec = ".",
skip = 0, fill = !blank.lines.skip, blank.lines.skip = TRUE, comment.char = "#",
allowEscapes = FALSE, encoding = "unknown", autoColClasses = TRUE,
rowsPerBlock = 50000, postTransFn = identity, output = NULL, overwrite = FALSE,
params = NULL, packages = NULL, control = NULL, ...)
# S3 method for csv
drRead(file, header = TRUE, sep = ",",
quote = "\"", dec = ".", fill = TRUE, comment.char = "", ...)
# S3 method for csv2
drRead(file, header = TRUE, sep = ";",
quote = "\"", dec = ",", fill = TRUE, comment.char = "", ...)
# S3 method for delim
drRead(file, header = TRUE, sep = "\t",
quote = "\"", dec = ".", fill = TRUE, comment.char = "", ...)
# S3 method for delim2
drRead(file, header = TRUE, sep = "\t",
quote = "\"", dec = ",", fill = TRUE, comment.char = "", ...)
input text file - can either be character string pointing to a file on local disk, or an hdfsConn
object pointing to a text file on HDFS (see output
argument below)
this and parameters other parameters below are passed to read.table
for each chunk being processed - see read.table
for more info. Most all have defaults or appropriate defaults are set through other format-specific functions such as drRead.csv
and drRead.delim
.
see read.table
for more info
see read.table
for more info
see read.table
for more info
see read.table
for more info
see read.table
for more info
see read.table
for more info
see read.table
for more info
see read.table
for more info
see read.table
for more info
should column classes be determined automatically by reading in a sample? This can sometimes be problematic because of strange ways R handles quotes in read.table
, but keeping the default of TRUE
is advantageous for speed.
how many rows of the input file should make up a block (key-value pair) of output?
a function to be applied after a block is read in to provide any additional processingn before the block is stored
a "kvConnection" object indicating where the output data should reside. Must be a localDiskConn
object if input is a text file on local disk, or a hdfsConn
object if input is a text file on HDFS.
logical; should existing output location be overwritten? (also can specify overwrite = "backup"
to move the existing output to _bak)
a named list of objects external to the input data that are needed in postTransFn
a vector of R package names that contain functions used in fn
(most should be taken care of automatically such that this is rarely necessary to specify)
parameters specifying how the backend should handle things (most-likely parameters to rhwatch
in RHIPE) - see rhipeControl
and localDiskControl
see read.table
for more info
an object of class "ddf"
# NOT RUN {
csvFile <- file.path(tempdir(), "iris.csv")
write.csv(iris, file = csvFile, row.names = FALSE, quote = FALSE)
irisTextConn <- localDiskConn(file.path(tempdir(), "irisText2"), autoYes = TRUE)
a <- drRead.csv(csvFile, output = irisTextConn, rowsPerBlock = 10)
# }
Run the code above in your browser using DataLab