The Hadoop Distributed File System (HDFS) is typically part of a Hadoop
cluster or can be used as a stand-alone general purpose distributed file
system (DFS). Several high-level functions provide easy access to
distributed storage.
DFS_cat
is useful for producing output in user-defined
functions. It reads from files on the DFS and typically prints the
output to the standard output. Its behaviour is similar to the base
function cat
.
DFS_dir_create
creates directories with the given path names if
they do not already exist. It's behaviour is similar to the base
function dir.create
.
DFS_dir_exists
and DFS_file_exists
return a logical
vector indicating whether the directory or file respectively named by
its argument exist. See also function file.exists
.
DFS_dir_remove
attempts to remove the directory named in its
argument and if recursive
is set to TRUE
also attempts
to remove subdirectories in a recursive manner.
DFS_list
produces a character vector of the names of files
in the directory named by its argument.
DFS_read_lines
is a reader for (plain text) files stored on the
DFS. It returns a vector of character strings representing lines in
the (text) file. If n
is given as an argument it reads that
many lines from the given file. It's behaviour is similar to the base
function readLines
.
DFS_put
copies files named by its argument to a given path in
the DFS.
DFS_put_object
serializes an R object to the DFS.
DFS_write_lines
writes a given vector of character strings to a
file stored on the DFS. It's behaviour is similar to the base
function writeLines
.