DFS: Hadoop Distributed File System

Description

Functions providing high-level access to the Hadoop Distributed File System (HDFS).

Usage

DFS_cat( file, con = stdout(), henv = hive() )
DFS_delete( file, recursive = FALSE, henv = hive() )
DFS_dir_create( path, henv = hive() )
DFS_dir_exists( path, henv = hive() )
DFS_dir_remove( path, recursive = TRUE, henv = hive() )
DFS_file_exists( file, henv = hive() )
DFS_get_object( file, henv = hive() )
DFS_read_lines( file, n = -1L, henv = hive() )
DFS_rename( from, to, henv = hive() )
DFS_list( path = ".", henv = hive() )
DFS_tail( file, n = 6L, size = 1024L, henv = hive() )
DFS_put( files, path = ".", henv = hive() )
DFS_put_object( obj, file, henv = hive() )
DFS_write_lines( text, file, henv = hive() )

Arguments

henv

An object containing the local Hadoop configuration.

file

a character string representing a file on the DFS.

files

a character string representing files located on the local file system to be copied to the DFS.

an integer specifying the number of lines to read.

obj

an R object to be serialized to/from the DFS.

path

a character string representing a full path name in the DFS (without the leading hdfs://); for many functions the default corresponds to the user's home directory in the DFS.

recursive

logical. Should elements of the path other than the last be deleted recursively?

size

an integer specifying the number of bytes to be read. Must be sufficiently large otherwise n does not have the desired effect.

text

a (vector of) character string(s) to be written to the DFS.

con

A connection to be used for printing the output provided by cat. Default: standard output connection, has currently no other effect

from

a character string representing a file or directory on the DFS to be renamed.

a character string representing the new filename on the DFS.

Value

DFS_delete(), DFS_dir_create(), and DFS_dir_remove return a logical value indicating if the operation succeeded for the given argument.

DFS_dir_exists() and DFS_file_exists() return TRUE if the named directories or files exist in the HDFS.

DFS_get__object() returns the deserialized object stored in a file on the HDFS.

DFS_list() returns a character vector representing the directory listing of the corresponding path on the HDFS.

DFS_read_lines() returns a character vector of length the number of lines read.

DFS_tail() returns a character vector of length the number of lines to read until the end of a file on the HDFS.

Details

The Hadoop Distributed File System (HDFS) is typically part of a Hadoop cluster or can be used as a stand-alone general purpose distributed file system (DFS). Several high-level functions provide easy access to distributed storage.

DFS_cat is useful for producing output in user-defined functions. It reads from files on the DFS and typically prints the output to the standard output. Its behaviour is similar to the base function cat.

DFS_dir_create creates directories with the given path names if they do not already exist. It's behaviour is similar to the base function dir.create.

DFS_dir_exists and DFS_file_exists return a logical vector indicating whether the directory or file respectively named by its argument exist. See also function file.exists.

DFS_dir_remove attempts to remove the directory named in its argument and if recursive is set to TRUE also attempts to remove subdirectories in a recursive manner.

DFS_list produces a character vector of the names of files in the directory named by its argument.

DFS_read_lines is a reader for (plain text) files stored on the DFS. It returns a vector of character strings representing lines in the (text) file. If n is given as an argument it reads that many lines from the given file. It's behaviour is similar to the base function readLines.

DFS_put copies files named by its argument to a given path in the DFS.

DFS_put_object serializes an R object to the DFS.

DFS_write_lines writes a given vector of character strings to a file stored on the DFS. It's behaviour is similar to the base function writeLines.

References

The Hadoop Distributed File System (https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html).

Examples

Run this code

# NOT RUN {
## Do we have access to the root directory of the DFS?
# }
# NOT RUN {
DFS_dir_exists("/")
# }
# NOT RUN {
## Some self-explanatory DFS interaction
# }
# NOT RUN {
DFS_list( "/" )
DFS_dir_create( "/tmp/test" )
DFS_write_lines( c("Hello HDFS", "Bye Bye HDFS"), "/tmp/test/hdfs.txt" )
DFS_list( "/tmp/test" )
DFS_read_lines( "/tmp/test/hdfs.txt" )
# }
# NOT RUN {
## Serialize an R object to the HDFS
# }
# NOT RUN {
foo <- function()
"You got me serialized."
sro <- "/tmp/test/foo.sro"
DFS_put_object(foo, sro)
DFS_get_object( sro )()
# }
# NOT RUN {
## finally (recursively) remove the created directory
# }
# NOT RUN {
DFS_dir_remove( "/tmp/test" )
# }

Run the code above in your browser using DataLab