Learn R Programming

reproducible (version 1.1.1)

.robustDigest: Create reproducible digests of objects in R

Description

Not all aspects of R objects are captured by current hashing tools in R (e.g. digest::digest, knitr caching, archivist::cache). This is mostly because many objects have "transient" (e.g., functions have environments), or "disk-backed" features. Since the goal of using reproducibility is to have tools that are not session specific, this function attempts to strip all session specific information so that the digest works between sessions and operating systems. It is tested under many conditions and object types, there are bound to be others that don't work correctly.

Usage

.robustDigest(
  object,
  .objects,
  length = getOption("reproducible.length", Inf),
  algo = "xxhash64",
  quick = getOption("reproducible.quick", FALSE),
  classOptions = list(),
  ...
)

# S4 method for ANY .robustDigest(object, .objects, length, algo, quick, classOptions)

# S4 method for `function` .robustDigest(object, .objects, length, algo, quick, classOptions)

# S4 method for expression .robustDigest(object, .objects, length, algo, quick, classOptions)

# S4 method for character .robustDigest(object, .objects, length, algo, quick, classOptions)

# S4 method for Path .robustDigest(object, .objects, length, algo, quick, classOptions)

# S4 method for environment .robustDigest(object, .objects, length, algo, quick, classOptions)

# S4 method for list .robustDigest(object, .objects, length, algo, quick, classOptions)

# S4 method for data.frame .robustDigest(object, .objects, length, algo, quick, classOptions)

# S4 method for Raster .robustDigest(object, .objects, length, algo, quick, classOptions)

# S4 method for Spatial .robustDigest(object, .objects, length, algo, quick, classOptions)

Arguments

object

an object to digest.

.objects

Character vector of objects to be digested. This is only applicable if there is a list, environment (or similar) named objects within it. Only this/these objects will be considered for caching, i.e., only use a subset of the list, environment or similar objects.

length

Numeric. If the element passed to Cache is a Path class object (from e.g., asPath(filename)) or it is a Raster with file-backing, then this will be passed to digest::digest, essentially limiting the number of bytes to digest (for speed). This will only be used if quick = FALSE. Default is getOption("reproducible.length"), which is set to Inf.

algo

The algorithms to be used; currently available choices are md5, which is also the default, sha1, crc32, sha256, sha512, xxhash32, xxhash64, murmur32 and spookyhash.

quick

Logical. If TRUE, little or no disk-based information will be assessed, i.e., mostly its memory content. This is relevant for objects of class character, Path and Raster currently. For class character, it is ambiguous whether this represents a character string or a vector of file paths. The function will assess if it is a path to a file or directory first. If not, it will treat the object as a character string. If it is known that character strings should not be treated as paths, then quick = TRUE will be much faster, with no loss of information. If it is file or directory, then it will digest the file content, or basename(object). For class Path objects, the file's metadata (i.e., filename and file size) will be hashed instead of the file contents if quick = TRUE. If set to FALSE (default), the contents of the file(s) are hashed. If quick = TRUE, length is ignored. Raster objects are treated as paths, if they are file-backed.

classOptions

Optional list. This will pass into .robustDigest for specific classes. Should be options that the .robustDigest knows what to do with.

...

Arguments passed to FUN

objects

Optional character vector indicating which objects are to be considered while making digestible. This argument is not used in the default cases; the only known method that uses this in the default cases; the only known method that uses this argument is the simList class from SpaDES.core.

Value

A hash i.e., digest of the object passed in.

Classes

Raster* objects have the potential for disk-backed storage, thus, require more work. Also, because Raster* can have a built-in representation for having their data content located on disk, this format will be maintained if the raster already is file-backed, i.e., to create .tif or .grd backed rasters, use writeRaster first, then Cache. The .tif or .grd will be copied to the raster/ subdirectory of the cacheRepo. Their RAM representation (as an R object) will still be in the usual cacheOutputs/ (or formerly gallery/) directory. For inMemory raster objects, they will remain as binary .RData files.

Functions (which are contained within environments) are converted to a text representation via a call to format(FUN).

Objects contained within a list or environment are recursively hashed using digest, while removing all references to environments.

Character strings are first assessed with dir.exists and file.exists to check for paths. If they are found to be paths, then the path is hashed with only its filename via basename(filename). If it is actually a path, we suggest using asPath(thePath)

Examples

Run this code
# NOT RUN {
a <- 2
tmpfile1 <- tempfile()
tmpfile2 <- tempfile()
save(a, file = tmpfile1)
save(a, file = tmpfile2)

# treats as character string, so 2 filenames are different
digest::digest(tmpfile1)
digest::digest(tmpfile2)

# tests to see whether character string is representing a file
.robustDigest(tmpfile1)
.robustDigest(tmpfile2) # same

# if you tell it that it is a path, then you can decide if you want it to be
#  treated as a character string or as a file path
.robustDigest(asPath(tmpfile1), quick = TRUE)
.robustDigest(asPath(tmpfile2), quick = TRUE) # different because using file info

.robustDigest(asPath(tmpfile1), quick = FALSE)
.robustDigest(asPath(tmpfile2), quick = FALSE) # same because using file content

# Rasters are interesting because it is not know a priori if it
#   it has a file name associated with it.
library(raster)
r <- raster(extent(0,10,0,10), vals = 1:100)

# write to disk
r1 <- writeRaster(r, file = tmpfile1)
r2 <- writeRaster(r, file = tmpfile2)

digest::digest(r1)
digest::digest(r2) # different
digest::digest(r1)
digest::digest(r2) # different
.robustDigest(r1)
.robustDigest(r2) # same... data are the same in the file

# note, this is not true for comparing memory and file-backed rasters
.robustDigest(r)
.robustDigest(r1) # different

# }

Run the code above in your browser using DataLab