maturing
Still experimental and may change. This form cannot pass any arguments to
]codeCache, such as cacheRepo
, thus it is of limited utility. However,
it is a clean alternative for simple cases.
Cache(
FUN,
...,
notOlderThan = NULL,
.objects = NULL,
outputObjects = NULL,
algo = "xxhash64",
cacheRepo = NULL,
length = getOption("reproducible.length", Inf),
compareRasterFileLength,
userTags = c(),
digestPathContent,
omitArgs = NULL,
classOptions = list(),
debugCache = character(),
sideEffect = FALSE,
makeCopy = FALSE,
quick = getOption("reproducible.quick", FALSE),
verbose = getOption("reproducible.verbose", 0),
cacheId = NULL,
useCache = getOption("reproducible.useCache", TRUE),
useCloud = FALSE,
cloudFolderID = getOption("reproducible.cloudFolderID", NULL),
showSimilar = getOption("reproducible.showSimilar", FALSE),
drv = getOption("reproducible.drv", RSQLite::SQLite()),
conn = getOption("reproducible.conn", NULL)
)# S4 method for ANY
Cache(
FUN,
...,
notOlderThan = NULL,
.objects = NULL,
outputObjects = NULL,
algo = "xxhash64",
cacheRepo = NULL,
length = getOption("reproducible.length", Inf),
compareRasterFileLength,
userTags = c(),
digestPathContent,
omitArgs = NULL,
classOptions = list(),
debugCache = character(),
sideEffect = FALSE,
makeCopy = FALSE,
quick = getOption("reproducible.quick", FALSE),
verbose = getOption("reproducible.verbose", 0),
cacheId = NULL,
useCache = getOption("reproducible.useCache", TRUE),
useCloud = FALSE,
cloudFolderID = getOption("reproducible.cloudFolderID", NULL),
showSimilar = getOption("reproducible.showSimilar", FALSE),
drv = getOption("reproducible.drv", RSQLite::SQLite()),
conn = getOption("reproducible.conn", NULL)
)
lhs %
Either a function or an unevaluated function call (e.g., using
quote
.
Arguments passed to FUN
A time. Load an object from the Cache if it was created after this.
Character vector of objects to be digested. This is only applicable if there is a list, environment (or similar) named objects within it. Only this/these objects will be considered for caching, i.e., only use a subset of the list, environment or similar objects.
Optional character vector indicating which objects to return. This is only relevant for list, environment (or similar) objects
The algorithms to be used; currently available choices are
md5
, which is also the default, sha1
, crc32
,
sha256
, sha512
, xxhash32
, xxhash64
,
murmur32
and spookyhash
.
A repository used for storing cached objects.
This is optional if Cache
is used inside a SpaDES module.
Numeric. If the element passed to Cache is a Path
class
object (from e.g., asPath(filename)
) or it is a Raster
with
file-backing, then this will be
passed to digest::digest
, essentially limiting the number of bytes
to digest (for speed). This will only be used if quick = FALSE
.
Default is getOption("reproducible.length")
, which is set to Inf
.
Being deprecated; use length
.
A character vector with descriptions of the Cache function call. These
will be added to the Cache so that this entry in the Cache can be found using
userTags
e.g., via showCache
.
Being deprecated. Use quick
.
Optional character string of arguments in the FUN to omit from the digest.
Optional list. This will pass into .robustDigest
for
specific classes. Should be options that the .robustDigest
knows what
to do with.
Character or Logical. Either "complete"
or "quick"
(uses
partial matching, so "c" or "q" work). TRUE
is equivalent to "complete"
.
If "complete"
, then the returned object from the Cache
function will have two attributes, debugCache1
and debugCache2
,
which are the entire list(...)
and that same object, but after all
.robustDigest
calls, at the moment that it is digested using
digest
, respectively. This attr(mySimOut, "debugCache2")
can then be compared to a subsequent call and individual items within
the object attr(mySimOut, "debugCache1")
can be compared.
If "quick"
, then it will return the same two objects directly,
without evalutating the FUN(...)
.
Logical or path. Determines where the function will look for new files following function completion. See Details. NOTE: this argument is experimental and may change in future releases.
Logical. If sideEffect = TRUE
, and makeCopy = TRUE
,
a copy of the downloaded files will be made and stored in the cacheRepo
to speed up subsequent file recovery in the case where the original copy
of the downloaded files are corrupted or missing. Currently only works when
set to TRUE
during the first run of Cache
. Default is FALSE
.
NOTE: this argument is experimental and may change in future releases.
Logical. If TRUE
,
little or no disk-based information will be assessed, i.e., mostly its
memory content. This is relevant for objects of class character
,
Path
and Raster
currently. For class character
, it is ambiguous
whether this represents a character string or a vector of file paths. The function
will assess if it is a path to a file or directory first. If not, it will treat
the object as a character string. If it is known that character strings should
not be treated as paths, then quick = TRUE
will be much faster, with no loss
of information. If it is file or directory, then it will digest the file content,
or basename(object)
. For class Path
objects, the file's metadata
(i.e., filename and file size) will be hashed instead of the file contents if
quick = TRUE
.
If set to FALSE
(default), the contents of the file(s) are hashed.
If quick = TRUE
, length
is ignored. Raster
objects are treated
as paths, if they are file-backed.
Numeric, with 0 being off, 1 being a little, 2 being more verbose etc. Above 1 will output much more information about the internals of Caching, which may help diagnose Caching challenges.
Character string. If passed, this will override the calculated hash of the inputs, and return the result from this cacheId in the cacheRepo. Setting this is equivalent to manually saving the output of this function, i.e., the object will be on disk, and will be recovered in subsequent This may help in some particularly finicky situations where Cache is not correctly detecting unchanged inputs. This will guarantee the object will be identical each time; this may be useful in operational code.
Logical, numeric or "overwrite"
or "devMode"
. See details.
Logical. See Details.
A googledrive dribble of a folder, e.g., using drive_mkdir()
.
If left as NULL
, the function will create a cloud folder with name from last
two folder levels of the cacheRepo
path, :
paste0(basename(dirname(cacheRepo)), "_", basename(cacheRepo))
.
This cloudFolderID
will be added to options("reproducible.cloudFolderID")
,
but this will not persist across sessions. If this is a character string, it will
treat this as a folder name to create or use on GoogleDrive.
A logical or numeric. Useful for debugging.
If TRUE
or 1
, then if the Cache
does not find an identical archive in the cacheRepo, it will report (via message)
the next most similar archive, and indicate which argument(s) is/are different.
If a number larger than 1
, then it will report the N most similar archived
objects.
A '>DBIConnection object, as returned by
dbConnect()
.
A name to assign to.
A function call
As with archivist::cache
, returns the value of the
function call or the cached version (i.e., the result from a previous call
to this same cached function with identical arguments).
Commonly, Caching is nested, i.e., an outer function is wrapped in a Cache
function call, and one or more inner functions are also wrapped in a Cache
function call. A user can always specify arguments in every Cache function
call, but this can get tedious and can be prone to errors. The normal way that
R handles arguments is it takes the user passed arguments if any, and
default arguments for all those that have no user passed arguments. We have inserted
a middle step. The order or precedence for any given Cache
function call is
1. user arguments, 2. inherited arguments, 3. default arguments. At this time,
the top level Cache
arguments will propagate to all inner functions unless
each individual Cache
call has other arguments specified, i.e., "middle"
nested Cache
function calls don't propagate their arguments to further "inner"
Cache
function calls. See example.
userTags
is unique of all arguments: its values will be appended to the
inherited userTags
.
Caching speed may become a critical aspect of a final product. For example,
if the final product is a shiny app, rerunning the entire project may need
to take less then a few seconds at most. There are 3 arguments that affect
Cache speed: quick
, length
, and
algo
. quick
is passed to .robustDigest
, which currently
only affects Path
and Raster*
class objects. In both cases, quick
means that little or no disk-based information will be assessed.
If a function has a path argument, there is some ambiguity about what should be done. Possibilities include:
hash the string as is (this will be very system specific, meaning a
Cache
call will not work if copied between systems or directories);
hash the basename(path)
;
hash the contents of the file.
If paths are passed in as is (i.e,. character string), the result will not be predictable.
Instead, one should use the wrapper function asPath(path)
, which sets the
class of the string to a Path
, and one should decide whether one wants
to digest the content of the file (using quick = FALSE
),
or just the filename ((quick = TRUE)
). See examples.
In general, it is expected that caching will only be used when stochasticity
is not relevant, or if a user has achieved sufficient stochasticity (e.g., via
sufficient number of calls to experiment
) such that no new explorations
of stochastic outcomes are required. It will also be very useful in a
reproducible workflow.
Logical or numeric. If FALSE
or 0
, then the entire Caching
mechanism is bypassed and the
function is evaluated as if it was not being Cached. Default is
getOption("reproducible.useCache")
), which is TRUE
by default,
meaning use the Cache mechanism. This may be useful to turn all Caching on or
off in very complex scripts and nested functions. Increasing levels of numeric
values will cause deeper levels of Caching to occur. Currently, only implemented
in postProcess
: to do both caching of inner cropInputs
, projectInputs
and maskInputs
, and caching of outer postProcess
, use
useCache = 2
; to skip the inner sequence of 3 functions, use useCache = 1
.
For large objects, this may prevent many duplicated save to disk events.
If "overwrite"
(which can be set with options("reproducible.useCache" =
"overwrite")
), then the function invoke the caching mechanism but will purge
any entry that is matched, and it will be replaced with the results of the
current call.
If "devMode"
: The point of this mode is to facilitate using the Cache when
functions and datasets are continually in flux, and old Cache entries are
likely stale very often. In `devMode`, the cache mechanism will work as
normal if the Cache call is the first time for a function OR if it
successfully finds a copy in the cache based on the normal Cache mechanism.
It *differs* from the normal Cache if the Cache call does *not* find a copy
in the `cacheRepo`, but it does find an entry that matches based on
`userTags`. In this case, it will delete the old entry in the `cacheRepo`
(identified based on matching `userTags`), then continue with normal `Cache`.
For this to work correctly, `userTags` must be unique for each function call.
This should be used with caution as it is still experimental. Currently, if
userTags
are not unique to a single entry in the cacheRepo, it will
default to the behaviour of useCache = TRUE
with a message. This means
that "devMode"
is most useful if used from the start of a project.
This is a way to store all or some of the local Cache in the cloud.
Currently, the only cloud option is Google Drive, via googledrive.
For this to work, the user must be or be able to be authenticated
with googledrive::drive_auth
. The principle behind this
useCloud
is that it will be a full or partial mirror of a local Cache.
It is not intended to be used independently from a local Cache. To share
objects that are in the Cloud with another person, it requires 2 steps. 1)
share the cloudFolderID$id
, which can be retrieved by
getOption("reproducible.cloudFolderID")$id
after at least one Cache
call has been made. 2) The other user must then set their cacheFolderID
in a
Cache\(..., reproducible.cloudFolderID = \"the ID here\"\)
call or
set their option manually
options\(\"reproducible.cloudFolderID\" = \"the ID here\"\)
.
If TRUE
, then this Cache call will download
(if local copy doesn't exist, but cloud copy does exist), upload
(local copy does or doesn't exist and
cloud copy doesn't exist), or
will not download nor upload if object exists in both. If TRUE
will be at
least 1 second slower than setting this to FALSE
, and likely even slower as the
cloud folder gets large. If a user wishes to keep "high-level" control, set this to
getOption("reproducible.useCloud", FALSE)
or
getOption("reproducible.useCloud", TRUE)
(if the default behaviour should
be FALSE
or TRUE
, respectively) so it can be turned on and off with
this option. NOTE: This argument will not be passed into inner/nested Cache calls.)
If sideEffect
is not FALSE
, then metadata about any files that
added to sideEffect
will be added as an attribute to the cached copy.
Subsequent calls to this function will assess for the presence of the new files in the
sideEffect
location.
If the files are identical (quick = FALSE
) or their file size is identical
(quick = TRUE
), then the cached copy of the function will be returned
(and no files changed).
If there are missing or incorrect files, then the function will re-run.
This will accommodate the situation where the function call is identical, but somehow the side
effect files were modified.
If sideEffect
is logical, then the function will check the cacheRepo
;
if it is a path, then it will check the path.
The function will assess whether the files to be downloaded are found locally prior to download.
If it fails the local test, then it will try to recover from a local copy if (makeCopy
had been set to TRUE
the first time the function was run.
Currently, local recovery will only work ifmakeCOpy
was set to TRUE
the first time
Cache
was run). Default is FALSE
.
A function that can be used to wrap around other functions to cache function calls
for later use. This is normally most effective when the function to cache is
slow to run, yet the inputs and outputs are small. The benefit of caching, therefore,
will decline when the computational time of the "first" function call is fast and/or
the argument values and return objects are large. The default setting (and first
call to Cache) will always save to disk. The 2nd call to the same function will return
from disk. If the options("reproducible.useMemoise" = TRUE)
, then the 3rd time
will recover the object from RAM and is normally much faster.
There are other similar functions in the R universe. This version of Cache has been used as part of a robust continuous workflow approach. As a result, we have tested it with many "non-standard" R objects (e.g., RasterLayer objects) and environments, which tend to be challenging for caching as they are always unique.
This version of the Cache
function accommodates those four special,
though quite common, cases by:
converting any environments into list equivalents;
identifying the dispatched S4 method (including those made through inheritance) before hashing so the correct method is being cached;
by hashing the linked file, rather than the Raster object.
Currently, only file-backed Raster*
objects are digested
(e.g., not ff
objects, or any other R object where the data
are on disk instead of in RAM);
Uses digest
(formerly fastdigest, which does
not translate between operating systems).
This is used for file-backed objects as well.
Cache will save arguments passed by user in a hidden environment. Any nested Cache functions will use arguments in this order 1) actual arguments passed at each Cache call, 2) any inherited arguments from an outer Cache call, 3) the default values of the Cache function. See section on Nested Caching.
Caching R objects using archivist::cache
has five important limitations:
the archivist package detects different environments as different;
it also does not detect S4 methods correctly due to method inheritance;
it does not detect objects that have file-based storage of information
(specifically RasterLayer-class
objects);
the default hashing algorithm is relatively slow.
heavily nested function calls may want Cache arguments to propagate through
As part of the SpaDES ecosystem of R packages, Cache
can be used
within SpaDES modules. If it is, then the cached entry will automatically
get 3 extra userTags
: eventTime
, eventType
, and moduleName
.
These can then be used in clearCache
to selectively remove cached objects
by eventTime
, eventType
or moduleName
.
Cache
will add a tag to the artifact in the database called accessed
,
which will assign the time that it was accessed, either read or write.
That way, artifacts can be shown (using showCache
) or removed (using
clearCache
) selectively, based on their access dates, rather than only
by their creation dates. See example in clearCache
.
Cache
(uppercase C) is used here so that it is not confused with, and does
not mask, the archivist::cache
function.
showCache
, clearCache
, keepCache
,
CacheDigest
, movedCache
, .robustDigest
# NOT RUN {
tmpDir <- file.path(tempdir())
# Basic use
ranNumsA <- Cache(rnorm, 10, 16, cacheRepo = tmpDir)
# All same
ranNumsB <- Cache(rnorm, 10, 16, cacheRepo = tmpDir) # recovers cached copy
ranNumsC <- Cache(cacheRepo = tmpDir) %C% rnorm(10, 16) # recovers cached copy
ranNumsD <- Cache(quote(rnorm(n = 10, 16)), cacheRepo = tmpDir) # recovers cached copy
###############################################
# experimental devMode
###############################################
opt <- options("reproducible.useCache" = "devMode")
clearCache(tmpDir, ask = FALSE)
centralTendency <- function(x)
mean(x)
funnyData <- c(1, 1, 1, 1, 10)
uniqueUserTags <- c("thisIsUnique", "reallyUnique")
ranNumsB <- Cache(centralTendency, funnyData, cacheRepo = tmpDir,
userTags = uniqueUserTags) # sets new value to Cache
showCache(tmpDir) # 1 unique artifact -- cacheId is 8be9cf2a072bdbb0515c5f0b3578f474
# During development, we often redefine function internals
centralTendency <- function(x)
median(x)
# When we rerun, we don't want to keep the "old" cache because the function will
# never again be defined that way. Here, because of userTags being the same,
# it will replace the entry in the Cache, effetively overwriting it, even though
# it has a different cacheId
ranNumsD <- Cache(centralTendency, funnyData, cacheRepo = tmpDir, userTags = uniqueUserTags)
showCache(tmpDir) # 1 unique artifact -- cacheId is bb1195b40c8d37a60fd6004e5d526e6b
# If it finds it by cacheID, doesn't matter what the userTags are
ranNumsD <- Cache(centralTendency, funnyData, cacheRepo = tmpDir, userTags = "thisIsUnique")
options(opt)
# For more in depth uses, see vignette
# }
# NOT RUN {
# To use Postgres, set environment variables with the required credentials
if (requireNamespace("RPostgres")) {
Sys.setenv(PGHOST = "server.url")
Sys.setenv(PGPORT = 5432)
Sys.setenv(PGDATABASE = "mydatabase")
Sys.setenv(PGUSER = "mydbuser")
Sys.setenv(PGPASSWORD = "mysecurepassword")
conn <- DBI::dbConnect(RPostgres::Postgres())
options("reproducible.conn" = conn)
# Will use postgres for cache data table, and tempdir() for saved R objects
Cache(rnorm, 1, cacheRepo = tempdir())
}
browseVignettes(package = "reproducible")
# }
# NOT RUN {
# Equivalent
a <- Cache(rnorm, 1)
b %<% rnorm(1)
# }
Run the code above in your browser using DataLab