cache_rds: Cache the value of an R expression to an RDS file

Description

Save the value of an expression to a cache file (of the RDS format). Next time the value is loaded from the file if it exists.

Usage

cache_rds(
  expr = {
 },
  rerun = FALSE,
  file = "cache.rds",
  dir = "cache/",
  hash = NULL,
  clean = getOption("xfun.cache_rds.clean", TRUE),
  ...
)

Value

If the cache file does not exist, run the expression and save the result to the file, otherwise read the cache file and return the value.

Arguments

expr: An R expression.
rerun: Whether to delete the RDS file, rerun the expression, and save the result again (i.e., invalidate the cache if it exists).
file: The base (see Details) cache filename under the directory specified by the dir argument. If not specified and this function is called inside a code chunk of a knitr document (e.g., an R Markdown document), the default is the current chunk label plus the extension .rds.
dir: The path of the RDS file is partially determined by paste0(dir, file). If not specified and the knitr package is available, the default value of dir is the knitr chunk option cache.path (so if you are compiling a knitr document, you do not need to provide this dir argument explicitly), otherwise the default is cache/. If you do not want to provide a dir but simply a valid path to the file argument, you may use dir = "".
hash: A list object that contributes to the MD5 hash of the cache filename (see Details). It can also take a special character value "auto". Other types of objects are ignored.
clean: Whether to clean up the old cache files automatically when expr has changed.
...: Other arguments to be passed to saveRDS().

Details

Note that the file argument does not provide the full cache filename. The actual name of the cache file is of the form BASENAME_HASH.rds, where BASENAME is the base name provided via the file argument (e.g., if file = 'foo.rds', BASENAME would be foo), and HASH is the MD5 hash (also called the ‘checksum’) calculated from the R code provided to the expr argument and the value of the hash argument, which means when the code or the hash argument changes, the HASH string may also change, and the old cache will be invalidated (if it exists). If you want to find the cache file, look for .rds files that contain 32 hexadecimal digits (consisting of 0-9 and a-z) at the end of the filename.

The possible ways to invalidate the cache are: 1) change the code in expr argument; 2) delete the cache file manually or automatically through the argument rerun = TRUE; and 3) change the value of the hash argument. The first two ways should be obvious. For the third way, it makes it possible to automatically invalidate the cache based on changes in certain R objects. For example, when you run cache_rds({ x + y }), you may want to invalidate the cache to rerun { x + y } when the value of x or y has been changed, and you can tell cache_rds() to do so by cache_rds({ x + y }, hash = list(x, y)). The value of the argument hash is expected to be a list, but it can also take a special value, "auto", which means cache_rds(expr) will try to automatically figure out the global variables in expr, return a list of their values, and use this list as the actual value of hash. This behavior is most likely to be what you really want: if the code in expr uses an external global variable, you may want to invalidate the cache if the value of the global variable has changed. Here a “global variable” means a variable not created locally in expr, e.g., for cache_rds({ x <- 1; x + y }), x is a local variable, and y is (most likely to be) a global variable, so changes in y should invalidate the cache. However, you know your own code the best. If you want to be completely sure when to invalidate the cache, you can always provide a list of objects explicitly rather than relying on hash = "auto".

By default (the argument clean = TRUE), old cache files will be automatically cleaned up. Sometimes you may want to use clean = FALSE (set the R global option options(xfun.cache_rds.clean = FALSE) if you want FALSE to be the default). For example, you may not have decided which version of code to use, and you can keep the cache of both versions with clean = FALSE, so when you switch between the two versions of code, it will still be fast to run the code.

Examples

Run this code

f = tempfile()  # the cache file
compute = function(...) {
    res = xfun::cache_rds({
        Sys.sleep(1)
        1:10
    }, file = f, dir = "", ...)
    res
}
compute()  # takes one second
compute()  # returns 1:10 immediately
compute()  # fast again
compute(rerun = TRUE)  # one second to rerun
compute()
unlink(paste0(f, "_*.rds"))