mlazy: Cacheing objects for lazy-load access

Description

mlazy and friends are designed for handling collections of biggish objects, where only a few of the objects are accessed during any period, and especially where the individual objects might change and the collection might grow or shrink. As with "lazy loading" of packages, and the gdata/ASOR packages, the idea is to avoid the time & memory overhead associated with loading in numerous huge R binary objects when not all will be needed. Unlike lazy loading and gdata, mlazy caches each mlazyed object in a separate file, so it also avoids the overhead that would be associated with changing/adding/deleting objects if all objects lived in the same big file. When a workspace is Saved, the code updates only those individual object files that need updating.

mlazy does not require any special structure for object collections; in particular, the data doesn't have to go into a package. mlazy is particularly useful for users of cd because each cd to/from a task causes a read/write of the binary image file (usually ".RData"), which can be very large if mlazy is not used. Read DETAILS next. Feedback is welcome.

Usage

mlazy( ..., what, envir=.GlobalEnv, save.now=TRUE)
  # cache some objects
mtidy( ..., what, envir=.GlobalEnv)
  # (cache and) purge the cache to disk, freeing memory
demlazy( ..., what, envir=.GlobalEnv)
  # makes 'what' into normal uncached objects
mcachees( envir=.GlobalEnv)
  # shows which objects in  envir are cached
attach.mlazy( dir, pos=2, name=)
  # load mcached workspace into new search environment,
  # or create empty s.e. for cacheing

Arguments

...

unquoted object names, overridden by what if supplied

what

character vector of object names, all from the same environment. For mtidy and demlazy, defaults to all currently-cached objects in envir

envir

environment or position on the search path, defaulting to the environment where what or objs live.

save.now

see DETAILS

dir

name of directory, relative to task.home.

pos

numeric position of environment on search path, 2 or more

name

name to give environment, defaulting to something like "data:current.task:dir".

Value

These functions are used only for their side-effects, except for cachees which returns a character vector of object names.

More details

All this is geared to working with saved images (i.e. ".RData" or "all.rda" files) rather than creating all objects anew each session via source. If you use the latter approach, mlazy will probably be of little value.

The easiest way to set up cacheing is just to create your objects as normal, then call

mlazy( <<objname1>>, <<objname2>>, <<etc>>)

Save()

This will not seem to do much immediately-- your object can be read and changed as normal, and is still taking up memory. The memory and time savings will come in your next R session in this workspace.

You should never see any differences (except in time & memory usage) between working with cached (AKA mlazyed) and normal uncached objects.[One minor exception is that cacheing a function may stuff up the automatic backup system, or at any rate the "backstop" version of it which runs when you cd. This is deliberate, for speeding up cd. But why would you cache a function anyway?]

mlazy itself doesn't save the workspace image (the ".RData" or "all.rda" file), which is where the references live; that's why you need to call Save periodically. save.image and save will not work properly, and nor will load-- see NOTE below. Save doesn't store cached objects directly in the ".RData" file, but instead stores the uncached objects as normal in .RData together with a special object called something like .mcache00 (guaranteed not to conflict with one of your own objects). When the .RData file is subsequently reloaded by cd, the presence of the .mcache00 object triggers the creation of "stub" objects that will load the real cached objects from disk when and only when each one is required; the .mcache00 object is then deleted. Cached objects are loaded & stored in a subdirectory "mlazy" from individual files called "obj*.rda", where "*" is a number.

mlazy and Save do not immediately free any memory, to avoid any unnecessary re-loading from disk if you access the objects again during the current session. To force a "memory purge" during an R session, you need to call mtidy. mtidy purges its arguments from the cache, replacing them by promises just as when loading the workspace; when a reference is next accessed, its cached version will be re-loaded from disk. mtidy can be useful if you are looping over objects, and want to keep memory growth limited-- you can mtidy each object as the last statement in the loop. By default, mtidy purges the cache of all objects that have previously been cached. mtidy also caches any formerly uncached arguments, so one call to mtidy can be used instead of mlazy( ...); mtidy( ...).

move understands cached objects, and will shuffle the files accordingly.

demlazy will delete the corresponding "obj*.rda" file(s), so that only an in-memory copy will then exist; don't forget to Save soon after.

Warning

The system function load does not understand cacheing. If you merely load an image file saved using Save, cached objects will not be there, but there will be an extra object called something like .mcache00. Hence, if you have cached objects in your ROOT task, they will not be visible when you start R until you load the mvbutils library-- another fine reason to do that in your .First. The .First.lib function in mvbutils calls setup.mcache( .GlobalEnv) to automatically prepare any references in the ROOT task.

Cacheing in other search environments

It is possible to cache in search environments other the current top one (AKA the current workspace, AKA .GlobalEnv). This could be useful if, for example, you have a large number of simulated datasets that you might need to access, but you don't want them cluttering up .GlobalEnv. If you weren't worried about cacheing, you'd probably do this by calling attach( "<<filename>>"). The cacheing equivalent is attach.mlazy( "cachedir"). The argument is the name of a directory where the cached objects will be (or already are) stored; the directory will be created if necessary. If there is a ".RData" file in the directory, attach.mlazy will load it and set up any references properly; the ".RData" file will presumably contain mostly references to cached data objects, but can contain normal uncached objects too.

Once you have set up a cacheable search environment via attach.mlazy (typically in search position 2), you can cache objects into it using mlazy with the envir argument set (typically to 2). If the objects are originally somewhere else, they will be transferred to envir before cacheing. Whenever you want to save the cached objects, call Save.pos(2).

You will probably also want to modify or create the .First.task (see cd) of the current task so that it calls attach.mlazy("<<cache directory name>>"). Also, you should create a .Last.task (see cd) containing detach(2), otherwise cd(..) and cd(0/...) won't work.

Options

By default, mlazy now saves & loads into a auto-created subdirectory called "mlazy". In the earliest releases, though, it saved "obj*.rda" files into the same directory as ".RData". It will now move any "obj*.rda" files that it finds alongside ".RData" into the "mlazy" subdirectory. You can (possibly) override this by setting options( mlazy.subdir=FALSE), but the default is likely more reliable.

By default, there is no way to figure out what object is contained in a "obj*.rda" without forcibly loading that file or inspecting the .mcache00 object in the "parent" .RData file-- not that you should ever need to know. However, if you set options( mlazy.index=TRUE) (recommended), then a file "obj.ind" will be maintained in the "mlazy" directory, showing (object name - value) pairs in plain text (tab-separated). For directories with very large numbers of objects, there may be some speed penalty. If you want to create an index file for an existing "mlazy" directory that lacks one, cd to the task and call mvbutils:::mupdate.mcache.index.if.opt(mlazy.index=TRUE).

See Save for how to set compression options, and save for what you can set them to; options(mvbutils.compression_level=1) may save some time, at the expense of disk space.

Troubleshooting

In the unlikely event of needing to manually load a cached image file, use load.refdb-- cd and attach.mlazy do this automatically.

In the unlikely event of lost/corrupted data, you can manually reload individual "obj*.rda" files using load-- each "obj*.rda" file contains one object stored with its correct name. Before doing that, call demlazy( what=mcachees()) to avoid subsequent trouble. Once you have reloaded the objects, you can call mlazy again.

See Options for the easy way to check what object is stored in a particular "obj*.rda" file. If that feature is turned off on your system, the failsafe way is to load the file into a new environment, e.g. e <- new.env(); load( "obj99.rda", e); ls( e).

To see how memory changes when you call mlazy and mtidy, call gc().

To check object sizes without actually loading the cached objects, use lsize. Many functions that iterate over all objects in the environment, such as eapply, will cause mlazy objects to be loaded.

Housekeeping of "obj**.rda" files happens during Save; any obsolete files (i.e. corresponding to objects that have been removed) are deleted.

Inner workings

What happens: each workspace acquires a mcache attribute, which is a named numeric vector. The absolute values of the entries correspond to files-- 53 corresponds to a file "obj53.rda", etc., and the names to objects. When an object myobj is mlazyed, the mcache is augmented by a new element named "myobj" with a new file number, and that file is saved to disk. Also, "myobj" is replaced with an active binding (see makeActiveBinding). The active binding is a function which retrieves or sets the object's data within the function's environment. If the function is called in change-value mode, then it also makes negative the file number in mcache. Hence it's possible to tell whether an object has been changed since last being saved.

When an object is first mlazyed, the object data is placed directly into the active binding function's environment so that the function can find/modify the data. When an object is mtidyed, or when a cached image is loaded from disk, the thing placed into the A.B.fun's environment is not the data itself, but instead a promise saying, in effect, "fetch me from disk when you need me". The promise gets forced when the object is accessed for reading or writing. This is how "lazy loading" of packages works, and also the gdata package. However, for mlazy there is the additional requirement of being able to determine whether an object has been modified; for efficiency, only modified objects should be written to disk when there is a Save.

There is presumably some speed penalty from using a cache, but experience to date suggests that the penalty is small. Cached objects are saved in compressed format, which seems to take a little longer than an uncompressed save, but loading seems pretty quick compared to uncompressed files.

Examples

Run this code

# NOT RUN {
biggo <- matrix( runif( 1e6), 1000, 1000)
gc() # lots of memory
mlazy( biggo)
gc() # still lots of memory
mtidy( biggo)
gc() # better
biggo[1,1]
gc() # worse; it's been reloaded
# }

Run the code above in your browser using DataLab