mlazy: Cacheing objects for lazy-load access

Description

mlazy and friends are designed for handling collections of biggish objects, where only a few of the objects are accessed during any period, and especially where the individual objects might change and the collection might grow or shrink. As with "lazy loading" of packages, and the gdata package, the idea is to avoid the time & memory overhead associated with loading in numerous huge Rbinary objects when not all will be needed. Unlike lazy loading and gdata, mlazy keeps each lazy-load object in a separate file, so it also avoids the overhead associated with changing/adding/deleting objects when all objects are saved into the same big file. When a workspace is Saved, the code updates only those individual object files that need updating. mlazy does not require any special structure for object collections; in particular, the data doesn't have to go into a package. mlazy is particularly useful for users of cd because each cd to/from a task causes a read/write of the binary image file (usually ".Rdata"), which can be very large if mlazy is not used. Read DETAILS next. Feedback is welcome.

Usage

mlazy( ..., what, envir=.GlobalEnv) # cache some objects
mtidy( ..., what, envir=.GlobalEnv) # (cache and) purge the cache to disk, freeing memory
demlazy( ..., what, envir=.GlobalEnv) # makes what into normal uncached objects
mcachees( envir=.GlobalEnv) # shows which objects in envir are cached
attach.mlazy( dir, pos=2, name=) # load mcached workspace into new search environment, or create empty s.e. for cacheing

Arguments

...

unquoted object names, overridden by what if supplied

what

character vector of object names, all from the same environment. For mtidy and demlazy, defaults to all currently-cached objects in envir

envir

environment or position on the search path, defaulting to the environment where what or objs live.

dir

name of directory, relative to task.home.

pos

numeric position of environment on search path, 2 or more

name

name to give environment, defaulting to something like "data:current.task:dir".

Value

These functions are used only for their side-effects, except for cachees which returns a character vector of object names.

More details

What happens: each workspace acquires a mcache attribute, which is a named numeric vector. The absolute values of the entries correspond to files-- 53 corresponds to a file "obj53.rda", etc., and the names to objects. When an object myobj is mlazyed, the mcache is augmented by a new element named "myobj" with a new file number, and that file is saved to disk. Also, "myobj" is replaced with an active binding (see makeActiveBinding). The active binding is a function which retrieves or sets the object's data within the function's environment. If the function is called in change-value mode, then it also makes negative the file number in mcache. Hence it's possible to tell whether a function has been changed since last being saved. When an object is first mlazyed, the object data is placed directly into the active binding function's environment so that the function can find/modify the data. When an object is mtidyed, or when a cached image is loaded from disk, the thing placed into the A.B.fun's environment is not the data itself, but instead a promise saying, in effect, "fetch me from disk when you need me". The promise gets forced when the object is accessed for reading or writing. This is how "lazy loading" of packages works, and also the gdata package. However, for mlazy there is the additional requirement of being able to determine whether an object has been modified; for efficiency, only modified objects should be written to disk when there is a Save. There is presumably some speed penalty from using a cache, but experience to date suggests that the penalty is small. Cached objects are saved in compressed format, which seems to take a little longer than an uncompressed save, but loading seems pretty quick compared to uncompressed files.

Details

All this is geared to working with saved images (i.e. ".Rdata" or all.rda files) rather than creating all objects anew each session via source. If you use the latter approach, mlazy will probably be of little value. The easiest way to set up cacheing is just to create your objects as normal, then call mlazy( <>, <>, <>) followed by Save(). This will not seem to do much immediately-- your object can be read and changed as normal, and is still taking up memory. The memory and time savings will come in your next Rsession in this workspace. You should never see any differences (except in time & memory usage) between working with cached and normal uncached objects. [One minor exception is that cacheing a function may stuff up the automatic backup system, or at any rate the "backstop" version of it which runs when you cd. This is deliberate, for speeding up cd. But why would you cache a function anyway?] mlazy itself doesn't save the workspace image (the ".Rdata" or "all.rda" file), which is where the references live; that's why you need to call Save periodically. save.image and save will not work (and nor will load-- see NOTE). Save doesn't store mcached objects directly in the .Rdata file, but instead stores an index object called something like .mache00 (guaranteed not to conflict with one of your own objects) that triggers the creation of mcached objects with promises-to-load, and is then deleted. The actual load process is handled by load.refdb but you shouldn't need to call this directly. mlazy and Save do not immediately free any memory, to avoid any unnecessary re-loading from disk if you access the objects again during the current session. To force a "memory purge" during an Rsession, you need to call mtidy. mtidy purges its arguments from the cache, replacing them by promises just as when loading the workspace; when a reference is next accessed, its cached version will be re-loaded from disk. mtidy can be useful if you are looping over objects, and want to keep memory growth limited-- you can mtidy each object as the last statement in the loop. By default, mtidy purges the cache of all objects that have previously been cached. mtidy also caches any formerly uncached arguments, so one call to mtidy can be used instead of mlazy( ...); mtidy( ...). move understands cached objects, and will shuffle the files accordingly. demlazy will delete the corresponding "obj*.rda" file(s), so that only an in-memory copy will then exist; don't forget to Save soon after. Cacheing in other search environments It is possible to cache in search environments other the current top one (AKA the current workspace, AKA'.GlobalEnv'). This could be useful if, for example, you have a large number of simulated datasets that you might need to access, but you don't want them cluttering up .GlobalEnv. If you weren't worried about cacheing, you'd probably do this by calling attach( "<>") (qv). The cacheing equivalent is attach.mlazy( "cachedir"). The argument is the name of a directory where the cached objects will be (or already are) stored; the directory will be created if necessary. If there is a ".Rdata" file in the directory, attach.mlazy will load it and set up any references properly; the ".Rdata" file will presumably contain mostly references to cached data objects, but can contain normal uncached objects too. Once you have set up a cacheable search environment via attach.mlazy (typically in search position 2), you can cache objects into it using mlazy with the envir argument set (typically to 2). If the objects are originally somewhere else, they will be transferred to envir before cacheing. Whenever you want to save the cached objects, call Save.pos(2). You will probably also want to modify or create the .First.task (see cd (qv)) of the current task so that it calls attach.mlazy("<>"). Also, you should create a .Last.task (see cd (qv)) containing detach(2), otherwise cd(..) and cd(0/...) won't work.

Examples

Run this code

biggo <- matrix( runif( 1e6), 1000, 1000)
gc() # lots of memory
mlazy( biggo)
gc() # still lots of memory
mtidy( biggo)
gc() # better
biggo[1,1]
gc() # worse; it's been reloaded

Run the code above in your browser using DataLab