mlazy
and friends are designed for handling collections of biggish objects, where only a few of the objects are accessed during any period, and especially where the individual objects might change and the collection might grow or shrink. As with "lazy loading" of packages, and the gdata
, mlazy
keeps each lazy-load object in a separate file, so it also avoids the overhead associated with changing/adding/deleting objects when all objects are saved into the same big file. When a workspace is Save
d, the code updates only those individual object files that need updating.
mlazy
does not require any special structure for object collections; in particular, the data doesn't have to go into a package. mlazy
is particularly useful for users of cd
because each cd
to/from a task causes a read/write of the binary image file (usually ".Rdata"), which can be very large if mlazy
is not used. Read DETAILS next. Feedback is welcome.mlazy( ..., what, envir=.GlobalEnv) # cache some objects
mtidy( ..., what, envir=.GlobalEnv) # (cache and) purge the cache to disk, freeing memory
demlazy( ..., what, envir=.GlobalEnv) # makes what
into normal uncached objects
mcachees( envir=.GlobalEnv) # shows which objects in envir are cached
attach.mlazy( dir, pos=2, name=) # load mcached workspace into new search environment, or create empty s.e. for cacheing
what
if suppliedmtidy
and demlazy
, defaults to all currently-cached objects in envir
what
or objs
live.task.home
.cachees
which returns a character vector of object names.mcache
attribute, which is a named numeric vector. The absolute values of the entries correspond to files-- 53 corresponds to a file "obj53.rda", etc., and the names to objects. When an object myobj
is mlazy
ed, the mcache
is augmented by a new element named "myobj" with a new file number, and that file is saved to disk. Also, "myobj" is replaced with an active binding (see makeActiveBinding
). The active binding is a function which retrieves or sets the object's data within the function's environment. If the function is called in change-value mode, then it also makes negative the file number in mcache
. Hence it's possible to tell whether a function has been changed since last being saved.
When an object is first mlazy
ed, the object data is placed directly into the active binding function's environment so that the function can find/modify the data. When an object is mtidy
ed, or when a cached image is loaded from disk, the thing placed into the A.B.fun's environment is not the data itself, but instead a promise
saying, in effect, "fetch me from disk when you need me". The promise gets forced when the object is accessed for reading or writing. This is how "lazy loading" of packages works, and also the mlazy
there is the additional requirement of being able to determine whether an object has been modified; for efficiency, only modified objects should be written to disk when there is a Save
.
There is presumably some speed penalty from using a cache, but experience to date suggests that the penalty is small. Cached objects are saved in compressed format, which seems to take a little longer than an uncompressed save, but loading seems pretty quick compared to uncompressed files.all.rda
files) rather than creating all objects anew each session via source
. If you use the latter approach, mlazy
will probably be of little value.
The easiest way to set up cacheing is just to create your objects as normal, then call mlazy( <>, <>, <>)
followed by Save()
. This will not seem to do much immediately-- your object can be read and changed as normal, and is still taking up memory. The memory and time savings will come in your next Rsession in this workspace.
You should never see any differences (except in time & memory usage) between working with cached and normal uncached objects.
[One minor exception is that cacheing a function may stuff up the automatic backup system, or at any rate the "backstop" version of it which runs when you cd
. This is deliberate, for speeding up cd
. But why would you cache a function anyway?]
mlazy
itself doesn't save the workspace image (the ".Rdata" or "all.rda" file), which is where the references live; that's why you need to call Save
periodically. save.image
and save
will not work (and nor will load
-- see NOTE). Save
doesn't store mcache
d objects directly in the .Rdata
file, but instead stores an index object called something like .mache00
(guaranteed not to conflict with one of your own objects) that triggers the creation of mcache
d objects with promises-to-load, and is then deleted. The actual load process is handled by load.refdb
but you shouldn't need to call this directly.
mlazy
and Save
do not immediately free any memory, to avoid any unnecessary re-loading from disk if you access the objects again during the current session. To force a "memory purge" during an Rsession, you need to call mtidy
. mtidy
purges its arguments from the cache, replacing them by promise
s just as when loading the workspace; when a reference is next accessed, its cached version will be re-loaded from disk. mtidy
can be useful if you are looping over objects, and want to keep memory growth limited-- you can mtidy
each object as the last statement in the loop. By default, mtidy
purges the cache of all objects that have previously been cached. mtidy
also caches any formerly uncached arguments, so one call to mtidy
can be used instead of mlazy( ...); mtidy( ...)
.
move
understands cached objects, and will shuffle the files accordingly.
demlazy
will delete the corresponding "obj*.rda" file(s), so that only an in-memory copy will then exist; don't forget to Save
soon after.
Cacheing in other search environments
It is possible to cache in search environments other the current top one (AKA the current workspace, AKA'.GlobalEnv'). This could be useful if, for example, you have a large number of simulated datasets that you might need to access, but you don't want them cluttering up .GlobalEnv
. If you weren't worried about cacheing, you'd probably do this by calling attach( "<>")
(qv). The cacheing equivalent is attach.mlazy( "cachedir")
. The argument is the name of a directory where the cached objects will be (or already are) stored; the directory will be created if necessary. If there is a ".Rdata" file in the directory, attach.mlazy
will load it and set up any references properly; the ".Rdata" file will presumably contain mostly references to cached data objects, but can contain normal uncached objects too.
Once you have set up a cacheable search environment via attach.mlazy
(typically in search position 2), you can cache objects into it using mlazy
with the envir
argument set (typically to 2). If the objects are originally somewhere else, they will be transferred to envir
before cacheing. Whenever you want to save the cached objects, call Save.pos(2)
.
You will probably also want to modify or create the .First.task
(see cd
(qv)) of the current task so that it calls attach.mlazy("<>")
. Also, you should create a .Last.task
(see cd
(qv)) containing detach(2)
, otherwise cd(..)
and cd(0/...)
won't work.gc
, package biggo <- matrix( runif( 1e6), 1000, 1000)
gc() # lots of memory
mlazy( biggo)
gc() # still lots of memory
mtidy( biggo)
gc() # better
biggo[1,1]
gc() # worse; it's been reloaded
Run the code above in your browser using DataLab