make()
.drake_config()
collects and sanitizes the multitude of
parameters and settings that make()
needs to do its job:
the plan, packages,
the environment of functions and initial data objects,
parallel computing instructions,
verbosity level, etc.
drake_config(plan, targets = NULL, envir = parent.frame(),
verbose = 1L, hook = NULL, cache = drake::drake_cache(verbose =
verbose, console_log_file = console_log_file), fetch_cache = NULL,
parallelism = "loop", jobs = 1L, jobs_preprocess = 1L,
packages = rev(.packages()), lib_loc = NULL,
prework = character(0), prepend = NULL, command = NULL,
args = NULL, recipe_command = NULL, timeout = NULL, cpu = Inf,
elapsed = Inf, retries = 0, force = FALSE, log_progress = TRUE,
graph = NULL, trigger = drake::trigger(), skip_targets = FALSE,
skip_imports = FALSE, skip_safety_checks = FALSE,
lazy_load = "eager", session_info = TRUE, cache_log_file = NULL,
seed = NULL, caching = c("master", "worker"), keep_going = FALSE,
session = NULL, pruning_strategy = NULL, makefile_path = NULL,
console_log_file = NULL, ensure_workers = NULL,
garbage_collection = FALSE, template = list(), sleep = function(i)
0.01, hasty_build = NULL, memory_strategy = "speed", layout = NULL,
lock_envir = TRUE, history = TRUE, recover = FALSE,
recoverable = TRUE)
Workflow plan data frame.
A workflow plan data frame is a data frame
with a target
column and a command
column.
(See the details in the drake_plan()
help file
for descriptions of the optional columns.)
Targets are the objects that drake generates,
and commands are the pieces of R code that produce them.
You can create and track custom files along the way
(see file_in()
, file_out()
, and knitr_in()
).
Use the function drake_plan()
to generate workflow plan
data frames.
Character vector, names of targets to build.
Dependencies are built too. Together, the plan
and
targets
comprise the workflow network
(i.e. the graph
argument).
Changing either will change the network.
Environment to use. Defaults to the current
workspace, so you should not need to worry about this
most of the time. A deep copy of envir
is made,
so you don't need to worry about your workspace being modified
by make
. The deep copy inherits from the global environment.
Wherever necessary, objects and functions are imported
from envir
and the global environment and
then reproducibly tracked as dependencies.
Integer, control printing to the console/terminal.
0
: print nothing.
1
: print targets, retries, and failures.
2
: also show a spinner when preprocessing tasks are underway.
Deprecated.
drake cache as created by new_cache()
.
See also drake_cache()
.
Deprecated.
Character scalar, type of parallelism to use. For detailed explanations, see the high-performance computing chapter of the user manual.
You could also supply your own scheduler function
if you want to experiment or aggressively optimize.
The function should take a single config
argument
(produced by drake_config()
). Existing examples
from drake
's internals are the backend_*()
functions:
backend_loop()
backend_clustermq()
backend_future()
However, this functionality is really a back door
and should not be used for production purposes unless you really
know what you are doing and you are willing to suffer setbacks
whenever drake
's unexported core functions are updated.
Maximum number of parallel workers for processing the targets.
You can experiment with predict_runtime()
to help decide on an appropriate number of jobs.
For details, visit
https://ropenscilabs.github.io/drake-manual/time.html.
Number of parallel jobs for processing the imports and doing other preprocessing tasks.
Character vector packages to load, in the order
they should be loaded. Defaults to rev(.packages())
, so you
should not usually need to set this manually. Just call
library()
to load your packages before make()
.
However, sometimes packages need to be strictly forced to load
in a certain order, especially if parallelism
is
"Makefile"
. To do this, do not use library()
or require()
or loadNamespace()
or
attachNamespace()
to load any libraries beforehand.
Just list your packages in the packages
argument in the order
you want them to be loaded.
Character vector, optional.
Same as in library()
or require()
.
Applies to the packages
argument (see above).
Expression (language object), list of expressions,
or character vector.
Code to run right before targets build.
Called only once if parallelism
is "loop"
and once per target otherwise.
This code can be used to set global options, etc.
Deprecated.
Deprecated.
Deprecated.
Deprecated.
deprecated
. Use elapsed
and cpu
instead.
Same as the cpu
argument of setTimeLimit()
.
Seconds of cpu time before a target times out.
Assign target-level cpu timeout times with an optional cpu
column in plan
.
Same as the elapsed
argument of setTimeLimit()
.
Seconds of elapsed time before a target times out.
Assign target-level elapsed timeout times with an optional elapsed
column in plan
.
Number of retries to execute if the target fails.
Assign target-level retries with an optional retries
column in plan
.
Logical. If FALSE
(default) then drake
imposes checks if the cache was created with an old
and incompatible version of drake.
If there is an incompatibility, make()
stops to
give you an opportunity to
downgrade drake
to a compatible version
rather than rerun all your targets from scratch.
Logical, whether to log the progress
of individual targets as they are being built. Progress logging
creates extra files in the cache (usually the .drake/
folder)
and slows down make()
a little.
If you need to reduce or limit the number of files in the cache,
call make(log_progress = FALSE, recover = FALSE)
.
An igraph
object from the previous make()
.
Supplying a pre-built graph could save time.
Name of the trigger to apply to all targets.
Ignored if plan
has a trigger
column.
See trigger()
for details.
Logical, whether to skip building the targets
in plan
and just import objects and files.
Logical, whether to totally neglect to
process the imports and jump straight to the targets. This can be useful
if your imports are massive and you just want to test your project,
but it is bad practice for reproducible data analysis.
This argument is overridden if you supply your own graph
argument.
Logical, whether to skip the safety checks on your workflow. Use at your own peril.
Either a character vector or a logical. Choices:
"eager"
: no lazy loading. The target is loaded right away
with assign()
.
"promise"
: lazy loading with delayedAssign()
"bind"
: lazy loading with active bindings:
bindr::populate_env()
.
TRUE
: same as "promise"
.
FALSE
: same as "eager"
.
lazy_load
should not be "promise"
for "parLapply"
parallelism combined with jobs
greater than 1.
For local multi-session parallelism and lazy loading, try
library(future); future::plan(multisession)
and then
make(..., parallelism = "future_lapply", lazy_load = "bind")
.
If lazy_load
is "eager"
,
drake prunes the execution environment before each target/stage,
removing all superfluous targets
and then loading any dependencies it will need for building.
In other words, drake prepares the environment in advance
and tries to be memory efficient.
If lazy_load
is "bind"
or "promise"
, drake assigns
promises to load any dependencies at the last minute.
Lazy loading may be more memory efficient in some use cases, but
it may duplicate the loading of dependencies, costing time.
Name of the CSV cache log file to write.
If TRUE
, the default file name is used (drake_cache.CSV
).
If NULL
, no file is written.
If activated, this option writes a flat text file
to represent the state of the cache
(fingerprints of all the targets and imports).
If you put the log file under version control, your commit history
will give you an easy representation of how your results change
over time as the rest of your project changes. Hopefully,
this is a step in the right direction for data reproducibility.
Integer, the root pseudo-random number generator
seed to use for your project.
In make()
, drake
generates a unique
local seed for each target using the global seed
and the target name. That way, different pseudo-random numbers
are generated for different targets, and this pseudo-randomness
is reproducible.
To ensure reproducibility across different R sessions,
set.seed()
and .Random.seed
are ignored and have no affect on
drake
workflows. Conversely, make()
does not usually
change .Random.seed
,
even when pseudo-random numbers are generated.
The exception to this last point is
make(parallelism = "clustermq")
because the clustermq
package needs to generate random numbers
to set up ports and sockets for ZeroMQ.
On the first call to make()
or drake_config()
, drake
uses the random number generator seed from the seed
argument.
Here, if the seed
is NULL
(default), drake
uses a seed
of 0
.
On subsequent make()
s for existing projects, the project's
cached seed will be used in order to ensure reproducibility.
Thus, the seed
argument must either be NULL
or the same
seed from the project's cache (usually the .drake/
folder).
To reset the random number generator seed for a project,
use clean(destroy = TRUE)
.
Character string, either "master"
or "worker"
.
"master"
: Targets are built by remote workers and sent back to
the master process. Then, the master process saves them to the
cache (config$cache
, usually a file system storr
).
Appropriate if remote workers do not have access to the file system
of the calling R session. Targets are cached one at a time,
which may be slow in some situations.
"worker"
: Remote workers not only build the targets, but also
save them to the cache. Here, caching happens in parallel.
However, remote workers need to have access to the file system
of the calling R session. Transferring target data across
a network can be slow.
Logical, whether to still keep running make()
if targets fail.
Deprecated. Has no effect now.
Deprecated. See memory_strategy
.
Deprecated.
Optional character scalar of a file name or
connection object (such as stdout()
) to dump maximally verbose
log information for make()
and other functions (all functions that
accept a config
argument, plus drake_config()
).
If you choose to use a text file as the console log,
it will persist over multiple function calls
until you delete it manually.
Fields in each row the log file, from left to right:
- The node name (short host name) of the
computer (from Sys.info()["nodename"]
).
- The process ID (from Sys.getpid()
).
- A timestamp with the date and time (in microseconds).
- A brief description of what drake
was doing. The fields are separated by pipe symbols (
"|"`).
Deprecated.
Logical, whether to call gc()
each time
a target is built during make()
.
A named list of values to fill in the {{ ... }}
placeholders in template files (e.g. from drake_hpc_template_file()
).
Same as the template
argument of clustermq::Q()
and
clustermq::workers
.
Enabled for clustermq
only (make(parallelism = "clustermq")
),
not future
or batchtools
so far.
For more information, see the clustermq
package:
https://github.com/mschubert/clustermq.
Some template placeholders such as {{ job_name }}
and {{ n_jobs }}
cannot be set this way.
Optional function on a single numeric argument i
.
Default: function(i) 0.01
.
To conserve memory, drake
assigns a brand new closure to
sleep
, so your custom function should not depend on in-memory data
except from loaded packages.
For parallel processing, drake
uses
a central master process to check what the parallel
workers are doing, and for the affected high-performance
computing workflows, wait for data to arrive over a network.
In between loop iterations, the master process sleeps to avoid throttling.
The sleep
argument to make()
and drake_config()
allows you to customize how much time the master process spends
sleeping.
The sleep
argument is a function that takes an argument
i
and returns a numeric scalar, the number of seconds to
supply to Sys.sleep()
after iteration i
of checking.
(Here, i
starts at 1.)
If the checking loop does something other than sleeping
on iteration i
, then i
is reset back to 1.
To sleep for the same amount of time between checks,
you might supply something like function(i) 0.01
.
But to avoid consuming too many resources during heavier
and longer workflows, you might use an exponential
back-off: say,
function(i) { 0.1 + 120 * pexp(i - 1, rate = 0.01) }
.
A user-defined function.
In "hasty mode" (make(parallelism = "hasty")
)
this is the function that evaluates a target's command
and returns the resulting value. The hasty_build
argument
has no effect if parallelism
is any value other than "hasty".
The function you pass to hasty_build
must have arguments target
and config
. Here, target
is a character scalar naming the
target being built, and config
is a configuration list of
runtime parameters generated by drake_config()
.
Character scalar, name of the
strategy drake
uses to load/unload a target's dependencies in memory.
You can give each target its own memory strategy,
(e.g. drake_plan(x = 1, y = target(f(x), memory_strategy = "lookahead"))
)
to override the global memory strategy. Choices:
"speed"
: Once a target is newly built or loaded in memory,
just keep it there.
This choice maximizes speed and hogs memory.
"autoclean"
: Just before building each new target,
unload everything from memory except the target's direct dependencies.
After a target is built, discard it from memory.
(Set garbage_collection = TRUE
to make sure it is really gone.)
This option conserves memory, but it sacrifices speed because
each new target needs to reload
any previously unloaded targets from storage.
"preclean"
: Just before building each new target,
unload everything from memory except the target's direct dependencies.
After a target is built, keep it in memory until drake
determines
they can be unloaded.
This option conserves memory, but it sacrifices speed because
each new target needs to reload
any previously unloaded targets from storage.
"lookahead"
: Just before building each new target,
search the dependency graph to find targets that will not be
needed for the rest of the current make()
session.
After a target is built, keep it in memory until the next
memory management stage.
In this mode, targets are only in memory if they need to be loaded,
and we avoid superfluous reads from the cache.
However, searching the graph takes time,
and it could even double the computational overhead for large projects.
"unload"
: Just before building each new target,
unload all targets from memory.
After a target is built, do not keep it in memory.
This mode aggressively optimizes for both memory and speed,
but in commands and triggers,
you have to manually load any dependencies you need using readd()
.
"none"
: Do not manage memory at all.
Do not load or unload anything before building targets.
After a target is built, do not keep it in memory.
This mode aggressively optimizes for both memory and speed,
but in commands and triggers,
you have to manually load any dependencies you need using readd()
.
For even more direct
control over which targets drake
keeps in memory, see the
help file examples of drake_envir()
.
Also see the garbage_collection
argument of make()
and
drake_config()
.
config$layout
, where config
is the return value
from a prior call to drake_config()
. If your plan or environment
have changed since the last make()
, do not supply a layout
argument.
Otherwise, supplying one could save time.
Logical, whether to lock config$envir
during make()
.
If TRUE
, make()
quits in error whenever a command in your
drake
plan (or prework
) tries to add, remove, or modify
non-hidden variables in your environment/workspace/R session.
This is extremely important for ensuring the purity of your functions
and the reproducibility/credibility/trust you can place in your project.
lock_envir
will be set to a default of TRUE
in drake
version
7.0.0 and higher.
Logical, whether to record the build history
of your targets. You can also supply a
txtq, which is
how drake
records history.
Must be TRUE
for drake_history()
to work later.
Logical, whether to activate automated data recovery.
The default is FALSE
because
Automated data recovery is still experimental.
It has reproducibility issues. Targets recovered from the distant past may have been generated with earlier versions of R and earlier package environments that no longer exist.
How it works: if recover
is TRUE
,
drake
tries to salvage old target values from the cache
instead of running commands from the plan.
A target is recoverable if
There is an old value somewhere in the cache that shares the command, dependencies, etc. of the target about to be built.
The old value was generated with make(recoverable = TRUE)
.
If both conditions are met, drake
will
Assign the most recently-generated admissible data to the target, and
skip the target's command.
Functions recoverable()
and r_recoverable()
show the most upstream
outdated targets that will be recovered in this way in the next
make()
or r_make()
.
Logical, whether to make target values recoverable
with make(recover = TRUE)
.
This requires writing extra files to the cache,
and it prevents old metadata from being removed with garbage collection
(clean(garbage_collection = TRUE)
, gc()
in storr
s).
If you need to limit the cache size or the number of files in the cache,
consider make(recoverable = FALSE, progress = FALSE)
.
The master internal configuration list of a project.
make(recover = TRUE, recoverable = TRUE)
powers automated data recovery.
The default of recover
is FALSE
because
Automated data recovery is still experimental.
It has reproducibility issues. Targets recovered from the distant past may have been generated with earlier versions of R and earlier package environments that no longer exist.
How it works: if recover
is TRUE
,
drake
tries to salvage old target values from the cache
instead of running commands from the plan.
A target is recoverable if
There is an old value somewhere in the cache that shares the command, dependencies, etc. of the target about to be built.
The old value was generated with make(recoverable = TRUE)
.
If both conditions are met, drake
will
Assign the most recently-generated admissible data to the target, and
skip the target's command.
Once you create a list with drake_config()
,
do not modify it by hand.
Utility functions such as outdated()
,
vis_drake_graph()
, and predict_runtime()
require output from
drake_config()
for the config
argument.
If you supply a drake_config()
object to the config
argument of make()
, then drake
will ignore all the other arguments
because it already has everything it needs in config
.
# NOT RUN {
isolate_example("Quarantine side effects.", {
load_mtcars_example() # Get the code with drake_example("mtcars").
# Construct the master internal configuration list.
config <- drake_config(my_plan)
if (requireNamespace("visNetwork")) {
vis_drake_graph(config) # See the dependency graph.
if (requireNamespace("networkD3")) {
sankey_drake_graph(config) # See the dependency graph.
}
}
# These functions are faster than otherwise
# because they use the configuration list.
outdated(config) # Which targets are out of date?
missed(config) # Which imports are missing?
})
# }
Run the code above in your browser using DataLab