Big data is defined loosely here as data that is too large for
computer memory (RAM). The BigData
function uses the
split-apply-combine strategy with a big data set. The unmanageable
big data set is split into smaller, manageable pieces (batches),
a function is applied to each batch, and results are combined.
Each iteration, the BigData
function opens a connection to a
big data set and keeps the connection open while the scan
function reads in each batch of data (elsewhere, batches are often
referred to chunks). A user-specified function is applied to each
batch of data, the results are combined together, the connection is
closed, and the results are returned.
As an introductory example, suppose a statistician updates a linear
regression model, but the design matrix \(\textbf{X}\) is too
large for computer memory. Suppose the design matrix has 100 million
rows, and the statistician specifies size=1e6
. The statistician
combines dependent variable \(\textbf{y}\) with design matrix
\(\textbf{X}\). Each iteration in IterativeQuadrature
,
LaplaceApproximation
, LaplacesDemon
,
PMC
, or VariationalBayes
, the
BigData
function sequentially reads in one million rows of the
combined data \(\textbf{X}\), calculates expectation vector
\(\mu\), and finally returns the sum of the log-likelihood. The sum
of the log-likelihood is added together for all batches, and returned.
There are many limitations with this function.
This function is not fast, in the sense that the entire big data set
is processed in batches, each iteration. With iterative methods, this
may perform well, albeit slowly.
There are many functions that cannot be performed on batches, though
most models in the Examples vignette may easily be updated with big
data.
Large matrices of samples are unaddressed, only the data.
Although many (but not all) models may be estimated, many additional
functions in this package will not work when applied after the model
has updated. Instead, a batch or random sample of data (see the
read.matrix
function for sampling from big data) should
be used in the usual way, in the Data
argument, and the
Model
function coded in the usual way without the
BigData
function.
Parallel processing may be performed when the user specifies
CPUs
to be greater than one, implying that the specified number
of CPUs exists and is available. Parallelization may be performed on a
multicore computer or a computer cluster. Either a Simple Network of
Workstations (SNOW) or Message Passing Interface (MPI) is used. Each
call to BigData
establishes and closes the parallelization,
which is costly, and unfortunately results in copious output to the
console. With small data sets, parallel processing may be slower, due
to computer network communication. With larger data sets, the user
should experience a faster run-time.
There have been several alternative approaches suggested for big data.
Huang and Gelman (2005) propose that the user creates batches by
sampling from big data, updating a separate Bayesian model on each
batch, and combining the results into a consensus posterior. This
many-mini-model approach may be faster when feasible, because multiple
models may be updated in parallel, say one per CPU. Such results will
work with all functions in this package. With the many-mini-model
approach, several methods are proposed for combining posterior samples
from batch-level models, such as by using a normal approximation,
updating from prior to posterior sequentially (the posterior from the
last batch becomes the prior of the next batch), sample from the full
posterior via importance sampling from the batched posteriors, and
more.
Scott et al. (2013) propose a method that they call Consensus Monte
Carlo, which consists of breaking the data down into chunks, calling
each chunk a shard, and use a many-mini-model approach as well, but
propose their own method of weighting the posteriors back together.
Balakrishnan and Madigan (2006) introduced a Sequential Monte Carlo
(SMC) sampler, a refinement of an earlier proposal, that was designed
for big data. It makes one pass through the massive data set, after an
initial MCMC estimation on a small sample. Each particle is updated
for each record, resulting in numerous evaluations per record.
Welling and Teh (2011) proposed a new class of MCMC sampler in which
only a random sample of big data is used each iteration. The
stochastic gradient Langevin dynamics (SGLD) algorithm is available
in the LaplacesDemon
function.
An important alternative to consider is using the ff
package,
where "ff" stands for fast access file. The ff
package has been
tested successfully with updating a model in LaplacesDemon
.
Once the big data set, say \(\textbf{X}\), is an object of
class ff_matrix
, simply include it in the list of data as
usual, and modify the Model
specification function
appropriately. For example, change mu <- tcrossprod(X, t(beta))
to mu <- tcrossprod(X[], t(beta))
. The ff
package is
not included as a dependency in the LaplacesDemon
package, so
it must be installed and activated.