As an alternative to calling collect()
on a Dataset
query, you can
use this function to access the stream of RecordBatch
es in the Dataset
.
This lets you aggregate on each chunk and pull the intermediate results into
a data.frame
for further aggregation, even if you couldn't fit the whole
Dataset
result in memory.
map_batches(X, FUN, ..., .data.frame = TRUE)
A Dataset
or arrow_dplyr_query
object, as returned by the
dplyr
methods on Dataset
.
A function or purrr
-style lambda expression to apply to each
batch
Additional arguments passed to FUN
logical: collect the resulting chunks into a single
data.frame
? Default TRUE
This is experimental and not recommended for production use.