This function allows you to write a dataset. By writing to more efficient binary storage formats, and by specifying relevant partitioning, you can make it much faster to read and query.
write_dataset(
dataset,
path,
format = dataset$format,
partitioning = dplyr::group_vars(dataset),
basename_template = paste0("part-{i}.", as.character(format)),
hive_style = TRUE,
...
)
Dataset, RecordBatch, Table, arrow_dplyr_query
, or
data.frame
. If an arrow_dplyr_query
or grouped_df
,
schema
and partitioning
will be taken from the result of any select()
and group_by()
operations done on the dataset. filter()
queries will be
applied to restrict written rows.
Note that select()
-ed columns may not be renamed.
string path, URI, or SubTreeFileSystem
referencing a directory
to write to (directory will be created if it does not exist)
file format to write the dataset to. Currently supported
formats are "feather" (aka "ipc") and "parquet". Default is to write to the
same format as dataset
.
Partitioning
or a character vector of columns to
use as partition keys (to be written as path segments). Default is to
use the current group_by()
columns.
string template for the names of files to be written.
Must contain "{i}"
, which will be replaced with an autoincremented
integer to generate basenames of datafiles. For example, "part-{i}.feather"
will yield "part-0.feather", ...
.
logical: write partition segments as Hive-style
(key1=value1/key2=value2/file.ext
) or as just bare values. Default is TRUE
.
additional format-specific arguments. For available Parquet
options, see write_parquet()
. The available Feather options are
use_legacy_format
logical: write data formatted so that Arrow libraries
versions 0.14 and lower can read it. Default is FALSE
. You can also
enable this by setting the environment variable ARROW_PRE_0_15_IPC_FORMAT=1
.
metadata_version
: A string like "V5" or the equivalent integer indicating
the Arrow IPC MetadataVersion. Default (NULL) will use the latest version,
unless the environment variable ARROW_PRE_1_0_METADATA_VERSION=1
, in
which case it will be V4.
codec
: A Codec which will be used to compress body buffers of written
files. Default (NULL) will not compress body buffers.
The input dataset
, invisibly