Learn R Programming

arrow (version 0.17.1)

dataset_factory: Create a DatasetFactory

Description

A Dataset can constructed using one or more DatasetFactorys. This function helps you construct a DatasetFactory that you can pass to open_dataset().

Usage

dataset_factory(
  x,
  filesystem = c("auto", "local"),
  format = c("parquet", "arrow", "ipc", "feather"),
  partitioning = NULL,
  allow_not_found = FALSE,
  recursive = TRUE,
  ...
)

Arguments

x

A string file x containing data files, or a list of DatasetFactory objects whose datasets should be grouped. If this argument is specified it will be used to construct a UnionDatasetFactory and other arguments will be ignored.

filesystem

A string identifier for the filesystem corresponding to x. Currently only "local" is supported.

format

A string identifier of the format of the files in x. Currently "parquet" and "ipc"/"arrow"/"feather" (aliases for each other) are supported. For Feather, only version 2 files are supported.

partitioning

One of

  • A Schema, in which case the file paths relative to sources will be parsed, and path segments will be matched with the schema fields. For example, schema(year = int16(), month = int8()) would create partitions for file paths like "2019/01/file.parquet", "2019/02/file.parquet", etc.

  • A character vector that defines the field names corresponding to those path segments (that is, you're providing the names that would correspond to a Schema but the types will be autodetected)

  • A HivePartitioning or HivePartitioningFactory, as returned by hive_partition() which parses explicit or autodetected fields from Hive-style path segments

  • NULL for no partitioning

allow_not_found

logical: is x allowed to not exist? Default FALSE. See FileSelector.

recursive

logical: should files be discovered in subdirectories of x? Default TRUE.

...

Additional arguments passed to the FileSystem $create() method

Value

A DatasetFactory object. Pass this to open_dataset(), in a list potentially with other DatasetFactory objects, to create a Dataset.

Details

If you would only have a single DatasetFactory (for example, you have a single directory containing Parquet files), you can call open_dataset() directly. Use dataset_factory() when you want to combine different directories, file systems, or file formats.