dataset_snapshot: Persist the output of a dataset

Description

Persist the output of a dataset

Usage

dataset_snapshot(
  dataset,
  path,
  compression = c("AUTO", "GZIP", "SNAPPY", "None"),
  reader_func = NULL,
  shard_func = NULL
)

Arguments

dataset: A tensorflow dataset
path: Required. A directory to use for storing/loading the snapshot to/from.
compression: Optional. The type of compression to apply to the snapshot written to disk. Supported options are "GZIP", "SNAPPY", "AUTO" or NULL (values of "", NA, and "None" are synonymous with NULL) Defaults to AUTO, which attempts to pick an appropriate compression algorithm for the dataset.
reader_func: Optional. A function to control how to read data from snapshot shards.
shard_func: Optional. A function to control how to shard data when writing a snapshot.

Details

The snapshot API allows users to transparently persist the output of their preprocessing pipeline to disk, and materialize the pre-processed data on a different training run.

This API enables repeated preprocessing steps to be consolidated, and allows re-use of already processed data, trading off disk storage and network bandwidth for freeing up more valuable CPU resources and accelerator compute time.

https://github.com/tensorflow/community/blob/master/rfcs/20200107-tf-data-snapshot.md has detailed design documentation of this feature.

Users can specify various options to control the behavior of snapshot, including how snapshots are read from and written to by passing in user-defined functions to the reader_func and shard_func parameters.

shard_func is a user specified function that maps input elements to snapshot shards.

NUM_SHARDS <- parallel::detectCores()
dataset %>%
  dataset_enumerate() %>%
  dataset_snapshot(
    "/path/to/snapshot/dir",
    shard_func = function(index, ds_elem) x %% NUM_SHARDS) %>%
  dataset_map(function(index, ds_elem) ds_elem)

reader_func is a user specified function that accepts a single argument: a Dataset of Datasets, each representing a "split" of elements of the original dataset. The cardinality of the input dataset matches the number of the shards specified in the shard_func. The function should return a Dataset of elements of the original dataset.

Users may want specify this function to control how snapshot files should be read from disk, including the amount of shuffling and parallelism.

Here is an example of a standard reader function a user can define. This function enables both dataset shuffling and parallel reading of datasets:

user_reader_func <- function(datasets) {
  num_cores <- parallel::detectCores()
  datasets %>%
    dataset_shuffle(num_cores) %>%
    dataset_interleave(function(x) x, num_parallel_calls=AUTOTUNE)
}
dataset <- dataset %>%
  dataset_snapshot("/path/to/snapshot/dir",
                   reader_func = user_reader_func)

By default, snapshot parallelizes reads by the number of cores available on the system, but will not attempt to shuffle the data.