Reads CSV files into a dataset, where each element is a (features, labels) list that corresponds to a batch of CSV rows. The features dictionary maps feature column names to tensors containing the corresponding feature data, and labels is a tensor containing the batch's label data.
make_csv_dataset(
file_pattern,
batch_size,
column_names = NULL,
column_defaults = NULL,
label_name = NULL,
select_columns = NULL,
field_delim = ",",
use_quote_delim = TRUE,
na_value = "",
header = TRUE,
num_epochs = NULL,
shuffle = TRUE,
shuffle_buffer_size = 10000,
shuffle_seed = NULL,
prefetch_buffer_size = 1,
num_parallel_reads = 1,
num_parallel_parser_calls = 2,
sloppy = FALSE,
num_rows_for_inference = 100
)
A dataset, where each element is a (features, labels) list that corresponds to
a batch of batch_size
CSV rows. The features dictionary maps feature column names
to tensors containing the corresponding column data, and labels is a tensor
containing the column data for the label column specified by label_name
.
List of files or glob patterns of file paths containing CSV records.
An integer representing the number of records to combine in a single batch.
An optional list of strings that corresponds to the CSV columns, in order. One per column of the input record. If this is not provided, infers the column names from the first row of the records. These names will be the keys of the features dict of each dataset element.
A optional list of default values for the CSV fields. One item
per selected column of the input record. Each item in the list is either a valid CSV
dtype (integer, numeric, or string), or a tensor with one of the
aforementioned types. The tensor can either be a scalar default value (if the column
is optional), or an empty tensor (if the column is required). If a dtype is provided
instead of a tensor, the column is also treated as required. If this list is not
provided, tries to infer types based on reading the first num_rows_for_inference
rows
of files specified, and assumes all columns are optional, defaulting to 0
for
numeric values and ""
for string values. If both this and select_columns
are
specified, these must have the same lengths, and column_defaults
is assumed to be
sorted in order of increasing column index.
A optional string corresponding to the label column. If provided, the data for this column is returned as a separate tensor from the features dictionary, so that the dataset complies with the format expected by a TF Estiamtors and Keras.
(Ignored if using TensorFlow version 1.8.) An optional list of
integer indices or string column names, that specifies a subset of columns of CSV data
to select. If column names are provided, these must correspond to names provided in
column_names
or inferred from the file header lines. When this argument is specified,
only a subset of CSV columns will be parsed and returned, corresponding to the columns
specified. Using this results in faster parsing and lower memory usage. If both this
and column_defaults
are specified, these must have the same lengths, and
column_defaults
is assumed to be sorted in order of increasing column index.
An optional string. Defaults to ","
. Char delimiter to separate
fields in a record.
An optional bool. Defaults to TRUE
. If false, treats double
quotation marks as regular characters inside of the string fields.
Additional string to recognize as NA/NaN.
A bool that indicates whether the first rows of provided CSV files correspond to header lines with column names, and should not be included in the data.
An integer specifying the number of times this dataset is repeated. If NULL, cycles through the dataset forever.
A bool that indicates whether the input should be shuffled.
Buffer size to use for shuffling. A large buffer size ensures better shuffling, but increases memory usage and startup time.
Randomization seed to use for shuffling.
An int specifying the number of feature batches to prefetch for performance improvement. Recommended value is the number of batches consumed per training step.
Number of threads used to read CSV records from files. If >1, the results will be interleaved.
(Ignored if using TensorFlow version 1.11 or later.) Number of parallel invocations of the CSV parsing function on CSV records.
If TRUE
, reading performance will be improved at the cost of
non-deterministic ordering. If FALSE
, the order of elements produced is
deterministic prior to shuffling (elements are still randomized if shuffle=TRUE
.
Note that if the seed is set, then order of elements after shuffling is
deterministic). Defaults to FALSE
.
Number of rows of a file to use for type inference if
record_defaults is not provided. If NULL
, reads all the rows of all the files.
Defaults to 100.