This layer has basic options for managing text in a Keras model. It transforms a batch of strings (one sample = one string) into either a list of token indices (one sample = 1D tensor of integer token indices) or a dense representation (one sample = 1D tensor of float values representing data about the sample's tokens).
layer_text_vectorization(
object,
max_tokens = NULL,
standardize = "lower_and_strip_punctuation",
split = "whitespace",
ngrams = NULL,
output_mode = c("int", "binary", "count", "tfidf"),
output_sequence_length = NULL,
pad_to_max_tokens = TRUE,
...
)
Model or layer object
The maximum size of the vocabulary for this layer. If NULL
,
there is no cap on the size of the vocabulary.
Optional specification for standardization to apply to the
input text. Values can be NULL
(no standardization),
"lower_and_strip_punctuation"
(lowercase and remove punctuation) or a
Callable. Default is "lower_and_strip_punctuation"
.
Optional specification for splitting the input text. Values can be
NULL
(no splitting), "split_on_whitespace"
(split on ASCII whitespace), or
a Callable. Default is "split_on_whitespace"
.
Optional specification for ngrams to create from the possibly-split
input text. Values can be NULL
, an integer or a list of integers; passing
an integer will create ngrams up to that integer, and passing a list of
integers will create ngrams for the specified values in the list. Passing
NULL
means that no ngrams will be created.
Optional specification for the output of the layer. Values can
be "int"
, "binary"
, "count"
or "tfidf"
, which control the outputs as follows:
"int": Outputs integer indices, one integer index per split string token.
"binary": Outputs a single int array per batch, of either vocab_size or
max_tokens
size, containing 1s in all elements where the token mapped
to that index exists at least once in the batch item.
"count": As "binary", but the int array contains a count of the number of times the token at that index appeared in the batch item.
"tfidf": As "binary", but the TF-IDF algorithm is applied to find the value in each token slot.
Only valid in "int" mode. If set, the output will have
its time dimension padded or truncated to exactly output_sequence_length
values, resulting in a tensor of shape (batch_size, output_sequence_length) regardless
of how many tokens resulted from the splitting step. Defaults to NULL
.
Only valid in "binary", "count", and "tfidf" modes. If TRUE
,
the output will have its feature axis padded to max_tokens
even if the
number of unique tokens in the vocabulary is less than max_tokens,
resulting in a tensor of shape (batch_size, max_tokens) regardless of
vocabulary size. Defaults to TRUE
.
Not used.
The processing of each sample contains the following steps:
standardize each sample (usually lowercasing + punctuation stripping)
split each sample into substrings (usually words)
recombine substrings into tokens (usually ngrams)
index tokens (associate a unique int value with each token)
transform each sample using this index, either into a vector of ints or a dense float vector.