Converts a text to a sequence of indexes in a fixed-size hashing space.
text_hashing_trick(
text,
n,
hash_function = NULL,
filters = "!\"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n",
lower = TRUE,
split = " "
)=>
A list of integer word indices (unicity non-guaranteed).
Input text (string).
Dimension of the hashing space.
if NULL
uses the Python hash()
function. Otherwise can be 'md5'
or
any function that takes in input a string and returns an int. Note that
hash
is not a stable hashing function, so it is not consistent across
different runs, while 'md5'
is a stable hashing function.
Sequence of characters to filter out such as punctuation. Default includes basic punctuation, tabs, and newlines.
Whether to convert the input to lowercase.
Sentence split marker (string).
Two or more words may be assigned to the same index, due to possible collisions by the hashing function.
Other text preprocessing:
make_sampling_table()
,
pad_sequences()
,
skipgrams()
,
text_one_hot()
,
text_to_word_sequence()