Converts a text to a sequence of indexes in a fixed-size hashing space.
text_hashing_trick(text, n, hash_function = NULL,
filters = "!\"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n",
lower = TRUE, split = " ")=>
Input text (string).
Dimension of the hashing space.
if NULL
uses python hash
function, can be 'md5' or
any function that takes in input a string and returns a int. Note that
hash
is not a stable hashing function, so it is not consistent across
different runs, while 'md5' is a stable hashing function.
Sequence of characters to filter out such as punctuation. Default includes basic punctuation, tabs, and newlines.
Whether to convert the input to lowercase.
Sentence split marker (string).
A list of integer word indices (unicity non-guaranteed).
Two or more words may be assigned to the same index, due to possible collisions by the hashing function.
Other text preprocessing: make_sampling_table
,
pad_sequences
, skipgrams
,
text_one_hot
,
text_to_word_sequence