Learn R Programming

text2vec (version 0.5.0)

tokenizers: Simple tokenization functions for string splitting

Description

very thin wrappers around base regular expressions. For much more faster and functional tokenizers see tokenizers package: https://cran.r-project.org/package=tokenizers. The reason for not including this to text2vec is to keep number of dependencies small. Also check stringi::stri_split_* and stringr::str_split_*.

Usage

word_tokenizer(strings, ...)

regexp_tokenizer(strings, pattern, ...)

char_tokenizer(strings, ...)

space_tokenizer(strings, sep = " ", xptr = FALSE, ...)

Arguments

strings

character vector

...

other parameters to strsplit function, which is used under the hood.

pattern

character pattern symbol.

sep

character, nchar(sep) = 1 - split strings by this character.

xptr

logical tokenize at C++ level - could speed-up by 15-50%.

Value

list of character vectors. Each element of list contains vector of tokens.

Examples

Run this code
# NOT RUN {
doc = c("first  second", "bla, bla, blaa")
# split by words
word_tokenizer(doc)
#faster, but far less general - perform split by a fixed single whitespace symbol.
space_tokenizer(doc, " ")
# }

Run the code above in your browser using DataLab