powered by
Few simple tokenization functions. For more comprehensive list see tokenizers package: https://cran.r-project.org/package=tokenizers. Also check stringi::stri_split_*.
tokenizers
stringi::stri_split_*
word_tokenizer(strings, ...)char_tokenizer(strings, ...)space_tokenizer(strings, sep = " ", xptr = FALSE, ...)postag_lemma_tokenizer(strings, udpipe_model, tagger = "default", tokenizer = "tokenizer", pos_keep = character(0), pos_remove = c("PUNCT", "DET", "ADP", "SYM", "PART", "SCONJ", "CCONJ", "AUX", "X", "INTJ"))
char_tokenizer(strings, ...)
space_tokenizer(strings, sep = " ", xptr = FALSE, ...)
postag_lemma_tokenizer(strings, udpipe_model, tagger = "default", tokenizer = "tokenizer", pos_keep = character(0), pos_remove = c("PUNCT", "DET", "ADP", "SYM", "PART", "SCONJ", "CCONJ", "AUX", "X", "INTJ"))
list of character vectors. Each element of list contains vector of tokens.
list
character
character vector
other parameters (usually not used - see source code for details).
character, nchar(sep) = 1 - split strings by this character.
nchar(sep)
logical tokenize at C++ level - could speed-up by 15-50%.
logical
- udpipe model, can be loaded with ?udpipe::udpipe_load_model
?udpipe::udpipe_load_model
"default" - tagger parameter as per ?udpipe::udpipe_annotate docs.
"default"
?udpipe::udpipe_annotate
"tokenizer" - tokenizer parameter as per ?udpipe::udpipe_annotate docs.
"tokenizer"
character(0) specifies which tokens to keep. character(0) means to keep all of them
character(0)
c("PUNCT", "DET", "ADP", "SYM", "PART", "SCONJ", "CCONJ", "AUX", "X", "INTJ") - which tokens to remove. character(0) is equal to not remove any.
c("PUNCT", "DET", "ADP", "SYM", "PART", "SCONJ", "CCONJ", "AUX", "X", "INTJ")
doc = c("first second", "bla, bla, blaa") # split by words word_tokenizer(doc) #faster, but far less general - perform split by a fixed single whitespace symbol. space_tokenizer(doc, " ")
Run the code above in your browser using DataLab