tokenizers

word_tokenizer

char_tokenizer

space_tokenizer

postag_lemma_tokenizer

strings

other parameters (usually not used - see source code for details).

<code>character</code>, <code>nchar(sep)</code> = 1 - split strings by this character.

<code>logical</code> tokenize at C++ level - could speed-up by 15-50%.

xptr

- udpipe model, can be loaded with <code>?udpipe::udpipe_load_model</code>

udpipe_model

<code>"default"</code> - tagger parameter as per <code>?udpipe::udpipe_annotate</code> docs.

tagger

<code>"tokenizer"</code> - tokenizer parameter as per <code>?udpipe::udpipe_annotate</code> docs.

tokenizer

<code>character(0)</code> specifies which tokens to keep. <code>character(0)</code> means to keep all of them

pos_keep

<code>c("PUNCT", "DET", "ADP", "SYM", "PART", "SCONJ", "CCONJ", "AUX", "X", "INTJ")</code> - which tokens to remove.
<code>character(0)</code> is equal to not remove any.

pos_remove

Few simple tokenization functions. For more comprehensive list see <code>tokenizers</code> package:
<a href="https://cran.r-project.org/package=tokenizers">https://cran.r-project.org/package=tokenizers</a>.
Also check <code>stringi::stri_split_*</code>.

Fast and memory-friendly tools for text vectorization, topic
modeling (LDA, LSA), word embeddings (GloVe), similarities. This package
provides a source-agnostic streaming API, which allows researchers to perform
analysis of collections of documents which are larger than available RAM. All
core functions are parallelized to benefit from multicore machines.

Dmitriy Selivanov

text2vec

Modern Text Mining Framework for R

Manuel Bickel

Qing Wang

tokenizers function

Few simple tokenization functions. For more comprehensive list see <code>tokenizers</code> package:
<a href='https://cran.r-project.org/package=tokenizers'>https://cran.r-project.org/package=tokenizers</a>.
Also check <code>stringi::stri_split_*</code>.

tokenizers: Simple tokenization functions for string splitting

Description

Usage

Arguments

Value

Examples