tokenizers

word_tokenizer

regexp_tokenizer

char_tokenizer

space_tokenizer

strings

other parameters to <a rd-options="" href="/link/strsplit?package=text2vec&version=0.5.0" data-mini-rdoc="text2vec::strsplit">strsplit</a> function, which is used under the hood.

<code>character</code> pattern symbol.

pattern

<code>character</code>, <code>nchar(sep)</code> = 1 - split strings by this character.

<code>logical</code> tokenize at C++ level - could speed-up by 15-50%.

xptr

very thin wrappers around <code>base</code> regular expressions.
For much more faster and functional tokenizers see <code>tokenizers</code> package:
<a href="https://cran.r-project.org/package=tokenizers">https://cran.r-project.org/package=tokenizers</a>.
The reason for not including this to <code>text2vec</code> is to keep number of dependencies small.
Also check <code>stringi::stri_split_*</code> and <code>stringr::str_split_*</code>.

Fast and memory-friendly tools for text vectorization, topic
modeling (LDA, LSA), word embeddings (GloVe), similarities. This package
provides a source-agnostic streaming API, which allows researchers to perform
analysis of collections of documents which are larger than available RAM. All
core functions are parallelized to benefit from multicore machines.

Dmitriy Selivanov

text2vec

Modern Text Mining Framework for R

Qing Wang

tokenizers function

other parameters to <a rd-options='' href='strsplit'>strsplit</a> function, which is used under the hood.

very thin wrappers around <code>base</code> regular expressions.
For much more faster and functional tokenizers see <code>tokenizers</code> package:
<a href='https://cran.r-project.org/package=tokenizers'>https://cran.r-project.org/package=tokenizers</a>.
The reason for not including this to <code>text2vec</code> is to keep number of dependencies small.
Also check <code>stringi::stri_split_*</code> and <code>stringr::str_split_*</code>.

tokenizers: Simple tokenization functions for string splitting

Description

Usage

Arguments

Value

Examples