tokenize_internal

tokenize

tokenize_word

tokenize_word1

tokenize_character

tokenize_sentence

tokenize_fasterword

tokenize_fastestword

logical; if <code>TRUE</code>, split words that are connected by
hyphenation and hyphenation-like characters in between words, e.g.
<code>"self-aware"</code> becomes <code>c("self", "-", "aware")</code>

split_hyphens

if <code>TRUE</code>, print timing messages to the console

verbose

used to pass arguments among the functions

Internal methods for tokenization providing default and legacy methods for
text segmentation.

internal

tokens

A fast, flexible, and comprehensive framework for
quantitative text analysis in R.  Provides functionality for corpus management,
creating and manipulating tokens and ngrams, exploring keywords in context,
forming and manipulating sparse matrices
of documents by features and feature co-occurrences, analyzing keywords, computing feature similarities and
distances, applying content dictionaries, applying supervised and unsupervised machine learning,
visually representing text and text analyses, and more.

Kenneth Benoit

quanteda

Quantitative Analysis of Textual Data

Kohei Watanabe

Haiyan Wang

Paul Nulty

Adam Obeng

Stefan M<c3><bc>ller

Akitaka Matsuo

Jiong Wei Lua

Jouni Kuha

William Lowe

Christian M<c3><bc>ller

Lori Young

Stuart Soroka

Ian Fellows

European Research Council 

tokenize_internal function

quanteda tokenizers — tokenize_internal

quanteda tokenizers

tokenize_internal: quanteda tokenizers

Description

Usage

Arguments

Value

Examples