tokenize_custom

breakrules_get

breakrules_set

breakrules_reset

Allows users to tokenize texts using customized boundary rules. See the <a href="https://unicode-org.github.io/icu/userguide/boundaryanalysis/break-rules.html">ICU website</a>
for how to define boundary rules.
Tools for custom word and sentence breakrules, to retrieve, set, or reset
them to package defaults.

internal

A fast, flexible, and comprehensive framework for
quantitative text analysis in R.  Provides functionality for corpus management,
creating and manipulating tokens and n-grams, exploring keywords in context,
forming and manipulating sparse matrices
of documents by features and feature co-occurrences, analyzing keywords, computing feature similarities and
distances, applying content dictionaries, applying supervised and unsupervised machine learning,
visually representing text and text analyses, and more.

Kenneth Benoit

quanteda

Quantitative Analysis of Textual Data

Kohei Watanabe

Haiyan Wang

Paul Nulty

Adam Obeng

Stefan Müller

Akitaka Matsuo

William Lowe

Christian Müller

Olivier Delmarcelle

European Research Council 

tokenize_custom function

<dl><dt>x</dt>
<dd>character vector for texts to tokenize</dd>
<dt>rules</dt>
<dd>a list of rules for rule-based boundary detection</dd>
<dt>what</dt>
<dd>character; which set of rules to return, one of <code>"word"</code> or
<code>"sentence"</code></dd></dl>

Arguments

Allows users to tokenize texts using customized boundary rules. See the <a href='https://unicode-org.github.io/icu/userguide/boundaryanalysis/break-rules.html'>ICU website</a>
for how to define boundary rules.
Tools for custom word and sentence breakrules, to retrieve, set, or reset
them to package defaults.

Customizable tokenizer — tokenize_custom

<dl>

<dt>x</dt>
<dd>character vector for texts to tokenize</dd>


<dt>rules</dt>
<dd>a list of rules for rule-based boundary detection</dd>


<dt>what</dt>
<dd>character; which set of rules to return, one of <code>"word"</code> or
<code>"sentence"</code></dd>

</dl>

Customizable tokenizer

tokenize_custom: Customizable tokenizer

Description

Usage

Value

Arguments

Details

Examples