tokens_segment

<a rd-options="" href="/link/tokens?package=quanteda&version=1.5.1" data-mini-rdoc="quanteda::tokens">tokens</a> object whose token elements will be segmented

a character vector, list of character vectors,
<a rd-options="" href="/link/dictionary?package=quanteda&version=1.5.1" data-mini-rdoc="quanteda::dictionary">dictionary</a>, or <a rd-options="" href="/link/collocations?package=quanteda&version=1.5.1" data-mini-rdoc="quanteda::collocations">collocations</a> object. See <a rd-options="" href="/link/pattern?package=quanteda&version=1.5.1" data-mini-rdoc="quanteda::pattern">pattern</a> for
details.

pattern

the type of pattern matching: <code>"glob"</code> for 
"glob"-style wildcard expressions; <code>"regex"</code> for regular expressions;
or <code>"fixed"</code> for exact matching. See <a rd-options="" href="/link/valuetype?package=quanteda&version=1.5.1" data-mini-rdoc="quanteda::valuetype">valuetype</a> for details.

valuetype

ignore case when matching, if <code>TRUE</code>

case_insensitive

remove matched patterns from the texts and save in
<a rd-options="" href="/link/docvars?package=quanteda&version=1.5.1" data-mini-rdoc="quanteda::docvars">docvars</a>, if <code>TRUE</code>

extract_pattern

either <code>"before"</code> or <code>"after"</code>, depending 
on whether the pattern precedes the text (as with a tag) or follows the 
text (as with punctuation delimiters)

pattern_position

if <code>TRUE</code>, repeat the docvar values for each 
segmented text; if <code>FALSE</code>, drop the docvars in the segmented corpus. 
Dropping the docvars might be useful in order to conserve space or if these
are not desired for the segmented corpus.

use_docvars

Segment tokens by splitting on a pattern match. This is useful for breaking
the tokenized texts into smaller document units, based on a regular pattern
or a user-supplied annotation. While it normally makes more sense to do this
at the corpus level (see <code><a rd-options="" href="/link/corpus_segment?package=quanteda&version=1.5.1" data-mini-rdoc="quanteda::corpus_segment">corpus_segment</a></code>), <code>tokens_segment</code>
provides the option to perform this operation on tokens.

internal

tokens

A fast, flexible, and comprehensive framework for
quantitative text analysis in R.  Provides functionality for corpus management,
creating and manipulating tokens and ngrams, exploring keywords in context,
forming and manipulating sparse matrices
of documents by features and feature co-occurrences, analyzing keywords, computing feature similarities and
distances, applying content dictionaries, applying supervised and unsupervised machine learning,
visually representing text and text analyses, and more.

Kenneth Benoit

quanteda

Quantitative Analysis of Textual Data

Kohei Watanabe

Haiyan Wang

Paul Nulty

Adam Obeng

Stefan M<c3><bc>ller

Akitaka Matsuo

Jiong Wei Lua

Patrick O. Perry

Jouni Kuha

Benjamin Lauderdale

William Lowe

Christian M<c3><bc>ller

Lori Young

Stuart Soroka

Ian Fellows

European Research Council 

tokens_segment function

<a rd-options='' href='tokens'>tokens</a> object whose token elements will be segmented

a character vector, list of character vectors,
<a rd-options='' href='dictionary'>dictionary</a>, or <a rd-options='' href='collocations'>collocations</a> object. See <a rd-options='' href='pattern'>pattern</a> for
details.

the type of pattern matching: <code>"glob"</code> for 
"glob"-style wildcard expressions; <code>"regex"</code> for regular expressions;
or <code>"fixed"</code> for exact matching. See <a rd-options='' href='valuetype'>valuetype</a> for details.

remove matched patterns from the texts and save in
<a rd-options='' href='docvars'>docvars</a>, if <code>TRUE</code>

Segment tokens by splitting on a pattern match. This is useful for breaking
the tokenized texts into smaller document units, based on a regular pattern
or a user-supplied annotation. While it normally makes more sense to do this
at the corpus level (see <code><a rd-options='' href='corpus_segment'>corpus_segment</a></code>), <code>tokens_segment</code>
provides the option to perform this operation on tokens.

tokens_segment: Segment tokens object by patterns

Description

Usage

Arguments

Value

Examples