chunk_text

Given a text or vector/list of texts, break the texts into smaller segments
each with the same number of words. This allows you to treat a very long
document, such as a novel, as a set of smaller documents.

Convert natural language text into tokens. Includes tokenizers for
shingled n-grams, skip n-grams, words, word stems, sentences, paragraphs,
characters, shingled characters, lines, Penn Treebank, regular
expressions, as well as functions for counting characters, words, and sentences,
and a function for splitting longer texts into separate documents, each with
the same number of words.  The tokenizers have a consistent interface, and
the package is built on the 'stringi' and 'Rcpp' packages for  fast
yet correct tokenization in 'UTF-8'.

Lincoln Mullen

tokenizers

Fast, Consistent Tokenization of Natural Language Text

Os Keyes

Dmitriy Selivanov

Jeffrey Arnold

Kenneth Benoit

chunk_text function

<dl><dt>x</dt>
<dd>A character vector or a list of character vectors to be tokenized
into n-grams. If <code>x</code> is a character vector, it can be of any length,
and each element will be chunked separately. If <code>x</code> is a list of
character vectors, each element of the list should have a length of 1.</dd>
<dt>chunk_size</dt>
<dd>The number of words in each chunk.</dd>
<dt>doc_id</dt>
<dd>The document IDs as a character vector. This will be taken from
the names of the <code>x</code> vector if available. <code>NULL</code> is acceptable.</dd>
<dt>...</dt>
<dd>Arguments passed on to <code>tokenize_words</code>.</dd></dl>

Arguments

Chunk text into smaller segments — chunk_text

<dl>

<dt>x</dt>
<dd>A character vector or a list of character vectors to be tokenized
into n-grams. If <code>x</code> is a character vector, it can be of any length,
and each element will be chunked separately. If <code>x</code> is a list of
character vectors, each element of the list should have a length of 1.</dd>


<dt>chunk_size</dt>
<dd>The number of words in each chunk.</dd>


<dt>doc_id</dt>
<dd>The document IDs as a character vector. This will be taken from
the names of the <code>x</code> vector if available. <code>NULL</code> is acceptable.</dd>


<dt>...</dt>
<dd>Arguments passed on to <code>tokenize_words</code>.</dd>

</dl>

chunk_text: Chunk text into smaller segments

Description

Usage

Arguments

Details

Examples