txt_recode_ngram

a character vector of words where you want to replace tokens with compound multi-word expressions.
This is generally a character vector as returned by the token column of <code>as.data.frame(udpipe_annotate(txt))</code>

a character vector of compound words multi-word expressions indicating terms which can be considered as one word. 
For example <code>c('New York', 'Brussels Hoofdstedelijk Gewest')</code>.

compound

a integer vector of the same length as <code>compound</code> indicating how many terms there are in the specific compound multi-word expressions
given by <code>compound</code>, where <code>compound[i]</code> contains <code>ngram[i]</code> words. 
So if <code>x</code> is <code>c('New York', 'Brussels Hoofdstedelijk Gewest')</code>, the ngram would be <code>c(2, 3)</code>

ngram

separator used when the compounds were constructed by combining the words together into a compound multi-word expression. Defaults to a space: ' '.

Replace in a character vector of tokens, tokens with compound multi-word expressions.
So that <code>c("New", "York")</code> will be <code>c("New York", NA)</code>.

This natural language processing toolkit provides language-agnostic
'tokenization', 'parts of speech tagging', 'lemmatization' and 'dependency
parsing' of raw text. Next to text parsing, the package also allows you to train
annotation models based on data of 'treebanks' in 'CoNLL-U' format as provided
at <http://universaldependencies.org/format.html>. The techniques are explained
in detail in the paper: 'Tokenizing, POS Tagging, Lemmatizing and Parsing UD 2.0
with UDPipe', available at <doi:10.18653/v1/K17-3009>.

txt_recode_ngram: Recode words with compound multi-word expressions

Description

Usage

Arguments

Value

See Also

Examples