cnlp_annotate: Run the annotation pipeline on a set of documents

Description

Runs the clean_nlp annotators over a given corpus of text using the desired backend. The details for which annotators to run and how to run them are specified by using one of: cnlp_init_stringi, cnlp_init_spacy, cnlp_init_udpipe, or cnlp_init_corenlp.

Usage

cnlp_annotate(
  input,
  backend = NULL,
  verbose = 10,
  text_name = "text",
  doc_name = "doc_id"
)

Value

a named list with components "token", "document" and (when running spacy with NER) "entity".

Arguments

input: an object containing the data to parse. Either a character vector with the texts (optional names can be given to provide document ids) or a data frame. The data frame should have a column named 'text' containing the raw text to parse; if there is a column named 'doc_id', it is treated as a a document identifier. The name of the text and document id columns can be changed by setting text_name and doc_name This conforms with corpus objects respecting the Text Interchange Format (TIF), while allowing for some variation.
backend: name of the backend to use. Will default to the last model to be initalized.
verbose: set to a positive integer n to display a progress message to display every n'th record. The default is 10. Set to a non-positive integer to turn off messages. Logical input is converted to an integer, so it also possible to set to TRUE (1) to display a message for every document and FALSE (0) to turn off messages.
text_name: column name containing the text input. The default is 'text'. This parameter is ignored when input is a character vector.
doc_name: column name containing the document ids. The default is 'doc_id'. This parameter is ignored when input is a character vector.

Author

Taylor B. Arnold, taylor.arnold@acm.org

Details

The returned object is a named list where each element containing a data frame. The document table contains one row for each document, along with with all of the metadata that was passed as an input. The tokens table has one row for each token detected in the input. The first three columns are always "doc_id" (to index the input document), "sid" (an integer index for the sentence number), and "tid" (an integer index to the specific token). Together, these are a primary key for each row.

Other columns provide extracted data about each token, which differ slightly based on which backend, language, and options are supplied.

token: detected token, as given in the original input
token_with_ws: detected token along with white space; in, theory, collapsing this field through an entire document will yield the original text
lemma: lemmatised version of the token; the exact form depends on the choosen language and backend
upos: the universal part of speech code; see https://universaldependencies.org/u/pos/all.html for more information
xpos: language dependent part of speech code; the specific categories and their meaning depend on the choosen backend, model and language
feats: other extracted linguistic features, typically given as Universal Dependencies (https://universaldependencies.org/u/feat/index.html), but can be model dependent; currently only provided by the udpipe backend
tid_source: the token id (tid) of the head word for the dependency relationship starting from this token; for the token attached to the root, this will be given as zero
relation: the dependency relation, usually provided using Universal Dependencies (more information available here https://universaldependencies.org/ ), but could be different for a specific model

Examples

Run this code

cnlp_init_stringi()
cnlp_annotate(un)

Run the code above in your browser using DataLab