Annotator: Annotator (pipeline) objects

Description

Create annotator (pipeline) objects.

Usage

Annotator(f, meta = list(), classes = NULL)
Annotator_Pipeline(..., meta = list())
as.Annotator_Pipeline(x)

Value

For Annotator(), an annotator object inheriting from the given classes and class "Annotator".

For Annotator_Pipeline() and as.Annotator_Pipeline(), an annotator pipeline object inheriting from class

"Annotator_Pipeline".

Arguments

f: an annotator function, which must have formals s and a giving, respectively, the string with the natural language text to annotate and an annotation object to start from, and return an annotation object with the computed annotations.
meta: an empty or named list of annotator (pipeline) metadata tag-value pairs.
classes: a character vector or NULL (default) giving classes to be used for the created annotator object in addition to "Annotator".
...: annotator objects.
x: an R object.

Details

Annotator() checks that the given annotator function has the appropriate formals, and returns an annotator object which inherits from the given classes and "Annotator". There are print() and format() methods for such objects, which use the description element of the metadata if available.

Annotator_Pipeline() creates an annotator pipeline object from the given annotator objects. Such pipeline objects can be used by annotate() for successively computing and merging annotations, and can also be obtained by coercion with as.Annotator_Pipeline(), which currently handles annotator objects and lists of such (and of course, annotator pipeline objects).

Examples

Run this code

## Use blankline_tokenizer() for a simple paragraph token annotator:
para_token_annotator <-
Annotator(function(s, a = Annotation()) {
              spans <- blankline_tokenizer(s)
              n <- length(spans)
              ## Need n consecutive ids, starting with the next "free"
              ## one:
              from <- next_id(a$id)
              Annotation(seq(from = from, length.out = n),
                         rep.int("paragraph", n),
                         spans$start,
                         spans$end)
          },
          list(description = 
              "A paragraph token annotator based on blankline_tokenizer()."))
para_token_annotator
## Alternatively, use Simple_Para_Token_Annotator().

## A simple text with two paragraphs:
s <- String(paste("  First sentence.  Second sentence.  ",
                  "  Second paragraph.  ",
                  sep = "\n\n"))
a <- annotate(s, para_token_annotator)
## Annotations for paragraph tokens.
a
## Extract paragraph tokens.
s[a]