The purpose of these functions is to facilitate the creation of
annotators for basic NLP tasks as described below.
Simple_Para_Token_Annotator()
creates “simple” paragraph
token annotators. Argument f
should be a paragraph tokenizer,
which takes a string s
with the whole text to be processed, and
returns the spans of the paragraphs in s
, or an annotation
object with these spans and (possibly) additional features. The
generated annotator inherits from the default classes
"Simple_Para_Token_Annotator"
and "Annotator"
. It uses
the results of the simple paragraph tokenizer to create and return
annotations with unique ids and type ‘paragraph’.
Simple_Sent_Token_Annotator()
creates “simple” sentence
token annotators. Argument f
should be a sentence tokenizer,
which takes a string s
with the whole text to be processed, and
returns the spans of the sentences in s
, or an annotation
object with these spans and (possibly) additional features. The
generated annotator inherits from the default classes
"Simple_Sent_Token_Annotator"
and "Annotator"
. It uses
the results of the simple sentence tokenizer to create and return
annotations with unique ids and type ‘sentence’, possibly
combined with sentence constituent features for already available
paragraph annotations.
Simple_Word_Token_Annotator()
creates “simple” word
token annotators. Argument f
should be a simple word
tokenizer, which takes a string s
giving a sentence to be
processed, and returns the spans of the word tokens in s
, or an
annotation object with these spans and (possibly) additional features.
The generated annotator inherits from the default classes
"Simple_Word_Token_Annotator"
and "Annotator"
.
It uses already available sentence token annotations to extract the
sentences and obtains the results of the word tokenizer for these. It
then adds the sentence character offsets and unique word token ids,
and word token constituents features for the sentences, and returns
the word token annotations combined with the augmented sentence token
annotations.
Simple_POS_Tag_Annotator()
creates “simple” POS tag
annotators. Argument f
should be a simple POS tagger, which
takes a character vector giving the word tokens in a sentence, and
returns either a character vector with the tags, or a list of feature
maps with the tags as ‘POS’ feature and possibly other
features. The generated annotator inherits from the default classes
"Simple_POS_Tag_Annotator"
and "Annotator"
. It uses
already available sentence and word token annotations to extract the
word tokens for each sentence and obtains the results of the simple
POS tagger for these, and returns annotations for the word tokens with
the features obtained from the POS tagger.
Simple_Entity_Annotator()
creates “simple” entity
annotators. Argument f
should be a simple entity detector
(“named entity recognizer”) which takes a character vector
giving the word tokens in a sentence, and return an annotation object
with the word token spans, a ‘kind’ feature giving the
kind of the entity detected, and possibly other features. The
generated annotator inherits from the default classes
"Simple_Entity_Annotator"
and "Annotator"
. It uses
already available sentence and word token annotations to extract the
word tokens for each sentence and obtains the results of the simple
entity detector for these, transforms word token spans to character
spans and adds unique ids, and returns the combined entity
annotations.
Simple_Chunk_Annotator()
creates “simple” chunk
annotators. Argument f
should be a simple chunker, which takes
as arguments character vectors giving the word tokens and the
corresponding POS tags, and returns either a character vector with the
chunk tags, or a list of feature lists with the tags as
‘chunk_tag’ feature and possibly other features. The generated
annotator inherits from the default classes
"Simple_Chunk_Annotator"
and "Annotator"
. It uses
already available annotations to extract the word tokens and POS tags
for each sentence and obtains the results of the simple chunker for
these, and returns word token annotations with the chunk features
(only).
Simple_Stem_Annotator()
creates “simple” stem
annotators. Argument f
should be a simple stemmer, which takes
as arguments a character vector giving the word tokens, and returns a
character vector with the corresponding word stems. The generated
annotator inherits from the default classes
"Simple_Stem_Annotator"
and "Annotator"
. It uses
already available annotations to extract the word tokens, and returns
word token annotations with the corresponding stem features (only).
In all cases, if the underlying simple processing function returns
annotation objects these should not provide their own ids (or use such
in the features), as the generated annotators will necessarily provide
these (the already available annotations are only available at the
annotator level, but not at the simple processing level).