Learn R Programming

koRpus (version 0.04-40)

treetag: A function to call TreeTagger

Description

This function calls a local installation of TreeTagger[1] to tokenize and POS tag the given text.

Usage

treetag(file, treetagger = "kRp.env", rm.sgml = TRUE,
    lang = "kRp.env",
    sentc.end = c(".", "!", "?", ";", ":"),
    encoding = NULL, TT.options = NULL, debug = FALSE,
    TT.tknz = TRUE, format = "file", stopwords = NULL,
    stemmer = NULL)

Arguments

file
Either a connection or a character vector, valid path to a file, containing the text to be analyzed. If file is a connection, its contents will be written to a temporary file, since TreeTagger can't read from R connection objects.
treetagger
A character vector giving the TreeTagger script to be called. If set to "kRp.env" this is got from get.kRp.env. Only if set to "manual", it is assume
rm.sgml
Logical, whether SGML tags should be ignored and removed from output
lang
A character string naming the language of the analyzed corpus. See kRp.POS.tags for all supported languages. If set to "kRp.env" this is got from
sentc.end
A character vector with tokens indicating a sentence ending. This adds to TreeTaggers results, it doesn't really replace them.
encoding
A character string defining the character encoding of the input file, like "Latin1" or "UTF-8". If NULL, the encoding will either be taken from a preset (if defined in TT.options), or fall back t
TT.options
A list of options to configure how TreeTagger is called. You have two basic choices: Either you choose one of the pre-defined presets or you give a full set of valid options:
  • path
{Mandatory: The absolute path to the
tokenizer
tknz.opts
tagger
abbrev
params
lexicon
lookup
filter

Value

  • An object of class kRp.tagged-class. If debug=TRUE, prints internal variable settings and attempts to return the original output if the TreeTagger system call in a matrix.

code

Snowball

item

  • debug
  • TT.tknz
  • format
  • stopwords
  • stemmer

Details

Note that the value of lang must match a valid language supported by kRp.POS.tags. It will also get stored in the resulting object and might be used by other functions at a later point. E.g., treetag is being called by freq.analysis, which will by default query this language definition, unless explicitly told otherwise. The rationale behind this is to comfortably make it possible to have tokenized and POS tagged objects of various languages around in your workspace, and not worry about that too much.

References

Schmid, H. (1994). Probabilistic part-of-speec tagging using decision trees. In International Conference on New Methods in Language Processing, Manchester, UK, 44--49.

[1] http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/DecisionTreeTagger.html

See Also

freq.analysis, get.kRp.env, kRp.tagged-class

Examples

Run this code
# first way to invoke POS tagging, using a built-in preset:
tagged.results <- treetag("~/my.data/speech.txt", treetagger="manual", lang="en",
   TT.options=list(path="~/bin/treetagger", preset="en"))
# second way, use one of the batch scripts that come with TreeTagger:
tagged.results <- treetag("~/my.data/speech.txt",
   treetagger="~/bin/treetagger/cmd/tree-tagger-english", lang="en")
# third option, set the above batch script in an environment object first:
set.kRp.env(TT.cmd="~/bin/treetagger/cmd/tree-tagger-english", lang="en")
tagged.results <- treetag("~/my.data/speech.txt")

# after tagging, use the resulting object with other functions in this package:
readability(tagged.results)
lex.div(tagged.results)

## enabling stopword detection and stemming
# if you also installed the packages tm and Snowball,
# you can use some of their features with koRpus:
set.kRp.env(TT.cmd="manual", lang="en", TT.options=list(path="~/bin/treetagger",
   preset="en"))
tagged.results <- treetag("~/my.data/speech.txt",
   stopwords=tm::stopwords("en"),
   stemmer=Snowball::SnowballStemmer)
# removing all stopwords now is simple:
tagged.noStopWords <- kRp.filter.wclass(tagged.results, "stopword")

Run the code above in your browser using DataLab