Learn R Programming

koRpus (version 0.10-2)

read.tagged: Import already tagged texts

Description

This function can be used on text files or matrices containing already tagged text material, e.g. the results of TreeTagger[1].

Usage

read.tagged(file, lang = "kRp.env", encoding = NULL,
  tagger = "TreeTagger", apply.sentc.end = TRUE, sentc.end = c(".", "!",
  "?", ";", ":"), stopwords = NULL, stemmer = NULL, rm.sgml = TRUE)

Arguments

file

Either a matrix, a connection or a character vector. If the latter, that must be a valid path to a file, containing the previously analyzed text. If it is a matrix, it must contain three columns named "token", "tag", and "lemma", and only these three columns are used.

lang

A character string naming the language of the analyzed corpus. See kRp.POS.tags for all supported languages. If set to "kRp.env" this is got from get.kRp.env.

encoding

A character string defining the character encoding of the input file, like "Latin1" or "UTF-8". If NULL, the encoding will either be taken from a preset (if defined in TT.options), or fall back to "". Hence you can overwrite the preset encoding with this parameter.

tagger

The software which was used to tokenize and tag the text. Currently, TreeTagger is the only supported tagger.

apply.sentc.end

Logical, whethter the tokens defined in sentc.end should be searched and set to a sentence ending tag. You could call this a compatibility mode to make sure you get the results you would get if you called treetag on the original file. If set to FALSE, the tags will be imported as they are.

sentc.end

A character vector with tokens indicating a sentence ending. This adds to given results, it doesn't replace them.

stopwords

A character vector to be used for stopword detection. Comparison is done in lower case. You can also simply set stopwords=tm::stopwords("en") to use the english stopwords provided by the tm package.

stemmer

A function or method to perform stemming. For instance, you can set stemmer=Snowball::SnowballStemmer if you have the Snowball package installed (or SnowballC::wordStem). As of now, you cannot provide further arguments to this function.

rm.sgml

Logical, whether SGML tags should be ignored and removed from output

Value

An object of class kRp.tagged-class. If debug=TRUE, prints internal variable settings and attempts to return the original output if the TreeTagger system call in a matrix.

Details

Note that the value of lang must match a valid language supported by kRp.POS.tags. It will also get stored in the resulting object and might be used by other functions at a later point.

References

Schmid, H. (1994). Probabilistic part-of-speec tagging using decision trees. In International Conference on New Methods in Language Processing, Manchester, UK, 44--49.

[1] http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/DecisionTreeTagger.html

See Also

treetag, freq.analysis, get.kRp.env, kRp.tagged-class

Examples

Run this code
# NOT RUN {
tagged.results <- read.tagged("~/my.data/tagged_speech.txt", lang="en")
# }

Run the code above in your browser using DataLab