This function can be used on text files or matrices containing already tagged text material, e.g. the results of TreeTagger[1].
read.tagged(file, lang = "kRp.env", encoding = NULL,
tagger = "TreeTagger", apply.sentc.end = TRUE, sentc.end = c(".", "!",
"?", ";", ":"), stopwords = NULL, stemmer = NULL, rm.sgml = TRUE)
Either a matrix, a connection or a character vector. If the latter, that must be a valid path to a file, containing the previously analyzed text. If it is a matrix, it must contain three columns named "token", "tag", and "lemma", and only these three columns are used.
A character string naming the language of the analyzed corpus. See kRp.POS.tags
for all supported languages.
If set to "kRp.env"
this is got from get.kRp.env
.
A character string defining the character encoding of the input file,
like "Latin1"
or "UTF-8"
.
If NULL
,
the encoding will either be taken from a preset (if defined in TT.options
), or fall back to ""
.
Hence you can overwrite the preset encoding with this parameter.
The software which was used to tokenize and tag the text. Currently, TreeTagger is the only supported tagger.
Logical,
whethter the tokens defined in sentc.end
should be searched and set to a sentence ending tag.
You could call this a compatibility mode to make sure you get the results you would get if you called
treetag
on the original file.
If set to FALSE
, the tags will be imported as they are.
A character vector with tokens indicating a sentence ending. This adds to given results, it doesn't replace them.
A character vector to be used for stopword detection. Comparison is done in lower case. You can also simply set
stopwords=tm::stopwords("en")
to use the english stopwords provided by the tm
package.
A function or method to perform stemming. For instance,
you can set stemmer=Snowball::SnowballStemmer
if you
have the Snowball
package installed (or SnowballC::wordStem
). As of now,
you cannot provide further arguments to
this function.
Logical, whether SGML tags should be ignored and removed from output
An object of class kRp.tagged-class
. If debug=TRUE
,
prints internal variable settings and
attempts to return the original output if the TreeTagger system call in a matrix.
Note that the value of lang
must match a valid language supported by kRp.POS.tags
.
It will also get stored in the resulting object and might be used by other functions at a later point.
Schmid, H. (1994). Probabilistic part-of-speec tagging using decision trees. In International Conference on New Methods in Language Processing, Manchester, UK, 44--49.
[1] http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/DecisionTreeTagger.html
# NOT RUN {
tagged.results <- read.tagged("~/my.data/tagged_speech.txt", lang="en")
# }
Run the code above in your browser using DataLab