Learn R Programming

koRpus (version 0.13-8)

readTagged: Import already tagged texts

Description

This method can be used on text files or matrices containing already tagged text material, e.g. the results of TreeTagger[1].

Usage

readTagged(file, ...)

# S4 method for matrix readTagged( file, lang = "kRp.env", tagger = "TreeTagger", apply.sentc.end = TRUE, sentc.end = c(".", "!", "?", ";", ":"), stopwords = NULL, stemmer = NULL, rm.sgml = TRUE, doc_id = NA, add.desc = "kRp.env", mtx_cols = c(token = "token", tag = "tag", lemma = "lemma") )

# S4 method for data.frame readTagged( file, lang = "kRp.env", tagger = "TreeTagger", apply.sentc.end = TRUE, sentc.end = c(".", "!", "?", ";", ":"), stopwords = NULL, stemmer = NULL, rm.sgml = TRUE, doc_id = NA, add.desc = "kRp.env", mtx_cols = c(token = "token", tag = "tag", lemma = "lemma") )

# S4 method for kRp.connection readTagged( file, lang = "kRp.env", encoding = getOption("encoding"), tagger = "TreeTagger", apply.sentc.end = TRUE, sentc.end = c(".", "!", "?", ";", ":"), stopwords = NULL, stemmer = NULL, rm.sgml = TRUE, doc_id = NA, add.desc = "kRp.env" )

# S4 method for character readTagged( file, lang = "kRp.env", encoding = getOption("encoding"), tagger = "TreeTagger", apply.sentc.end = TRUE, sentc.end = c(".", "!", "?", ";", ":"), stopwords = NULL, stemmer = NULL, rm.sgml = TRUE, doc_id = NA, add.desc = "kRp.env" )

Arguments

file

Either a matrix, a connection or a character vector. If the latter, that must be a valid path to a file, containing the previously analyzed text. If it is a matrix, it must contain three columns named "token", "tag", and "lemma", and except for these three columns all others are ignored.

...

Additional options, currently unused.

lang

A character string naming the language of the analyzed corpus. See kRp.POS.tags for all supported languages. If set to "kRp.env" this is got from get.kRp.env.

tagger

The software which was used to tokenize and tag the text. Currently, "TreeTagger" and "manual" are the only supported values. If "manual", you must also adjust the values of mtx_cols to define the columns to be imported.

apply.sentc.end

Logical, whethter the tokens defined in sentc.end should be searched and set to a sentence ending tag. You could call this a compatibility mode to make sure you get the results you would get if you called treetag on the original file. If set to FALSE, the tags will be imported as they are.

sentc.end

A character vector with tokens indicating a sentence ending. This adds to given results, it doesn't replace them.

stopwords

A character vector to be used for stopword detection. Comparison is done in lower case. You can also simply set stopwords=tm::stopwords("en") to use the english stopwords provided by the tm package.

stemmer

A function or method to perform stemming. For instance, you can set stemmer=Snowball::SnowballStemmer if you have the Snowball package installed (or SnowballC::wordStem). As of now, you cannot provide further arguments to this function.

rm.sgml

Logical, whether SGML tags should be ignored and removed from output.

doc_id

Character string, optional identifier of the particular document. Will be added to the desc slot.

add.desc

Logical. If TRUE, the tag description (column "desc" of the data.frame) will be added directly to the resulting object. If set to "kRp.env" this is fetched from get.kRp.env. Only needed if tag=TRUE.

mtx_cols

Character vector with exactly three elements named "token", "tag", and "lemma", the values of which must match the respective column names of the matrix provided via file. It is possible to set lemma=NA if the tagged results only provide token and tag. This argument is ignored unless tagger="manual" and data is provided as either a matrix or data frame.

encoding

A character string defining the character encoding of the input file, like "Latin1" or "UTF-8".

Value

An object of class kRp.text. If debug=TRUE, prints internal variable settings and attempts to return the original output if the TreeTagger system call in a matrix.

Details

Note that the value of lang must match a valid language supported by kRp.POS.tags. It will also get stored in the resulting object and might be used by other functions at a later point.

References

Schmid, H. (1994). Probabilistic part-of-speec tagging using decision trees. In International Conference on New Methods in Language Processing, Manchester, UK, 44--49.

[1] https://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/

See Also

treetag, freq.analysis, get.kRp.env, kRp.text

Examples

Run this code
# NOT RUN {
  # call method on a connection
  text_con <- file("~/my.data/tagged_speech.txt", "r")
  tagged_results <- readTagged(text_con, lang="en")
  close(text_con)

  # call it on the file directly
  tagged_results <- readTagged("~/my.data/tagged_speech.txt", lang="en")
  
  # import the results of RDRPOSTagger, using the "manual" tagger feature
  sample_text <- c("Dies ist ein kurzes Beispiel. Es ergibt wenig Sinn.")
  tagger <- RDRPOSTagger::rdr_model(language="German", annotation="POS")
  tagged_rdr <- RDRPOSTagger::rdr_pos(tagger, x=sample_text)
  tagged_results <- readTagged(
    tagged_rdr,
    lang="de",
    tagger="manual",
    mtx_cols=c(token="token", tag="pos", lemma=NA)
  )
# }

Run the code above in your browser using DataLab