decode: Decode corpus or subcorpus.

Description

Decode corpus or subcorpus and return class specified by argument to.

Usage

decode(.Object, ...)
# S4 method for corpus
decode(
  .Object,
  to = c("data.table", "Annotation"),
  p_attributes = NULL,
  s_attributes = NULL,
  decode = TRUE,
  verbose = TRUE
)
# S4 method for character
decode(
  .Object,
  to = c("data.table", "Annotation"),
  s_attributes = NULL,
  p_attributes = NULL,
  decode = TRUE,
  verbose = TRUE
)
# S4 method for slice
decode(
  .Object,
  to = "data.table",
  s_attributes = NULL,
  p_attributes = NULL,
  decode = TRUE,
  verbose = TRUE
)
# S4 method for partition
decode(
  .Object,
  to = "data.table",
  s_attributes = NULL,
  p_attributes = NULL,
  decode = TRUE,
  verbose = TRUE
)
# S4 method for subcorpus
decode(
  .Object,
  to = "data.table",
  s_attributes = NULL,
  p_attributes = NULL,
  decode = TRUE,
  verbose = TRUE
)
# S4 method for integer
decode(.Object, corpus, p_attributes, boost = NULL)
# S4 method for data.table
decode(.Object, corpus, p_attributes)

Value

The return value will correspond to the class specified by argument to.

Arguments

.Object: The corpus or subcorpus to decode.
...: Further arguments.
to: The class of the returned object, stated as a length-one character vector.
p_attributes: The positional attributes to decode. If NULL (default), all positional attributes will be decoded.
s_attributes: The structural attributes to decode. If NULL (default), all structural attributes will be decoded.
decode: A logical value, whether to decode token ids and struc ids to character strings. If FALSE, the values of columns for p- and s-attributes will be integer vectors. If TRUE (default), the respective columns are character vectors.
verbose: A logical value, whether to output progess messages.
corpus: A CWB indexed corpus, either a length-one character vector, or a corpus object.
boost: A length-one logical value, whether to speed up decoding a long vector of token ids by directly by reading in the lexion file from the data directory of a corpus. If NULL (default), the internal decision rule is that boost will be TRUE if the corpus is larger than 10 000 000 million tokens and more than 5 percent of the corpus are to be decoded.

Details

The primary purpose of the method is type conversion. By obtaining the corpus or subcorpus in the format specified by the argument to, the data can be processed with tools that do not rely on the Corpus Workbench (CWB). Supported output formats are data.table (which can be converted to a data.frame or tibble easily) or an Annotation object as defined in the package NLP. Another purpose of decoding the corpus can be to rework it, and to re-import it into the CWB (e.g. using the cwbtools-package).

An earlier version of the method included an option to decode a single s-attribute, which is not supported any more. See the s_attribute_decode function of the package RcppCWB.

If .Object is an integer vector, it is assumed to be a vector of integer ids of p-attributes. The decode-method will translate token ids to string values as efficiently as possible. The approach taken will depend on the corpus size and the share of the corpus that is to be decoded. To decode a large number of integer ids, it is more efficient to read the lexicon file from the data directory directly and to index the lexicon with the ids rather than relying on RcppCWB::cl_id2str. The internal decision rule is to use the lexicon file when the corpus is larger than 10 000 000 million tokens and more than 5 percent of the corpus are to be decoded. The encoding of the character vector that is returned will be the coding of the locale (usually ISO-8859-1 on Windows, and UTF-8 on macOS and Linux machines).

The decode-method for data.table objects will decode token ids (column 'p-attribute_id'), adding the corresponding string as a new column. If a column "cpos" with corpus positions is present, ids are derived for the corpus positions given first. If the data.table neither has a column "cpos" nor columns with token ids (i.e. colummn name ending with "_id"), the input data.table is returned unchanged. Note that columns are added to the data.table in an in-place operation to handle memory parsimoniously.

Examples

Run this code

use("polmineR")
use(pkg = "RcppCWB", corpus = "REUTERS")

# Decode corpus as data.table
dt <- decode("GERMAPARLMINI", to = "data.table")

# Decode corpus selectively
dt <- decode("GERMAPARLMINI", to = "data.table", p_attributes = "word", s_attributes = "party")

# Decode a subcorpus
dt <- corpus("GERMAPARLMINI") %>%
  subset(speaker == "Angela Dorothea Merkel") %>%
  decode(s_attributes = c("speaker", "party", "date"), to = "data.table")

# Decode subcorpus selectively
corpus("GERMAPARLMINI") %>%
  subset(speaker == "Angela Dorothea Merkel") %>%
  decode(to = "data.table", p_attributes = "word", s_attributes = "party")

# Decode partition
P <- partition("REUTERS", places = "kuwait", regex = TRUE)
dt <- decode(P)

# Previous versions of polmineR offered an option to decode a single
# s-attribute. This is how you could proceed to get a table with metadata.
dt <- decode(P, s_attribute = "id", decode = FALSE)
dt[, "word" := NULL]
dt[,{list(cpos_left = min(.SD[["cpos"]]), cpos_right = max(.SD[["cpos"]]))}, by = "id"]

# Decode subcorpus as Annotation object
if (FALSE) {
if (requireNamespace("NLP")){
  library(NLP)
  p <- corpus("GERMAPARLMINI") %>%
    subset(date == "2009-11-10" & speaker == "Angela Dorothea Merkel")
  s <- as(p, "String")
  a <- as(p, "Annotation")
  
  # The beauty of having this NLP Annotation object is that you can now use 
  # the different annotators of the openNLP package. Here, just a short scenario
  # how you can have a look at the tokenized words and the sentences.

  words <- s[a[a$type == "word"]]
  sentences <- s[a[a$type == "sentence"]] # does not yet work perfectly for plenary protocols 
  
  doc <- as(p, "AnnotatedPlainTextDocument")
}
}
 
# decode vector of token ids
y <- decode(0:20, corpus = "GERMAPARLMINI", p_attributes = "word")
dt <- data.table::data.table(cpos = cpos("GERMAPARLMINI", query = "Liebe")[,1])
decode(dt, corpus = "GERMAPARLMINI", p_attributes = c("word", "pos"))
y <- dt[, .N, by = c("word", "pos")]