textstat-class: S4 textstat superclass.

Description

The textstat S4 class is the superclass for the classes features, context, and partition. Usually, these subclasses, which are designed to serve a specified analytical purpose, will be used . Common standard generic methods such as head, tail, dim, nrow, colnames are defined for the textstat class and are available for subclasses by inheritence. The core of textstat and its childs is a data.table in the slot stat for keeping data on text statistics of a corpus, or a partition. The textstat class inherits from the corpus class, keeping information on the corpus available.

Usage

# S4 method for textstat
name(x)
# S4 method for character
name(x)
# S4 method for textstat
name(x) <- value
# S4 method for textstat
round(x, digits = 2L)
# S4 method for textstat
sort(x, by, decreasing = TRUE)
as.bundle(object, ...)
# S4 method for textstat,textstat
+(e1, e2)
# S4 method for textstat
subset(x, subset)
# S3 method for textstat
as.data.table(x, ...)
# S4 method for textstat
show(object)
# S4 method for textstat
p_attributes(.Object)
# S4 method for textstat
knit_print(x, options = knitr::opts_chunk, ...)
# S4 method for textstat
get_corpus(x)
# S4 method for textstat
format(x, digits = 2L)
restore(file)
cp(x)
# S4 method for textstat
view(.Object)

Arguments

x: An object (textstat or class inheriting from textstat).
value: A character vector to assign as name to slot name of a textstat class object.
digits: Number of digits.
by: Column that will serve as the key for sorting.
decreasing: Logical, whether to return decreasing order.
object: a textstat object
...: Argument that will be passed into a call of the format method on the object x.
e1: A texstat object.
e2: Another texstat object.
subset: A logical expression indicating elements or rows to keep.
.Object: A textstat object.
options: Chunk options.
file: An rds file to restore (filename).

Slots

p_attribute: Object of class character, p-attribute of the query.

corpus

A corpus specified by a length-one character vector.

stat

A data.table with statistical information.

name

The name of the object.

annotation_cols

A character vector, column names of data.table in slot stat that are annotations.

encoding

A length-one character vector, the encoding of the corpus.

Details

A head-method will return the first rows of the data.table in the stat-slot. Use argument n to specify the number of rows.

A tail-method will return the last rows of the data.table in the stat-slot. Use argument n to specify the number of rows.

The methods dim, nrow and ncol will return information on the dimensions, the number of rows, or the number of columns of the data.table in the stat-slot, respectively.

Objects derived from the textstat class can be indexed with simple square brackets ("[") to get rows specified by an numeric/integer vector, and with double square brackets ("[[") to get specific columns from the data.table in the slot stat.

The colnames-method will return the column names of the data-table in the slot stat.

The methods as.data.table, and as.data.frame will extract the data.table in the slot stat as a data.table, or data.frame, respectively.

textstat objects can have a name, which can be retrieved, and set using the name-method and name<-, respectively.

The round()-method looks up all numeric columns in the data.table in the stat-slot of the textstat object and rounds values of these columns to the number of decimal places specified by argument digits.

The knit_print method will be called by knitr to render textstat objects or objects inheriting from the textstat class as a DataTable htmlwidget when rendering a R Markdown document as html. It will usually be necessary to explicitly state "render = knit_print" in the chunk options. The option polmineR.pagelength controls the number of lines displayed in the resulting htmlwidget. Note that including htmlwidgets in html documents requires that pandoc is installed. To avoid an error, a formatted data.table is returned by knit_print if pandoc is not available.

The format()-method returns a pretty-printed and minimized version of the data.table in the stat-slot of the textstat-object: It will round all numeric columns to the number of decimal numbers specified by digits, and drop all columns with token ids. The return value is a data.table.

Using the reference semantics of data.table objects (i.e. inplace modification) has great advantages for memory efficiency. But there may be unexpected behavior when reloading an S4 textstat object (including classes inheriting from textstat) with a data.table in the stat slot. Use restore to copy the data.table once to have a restored object that works for inplace operations after saving / reloading it.

It is not possible to add columns to the data.table in the stat slot of a textclass object, when the object has been saved and loaded using save()/load(). This scenario applies for instance, when the objects of an interactive R session are saved, and loaded when starting the next interactive R session. The cp() function will create a copy of the object, including an explicit copy of the data.table in the stat slot. Inplace modifications of the new object are possible. The function can also be used to avoid unwanted side effects of modifying an object.

Examples

Run this code

use(pkg = "polmineR", corpus = "GERMAPARLMINI")
use(pkg = "RcppCWB", corpus = "REUTERS")

P <- partition("GERMAPARLMINI", date = ".*", p_attribute = "word", regex = TRUE)
y <- cooccurrences(P, query = "Arbeit")

# generics defined in the polmineR package
x <- count("REUTERS", p_attribute = "word")
name(x) <- "count_reuters"
name(x)
get_corpus(x)

# Standard generic methods known from data.frames work for objects inheriting
# from the textstat class

head(y)
tail(y)
nrow(y)
ncol(y)
dim(y)
colnames(y)

# Use brackets for indexing 

if (FALSE) {
y[1:25]
y[,c("word", "ll")]
y[1:25, "word"]
y[1:25][["word"]]
y[which(y[["word"]] %in% c("Arbeit", "Sozial"))]
y[ y[["word"]] %in% c("Arbeit", "Sozial") ]
}
sc <- partition("GERMAPARLMINI", speaker = "Angela Dorothea Merkel")
cnt <- count(sc, p_attribute = c("word", "pos"))
cnt_min <- subset(cnt, pos %in% c("NN", "ADJA"))
cnt_min <- subset(cnt, pos == "NE")
use(pkg = "RcppCWB", corpus = "REUTERS")

# Get statistics in textstat object as data.table
count_dt <- corpus("REUTERS") %>%
  subset(grep("saudi-arabia", places)) %>% 
  count(p_attribute = "word") %>%
  as.data.table()

# textstat objects stored as *.rds files should be loaded using restore().
# Before moving to examples, this is a brief technical dip why this is
# recommended: If we load the *.rds file with readRDS(), the data.table in
# the slot 'stat' will have the pointer '0x0', and the data.table cannot be
# augmented without having been copied previously.

k <- kwic("REUTERS", query = "oil")
kwicfile <- tempfile(fileext = ".rds")
saveRDS(k, file = kwicfile)
problemprone <- readRDS(file = kwicfile)
problemprone@stat[, "newcol" := TRUE]
"newcol" %in% colnames(problemprone@stat) # is FALSE!

attr(problemprone@stat, ".internal.selfref")
identical(attr(problemprone@stat, ".internal.selfref"), new("externalptr"))

# Restore stored S4 object with copy of data.table in 'stat' slot
k <- kwic("REUTERS", query = "oil")
kwicfile <- tempfile(fileext = ".rds")
saveRDS(k, file = kwicfile)

k2 <- restore(kwicfile)
enrich(k2, s_attribute = "id")
"id" %in% colnames(k2) # is TRUE
k <- kwic("REUTERS", query = "oil")
rdata_file <- tempfile(fileext = ".RData")
save(k, file = rdata_file)
rm(k)

load(rdata_file)
k <- cp(k) # now it is possible to columns by reference
enrich(k, s_attribute = "id")
"id" %in% colnames(k)

Run the code above in your browser using DataLab