Learn R Programming

polmineR (version 0.8.9)

get_token_stream: Get Token Stream.

Description

Auxiliary method to get the fulltext of a corpus, subcorpora etc. Can be used to export corpus data to other tools.

Usage

get_token_stream(.Object, ...)

# S4 method for numeric get_token_stream( .Object, corpus, registry = NULL, p_attribute, subset = NULL, boost = NULL, encoding = NULL, collapse = NULL, beautify = TRUE, cpos = FALSE, cutoff = NULL, decode = TRUE, ... )

# S4 method for matrix get_token_stream(.Object, corpus, registry = NULL, split = FALSE, ...)

# S4 method for corpus get_token_stream(.Object, left = NULL, right = NULL, ...)

# S4 method for character get_token_stream(.Object, left = NULL, right = NULL, ...)

# S4 method for slice get_token_stream(.Object, p_attribute, collapse = NULL, cpos = FALSE, ...)

# S4 method for partition get_token_stream(.Object, p_attribute, collapse = NULL, cpos = FALSE, ...)

# S4 method for subcorpus get_token_stream(.Object, p_attribute, collapse = NULL, cpos = FALSE, ...)

# S4 method for regions get_token_stream( .Object, p_attribute = "word", collapse = NULL, cpos = FALSE, split = FALSE, ... )

# S4 method for partition_bundle get_token_stream( .Object, p_attribute = "word", vocab = NULL, phrases = NULL, subset = NULL, min_length = NULL, collapse = NULL, cpos = FALSE, decode = TRUE, beautify = FALSE, verbose = TRUE, progress = FALSE, mc = FALSE, ... )

Arguments

.Object

Input object.

...

Arguments that will be be passed into the get_token_stream-method for a numeric vector, the real worker.

corpus

A CWB indexed corpus.

registry

Registry directory with registry file describing the corpus.

p_attribute

A character vector, the p-attribute(s) to decode.

subset

An expression applied on p-attributes, using non-standard evaluation. Note that symbols used in the expression may not be used internally (e.g. 'stopwords').

boost

A length-one logical value, whether to speed up decoding a long vector of token ids by directly by reading in the lexion file from the data directory of a corpus. If NULL (default), the internal decision rule is that boost will be TRUE if the corpus is larger than 10 000 000 million tokens and more than 5 percent of the corpus are to be decoded.

encoding

If not NULL (default) a length-one character vector stating an encoding that will be assigned to the (decoded) token stream.

collapse

If not NULL (default), a length-one character string passed into paste to collapse character vector into a single string.

beautify

A (length-one) logical value, whether to adjust whitespace before and after interpunctation.

cpos

A logical value, whether to return corpus positions as names of the tokens.

cutoff

Maximum number of tokens to be reconstructed.

decode

A (length-one) logical value, whether to decode token ids to character strings. Defaults to TRUE, if FALSE, an integer vector with token ids is returned.

split

A logical value, whether to return a character vector (when split is FALSE, default) or a list of character vectors; each of these vectors will then represent the tokens of a region defined by a row in a regions matrix.

left

Left corpus position.

right

Right corpus position.

vocab

A character vector with an alternative vocabulary to the one stored on disk.

phrases

A phrases object. Defined phrases will be concatenated.

min_length

If not NULL (default), an integer value with minimum length of documents required to keep them in the list object that is returned.

verbose

A length-one logical value, whether to show messages.

progress

A length-one logical value, whether to show progress bar.

mc

Number of cores to use. If FALSE (default), only one thread will be used.

Details

CWB indexed corpora have a fixed order of tokens which is called the token stream. Every token is assigned to a unique corpus position, Subsets of the (entire) token stream defined by a left and a right corpus position are called regions. The get_token_stream-method will extract the tokens (for regions) from a corpus.

The primary usage of this method is to return the token stream of a (sub-)corpus as defined by a corpus, subcorpus or partition object. The methods defined for a numeric vector or a (two-column) matrix defining regions (i.e. left and right corpus positions in the first and second column) are the actual workers for this operation.

The get_token_stream has been introduced so serve as a worker by higher level methods such as read, html, and as.markdown. It may however be useful for decoding a corpus so that it can be exported to other tools.

Examples

Run this code
use(pkg = "RcppCWB", corpus = "REUTERS")

# Decode first words of REUTERS corpus (first sentence)
get_token_stream(0:20, corpus = "REUTERS", p_attribute = "word")

# Decode first sentence and collapse tokens into single string
get_token_stream(0:20, corpus = "REUTERS", p_attribute = "word", collapse = " ")

# Decode regions defined by two-column integer matrix
region_matrix <- matrix(c(0L,20L,21L,38L), ncol = 2, byrow = TRUE)
get_token_stream(
  region_matrix,
  corpus = "REUTERS",
  p_attribute = "word",
  encoding = "latin1"
)

# Use argument 'beautify' to remove surplus whitespace
if (FALSE) {
get_token_stream(
  region_matrix,
  corpus = "GERMAPARLMINI",
  p_attribute = "word",
  encoding = "latin1",
  collapse = " ", beautify = TRUE
)
}

# Decode entire corpus (corpus object / specified by corpus ID)
corpus("REUTERS") %>%
  get_token_stream(p_attribute = "word") %>%
  head()

# Decode subcorpus
corpus("REUTERS") %>%
  subset(id == "127") %>%
  get_token_stream(p_attribute = "word") %>%
  head()

# Decode partition_bundle
if (FALSE) {
pb_tokstr <- corpus("REUTERS") %>%
  split(s_attribute = "id") %>%
  get_token_stream(p_attribute = "word")
}
if (FALSE) {
# Get token stream for partition_bundle
pb <- partition_bundle("REUTERS", s_attribute = "id")
ts_list <- get_token_stream(pb)

# Use two p-attributes
sp <- corpus("GERMAPARLMINI") %>%
  as.speeches(s_attribute_name = "speaker", s_attribute_date = "date", progress = FALSE)
p2 <- get_token_stream(sp, p_attribute = c("word", "pos"), verbose = FALSE)

# Apply filter
p_sub <- get_token_stream(
  sp, p_attribute = c("word", "pos"),
  subset = {!grepl("(\\$.$|ART)", pos)}
)

# Concatenate phrases and apply filter
queries <- c('"freiheitliche" "Grundordnung"', '"Bundesrepublik" "Deutschland"' )
phr <- corpus("GERMAPARLMINI") %>%
  cpos(query = queries) %>%
  as.phrases(corpus = "GERMAPARLMINI")

kill <- tm::stopwords("de")

ts_phr <- get_token_stream(
  sp,
  p_attribute = c("word", "pos"),
  subset = {!word %in% kill  & !grepl("(\\$.$|ART)", pos)},
  phrases = phr,
  progress = FALSE,
  verbose = FALSE
)
}

Run the code above in your browser using DataLab