get_token_stream: Get Token Stream.

Description

Auxiliary method to get the fulltext of a corpus, subcorpora etc. Can be used to export corpus data to other tools.

Usage

get_token_stream(.Object, ...)
# S4 method for numeric
get_token_stream(
  .Object,
  corpus,
  p_attribute,
  subset = NULL,
  boost = NULL,
  encoding = NULL,
  collapse = NULL,
  beautify = TRUE,
  cpos = FALSE,
  cutoff = NULL,
  decode = TRUE,
  ...
)
# S4 method for matrix
get_token_stream(.Object, ...)
# S4 method for corpus
get_token_stream(.Object, left = NULL, right = NULL, ...)
# S4 method for character
get_token_stream(.Object, left = NULL, right = NULL, ...)
# S4 method for slice
get_token_stream(.Object, p_attribute, collapse = NULL, cpos = FALSE, ...)
# S4 method for partition
get_token_stream(.Object, p_attribute, collapse = NULL, cpos = FALSE, ...)
# S4 method for subcorpus
get_token_stream(.Object, p_attribute, collapse = NULL, cpos = FALSE, ...)
# S4 method for regions
get_token_stream(
  .Object,
  p_attribute = "word",
  collapse = NULL,
  cpos = FALSE,
  ...
)
# S4 method for partition_bundle
get_token_stream(
  .Object,
  p_attribute = "word",
  phrases = NULL,
  subset = NULL,
  collapse = NULL,
  cpos = FALSE,
  decode = TRUE,
  verbose = TRUE,
  progress = FALSE,
  mc = FALSE,
  ...
)

Arguments

.Object

Input object.

...

Arguments that will be be passed into the get_token_stream-method for a numeric vector, the real worker.

corpus

A CWB indexed corpus.

p_attribute

A length-one character vector, the p-attribute to decode.

subset

An expression applied on p-attributes, using non-standard evaluation. Note that symbols used in the expression may not be used internally (e.g. 'stopwords').

boost

A length-one logical value, whether to speed up decoding a long vector of token ids by directly by reading in the lexion file from the data directory of a corpus. If NULL (default), the internal decision rule is that boost will be TRUE if the corpus is larger than 10 000 000 million tokens and more than 5 percent of the corpus are to be decoded.

encoding

If not NULL (default) a length-one character vector stating an encoding that will be assigned to the (decoded) token stream.

collapse

If not NULL (default), a length-one character string passed into paste to collapse character vector into a single string.

beautify

A (length-one) logical value, whether to adjust whitespace before and after interpunctation.

cpos

A logical value, whether to return corpus positions as names of the tokens.

cutoff

Maximum number of tokens to be reconstructed.

decode

A (length-one) logical value, whether to decode token ids to character strings. Defaults to TRUE, if FALSE, an integer vector with token ids is returned.

left

Left corpus position.

right

Right corpus position.

phrases

A phrases object. Defined phrases will be concatenated.

verbose

A length-one logical value, whether to show messages.

progress

A length-one logical value, whether to show progress bar.

Number of cores to use. If FALSE (default), only one thread will be used.

Details

CWB indexed corpora have a fixed order of tokens which is called the token stream. Every token is assigned to a unique corpus position, Subsets of the (entire) token stream defined by a left and a right corpus position are called regions. The get_token_stream-method will extract the tokens (for regions) from a corpus.

The primary usage of this method is to return the token stream of a (sub-)corpus as defined by a corpus, subcorpus or partition object. The methods defined for a numeric vector or a (two-column) matrix defining regions (i.e. left and right corpus positions in the first and second column) are the actual workers for this operation.

The get_token_stream has been introduced so serve as a worker by higher level methods such as read, html, and as.markdown. It may however be useful for decoding a corpus so that it can be exported to other tools.

Examples

Run this code

# NOT RUN {
# Decode first words of GERMAPARLMINI corpus (first sentence)
get_token_stream(0:9, corpus = "GERMAPARLMINI", p_attribute = "word")

# Decode first sentence and collapse tokens into single string
get_token_stream(0:9, corpus = "GERMAPARLMINI", p_attribute = "word", collapse = " ")

# Decode regions defined by two-column matrix
region_matrix <- matrix(c(0,9,10,25), ncol = 2, byrow = TRUE)
get_token_stream(region_matrix, corpus = "GERMAPARLMINI", p_attribute = "word", encoding = "latin1")

# Use argument 'beautify' to remove surplus whitespace
get_token_stream(
  region_matrix,
  corpus = "GERMAPARLMINI",
  p_attribute = "word",
  encoding = "latin1",
  collapse = " ", beautify = TRUE
)

# Decode entire corpus (corpus object / specified by corpus ID)
fulltext <- get_token_stream("GERMAPARLMINI", p_attribute = "word")
corpus("GERMAPARLMINI") %>%
  get_token_stream(p_attribute = "word") %>%
  head()

# Decode subcorpus
corpus("REUTERS") %>%
  subset(id == "127") %>%
  get_token_stream(p_attribute = "word") %>%
  head()

# Decode partition_bundle
pb_tokstr <- corpus("REUTERS") %>%
  split(s_attribute = "id") %>%
  get_token_stream(p_attribute = "word")

# Get token stream for partition_bundle
pb <- partition_bundle("REUTERS", s_attribute = "id")
ts_list <- get_token_stream(pb)

# Workflow to filter decoded subcorpus_bundle
# }
# NOT RUN {
sp <- corpus("GERMAPARLMINI") %>% as.speeches(s_attribute_name = "speaker", progress = FALSE)
queries <- c('"freiheitliche" "Grundordnung"', '"Bundesrepublik" "Deutschland"' )
phr <- corpus("GERMAPARLMINI") %>% cpos(query = queries) %>% as.phrases(corpus = "GERMAPARLMINI")

kill <- tm::stopwords("de")

ts_phr <- get_token_stream(
  sp,
  p_attribute = c("word", "pos"),
  subset = {!word %in% kill  & !grepl("(\\$.$|ART)", pos)},
  phrases = phr,
  progress = FALSE,
  verbose = FALSE
)
# }

Run the code above in your browser using DataLab