query: A method to get information out of koRpus objects

Description

The method query returns query information from objects of classes kRp.corp.freq and kRp.text.

Usage

query(obj, ...)
# S4 method for kRp.corp.freq
query(
  obj,
  var = NULL,
  query,
  rel = "eq",
  as.df = TRUE,
  ignore.case = TRUE,
  perl = FALSE,
  regexp_var = "word"
)
# S4 method for kRp.text
query(
  obj,
  var,
  query,
  rel = "eq",
  as.df = TRUE,
  ignore.case = TRUE,
  perl = FALSE,
  regexp_var = "token"
)
# S4 method for data.frame
query(
  obj,
  var,
  query,
  rel = "eq",
  as.df = TRUE,
  ignore.case = TRUE,
  perl = FALSE,
  regexp_var = "token"
)

Arguments

obj

An object of class kRp.corp.freq, kRp.text, or data.frame.

...

Optional arguments, see above.

var

A character string naming a variable in the object (i.e., colname). If set to "regexp", grepl is called on the column specified by regexp_var.

query

A character vector (for words), regular expression, or single number naming values to be matched in the variable. Can also be a vector of two numbers to query a range of frequency data, or a list of named lists for multiple queries (see "Query lists" section in details).

rel

A character string defining the relation of the queried value and desired results. Must either be "eq" (equal, the default), "gt" (greater than), "ge" (greater of equal), "lt" (less than) or "le" (less or equal). If var="word", is always interpreted as "eq"

as.df

Logical, if TRUE, returns a data.frame, otherwise an object of the input class. Ignored if obj is a data frame already.

ignore.case

Logical, passed through to grepl if var="regexp".

perl

Logical, passed through to grepl if var="regexp".

regexp_var

A character string naming the column to query if var="regexp".

Value

Depending on the arguments, might include whole objects, lists, single values etc.

Query lists

You can combine an arbitrary number of queries in a simple way by providing a list of named lists to the query parameter, where each list contains one query request. In each list, the first element name represents the var value of the request, and its value is taken as the query argument. You can also assign rel, ignore.case and perl for each request individually, and if you don't, the settings of the main query call are taken as default (as.df only applies to the final query). The filters will be applied in the order given, i.e., the second query will be made to the results of the first.

This method calls subset, which might actually be even more flexible if you need more control.

Details

kRp.corp.freq: Depending on the setting of the var parameter, will return entries with a matching character (var="word"), or all entries of the desired frequency (see the examples). A special case is the need for a range of frequencies, which can be achieved by providing a nomerical vector of two values as the query value, for start and end of the range, respectively. In these cases, if rel is set to "gt" or "lt", the given range borders are excluded, otherwise they will be included as true matches.

kRp.text: var can be any of the variables in slot tokens. If rel="num", a vector with the row numbers in which the query was found is returned.

Examples

Run this code

# NOT RUN {
# code is only run when the english language package can be loaded
if(require("koRpus.lang.en", quietly = TRUE)){
  sample_file <- file.path(
    path.package("koRpus"), "examples", "corpus", "Reality_Winner.txt"
  )
  tokenized.obj <- tokenize(
    txt=sample_file,
    lang="en"
  )
  en_corp <- read.corp.custom(
    tokenized.obj,
    caseSens=FALSE
  )

  # look up frequencies for the word "winner"
  query(en_corp, var="word", query="winner")

  # show all entries with a frequency of exactly 3 in the corpus
  query(en_corp, "freq", 3)

  # now, which tokens appear more than 40000 times in a million?
  query(en_corp, "pmio", 40000, "gt")

  # example for a range request: tokens with a log10 between 4.2 and 4.7
  # (including these two values)
  query(en_corp, "log10", c(4.2, 4.7))
  # (and without them)
  query(en_corp, "log10", c(4.2, 4.7), "gt")

  # example for a list of queries: get words with a frequency between
  # 10000 and 25000 per million and at least four letters
  query(en_corp, query=list(
    list(pmio=c(10000, 25000)),
    list(lttr=4, rel="ge"))
  )

  # get all instances of "the" in a tokenized text object
  query(tokenized.obj, "token", "the")
} else {}
# }

Run the code above in your browser using DataLab