types: Get types and tokens of a given text

Description

These methods return character vectors that return all types or tokens of a given text, where text can either be a character vector itself, a previosly tokenized/tagged koRpus object, or an object of class kRp.TTR.

Usage

types(txt, ...)
tokens(txt, ...)
# S4 method for kRp.TTR
types(txt, stats = FALSE)
# S4 method for kRp.TTR
tokens(txt)
# S4 method for kRp.text
types(
  txt,
  case.sens = FALSE,
  lemmatize = FALSE,
  corp.rm.class = "nonpunct",
  corp.rm.tag = c(),
  stats = FALSE
)
# S4 method for kRp.text
tokens(
  txt,
  case.sens = FALSE,
  lemmatize = FALSE,
  corp.rm.class = "nonpunct",
  corp.rm.tag = c()
)
# S4 method for character
types(
  txt,
  case.sens = FALSE,
  lemmatize = FALSE,
  corp.rm.class = "nonpunct",
  corp.rm.tag = c(),
  stats = FALSE,
  lang = NULL
)
# S4 method for character
tokens(
  txt,
  case.sens = FALSE,
  lemmatize = FALSE,
  corp.rm.class = "nonpunct",
  corp.rm.tag = c(),
  lang = NULL
)

Value

A character vector. Fortypes and stats=TRUE a data.frame containing all types, their length (characters) and frequency. The types result is always sorted by frequency, with more frequent types coming first.

Arguments

txt: An object of either class kRp.text or kRp.TTR, or a character vector.
...: Only used for the method generic.
stats: Logical, whether statistics on the length in characters and frequency of types in the text should also be returned.
case.sens: Logical, whether types should be counted case sensitive. This option is available for tagged text and character input only.
lemmatize: Logical, whether analysis should be carried out on the lemmatized tokens rather than all running word forms. This option is available for tagged text and character input only.
corp.rm.class: A character vector with word classes which should be dropped. The default value "nonpunct" has special meaning and will cause the result of kRp.POS.tags(lang, tags=c("punct","sentc"), list.classes=TRUE) to be used. This option is available for tagged text and character input only.
corp.rm.tag: A character vector with POS tags which should be dropped. This option is available for tagged text and character input only.
lang: Set the language of a text, see the force.lang option of lex.div. This option is available for character input only.

Examples

Run this code

# code is only run when the english language package can be loaded
if(require("koRpus.lang.en", quietly = TRUE)){
  sample_file <- file.path(
    path.package("koRpus"), "examples", "corpus", "Reality_Winner.txt"
  )
  tokenized.obj <- tokenize(
    txt=sample_file,
    lang="en"
  )

  types(tokenized.obj)
  tokens(tokenized.obj)
} else {}

Run the code above in your browser using DataLab

Description

Usage

Value

Arguments

See Also

Examples