textTransform: Letter case transformation

Description

Transforms text in koRpus objects token by token.

Usage

textTransform(txt, ...)
# S4 method for kRp.text
textTransform(
  txt,
  scheme,
  p = 0.5,
  paste = FALSE,
  var = "wclass",
  query = "fullstop",
  method = "replace",
  replacement = ".",
  f = NA,
  ...
)

Arguments

txt

An object of class kRp.text.

...

Parameters passed to query to find matching tokens. Relevant only if scheme="normalize".

scheme

One of the following character strings:

"minor" Start each word with a lowercase letter.
"all.minor" Forces all letters into lowercase.
"major" Start each word with a uppercase letter.
"all.major" Forces all letters into uppercase.
"random" Randomly start words with uppercase or lowercase letters.
"de.norm" German norm: All names, nouns and sentence beginnings start with an uppercase letter, anything else with a lowercase letter.
"de.inv" Inversion of "de.norm".
"eu.norm" Usual European cases: Only names and sentence beginnings start with an uppercase letter, anything else with a lowercase letter.
"eu.inv" Inversion of "eu.norm".
"normalize" Replace all tokens matching query in column var according to method (see below).

Numeric value between 0 and 1. Defines the probability for upper case letters (relevant only if scheme="random").

paste

Logical, see value section.

var

A character string naming a variable in the object (i.e., colname). See query for details. Relevant only if scheme="normalize".

query

A character vector (for words), regular expression, or single number naming values to be matched in the variable. See query for details. Relevant only if scheme="normalize".

method

One of the following character strings:

"shortest" Replace all matches with the shortest value found.
"longest" Replace all matches with the longest value found.
"replace" Replace all matches with the token given via replacement.
"function" Replace all matches with the result of the function provided by f (see section Function for details).

In case of "shortest" and "longest", if multiple values of the same length are found, the (first) most prevalent one is being used. The actual replacement value is documented in the diff slot of the object, as a list called transfmt.normalize. Relevant only if scheme="normalize".

replacement

Character string defining the exact token to replace all query matches with. Relevant only if scheme="normalize" and method="replace".

A function to calculate the replacement for all query matches. Relevant only if scheme="normalize" and method="function".

Value

By default an object of class kRp.text with the added feature diff is returned. It provides a list with mostly atomic vectors, describing the amount of diffences between both text variants (percentage):

all.tokens:: Percentage of all tokens, including punctuation, that were altered.
words:: Percentage of altered words only.
all.chars:: Percentage of all characters, including punctuation, that were altered.
letters:: Percentage of altered letters in words only.
transfmt:: Character vector documenting the transformation(s) done to the tokens.
transfmt.equal:: Data frame documenting which token was changed in which transformational step. Only available if more than one transformation was done.
transfmt.normalize:: A list documenting steps of normalization that were done to the object, one element per transformation. Each entry holds the name of the method, the query parameters, and the effective replacement value.

If paste=TRUE, returns an atomic character vector (via pasteText).

Function

You can dynamically calculate the replacement value for the "normalize" scheme by setting method="function" and providing a function object as f. The function you provide must support the following arguments:

tokens The original tokens slot of the txt object (see taggedText).
match A logical vector, indicating for each row of tokens whether it's a query match or not.

You can then use these arguments in your function body to calculate the replacement, e.g. tokens[match,"token"] to get all relevant tokens. The return value of the function will be used as the replacement for all matched tokens. You probably want to make sure it's a character vecor of length one or of the same length as all matches.

Details

This method is mainly intended to produce text material for experiments.

Examples

Run this code

# NOT RUN {
# code is only run when the english language package can be loaded
if(require("koRpus.lang.en", quietly = TRUE)){
  sample_file <- file.path(
    path.package("koRpus"), "examples", "corpus", "Reality_Winner.txt"
  )
  tokenized.obj <- tokenize(
    txt=sample_file,
    lang="en"
  )
  tokenized.obj <- textTransform(
    tokenized.obj,
    scheme="random"
  )
  pasteText(tokenized.obj)

  # diff stats are now part of the object
  hasFeature(tokenized.obj)
  diffText(tokenized.obj)
} else {}
# }

Run the code above in your browser using DataLab