Learn R Programming

koRpus (version 0.13-8)

hyphen,kRp.text-method: Automatic hyphenation

Description

These methods implement word hyphenation, based on Liang's algorithm. For details, please refer to the documentation for the generic hyphen method in the sylly package.

Usage

# S4 method for kRp.text
hyphen(
  words,
  hyph.pattern = NULL,
  min.length = 4,
  rm.hyph = TRUE,
  corp.rm.class = "nonpunct",
  corp.rm.tag = c(),
  quiet = FALSE,
  cache = TRUE,
  as = "kRp.hyphen",
  as.feature = FALSE
)

# S4 method for kRp.text hyphen_df( words, hyph.pattern = NULL, min.length = 4, rm.hyph = TRUE, quiet = FALSE, cache = TRUE )

# S4 method for kRp.text hyphen_c( words, hyph.pattern = NULL, min.length = 4, rm.hyph = TRUE, quiet = FALSE, cache = TRUE )

Arguments

words

Either an object of class kRp.text, or a character vector with words to be hyphenated.

hyph.pattern

Either an object of class kRp.hyph.pat, or a valid character string naming the language of the patterns to be used. See details.

min.length

Integer, number of letters a word must have for considering a hyphenation. hyphen will not split words after the first or before the last letter, so values smaller than 4 are not useful.

rm.hyph

Logical, whether appearing hyphens in words should be removed before pattern matching.

corp.rm.class

A character vector with word classes which should be ignored. The default value "nonpunct" has special meaning and will cause the result of kRp.POS.tags(lang, tags=c("punct","sentc"), list.classes=TRUE) to be used. Relevant only if words is a valid koRpus object.

corp.rm.tag

A character vector with POS tags which should be ignored. Relevant only if words is a valid koRpus object.

quiet

Logical. If FALSE, short status messages will be shown.

cache

Logical. hyphen() can cache results to speed up the process. If this option is set to TRUE, the current cache will be queried and new tokens also be added. Caches are language-specific and reside in an environment, i.e., they are cleaned at the end of a session. If you want to save these for later use, see the option hyph.cache.file in set.kRp.env.

as

A character string defining the class of the object to be returned. Defaults to "kRp.hyphen", but can also be set to "data.frame" or "numeric", returning only the central data.frame or the numeric vector of counted syllables, respectively. For the latter two options, you can alternatively use the shortcut methods hyphen_df or hyphen_c. Ignored if as.feature=TRUE.

as.feature

Logical, whether the output should be just the analysis results or the input object with the results added as a feature. Use corpusHyphen to get the results from such an aggregated object. If set to TRUE, as="kRp.hyphen" is automatically set, overwriting other setting of as with a warning.

Value

An object of class kRp.text, kRp.hyphen, data.frame or a numeric vector, depending on the values of the as and as.feature arguments.

References

Liang, F.M. (1983). Word Hy-phen-a-tion by Com-put-er. Dissertation, Stanford University, Dept. of Computer Science.

[1] http://tug.ctan.org/tex-archive/language/hyph-utf8/tex/generic/hyph-utf8/patterns/

[2] http://www.ctan.org/tex-archive/macros/latex/base/lppl.txt

See Also

read.hyph.pat, manage.hyph.pat

Examples

Run this code
# NOT RUN {
# code is only run when the english language package can be loaded
if(require("koRpus.lang.en", quietly = TRUE)){
  sample_file <- file.path(
    path.package("koRpus"), "examples", "corpus", "Reality_Winner.txt"
  )
  # call hyphen on a given english word
  # "quiet=TRUE" suppresses the progress bar
  hyphen(
    "interference",
    hyph.pattern="en",
    quiet=TRUE
  )

  # call hyphen() on a tokenized text
  tokenized.obj <- tokenize(
    txt=sample_file,
    lang="en"
  )
  # language definition is defined in the object
  # if you call hyphen() without arguments,
  # you will get its results directly
  hyphen(tokenized.obj)

  # alternatively, you can also store those results as a
  # feature in the object itself
  tokenized.obj <- hyphen(
    tokenized.obj,
    as.feature=TRUE
  )
  # results are now part of the object
  hasFeature(tokenized.obj)
  corpusHyphen(tokenized.obj)
} else {}
# }

Run the code above in your browser using DataLab