hyphen: Automatic hyphenation

Description

These methods implement word hyphenation, based on Liang's algorithm.

Usage

hyphen(words, ...)
# S4 method for character
hyphen(
  words,
  hyph.pattern = NULL,
  min.length = 4,
  rm.hyph = TRUE,
  quiet = FALSE,
  cache = TRUE,
  as = "kRp.hyphen"
)
hyphen_df(words, ...)
# S4 method for character
hyphen_df(
  words,
  hyph.pattern = NULL,
  min.length = 4,
  rm.hyph = TRUE,
  quiet = FALSE,
  cache = TRUE
)
hyphen_c(words, ...)
# S4 method for character
hyphen_c(
  words,
  hyph.pattern = NULL,
  min.length = 4,
  rm.hyph = TRUE,
  quiet = FALSE,
  cache = TRUE
)

Arguments

words

Either a character vector with words/tokens to be hyphenated, or any tagged text object generated with the koRpus package.

...

Only used for the method generic.

hyph.pattern

Either an object of class kRp.hyph.pat, or a valid character string naming the language of the patterns to be used (must already be loaded, see details).

min.length

Integer, number of letters a word must have for considering a hyphenation. hyphen will not split words after the first or before the last letter, so values smaller than 4 are not useful.

rm.hyph

Logical, whether appearing hyphens in words should be removed before pattern matching.

quiet

Logical. If FALSE, short status messages will be shown.

cache

Logical. hyphen() can cache results to speed up the process. If this option is set to TRUE, the current cache will be queried and new tokens also be added. Caches are language-specific and reside in an environment, i.e., they are cleaned at the end of a session. If you want to save these for later use, see the option hyph.cache.file in set.sylly.env.

A character string defining the class of the object to be returned. Defaults to "kRp.hyphen", but can also be set to "data.frame" or "numeric", returning only the central data.frame or the numeric vector of counted syllables, respectively. For the latter two options, you can alternatively use the shortcut methods hyphen_df or hyphen_c.

Value

An object of class kRp.hyphen, data.frame or a numeric vector, depending on the value of the as argument.

Details

For this to work the function must be told which pattern set it should use to find the right hyphenation spots. The most straight forward way to add support for a particular language during a session is to load an appropriate language package (e.g., the package sylly.en for English or sylly.de for German). See available.sylly.lang and install.sylly.lang for more informatin on how to get language support packages.

After such a package was loaded, you can simply use the language abbreviation as the value for the hyph.pattern argument (like "en" for the English pattern set). If words is an object that was tokenized and tagged with the koRpus package, its language definition can be used instead, i.e. you don't need to specify hyph.pattern, hyphen will pick the language automatically.

In case you'd rather use your own pattern set, hyph.pattern can be an object of class kRp.hyph.pat, alternatively.

References

Liang, F.M. (1983). Word Hy-phen-a-tion by Com-put-er. Dissertation, Stanford University, Dept. of Computer Science.

Examples

Run this code

# NOT RUN {
library(sylly.en)
sampleText <- c("This", "is", "a", "rather", "stupid", "demonstration")
hyphen(sampleText, hyph.pattern="en")
hyphen_df(sampleText, hyph.pattern="en")
hyphen_c(sampleText, hyph.pattern="en")

# using a koRpus object
hyphen(tagged.text)
# }

Run the code above in your browser using DataLab