Learn R Programming

koRpus (version 0.10-2)

hyphen: Automatic hyphenation

Description

These methods implement word hyphenation, based on Liang's algorithm.

Usage

hyphen(words, ...)

# S4 method for kRp.taggedText hyphen(words, hyph.pattern = NULL, min.length = 4, rm.hyph = TRUE, corp.rm.class = "nonpunct", corp.rm.tag = c(), quiet = FALSE, cache = TRUE)

# S4 method for character hyphen(words, hyph.pattern = NULL, min.length = 4, rm.hyph = TRUE, quiet = FALSE, cache = TRUE)

Arguments

words

Either an object of class kRp.tagged-class, kRp.txt.freq-class or kRp.analysis-class, or a character vector with words to be hyphenated.

...

Only used for the method generic.

hyph.pattern

Either an object of class kRp.hyph.pat-class, or a valid character string naming the language of the patterns to be used. See details.

min.length

Integer, number of letters a word must have for considering a hyphenation. hyphen will not split words after the first or before the last letter, so values smaller than 4 are not useful.

rm.hyph

Logical, whether appearing hyphens in words should be removed before pattern matching.

corp.rm.class

A character vector with word classes which should be ignored. The default value "nonpunct" has special meaning and will cause the result of kRp.POS.tags(lang, c("punct","sentc"), list.classes=TRUE) to be used. Relevant only if words is a valid koRpus object.

corp.rm.tag

A character vector with POS tags which should be ignored. Relevant only if words is a valid koRpus object.

quiet

Logical. If FALSE, short status messages will be shown.

cache

Logical. hyphen() can cache results to speed up the process. If this option is set to TRUE, the current cache will be queried and new tokens also be added. Caches are language-specific and reside in an environment, i.e., they are cleaned at the end of a session. If you want to save these for later use, see the option hyph.cache.file in set.kRp.env.

Value

An object of class kRp.hyphen-class

Details

For this to work the function must be told which pattern set it should use to find the right hyphenation spots. If words is already a tagged object, its language definition might be used. Otherwise, in addition to the words to be processed you must specify hyph.pattern. You have two options: If you want to use one of the built-in language patterns, just set it to the according language abbrevation. As of this version valid choices are:

  • "de" --- German (new spelling, since 1996)

  • "de.old" --- German (old spelling, 1901--1996)

  • "en" --- English (UK)

  • "en.us" --- English (US)

  • "es" --- Spanish

  • "fr" --- French

  • "it" --- Italian

  • "ru" --- Russian

In case you'd rather use your own pattern set, hyph.pattern can be an object of class kRp.hyph.pat, alternatively.

The built-in hyphenation patterns were derived from the patterns available on CTAN[1] under the terms of the LaTeX Project Public License[2], see hyph.XX for detailed information.

References

Liang, F.M. (1983). Word Hy-phen-a-tion by Com-put-er. Dissertation, Stanford University, Dept. of Computer Science.

[1] http://tug.ctan.org/tex-archive/language/hyph-utf8/tex/generic/hyph-utf8/patterns/

[2] http://www.ctan.org/tex-archive/macros/latex/base/lppl.txt

See Also

read.hyph.pat, manage.hyph.pat

Examples

Run this code
# NOT RUN {
hyphen(tagged.text)
# }

Run the code above in your browser using DataLab