These methods implement word hyphenation, based on Liang's algorithm.
hyphen(words, ...)# S4 method for kRp.taggedText
hyphen(words, hyph.pattern = NULL,
min.length = 4, rm.hyph = TRUE, corp.rm.class = "nonpunct",
corp.rm.tag = c(), quiet = FALSE, cache = TRUE)
# S4 method for character
hyphen(words, hyph.pattern = NULL, min.length = 4,
rm.hyph = TRUE, quiet = FALSE, cache = TRUE)
Either an object of class kRp.tagged-class,
kRp.txt.freq-class or
kRp.analysis-class,
or a character vector with words to be hyphenated.
Only used for the method generic.
Either an object of class kRp.hyph.pat-class, or
a valid character string naming the language of the patterns to be used. See details.
Integer,
number of letters a word must have for considering a hyphenation. hyphen will
not split words after the first or before the last letter,
so values smaller than 4 are not useful.
Logical, whether appearing hyphens in words should be removed before pattern matching.
A character vector with word classes which should be ignored. The default value
"nonpunct" has special meaning and will cause the result of
kRp.POS.tags(lang, c("punct","sentc"),
list.classes=TRUE) to be used. Relevant only if words
is a valid koRpus object.
A character vector with POS tags which should be ignored. Relevant only if words
is a valid koRpus object.
Logical. If FALSE, short status messages will be shown.
Logical. hyphen() can cache results to speed up the process. If this option is set to TRUE,
the
current cache will be queried and new tokens also be added. Caches are language-specific and reside in an environment,
i.e., they are cleaned at the end of a session. If you want to save these for later use,
see the option hyph.cache.file
in set.kRp.env.
An object of class kRp.hyphen-class
For this to work the function must be told which pattern set it should use to
find the right hyphenation spots. If words is already a tagged object,
its language definition might be used. Otherwise, in addition to the words to
be processed you must specify hyph.pattern. You have two options: If you
want to use one of the built-in language patterns, just set it to the according
language abbrevation. As of this version valid choices are:
"de" --- German (new spelling, since 1996)
"de.old" --- German (old spelling, 1901--1996)
"en" --- English (UK)
"en.us" --- English (US)
"es" --- Spanish
"fr" --- French
"it" --- Italian
"ru" --- Russian
In case you'd rather use your own pattern set, hyph.pattern can be an
object of class kRp.hyph.pat, alternatively.
The built-in hyphenation patterns were derived from the patterns available on CTAN[1]
under the terms of the LaTeX Project Public License[2],
see hyph.XX
for detailed information.
Liang, F.M. (1983). Word Hy-phen-a-tion by Com-put-er. Dissertation, Stanford University, Dept. of Computer Science.
[1] http://tug.ctan.org/tex-archive/language/hyph-utf8/tex/generic/hyph-utf8/patterns/
[2] http://www.ctan.org/tex-archive/macros/latex/base/lppl.txt
# NOT RUN {
hyphen(tagged.text)
# }
Run the code above in your browser using DataLab