These methods implement word hyphenation, based on Liang's algorithm.
hyphen(words, ...)# S4 method for kRp.taggedText
hyphen(words, hyph.pattern = NULL,
min.length = 4, rm.hyph = TRUE, corp.rm.class = "nonpunct",
corp.rm.tag = c(), quiet = FALSE, cache = TRUE)
# S4 method for character
hyphen(words, hyph.pattern = NULL, min.length = 4,
rm.hyph = TRUE, quiet = FALSE, cache = TRUE)
Either an object of class kRp.tagged-class
,
kRp.txt.freq-class
or
kRp.analysis-class
,
or a character vector with words to be hyphenated.
Only used for the method generic.
Either an object of class kRp.hyph.pat-class
, or
a valid character string naming the language of the patterns to be used. See details.
Integer,
number of letters a word must have for considering a hyphenation. hyphen
will
not split words after the first or before the last letter,
so values smaller than 4 are not useful.
Logical, whether appearing hyphens in words should be removed before pattern matching.
A character vector with word classes which should be ignored. The default value
"nonpunct"
has special meaning and will cause the result of
kRp.POS.tags(lang, c("punct","sentc"),
list.classes=TRUE)
to be used. Relevant only if words
is a valid koRpus object.
A character vector with POS tags which should be ignored. Relevant only if words
is a valid koRpus object.
Logical. If FALSE
, short status messages will be shown.
Logical. hyphen()
can cache results to speed up the process. If this option is set to TRUE
,
the
current cache will be queried and new tokens also be added. Caches are language-specific and reside in an environment,
i.e., they are cleaned at the end of a session. If you want to save these for later use,
see the option hyph.cache.file
in set.kRp.env
.
An object of class kRp.hyphen-class
For this to work the function must be told which pattern set it should use to
find the right hyphenation spots. If words
is already a tagged object,
its language definition might be used. Otherwise, in addition to the words to
be processed you must specify hyph.pattern
. You have two options: If you
want to use one of the built-in language patterns, just set it to the according
language abbrevation. As of this version valid choices are:
"de"
--- German (new spelling, since 1996)
"de.old"
--- German (old spelling, 1901--1996)
"en"
--- English (UK)
"en.us"
--- English (US)
"es"
--- Spanish
"fr"
--- French
"it"
--- Italian
"ru"
--- Russian
In case you'd rather use your own pattern set, hyph.pattern
can be an
object of class kRp.hyph.pat
, alternatively.
The built-in hyphenation patterns were derived from the patterns available on CTAN[1]
under the terms of the LaTeX Project Public License[2],
see hyph.XX
for detailed information.
Liang, F.M. (1983). Word Hy-phen-a-tion by Com-put-er. Dissertation, Stanford University, Dept. of Computer Science.
[1] http://tug.ctan.org/tex-archive/language/hyph-utf8/tex/generic/hyph-utf8/patterns/
[2] http://www.ctan.org/tex-archive/macros/latex/base/lppl.txt
# NOT RUN {
hyphen(tagged.text)
# }
Run the code above in your browser using DataLab