You can use this function to add new languages to be used with koRpus
.
set.lang.support(target, value)
One of "hyphen", "kRp.POS.tags", or "treetag", depending on what support is to be added.
A named list that upholds exactly the structure defined here for its respective target
.
The named list usually has one single entry to tell the new language abbreviation, e.g.,
set.lang.support("hyphen", list("xyz"="xyz"))
. However, this will only work if a)
the language support script is a part of the koRpus
package itself,
and b) the hyphen pattern
is located in its data
subdirectory.
For your custom hyphenation patterns to be found automatically,
provide it as the value in the named
list, e.g., set.lang.support("hyphen", list("xyz"=hyph.xyz))
.
This will directly add the patterns to korpus
' environment,
so it will be found when
hyphenation is requested for language "xyz"
.
If you would like to provide hyphenation as part of a third party language package,
you must name the
object hyph.<lang>
, save it to your package's data
subdirectory named
hyph.<lang>.rda
, and append package="<yourpackage>"
to the named list; e.g.,
set.lang.support("hyphen", list("xyz"=c("xyz",
package="koRpus.lang.xyz"))
. Only then
koRpus
will look for the pattern object in your package,
not its own data
directory.
The presets for the treetag() function are basically what the shell (GNU/Linux, MacOS) and batch (Win) scripts define that come with TreeTagger. Look for scripts called "$TREETAGGER/cmd/tree-tagger-xyzedish" and "$TREETAGGER\cmd\tree-tagger-xyzedish.bat", figure out which call resembles which call and then define them in set.lang.support("treetag") accordingly.
Have a look at the commented template in your koRpus
installation directory for an elaborate
example.
If Xyzedish is supported by TreeTagger, you should find a tagset definition for the language on its homepage. treetag() needs to know all POS tags that TreeTagger might return, otherwise you will get a self-explaining error message as soon as an unknown tag appears. Notice that this can still happen after you implemented the full documented tag set: sometimes the contributed TreeTagger parameter files added their own tags, e.g., for special punctuation. So please test your tag set well.
As you can see in the template file, you will also have to add a global word class and an explaination for each tag. The former is especially important for further steps like frequency analysis.
Again, please have a look at the commented template and/or existing language support files in the package sources, most of it should be almost self-explaining.
To be able to also do syllable count with the newly added language,
you should add a hyphenation pattern
file as well.
Refer to the documentation of read.hyph.pat() to learn how to produce a pattern object from a downloaded
hyphenation pattern file. Make sure you use the correct name scheme (e.g. "hyph.xyz.rda") and good
compression. Please refer to the "hyphen"
section for details on how to add these patterns to
a running koRpus
session or a language support package.
Language support in this package is designed to be extended easily. You could call it modular, although it's actually more "environemntal", but nevermind.
To add full new language support, say for Xyzedish, you basically have to call this function three times with different targets, and provide respective hyphenation patterns. If you would like to re-use this language support, you should consider making it a package.
Be it a package or a script, it should contain all three calls to this function. If it succeeds, it will fill an internal environment with the information you have defined.
The function set.language.support()
gets called three times because there's three
functions of koRpus that need language support:
hyphen() needs to know which language pattern tests are available as data files (which you must provide also)
treetag() needs the preset information from its own start scripts
kRp.POS.tags() needs to learn all possible POS tags that TreeTagger uses for the given language
All the calls follow the same pattern -- first,
you name one of the three targets explained above,
and second,
you provide a named list as the value
for the respective target
function.
# NOT RUN {
set.lang.support("hyphen",
list("xyz"="xyz")
)
# }
Run the code above in your browser using DataLab