read.corp.LCC

A character string,
 either path to a .tar/.tar.gz/.zip file in LCC format (flatfile),
or the path to the directory with the unpacked archive.

LCC.path

Either "flatfile" or "MySQL", depending on the type of LCC data.

format

A character string naming the encoding of the LCC files. Old zip archives used "ISO_8859-1".
This option will only influence the reading of meta information,
 as the actual database encoding is derived from
there.

fileEncoding

An integer value defining how many lines of data should be read if <code>format="flatfile"</code>. Reads all at -1.

Logical. If <code>LCC.path</code> is a tarred/zipped archive,
 setting <code>keep.temp=TRUE</code> will keep
the temporarily unpacked files for further use. By default all temporary files will be removed when
the function ends.

keep.temp

Character string,
 giving the prefix for the file names in the archive. Needed for newer LCC tar archives
if they are already decompressed (autodetected if <code>LCC.path</code> points to the tar archive directly).

prefix

Logical, whether infomration on bigrams should be imported.
This is <code>FALSE</code> by default, because it might make the objects quite large.
Note that this will only work in <code>n = -1</code> because otherwise the tokens cannot be looked up.

bigrams

Logical, like <code>bigrams</code>,
 but for information on co-occurences of tokens in a sentence.

cooccurence

Logical,
 if <code>FALSE</code> forces all frequency statistics to be calculated regardless of the tokens' case.
Otherwise, if the imported database supports it,
 you will get different frequencies for the same tokens in different
cases (e.\,g., "one" and "One").

caseSens

Read data from LCC[1] formatted corpora (Quasthoff, Richter &amp; Biemann, 2006).

corpora

A set of tools to analyze texts. Includes, amongst others, functions for
automatic language detection, hyphenation, several indices of lexical diversity
(e.g., type token ratio, HD-D/vocd-D, MTLD) and readability (e.g., Flesch,
SMOG, LIX, Dale-Chall). Basic import functions for language corpora are also
provided, to enable frequency analyses (supports Celex and Leipzig Corpora
Collection file formats) and measures like tf-idf. Note: For full functionality
a local installation of TreeTagger is recommended. It is also recommended to
not load this package directly, but by loading one of the available language
support packages from the 'l10n' repository
<https://undocumeantit.github.io/repos/l10n/>. 'koRpus' also includes a plugin
for the R GUI and IDE RKWard, providing graphical dialogs for its basic
features. The respective R package 'rkward' cannot be installed directly from a
repository, as it is a part of RKWard. To make full use of this feature, please
install RKWard from <https://rkward.kde.org> (plugins are detected
automatically). Due to some restrictions on CRAN, the full package sources are
only available from the project homepage. To ask for help, report bugs, request
features, or discuss the development of the package, please subscribe to the
koRpus-dev mailing list (<https://korpusml.reaktanz.de>).

read.corp.LCC: Import LCC data

Description

Usage

Arguments

Value

Details

References

See Also

Examples