Read data from a custom corpus into a valid object of class kRp.corp.freq-class
.
read.corp.custom(corpus, ...)# S4 method for kRp.taggedText
read.corp.custom(corpus, quiet = TRUE,
caseSens = TRUE, log.base = 10, ...)
# S4 method for character
read.corp.custom(corpus, format = "file",
quiet = TRUE, caseSens = TRUE, log.base = 10, tagger = "kRp.env",
force.lang = NULL, ...)
# S4 method for list
read.corp.custom(corpus, quiet = TRUE, caseSens = TRUE,
log.base = 10, ...)
Either the path to directory with txt files to read and analyze,
or a vector object already holding the text corpus.
Can also be an already tokenized and tagged text object which inherits class kRp.tagged
(then the column "token"
of
the "TT.res"
slot is used).
Additional options to be passed through to the tokenize
function.
Logical. If FALSE
, short status messages will be shown.
Logical. If FALSE
,
all tokens will be matched in their lower case form.
A numeric value defining the base of the logarithm used for inverse document frequency (idf). See
log
for details.
Either "file" or "obj", depending on whether you want to scan files or analyze the given object.
A character string pointing to the tokenizer/tagger command you want to use for basic text analysis. Can be omitted if
txt.file
is already of class kRp.tagged-class
. Defaults to tagger="kRp.env"
to get the settings by
get.kRp.env
. Set to "tokenize"
to use tokenize
.
A character string defining the language to be assumed for the text(s), by force.
An object of class kRp.corp.freq-class
.
The methods should enable you to perform a basic text corpus frequency analysis. That is,
not just to
import analysis results like LCC files,
but to import the corpus material itself. The resulting object
is of class kRp.corp.freq-class
,
so it can be used for frequency analysis by
other functions and methods of this package.
# NOT RUN {
ru.corp <- read.corp.custom("~/mydata/corpora/russian_corpus/")
# }
Run the code above in your browser using DataLab