This class is used for objects that are returned by read.corp.LCC
and read.corp.celex
.
meta
Metadata on the corpora (see details).
words
Absolute word frequencies. It has at least the following columns:
num
:Some word ID from the DB, integer
word
:The word itself
lemma
:The lemma of the word
tag
:A part-of-speech tag
wclass
:The word class
lttr
:The number of characters
freq
:The frequency of that word in the corpus DB
pct
:Percentage of appearance in DB
pmio
:Appearance per million words in DB
log10
:Base 10 logarithm of word frequency
rank.avg
:Rank in corpus data, rank
ties method "average"
rank.min
:Rank in corpus data, rank
ties method "min"
rank.rel.avg
:Relative rank, i.e. percentile of "rank.avg"
rank.rel.min
:Relative rank, i.e. percentile of "rank.min"
inDocs
:The absolute number of documents in the corpus containing the word
idf
:The inverse document frequency
desc
Descriptive information. It contains six numbers from the meta
information,
for convenient accessibility:
tokens
:Number of running word forms
types
:Number of distinct word forms
words.p.sntc
:Average sentence length in words
chars.p.sntc
:Average sentence length in characters
chars.p.wform
:Average word form length
chars.p.word
:Average running word length
bigrams
A data.frame listing all tokens that co-occurred next to each other in the corpus:
token1
:The first token
token2
:The second token that appeared right next to the first
freq
:How often the co-occurrance was present
sig
:Log-likelihood significance of the co-occurrende
cooccur
Similar to bigrams
,
but listing co-occurrences anywhere in one sentence:
token1
:The first token
token2
:The second token that appeared in the same sentence
freq
:How often the co-occurrance was present
sig
:Log-likelihood significance of the co-occurrende
caseSens
A single logical value, whether the frequency statistics were calculated case sensitive or not.
Should you need to manually generate objects of this class (which should rarely be the case),
the contructor function
kRp_corp_freq(...)
can be used instead of
new("kRp.corp.freq", ...)
.
The slot meta
simply contains all information from the "meta.txt" of the LCC[1] data and remains
empty for data from a Celex[2] DB.
[1] https://wortschatz.uni-leipzig.de/en/download/ [2] http://celex.mpi.nl