This class is used for objects that are returned by read.corp.LCC
and read.corp.celex
.
meta
Metadata on the corpora (dee details).
words
Absolute word frequencies. It has at least the following columns:
num
:Some word ID from the DB, integer
word
:The word itself
lemma
:The lemma of the word
tag
:A part-of-speech tag
wclass
:The word class
lttr
:The number of characters
freq
:The frequency of that word in the corpus DB
pct
:Percentage of appearance in DB
pmio
:Appearance per million words in DB
log10
:Base 10 logarithm of word frequency
rank.avg
:Rank in corpus data, rank
ties method "average"
rank.min
:Rank in corpus data, rank
ties method "min"
rank.rel.avg
:Relative rank, i.e. percentile of "rank.avg"
rank.rel.min
:Relative rank, i.e. percentile of "rank.min"
inDocs
:The absolute number of documents in the corpus containing the word
idf
:The inverse document frequency
desc
Descriptive information. It contains six numbers from the meta
information,
for convenient accessibility:
tokens
:Number of running word forms
types
:Number of distinct word forms
words.p.sntc
:Average sentence length in words
chars.p.sntc
:Average sentence length in characters
chars.p.wform
:Average word form length
chars.p.word
:Average running word length
bigrams
A data.frame listing all tokens that co-occurred next to each other in the corpus:
token1
:The first token
token2
:The second token that appeared right next to the first
freq
:How often the co-occurrance was present
sig
:Log-likelihood significance of the co-occurrende
cooccur
Similar to bigrams
,
but listing co-occurrences anywhere in one sentence:
token1
:The first token
token2
:The second token that appeared in the same sentence
freq
:How often the co-occurrance was present
sig
:Log-likelihood significance of the co-occurrende
The slot meta
simply contains all information from the "meta.txt" of the LCC[1] data and remains empty for data from a Celex[2] DB.
[1] http://corpora.informatik.uni-leipzig.de/download.html [2] http://celex.mpi.nl