Learn R Programming

koRpus (version 0.06-5)

kRp.corp.freq,-class: S4 Class kRp.corp.freq

Description

This class is used for objects that are returned by read.corp.LCC and read.corp.celex.

Arguments

Slots

meta
Metadata on the corpora (dee details).
words
Absolute word frequencies. It has at least the following columns:
num:
Some word ID from the DB, integer
word:
The word itself
lemma:
The lemma of the word
tag:
A part-of-speech tag
wclass:
The word class
lttr:
The number of characters
freq:
The frequency of that word in the corpus DB
pct:
Percentage of appearance in DB
pmio:
Appearance per million words in DB
log10:
Base 10 logarithm of word frequency
rank.avg:
Rank in corpus data, rank ties method "average"
rank.min:
Rank in corpus data, rank ties method "min"
rank.rel.avg:
Relative rank, i.e. percentile of "rank.avg"
rank.rel.min:
Relative rank, i.e. percentile of "rank.min"
inDocs:
The absolute number of documents in the corpus containing the word
idf:
The inverse document frequency
The slot might have additional columns, depending on the input material.
desc
Descriptive information. It contains six numbers from the meta information, for convenient accessibility:
tokens:
Number of running word forms
types:
Number of distinct word forms
words.p.sntc:
Average sentence length in words
chars.p.sntc:
Average sentence length in characters
chars.p.wform:
Average word form length
chars.p.word:
Average running word length
The slot might have additional columns, depending on the input material.

Details

The slot meta simply contains all information from the "meta.txt" of the LCC[1] data and remains empty for data from a Celex[2] DB.

References

[1] http://corpora.informatik.uni-leipzig.de/download.html [2] http://celex.mpi.nl