Learn R Programming

koRpus (version 0.13-8)

kRp.corp.freq,-class: S4 Class kRp.corp.freq

Description

This class is used for objects that are returned by read.corp.LCC and read.corp.celex.

Arguments

Slots

meta

Metadata on the corpora (see details).

words

Absolute word frequencies. It has at least the following columns:

num:

Some word ID from the DB, integer

word:

The word itself

lemma:

The lemma of the word

tag:

A part-of-speech tag

wclass:

The word class

lttr:

The number of characters

freq:

The frequency of that word in the corpus DB

pct:

Percentage of appearance in DB

pmio:

Appearance per million words in DB

log10:

Base 10 logarithm of word frequency

rank.avg:

Rank in corpus data, rank ties method "average"

rank.min:

Rank in corpus data, rank ties method "min"

rank.rel.avg:

Relative rank, i.e. percentile of "rank.avg"

rank.rel.min:

Relative rank, i.e. percentile of "rank.min"

inDocs:

The absolute number of documents in the corpus containing the word

idf:

The inverse document frequency

The slot might have additional columns, depending on the input material.

desc

Descriptive information. It contains six numbers from the meta information, for convenient accessibility:

tokens:

Number of running word forms

types:

Number of distinct word forms

words.p.sntc:

Average sentence length in words

chars.p.sntc:

Average sentence length in characters

chars.p.wform:

Average word form length

chars.p.word:

Average running word length

The slot might have additional columns, depending on the input material.

bigrams

A data.frame listing all tokens that co-occurred next to each other in the corpus:

token1:

The first token

token2:

The second token that appeared right next to the first

freq:

How often the co-occurrance was present

sig:

Log-likelihood significance of the co-occurrende

cooccur

Similar to bigrams, but listing co-occurrences anywhere in one sentence:

token1:

The first token

token2:

The second token that appeared in the same sentence

freq:

How often the co-occurrance was present

sig:

Log-likelihood significance of the co-occurrende

caseSens

A single logical value, whether the frequency statistics were calculated case sensitive or not.

Contructor function

Should you need to manually generate objects of this class (which should rarely be the case), the contructor function kRp_corp_freq(...) can be used instead of new("kRp.corp.freq", ...).

Details

The slot meta simply contains all information from the "meta.txt" of the LCC[1] data and remains empty for data from a Celex[2] DB.

References

[1] https://wortschatz.uni-leipzig.de/en/download/ [2] http://celex.mpi.nl