Learn R Programming

koRpus (version 0.13-8)

freq.analysis: Analyze word frequencies

Description

The function freq.analysis analyzes texts regarding frequencies of tokens, word classes etc.

Usage

freq.analysis(txt.file, ...)

# S4 method for kRp.text freq.analysis( txt.file, corp.freq = NULL, desc.stat = TRUE, corp.rm.class = "nonpunct", corp.rm.tag = c() )

Arguments

txt.file

An object of class kRp.text.

...

Additional options for the generic.

corp.freq

An object of class kRp.corp.freq.

desc.stat

Logical, whether an updated descriptive statistical analysis should be conducted.

corp.rm.class

A character vector with word classes which should be ignored for frequency analysis. The default value "nonpunct" has special meaning and will cause the result of kRp.POS.tags(lang, tags=c("punct","sentc"), list.classes=TRUE) to be used.

corp.rm.tag

A character vector with POS tags which should be ignored for frequency analysis.

Value

An updated object of class kRp.text with the added feature freq, which is a list with information on the word frequencies of the analyzed text. Use corpusFreq to get that slot.

Details

It adds new columns with frequency information to the tokens data frame of the input data, describing how often the particular token is used in the additionally provided corpus frequency object.

To get the results, you can use taggedText to get the tokens slot, describe to get the raw descriptive statistics (only updated if desc.stat=TRUE), and corpusFreq to get the data from the added freq feature.

If corp.freq provides appropriate idf values for the types in txt.file, the term frequency--inverse document frequency statistic (tf-idf) will also be computed. Missing idf values will result in NA.

See Also

get.kRp.env, kRp.text, kRp.corp.freq

Examples

Run this code
# NOT RUN {
# code is only run when the english language package can be loaded
if(require("koRpus.lang.en", quietly = TRUE)){
  sample_file <- file.path(
    path.package("koRpus"), "examples", "corpus", "Reality_Winner.txt"
  )
  # call freq.analysis() on a tokenized text
  tokenized.obj <- tokenize(
    txt=sample_file,
    lang="en"
  )
  # the token slot before frequency analysis
  head(taggedText(tokenized.obj))

  # instead of data from a larger corpus, we'll
  # use the token frequencies of the text itself
  tokenized.obj <- freq.analysis(
    tokenized.obj,
    corp.freq=read.corp.custom(tokenized.obj)
  )
  # compare the columns after the anylsis
  head(taggedText(tokenized.obj))

  # the object now has further statistics in a
  # new feature slot called freq
  hasFeature(tokenized.obj)
  corpusFreq(tokenized.obj)
} else {}
# }

Run the code above in your browser using DataLab