Learn R Programming

polmineR (version 0.8.8)

features: Get features by comparison.

Description

The features of two objects, usually a partition defining a corpus of interest (coi), and a partition defining a reference corpus (ref) are compared. The most important purpose is term extraction.

Usage

features(x, y, ...)

# S4 method for partition features(x, y, included = FALSE, method = "chisquare", verbose = FALSE)

# S4 method for count features( x, y, by = NULL, included = FALSE, method = "chisquare", verbose = TRUE )

# S4 method for partition_bundle features( x, y, included = FALSE, method = "chisquare", verbose = TRUE, mc = getOption("polmineR.mc"), progress = FALSE )

# S4 method for count_bundle features( x, y, included = FALSE, method = "chisquare", verbose = !progress, mc = getOption("polmineR.mc"), progress = FALSE )

# S4 method for ngrams features(x, y, included = FALSE, method = "chisquare", verbose = TRUE, ...)

# S4 method for Cooccurrences features(x, y, included = FALSE, method = "ll", verbose = TRUE)

Arguments

x

A partition or partition_bundle object.

y

A partition object, it is assumed that the coi is a subcorpus of ref

...

further parameters

included

TRUE if coi is part of ref, defaults to FALSE

method

the statistical test to apply (chisquare or log likelihood)

verbose

A logical value, defaults to TRUE

by

the columns used for merging, if NULL (default), the p-attribute of x will be used

mc

logical, whether to use multicore

progress

logical

Author

Andreas Blaette

References

Baker, Paul (2006): Using Corpora in Discourse Analysis. London: continuum, p. 121-149 (ch. 6).

Manning, Christopher D.; Schuetze, Hinrich (1999): Foundations of Statistical Natural Language Processing. MIT Press: Cambridge, Mass., pp. 151-189 (ch. 5).

Examples

Run this code
use("polmineR")

kauder <- partition(
  "GERMAPARLMINI",
  speaker = "Volker Kauder", interjection = "speech",
  p_attribute = "word"
  )
all <- partition("GERMAPARLMINI", interjection = "speech", p_attribute = "word")

terms_kauder <- features(x = kauder, y = all, included = TRUE)
top100 <- subset(terms_kauder, rank_chisquare <= 100)
head(top100)

# a different way is to compare count objects
kauder_count <- as(kauder, "count")
all_count <- as(all, "count")
terms_kauder <- features(kauder_count, all_count, included = TRUE)
top100 <- subset(terms_kauder, rank_chisquare <= 100)
head(top100)

# get matrix with features (dontrun to keep time for examples short)
if (FALSE) {
use("RcppCWB")
docs <- partition_bundle("REUTERS", s_attribute = "id") %>%
  enrich( p_attribute = "word")
all <- corpus("REUTERS") %>%
  count(p_attribute = "word")
docs_terms <- features(docs[1:5], all, included = TRUE, progress = FALSE)
dtm <- as.DocumentTermMatrix(docs_terms, col = "chisquare", verbose = FALSE)
}
# Get features of objects in a count_bundle
ref <- corpus("GERMAPARLMINI") %>% count(p_attribute = "word")
cois <- corpus("GERMAPARLMINI") %>%
  subset(speaker %in% c("Angela Dorothea Merkel", "Hubertus Heil")) %>%
  split(s_attribute = "speaker") %>%
  count(p_attribute = "word")
y <- features(cois, ref, included = TRUE, method = "chisquare", progress = TRUE)

Run the code above in your browser using DataLab