fcm: Create a feature co-occurrence matrix

Description

Create a sparse feature co-occurrence matrix, measuring co-occurrences of features within a user-defined context. The context can be defined as a document or a window within a collection of documents, with an optional vector of weights applied to the co-occurrence counts.

Usage

fcm(x, context = c("document", "window"), count = c("frequency", "boolean",
  "weighted"), window = 5L, weights = 1L, ordered = FALSE,
  span_sentence = TRUE, tri = TRUE, ...)

Arguments

character, corpus, tokens, or dfm object from which to generate the feature co-occurrence matrix

context

the context in which to consider term co-occurrence: "document" for co-occurrence counts within document; "window" for co-occurrence within a defined window of words, which requires a positive integer value for window. Note: if x is a dfm object, then context can only be "document".

count

how to count co-occurrences:

"frequency": count the number of co-occurrences within the context
"boolean": count only the co-occurrence or not within the context, irrespective of how many times it occurs.
"weighted": count a weighted function of counts, typically as a function of distance from the target feature. Only makes sense for context = "window".

window

positive integer value for the size of a window on either side of the target feature, default is 5, meaning 5 words before and after the target feature

weights

a vector of weights applied to each distance from 1:window, strictly decreasing by default; can be a custom-defined vector of the same length as length(weights)

ordered

if TRUE the number of times that a term appears before or after the target feature are counted separately. Only makes sense for context = "window".

span_sentence

if FALSE, then word windows will not span sentences

tri

if TRUE return only upper triangle (including diagonal)

...

not used here

Details

The function fcm provides a very general implementation of a "context-feature" matrix, consisting of a count of feature co-occurrence within a defined context. This context, following Momtazi et. al. (2010), can be defined as the document, sentences within documents, syntactic relationships between features (nouns within a sentence, for instance), or according to a window. When the context is a window, a weighting function is typically applied that is a function of distance from the target word (see Jurafsky and Martin 2015, Ch. 16) and ordered co-occurrence of the two features is considered (see Church & Hanks 1990).

fcm provides all of this functionality, returning a \(V * V\) matrix (where \(V\) is the vocabulary size, returned by nfeat). The tri = TRUE option will only return the upper part of the matrix.

Unlike some implementations of co-occurrences, fcm counts feature co-occurrences with themselves, meaning that the diagonal will not be zero.

fcm also provides "boolean" counting within the context of "window", which differs from the counting within "document".

is.fcm(x) returns TRUE if and only if its x is an object of type fcm.

References

Momtazi, S., Khudanpur, S., & Klakow, D. (2010). "A comparative study of word co-occurrence for term clustering in language model-based sentence retrieval." Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the ACL, Los Angeles, California, June 2010, pp. 325-328.

Daniel Jurafsky & James H. Martin. (2015) Speech and Language Processing. Draft of April 11, 2016. Chapter 16, Semantics with Dense Vectors.

Church, K. W. & P. Hanks (1990) "Word association norms, mutual information, and lexicography" Computational Linguistics, 16(1):22<U+2013>29.

Examples

Run this code

# NOT RUN {
# see http://bit.ly/29b2zOA
txt <- "A D A C E A D F E B A C E D"
fcm(txt, context = "window", window = 2)
fcm(txt, context = "window", count = "weighted", window = 3)
fcm(txt, context = "window", count = "weighted", window = 3, 
             weights = c(3, 2, 1), ordered = TRUE, tri = FALSE)

# with multiple documents
txts <- c("a a a b b c", "a a c e", "a c e f g")
fcm(txts, context = "document", count = "frequency")
fcm(txts, context = "document", count = "boolean")
fcm(txts, context = "window", window = 2)


# from tokens
txt <- c("The quick brown fox jumped over the lazy dog.",
         "The dog jumped and ate the fox.")
toks <- tokens(char_tolower(txt), remove_punct = TRUE)
fcm(toks, context = "document")
fcm(toks, context = "window", window = 3)
# }

Run the code above in your browser using DataLab