dfm: create a document-feature matrix

Description

Create a sparse matrix document-feature matrix from a corpus or a vector of texts. The sparse matrix construction uses the Matrix package, and is both much faster and much more memory efficient than the corresponding dense (regular matrix) representation. For details on the structure of the dfm class, see dfm-class.

Usage

dfm(x, ...)
"dfm"(x, verbose = TRUE, toLower = TRUE, removeNumbers = TRUE, removePunct = TRUE, removeSeparators = TRUE, removeTwitter = FALSE, stem = FALSE, ignoredFeatures = NULL, keptFeatures = NULL, language = "english", thesaurus = NULL, dictionary = NULL, valuetype = c("glob", "regex", "fixed"), ...)
"dfm"(x, verbose = TRUE, toLower = FALSE, stem = FALSE, ignoredFeatures = NULL, keptFeatures = NULL, language = "english", thesaurus = NULL, dictionary = NULL, valuetype = c("glob", "regex", "fixed"), ...)
"dfm"(x, verbose = TRUE, groups = NULL, ...)
is.dfm(x)
as.dfm(x)

Arguments

corpus or character vector from which to generate the document-feature matrix

...

additional arguments passed to tokenize, which can include for instance ngrams and concatenator for tokenizing multi-token sequences

verbose

display messages if TRUE

toLower

convert texts to lowercase

removeNumbers

remove numbers, see tokenize

removePunct

remove punctuation, see tokenize

removeSeparators

remove separators (whitespace), see tokenize

removeTwitter

if FALSE, preserve # and @ characters, see tokenize

stem

if TRUE, stem words

ignoredFeatures

a character vector of user-supplied features to ignore, such as "stop words". To access one possible list (from any list you wish), use stopwords(). The pattern matching type will be set by valuetype. For behaviour of ingoredFeatures with ngrams > 1, see Details.

keptFeatures

a use supplied regular expression defining which features to keep, while excluding all others. This can be used in lieu of a dictionary if there are only specific features that a user wishes to keep. To extract only Twitter usernames, for example, set

keptFeatures = 
"@*"

and make sure that removeTwitter = FALSE as an additional argument passed to tokenize. Note:

keptFeatures = 
"^@\\w+\\b"

would be the regular expression version of this matching pattern. The pattern matching type will be set by valuetype.

language

Language for stemming. Choices are danish, dutch, english, finnish, french, german, hungarian, italian, norwegian, porter, portuguese, romanian, russian, spanish, swedish, turkish.

thesaurus

A list of character vector "thesaurus" entries, in a dictionary list format, which operates as a dictionary but without excluding values not matched from the dictionary. Thesaurus keys are converted to upper case to create a feature label in the dfm, as a reminder that this was not a type found in the text, but rather the label of a thesaurus key. For more fine-grained control over this and other aspects of converting features into dictionary/thesaurus keys from pattern matches to values, you can use applyDictionary after creating the dfm.

dictionary

A list of character vector dictionary entries, including regular expressions (see examples)

valuetype

fixed for words as is; "regex" for regular expressions; or "glob" for "glob"-style wildcard. Glob format is the default. See selectFeatures.

groups

character vector containing the names of document variables for aggregating documents

Value

A dfm-class object containing a sparse matrix representation of the counts of features by document, along with associated settings and metadata.

Details

The default behavior for ignoredFeatures when constructing ngrams using dfm(x, ngrams > 1) is to remove any ngram that contains any item in ignoredFeatures. If you wish to remove these before constructing ngrams, you will need to first tokenize the texts with ngrams, then remove the features to be ignored, and then construct the dfm using this modified tokenization object. See the code examples for an illustration.

is.dfm returns TRUE if and only if its argument is a dfm.

as.dfm coerces a matrix or data.frame to a dfm

Examples

Run this code

# why we phased out dense matrix dfm objects
(size1 <- object.size(dfm(inaugTexts, verbose = FALSE)))
(size2 <- object.size(as.matrix(dfm(inaugTexts, verbose = FALSE))))
cat("Compacted by ", round(as.numeric((1-size1/size2)*100), 1), "%.\n", sep="")

# for a corpus
mydfm <- dfm(subset(inaugCorpus, Year>1980))
mydfm <- dfm(subset(inaugCorpus, Year>1980), toLower=FALSE)

# grouping documents by docvars in a corpus
mydfmGrouped <- dfm(subset(inaugCorpus, Year>1980), groups = "President")

# with English stopwords and stemming
dfmsInaug2 <- dfm(subset(inaugCorpus, Year>1980), 
                  ignoredFeatures=stopwords("english"), stem=TRUE)
# works for both words in ngrams too
dfm("Banking industry", stem = TRUE, ngrams = 2, verbose = FALSE)

# with dictionaries
mycorpus <- subset(inaugCorpus, Year>1900)
mydict <- list(christmas=c("Christmas", "Santa", "holiday"),
               opposition=c("Opposition", "reject", "notincorpus"),
               taxing="taxing",
               taxation="taxation",
               taxregex="tax*",
               country="united states")
dictDfm <- dfm(mycorpus, dictionary=mydict)
dictDfm

# with the thesaurus feature
mytexts <- c("The new law included a capital gains tax, and an inheritance tax.",
             "New York City has raised a taxes: an income tax and a sales tax.")
mydict <- dictionary(list(tax=c("tax", "income tax", "capital gains tax", "inheritance tax")))
dfm(phrasetotoken(mytexts, mydict), thesaurus = lapply(mydict, function(x) gsub("\\s", "_", x)))
# pick up "taxes" with "tax" as a regex
dfm(phrasetotoken(mytexts, mydict), thesaurus = list(anytax = "tax"), valuetype = "regex")

# removing stopwords
testText <- "The quick brown fox named Seamus jumps over the lazy dog also named Seamus, with
             the newspaper from a boy named Seamus, in his mouth."
testCorpus <- corpus(testText)
# note: "also" is not in the default stopwords("english")
features(dfm(testCorpus, ignoredFeatures = stopwords("english")))
# for ngrams
features(dfm(testCorpus, ngrams = 2, ignoredFeatures = stopwords("english")))
features(dfm(testCorpus, ngrams = 1:2, ignoredFeatures = stopwords("english")))

## removing stopwords before constructing ngrams
tokensAll <- tokenize(toLower(testText), removePunct = TRUE)
tokensNoStopwords <- removeFeatures(tokensAll, stopwords("english"))
tokensNgramsNoStopwords <- ngrams(tokensNoStopwords, 2)
features(dfm(tokensNgramsNoStopwords, verbose = FALSE))

# keep only certain words
dfm(testCorpus, keptFeatures = "*s", verbose = FALSE)  # keep only words ending in "s"
dfm(testCorpus, keptFeatures = "s$", valuetype = "regex", verbose = FALSE)

# testing Twitter functions
testTweets <- c("My homie @justinbieber #justinbieber shopping in #LA yesterday #beliebers",
                "2all the ha8ers including my bro #justinbieber #emabiggestfansjustinbieber",
                "Justin Bieber #justinbieber #belieber #fetusjustin #EMABiggestFansJustinBieber")
dfm(testTweets, keptFeatures = "#*", removeTwitter = FALSE)  # keep only hashtags
dfm(testTweets, keptFeatures = "^#.*$", valuetype = "regex", removeTwitter = FALSE)

Run the code above in your browser using DataLab