dfm_trim: trim a dfm using frequency threshold-based feature selection

Description

Returns a document by feature matrix reduced in size based on document and term frequency, usually in terms of a minimum frequencies, but may also be in terms of maximum frequencies. Setting a combination of minimum and maximum frequencies will select features based on a range.

Usage

dfm_trim(x, min_count = 1, min_docfreq = 1, max_count = NULL, max_docfreq = NULL, sparsity = NULL, verbose = TRUE)

Arguments

a dfm object

min_count

minimum count or fraction of features across all documents, below which features will be removed

min_docfreq

minimum number or fraction of documents in which a feature appears, below which features will be removed

max_count

maximum count or fraction of features across all documents, above which features will be removed. (Default is no upper limit.)

max_docfreq

maximum number or fraction of documents in which a feature appears, above which features will be removed. (Default is no upper limit.)

sparsity

equivalent to 1 - min_docfreq, included for comparison with tm

verbose

print messages

Value

A dfm reduced in features (with the same number of documents)

Examples

Run this code

(myDfm <- dfm(data_corpus_inaugural[1:5]))

# keep only words occuring >=10 times and in >=2 docs
dfm_trim(myDfm, min_count = 10, min_docfreq = 2) 

# keep only words occuring >=10 times and in at least 0.4 of the documents
dfm_trim(myDfm, min_count = 10, min_docfreq = 0.4)

# keep only words occuring <=10 times and in <=2 docs
dfm_trim(myDfm, max_count = 10, max_docfreq = 2) 

# keep only words occuring <=10 times and in at most 3/4 of the documents
dfm_trim(myDfm, max_count = 10, max_docfreq = 0.75)

# keep only words occuring at least 0.01 times and in >=2 documents
dfm_trim(myDfm, min_count = .01, min_docfreq = 2)

# keep only words occuring 5 times in 1000, and in 2 of 5 of documents
dfm_trim(myDfm, min_docfreq = 0.4, min_count = 0.005)

## Not run: 
# # compare to removeSpareTerms from the tm package 
# if (require(tm)) {
#     (tmdtm <- convert(myDfm, "tm"))
#     removeSparseTerms(tmdtm, 0.7)
#     dfm_trim(td, min_docfreq = 0.3)
#     dfm_trim(td, sparsity = 0.7)
# }
# ## End(Not run)

Run the code above in your browser using DataLab

Description

Usage

Arguments

Value

See Also

Examples