trim: Trim a dfm using threshold-based or random feature selection

Description

Returns a document by feature matrix reduced in size based on document and term frequency, and/or subsampling.

Usage

trim(x, minCount = 1, minDoc = 1, sparsity = NULL, nsample = NULL, verbose = TRUE)
"trim"(x, minCount = 1, minDoc = 1, sparsity = NULL, nsample = NULL, verbose = TRUE)
trimdfm(x, ...)

Arguments

document-feature matrix of dfm-class

minCount

minimum count or fraction of features in across all documents

minDoc

minimum number or fraction of documents in which a feature appears

sparsity

equivalent to 1 - minDoc, included for comparison with tm

nsample

how many features to retain (based on random selection)

verbose

print messages

...

only included to allow legacy trimdfm to pass arguments to trim

Value

A dfm-class object reduced in features (with the same number of documents)

Examples

Run this code

(myDfm <- dfm(inaugCorpus, verbose = FALSE))
# only words occuring >=10 times and in >=2 docs
trim(myDfm, minCount = 10, minDoc = 2) 
# only words occuring >=10 times and in at least 0.4 of the documents
trim(myDfm, minCount = 10, minDoc = 0.4)
# only words occuring at least 0.01 times and in >=2 documents
trim(myDfm, minCount = .01, minDoc = 2)
# only words occuring 5 times in 1000
trim(myDfm, minDoc = 0.2, minCount = 0.005)
# sample 50 words occurring at least 20 times each
(myDfmSampled <- trim(myDfm, minCount = 20, nsample = 50))  
topfeatures(myDfmSampled)
## Not run: 
# if (require(tm)) {
#     (tmdtm <- convert(myDfm, "tm"))
#     removeSparseTerms(tmdtm, 0.7)
#     trim(td, minDoc = 0.3)
#     trim(td, sparsity = 0.7)
# }
# ## End(Not run)

Run the code above in your browser using DataLab

Description

Usage

Arguments

Value

See Also

Examples