corpus_sample: randomly sample documents from a corpus

Description

Takes a random sample or documents or features of the specified size from a corpus or document-feature matrix, with or without replacement. Works just as sample works for the documents and their associated document-level variables.

Usage

corpus_sample(x, size = ndoc(x), replace = FALSE, prob = NULL,
  by = NULL, ...)

Arguments

a corpus object whose documents will be sampled

size

a positive number, the number of documents to select

replace

Should sampling be with replacement?

prob

A vector of probability weights for obtaining the elements of the vector being sampled.

a grouping variable for sampling. Useful for resampling sub-document units such as sentences, for instance by specifying

by =
"document"

...

unused

Value

A corpus object with number of documents equal to size, drawn from the corpus x. The returned corpus object will contain all of the meta-data of the original corpus, and the same document variables for the documents selected.

Examples

Run this code

# sampling from a corpus
summary(corpus_sample(data_corpus_inaugural, 5)) 
summary(corpus_sample(data_corpus_inaugural, 10, replace=TRUE))

# sampling sentences within document
doccorpus <- corpus(c(one = "Sentence one.  Sentence two.  Third sentence.",
                      two = "First sentence, doc2.  Second sentence, doc2."))
sentcorpus <- corpus_reshape(doccorpus, to = "sentences")
texts(sentcorpus)
texts(corpus_sample(sentcorpus, replace = TRUE, by = "document"))

Run the code above in your browser using DataLab