Learn R Programming

fdm2id (version 0.9.6)

getvocab: Extract words and phrases from a corpus

Description

Extract words and phrases from a corpus of documents.

Usage

getvocab(
  corpus,
  mincount = 5,
  minphrasecount = NULL,
  ngram = 1,
  lang = "en",
  stopwords = lang,
  ...
)

Value

The vocabulary used in the corpus of documents.

Arguments

corpus

The corpus of documents (a vector of characters).

mincount

Minimum word count to be considered as frequent.

minphrasecount

Minimum collocation of words count to be considered as frequent.

ngram

maximum size of n-grams.

lang

The language of the documents (NULL if no stemming).

stopwords

Stopwords, or the language of the documents. NULL if stop words should not be removed.

...

Other parameters.

See Also

plotzipf, stopwords, create_vocabulary

Examples

Run this code
if (FALSE) {
text = loadtext ("http://mattmahoney.net/dc/text8.zip")
vocab1 = getvocab (text) # With stemming
nrow (vocab1)
vocab2 = getvocab (text, lang = NULL) # Without stemming
nrow (vocab2)
}

Run the code above in your browser using DataLab