create_vocabulary: Creates a vocabulary of unique terms

Description

This function collects unique terms and corresponding statistics. See the below for details.

Usage

create_vocabulary(it, ngram = c(ngram_min = 1L, ngram_max = 1L),
  stopwords = character(0), sep_ngram = "_")
vocabulary(it, ngram = c(ngram_min = 1L, ngram_max = 1L),
  stopwords = character(0), sep_ngram = "_")
# S3 method for character
create_vocabulary(it, ngram = c(ngram_min = 1L, ngram_max
  = 1L), stopwords = character(0), sep_ngram = "_")
# S3 method for itoken
create_vocabulary(it, ngram = c(ngram_min = 1L, ngram_max =
  1L), stopwords = character(0), sep_ngram = "_")
# S3 method for list
create_vocabulary(it, ngram = c(ngram_min = 1L, ngram_max =
  1L), stopwords = character(0), sep_ngram = "_", ...)
# S3 method for itoken_parallel
create_vocabulary(it, ngram = c(ngram_min = 1L,
  ngram_max = 1L), stopwords = character(0), sep_ngram = "_", ...)

Arguments

iterator over a list of character vectors, which are the documents from which the user wants to construct a vocabulary. See itoken. Alternatively, a character vector of user-defined vocabulary terms (which will be used "as is").

ngram

integer vector. The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that ngram_min <= n <= ngram_max will be used.

stopwords

character vector of stopwords to filter out. NOTE that stopwords will be used "as is". This means that if preprocessing function in itoken does some text modification (like stemming), then this preprocessing need to be applied to stopwrods before passing them here. See https://github.com/dselivanov/text2vec/issues/228 for example.

sep_ngram

character a character string to concatenate words in ngrams

...

additional arguments to foreach function.

Value

text2vec_vocabulary object, which is actually a data.frame with following columns:

term

character vector of unique terms

term_count

integer vector of term counts across all documents

doc_count

integer vector of document counts that contain corresponding term

Also it contains metainformation in attributes: ngram: integer vector, the lower and upper boundary of the range of n-gram-values. document_count: integer number of documents vocabulary was built. stopwords: character vector of stopwords sep_ngram: character separator for ngrams

Methods (by class)

character: creates text2vec_vocabulary from predefined character vector. Terms will be inserted as is, without any checks (ngrams number, ngram delimiters, etc.).
itoken: collects unique terms and corresponding statistics from object.
list: collects unique terms and corresponding statistics from list of itoken iterators. If parallel backend is registered, it will build vocabulary in parallel using foreach.
itoken_parallel: collects unique terms and corresponding statistics from iterator. If parallel backend is registered, it will build vocabulary in parallel using foreach.

Examples

Run this code

# NOT RUN {
data("movie_review")
txt = movie_review[['review']][1:100]
it = itoken(txt, tolower, word_tokenizer, n_chunks = 10)
vocab = create_vocabulary(it)
pruned_vocab = prune_vocabulary(vocab, term_count_min = 10, doc_proportion_max = 0.8,
doc_proportion_min = 0.001, vocab_term_max = 20000)
# }

Run the code above in your browser using DataLab