This function collects unique terms and corresponding statistics. See the below for details.
create_vocabulary(it, ngram = c(ngram_min = 1L, ngram_max = 1L),
stopwords = character(0), sep_ngram = "_", window_size = 0L, ...)vocabulary(it, ngram = c(ngram_min = 1L, ngram_max = 1L),
stopwords = character(0), sep_ngram = "_", window_size = 0L, ...)
# S3 method for character
create_vocabulary(it, ngram = c(ngram_min = 1L,
ngram_max = 1L), stopwords = character(0), sep_ngram = "_",
window_size = 0L, ...)
# S3 method for itoken
create_vocabulary(it, ngram = c(ngram_min = 1L,
ngram_max = 1L), stopwords = character(0), sep_ngram = "_",
window_size = 0L, ...)
# S3 method for itoken_parallel
create_vocabulary(it, ngram = c(ngram_min = 1L,
ngram_max = 1L), stopwords = character(0), sep_ngram = "_",
window_size = 0L, ...)
text2vec_vocabulary
object, which is actually a data.frame
with following columns:
term
character
vector of unique terms
term_count
integer
vector of term counts across all
documents
doc_count
integer
vector of document
counts that contain corresponding term
Also it contains metainformation in attributes:
ngram
: integer
vector, the lower and upper boundary of the
range of n-gram-values.
document_count
: integer
number of documents vocabulary was
built.
stopwords
: character
vector of stopwords
sep_ngram
: character
separator for ngrams
iterator over a list
of character
vectors,
which are the documents from which the user wants to construct a vocabulary.
See itoken.
Alternatively, a character
vector of user-defined vocabulary terms
(which will be used "as is").
integer
vector. The lower and upper boundary of the range
of n-values for different n-grams to be extracted. All values of n
such that ngram_min <= n <= ngram_max will be used.
character
vector of stopwords to filter out. NOTE that
stopwords will be used "as is". This means that if preprocessing function in itoken does some
text modification (like stemming), then this preprocessing need to be applied to stopwords before passing them here.
See https://github.com/dselivanov/text2vec/issues/228 for example.
character
a character string to concatenate words in ngrams
integer
(0 by default). If window_size > 0
than vocabulary will
be created from pseudo-documents which are obtained by virtually splitting each documents into
chunks of the length window_size
by going with sliding window through them.
This is useful for creating special statistics which are used for coherence estimation in topic models.
placeholder for additional arguments (not used at the moment).
character
: creates text2vec_vocabulary
from predefined
character vector. Terms will be inserted as is, without any checks
(ngrams number, ngram delimiters, etc.).
itoken
: collects unique terms and corresponding statistics from object.
itoken_parallel
: collects unique terms and corresponding
statistics from iterator.
data("movie_review")
txt = movie_review[['review']][1:100]
it = itoken(txt, tolower, word_tokenizer, n_chunks = 10)
vocab = create_vocabulary(it)
pruned_vocab = prune_vocabulary(vocab, term_count_min = 10, doc_proportion_max = 0.8,
doc_proportion_min = 0.001, vocab_term_max = 20000)
Run the code above in your browser using DataLab