This function collects unique terms and corresponding statistics. See the below for details.
create_vocabulary(it, ngram = c(ngram_min = 1L, ngram_max = 1L),
stopwords = character(0), sep_ngram = "_")vocabulary(it, ngram = c(ngram_min = 1L, ngram_max = 1L),
stopwords = character(0), sep_ngram = "_")
# S3 method for character
create_vocabulary(it, ngram = c(ngram_min = 1L, ngram_max
= 1L), stopwords = character(0), sep_ngram = "_")
# S3 method for itoken
create_vocabulary(it, ngram = c(ngram_min = 1L, ngram_max =
1L), stopwords = character(0), sep_ngram = "_")
# S3 method for list
create_vocabulary(it, ngram = c(ngram_min = 1L, ngram_max =
1L), stopwords = character(0), sep_ngram = "_", ...)
# S3 method for itoken_parallel
create_vocabulary(it, ngram = c(ngram_min = 1L,
ngram_max = 1L), stopwords = character(0), sep_ngram = "_", ...)
iterator over a list
of character
vectors,
which are the documents from which the user wants to construct a vocabulary.
See itoken.
Alternatively, a character
vector of user-defined vocabulary terms
(which will be used "as is").
integer
vector. The lower and upper boundary of the range
of n-values for different n-grams to be extracted. All values of n
such that ngram_min <= n <= ngram_max will be used.
character
vector of stopwords to filter out. NOTE that
stopwords will be used "as is". This means that if preprocessing function in itoken does some
text modification (like stemming), then this preprocessing need to be applied to stopwrods before passing them here.
See https://github.com/dselivanov/text2vec/issues/228 for example.
character
a character string to concatenate words in ngrams
additional arguments to foreach function.
text2vec_vocabulary
object, which is actually a data.frame
with following columns:
term
character
vector of unique terms
term_count
integer
vector of term counts across all
documents
doc_count
integer
vector of document
counts that contain corresponding term
character
: creates text2vec_vocabulary
from predefined
character vector. Terms will be inserted as is, without any checks
(ngrams number, ngram delimiters, etc.).
itoken
: collects unique terms and corresponding statistics from object.
list
: collects unique terms and corresponding
statistics from list of itoken iterators. If parallel backend is
registered, it will build vocabulary in parallel using foreach.
itoken_parallel
: collects unique terms and corresponding
statistics from iterator. If parallel backend is
registered, it will build vocabulary in parallel using foreach.
# NOT RUN {
data("movie_review")
txt = movie_review[['review']][1:100]
it = itoken(txt, tolower, word_tokenizer, n_chunks = 10)
vocab = create_vocabulary(it)
pruned_vocab = prune_vocabulary(vocab, term_count_min = 10, doc_proportion_max = 0.8,
doc_proportion_min = 0.001, vocab_term_max = 20000)
# }
Run the code above in your browser using DataLab