This family of function creates iterators over input objects in order to create vocabularies, or DTM and TCM matrices. iterators usually used in following functions : create_vocabulary, create_dtm, vectorizers, create_tcm. See them for details.
itoken(iterable, ...)# S3 method for character
itoken(iterable, preprocessor = identity,
tokenizer = space_tokenizer, n_chunks = 10,
progressbar = interactive(), ids = NULL, ...)
# S3 method for list
itoken(iterable, n_chunks = 10,
progressbar = interactive(), ids = names(iterable), ...)
# S3 method for iterator
itoken(iterable, preprocessor = identity,
tokenizer = space_tokenizer, progressbar = interactive(), ...)
itoken_parallel(iterable, ...)
# S3 method for character
itoken_parallel(iterable, preprocessor = identity,
tokenizer = space_tokenizer, n_chunks = 10, ids = NULL, ...)
# S3 method for iterator
itoken_parallel(iterable, preprocessor = identity,
tokenizer = space_tokenizer, n_chunks = 1L, ...)
# S3 method for list
itoken_parallel(iterable, n_chunks = 10, ids = NULL,
...)
an object from which to generate an iterator
arguments passed to other methods
function
which takes chunk of
character
vectors and does all pre-processing.
Usually preprocessor
should return a
character
vector of preprocessed/cleaned documents. See "Details"
section.
function
which takes a character
vector from
preprocessor
, split it into tokens and returns a list
of character
vectors. If you need to perform stemming -
call stemmer inside tokenizer. See examples section.
integer
, the number of pieces that object should
be divided into. Then each chunk is processed independently (and in case itoken_parallel
in parallel if some parallel backend is registered).
Usually there is tradeoff: larger number of chunks means lower memory footprint, but slower (if
preprocessor, tokenizer
functions are efficiently vectorized). And small number
of chunks means larger memory footprint but faster execution (again if user
supplied preprocessor, tokenizer
functions are efficiently vectorized).
logical
indicates whether to show progress bar.
vector
of document ids. If ids
is not provided,
names(iterable)
will be used. If names(iterable) == NULL
,
incremental ids will be assigned.
S3 methods for creating an itoken iterator from list of tokens
list
: all elements of the input list should be
character vectors containing tokens
character
: raw text
source: the user must provide a tokenizer function
ifiles
: from files, a user must provide a function to read in the file
(to ifiles) and a function to tokenize it (to itoken)
idir
: from a directory, the user must provide a function to
read in the files (to idir) and a function to tokenize it (to itoken)
ifiles_parallel
: from files in parallel
ifiles, idir, create_vocabulary, create_dtm, vectorizers, create_tcm
# NOT RUN {
data("movie_review")
txt = movie_review$review[1:100]
ids = movie_review$id[1:100]
it = itoken(txt, tolower, word_tokenizer, n_chunks = 10)
it = itoken(txt, tolower, word_tokenizer, n_chunks = 10, ids = ids)
# Example of stemming tokenizer
# stem_tokenizer =function(x) {
# lapply(word_tokenizer(x), SnowballC::wordStem, language="en")
# }
it = itoken_parallel(movie_review$review[1:100], n_chunks = 4)
system.time(dtm <- create_dtm(it, hash_vectorizer(2**16), type = 'dgTMatrix'))
# }
Run the code above in your browser using DataLab