This is a function for constructing a term-co-occurrence matrix(TCM). TCM matrix usually used with GloVe word embedding model.
create_tcm(it, vectorizer, skip_grams_window = 5L,
skip_grams_window_context = c("symmetric", "right", "left"),
weights = 1/seq_len(skip_grams_window), binary_cooccurence = FALSE,
...)# S3 method for itoken
create_tcm(it, vectorizer, skip_grams_window = 5L,
skip_grams_window_context = c("symmetric", "right", "left"),
weights = 1/seq_len(skip_grams_window), binary_cooccurence = FALSE,
...)
# S3 method for itoken_parallel
create_tcm(it, vectorizer,
skip_grams_window = 5L, skip_grams_window_context = c("symmetric",
"right", "left"), weights = 1/seq_len(skip_grams_window),
binary_cooccurence = FALSE, ...)
list
of iterators over tokens from itoken.
Each element is a list of tokens, that is, tokenized and normalized
strings.
function
vectorizer function. See
vectorizers.
integer
window for term-co-occurence matrix
construction. skip_grams_window
should be > 0 if you plan to use
vectorizer
in create_tcm function.
Value of 0L
means to not construct the TCM.
one of c("symmetric", "right", "left")
-
which context words to use when count co-occurence statistics.
weights for context/distant words during co-occurence statistics calculation.
By default we are setting weight = 1 / distance_from_current_word
.
Should have length equal to skip_grams_window.
FALSE
by default. If set to TRUE
then function only counts first
appearence of the context word and remaining occurrence are ignored. Useful when creating TCM for evaluation
of coherence of topic models.
"symmetric"
by default - take into account skip_grams_window
left and right.
placeholder for additional arguments (not used at the moment).
it
.
dgTMatrix
TCM matrix
If a parallel backend is registered, it will construct the TCM in multiple threads.
The user should keep in mind that he/she should split data and provide a list
of itoken iterators. Each element of it
will be handled
in a separate thread combined at the end of processing.
# NOT RUN {
data("movie_review")
# single thread
tokens = word_tokenizer(tolower(movie_review$review))
it = itoken(tokens)
v = create_vocabulary(jobs)
vectorizer = vocab_vectorizer(v)
tcm = create_tcm(itoken(tokens), vectorizer, skip_grams_window = 3L)
# parallel version
# set to number of cores on your machine
it = token_parallel(movie_review$review[1:N], tolower, word_tokenizer, movie_review$id[1:N])
v = create_vocabulary(jobs)
vectorizer = vocab_vectorizer(v)
dtm = create_dtm(it, vectorizer, type = 'dgTMatrix')
tcm = create_tcm(jobs, vectorizer, skip_grams_window = 3L, skip_grams_window_context = "symmetric")
# }
Run the code above in your browser using DataLab