Conversion of text documents to word-vector-representation features ( Doc2Vec )
Conversion of text documents to word-vector-representation features ( Doc2Vec )
# utl <- Doc2Vec$new(token_list = NULL, word_vector_FILE = NULL, # print_every_rows = 10000, verbose = FALSE,
# copy_data = FALSE)
a matrix
Doc2Vec$new(token_list = NULL, word_vector_FILE = NULL, print_every_rows = 10000, verbose = FALSE, copy_data = FALSE)
--------------
doc2vec_methods(method = "sum_sqrt", global_term_weights = NULL, threads = 1)
--------------
pre_processed_wv()
new()
Doc2Vec$new(
token_list = NULL,
word_vector_FILE = NULL,
print_every_rows = 10000,
verbose = FALSE,
copy_data = FALSE
)
token_list
either NULL or a list of tokenized text documents
word_vector_FILE
a valid path to a text file, where the word-vectors are saved
print_every_rows
a numeric value greater than 1 specifying the print intervals. Frequent output in the R session can slow down the function especially in case of big files.
verbose
either TRUE or FALSE. If TRUE then information will be printed out in the R session.
copy_data
either TRUE or FALSE. If FALSE then a pointer will be created and no copy of the initial data takes place (memory efficient especially for big datasets). This is an alternative way to pre-process the data.
doc2vec_methods()
Doc2Vec$doc2vec_methods(
method = "sum_sqrt",
global_term_weights = NULL,
threads = 1
)
method
a character string specifying the method to use. One of sum_sqrt, min_max_norm or idf. See the details section for more information.
global_term_weights
either NULL or the output of the global_term_weights method of the textTinyR package. See the details section for more information.
threads
a numeric value specifying the number of cores to run in parallel
pre_processed_wv()
Doc2Vec$pre_processed_wv()
clone()
The objects of this class are cloneable with this method.
Doc2Vec$clone(deep = FALSE)
deep
Whether to make a deep clone.
the pre_processed_wv method should be used after the initialization of the Doc2Vec class, if the copy_data parameter is set to TRUE, in order to inspect the pre-processed word-vectors.
The global_term_weights method is part of the sparse_term_matrix R6 class of the textTinyR package. One can come to the correct global_term_weights by using the sparse_term_matrix class and by setting the tf_idf parameter to FALSE and the normalize parameter to NULL. In Doc2Vec class, if method equals to idf then the global_term_weights parameter should not be equal to NULL.
Explanation of the various methods :
Assuming that a single sublist of the token list will be taken into consideration : the wordvectors of each word of the sublist of tokens will be accumulated to a vector equal to the length of the wordvector (INITIAL_WORD_VECTOR). Then a scalar will be computed using this INITIAL_WORD_VECTOR in the following way : the INITIAL_WORD_VECTOR will be raised to the power of 2.0, then the resulted wordvector will be summed and the square-root will be calculated. The INITIAL_WORD_VECTOR will be divided by the resulted scalar
Assuming that a single sublist of the token list will be taken into consideration : the wordvectors of each word of the sublist of tokens will be first min-max normalized and then will be accumulated to a vector equal to the length of the initial wordvector
Assuming that a single sublist of the token list will be taken into consideration : the word-vector of each term in the sublist will be multiplied with the corresponding idf of the global weights term
There might be slight differences in the output data for each method depending on the input value of the copy_data parameter (if it's either TRUE or FALSE).
library(textTinyR)
#---------------------------------
# tokenized text in form of a list
#---------------------------------
tok_text = list(c('the', 'result', 'of'), c('doc2vec', 'are', 'vector', 'features'))
#-------------------------
# path to the word vectors
#-------------------------
PATH = system.file("example_files", "word_vecs.txt", package = "textTinyR")
init = Doc2Vec$new(token_list = tok_text, word_vector_FILE = PATH)
out = init$doc2vec_methods(method = "sum_sqrt")
Run the code above in your browser using DataLab