The calculations are done with the word2vec package.
word2vec(
text,
tokenizer = text2vec::space_tokenizer,
dim = 50,
type = c("cbow", "skip-gram"),
window = 5L,
min_count = 5L,
loss = c("ns", "hs"),
negative = 5L,
n_iter = 5L,
lr = 0.05,
sample = 0.001,
stopwords = character(),
threads = 1L,
collapse_character = "\t",
composition = c("tibble", "data.frame", "matrix")
)
Character string.
Function, function to perform tokenization. Defaults to text2vec::space_tokenizer.
dimension of the word vectors. Defaults to 50.
the type of algorithm to use, either 'cbow' or 'skip-gram'. Defaults to 'cbow'
skip length between words. Defaults to 5.
integer indicating the number of time a word should occur to be considered as part of the training vocabulary. Defaults to 5.
Charcter, choice of loss function must be one of "ns" or "hs". See detaulsfor more Defaults to "ns".
integer with the number of negative samples. Only used in case hs is set to FALSE
Integer, number of training iterations. Defaults to 5.
initial learning rate also known as alpha. Defaults to 0.05
threshold for occurrence of words. Defaults to 0.001
a character vector of stopwords to exclude from training
number of CPU threads to use. Defaults to 1.
Character vector with length 1. Character used to
glue together tokens after tokenizing. See details for more information.
Defaults to "\t"
.
Character, Either "tibble", "matrix", or "data.frame" for the format out the resulting word vectors.
A tibble, data.frame or matrix containing the token in the first column and word vectors in the remaining columns.
A trade-off have been made to allow for an arbitrary tokenizing function. The
text is first passed through the tokenizer. Then it is being collapsed back
together into strings using collapse_character
as the separator. You
need to pick collapse_character
to be a character that will not appear
in any of the tokens after tokenizing is done. The default value is a "tab"
character. If you pick a character that is present in the tokens then those
words will be split.
The choice of loss functions are one of:
"ns" negative sampling
"hs" hierarchical softmax
Mikolov, Tomas and Sutskever, Ilya and Chen, Kai and Corrado, Greg S and Dean, Jeff. 2013. Distributed Representations of Words and Phrases and their Compositionality
# NOT RUN {
word2vec(fairy_tales)
# Custom tokenizer that splits on non-alphanumeric characters
word2vec(fairy_tales, tokenizer = function(x) strsplit(x, "[^[:alnum:]]+"))
# }
Run the code above in your browser using DataLab