- text
A character vector of text,
or a file path on disk containing text.
- method
Training algorithm:
"word2vec"
(default):
using the word2vec
package
"glove"
:
using the rsparse
and
text2vec
packages
"fasttext"
:
using the fastTextR
package
- dims
Number of dimensions of word vectors to be trained.
Common choices include 50, 100, 200, 300, and 500.
Defaults to 300
.
- window
Window size (number of nearby words behind/ahead the current word).
It defines how many surrounding words to be included in training:
[window] words behind and [window] words ahead ([window]*2 in total).
Defaults to 5
.
- min.freq
Minimum frequency of words to be included in training.
Words that appear less than this value of times will be excluded from vocabulary.
Defaults to 5
(take words that appear at least five times).
- threads
Number of CPU threads used for training.
A modest value produces the fastest training.
Too many threads are not always helpful.
Defaults to 8
.
- model
<Only for Word2Vec / FastText>
Learning model architecture:
"skip-gram"
(default): Skip-Gram,
which predicts surrounding words given the current word
"cbow"
: Continuous Bag-of-Words,
which predicts the current word based on the context
- loss
<Only for Word2Vec / FastText>
Loss function (computationally efficient approximation):
- negative
<Only for Negative Sampling in Word2Vec / FastText>
Number of negative examples.
Values in the range 5~20 are useful for small training datasets,
while for large datasets the value can be as small as 2~5.
Defaults to 5
.
- subsample
<Only for Word2Vec / FastText>
Subsampling of frequent words (threshold for occurrence of words).
Those that appear with higher frequency in the training data will be randomly down-sampled.
Defaults to 0.0001
(1e-04
).
- learning
<Only for Word2Vec / FastText>
Initial (starting) learning rate, also known as alpha.
Defaults to 0.05
.
- ngrams
<Only for FastText>
Minimal and maximal ngram length.
Defaults to c(3, 6)
.
- x.max
<Only for GloVe>
Maximum number of co-occurrences to use in the weighting function.
Defaults to 10
.
- convergence
<Only for GloVe>
Convergence tolerance for SGD iterations. Defaults to -1
.
- stopwords
<Only for Word2Vec / GloVe>
A character vector of stopwords to be excluded from training.
- encoding
Text encoding. Defaults to "UTF-8"
.
- tolower
Convert all upper-case characters to lower-case?
Defaults to FALSE
.
- normalize
Normalize all word vectors to unit length?
Defaults to FALSE
. See normalize
.
- iteration
Number of training iterations.
More iterations makes a more precise model,
but computational cost is linearly proportional to iterations.
Defaults to 5
for Word2Vec and FastText
while 10
for GloVe.
- tokenizer
Function used to tokenize the text.
Defaults to text2vec::word_tokenizer
.
- remove
Strings (in regular expression) to be removed from the text.
Defaults to "_|'|<br/>|<br />|e\\.g\\.|i\\.e\\."
.
You may turn off this by specifying remove=NULL
.
- file.save
File name of to-be-saved R data (must be .RData).
- compress
Compression method for the saved file. Defaults to "bzip2"
.
Options include:
1
or "gzip"
: modest file size (fastest)
2
or "bzip2"
: small file size (fast)
3
or "xz"
: minimized file size (slow)
- verbose
Print information to the console? Defaults to TRUE
.