train_wordvec: Train static word embeddings using the Word2Vec, GloVe, or FastText algorithm.

Description

Train static word embeddings using the Word2Vec, GloVe, or FastText algorithm with multi-threading.

Usage

train_wordvec(
  text,
  method = c("word2vec", "glove", "fasttext"),
  dims = 300,
  window = 5,
  min.freq = 5,
  threads = 8,
  model = c("skip-gram", "cbow"),
  loss = c("ns", "hs"),
  negative = 5,
  subsample = 1e-04,
  learning = 0.05,
  ngrams = c(3, 6),
  x.max = 10,
  convergence = -1,
  stopwords = character(0),
  encoding = "UTF-8",
  tolower = FALSE,
  normalize = FALSE,
  iteration,
  tokenizer,
  remove,
  file.save,
  compress = "bzip2",
  verbose = TRUE
)

Value

A wordvec (data.table) with three variables: word, vec, freq.

Arguments

text

A character vector of text, or a file path on disk containing text.

method

Training algorithm:

"word2vec" (default): using the word2vec package
"glove": using the rsparse and text2vec packages
"fasttext": using the fastTextR package

dims

Number of dimensions of word vectors to be trained. Common choices include 50, 100, 200, 300, and 500. Defaults to 300.

window

Window size (number of nearby words behind/ahead the current word). It defines how many surrounding words to be included in training: [window] words behind and [window] words ahead ([window]*2 in total). Defaults to 5.

min.freq

Minimum frequency of words to be included in training. Words that appear less than this value of times will be excluded from vocabulary. Defaults to 5 (take words that appear at least five times).

threads

Number of CPU threads used for training. A modest value produces the fastest training. Too many threads are not always helpful. Defaults to 8.

model

<Only for Word2Vec / FastText>

Learning model architecture:

"skip-gram" (default): Skip-Gram, which predicts surrounding words given the current word
"cbow": Continuous Bag-of-Words, which predicts the current word based on the context

loss

<Only for Word2Vec / FastText>

Loss function (computationally efficient approximation):

"ns" (default): Negative Sampling
"hs": Hierarchical Softmax

negative

<Only for Negative Sampling in Word2Vec / FastText>

Number of negative examples. Values in the range 5~20 are useful for small training datasets, while for large datasets the value can be as small as 2~5. Defaults to 5.

subsample

<Only for Word2Vec / FastText>

Subsampling of frequent words (threshold for occurrence of words). Those that appear with higher frequency in the training data will be randomly down-sampled. Defaults to 0.0001 (1e-04).

learning

<Only for Word2Vec / FastText>

Initial (starting) learning rate, also known as alpha. Defaults to 0.05.

ngrams

<Only for FastText>

Minimal and maximal ngram length. Defaults to c(3, 6).

x.max

<Only for GloVe>

Maximum number of co-occurrences to use in the weighting function. Defaults to 10.

convergence

<Only for GloVe>

Convergence tolerance for SGD iterations. Defaults to -1.

stopwords

<Only for Word2Vec / GloVe>

A character vector of stopwords to be excluded from training.

encoding

Text encoding. Defaults to "UTF-8".

tolower

Convert all upper-case characters to lower-case? Defaults to FALSE.

normalize

Normalize all word vectors to unit length? Defaults to FALSE. See normalize.

iteration

Number of training iterations. More iterations makes a more precise model, but computational cost is linearly proportional to iterations. Defaults to 5 for Word2Vec and FastText while 10 for GloVe.

tokenizer

Function used to tokenize the text. Defaults to text2vec::word_tokenizer.

remove

Strings (in regular expression) to be removed from the text. Defaults to "_|'|<br/>|<br />|e\\.g\\.|i\\.e\\.". You may turn off this by specifying remove=NULL.

file.save

File name of to-be-saved R data (must be .RData).

compress

Compression method for the saved file. Defaults to "bzip2".

Options include:

1 or "gzip": modest file size (fastest)
2 or "bzip2": small file size (fast)
3 or "xz": minimized file size (slow)

verbose

Print information to the console? Defaults to TRUE.

Download

Download pre-trained word vectors data (.RData): https://psychbruce.github.io/WordVector_RData.pdf

References

All-in-one package:

https://CRAN.R-project.org/package=wordsalad

Word2Vec:

GloVe:

FastText:

Examples

Run this code

review = text2vec::movie_review  # a data.frame'
text = review$review

## Note: All the examples train 50 dims for faster code check.

## Word2Vec (SGNS)
dt1 = train_wordvec(
  text,
  method="word2vec",
  model="skip-gram",
  dims=50, window=5,
  normalize=TRUE)

dt1
most_similar(dt1, "Ive")  # evaluate performance
most_similar(dt1, ~ man - he + she, topn=5)  # evaluate performance
most_similar(dt1, ~ boy - he + she, topn=5)  # evaluate performance

## GloVe
dt2 = train_wordvec(
  text,
  method="glove",
  dims=50, window=5,
  normalize=TRUE)

dt2
most_similar(dt2, "Ive")  # evaluate performance
most_similar(dt2, ~ man - he + she, topn=5)  # evaluate performance
most_similar(dt2, ~ boy - he + she, topn=5)  # evaluate performance

## FastText
dt3 = train_wordvec(
  text,
  method="fasttext",
  model="skip-gram",
  dims=50, window=5,
  normalize=TRUE)

dt3
most_similar(dt3, "Ive")  # evaluate performance
most_similar(dt3, ~ man - he + she, topn=5)  # evaluate performance
most_similar(dt3, ~ boy - he + she, topn=5)  # evaluate performance

Run the code above in your browser using DataLab

Description

Usage

Value

Arguments

Download

References

See Also

Examples