glove: Extract word vectors from GloVe word embedding

Description

The calculations are done with the text2vec package.

Usage

glove(
  text,
  tokenizer = text2vec::space_tokenizer,
  dim = 10L,
  window = 5L,
  min_count = 5L,
  n_iter = 10L,
  x_max = 10L,
  stopwords = character(),
  convergence_tol = -1,
  threads = 1,
  composition = c("tibble", "data.frame", "matrix"),
  verbose = FALSE
)

Arguments

text

Character string.

tokenizer

Function, function to perform tokenization. Defaults to text2vec::space_tokenizer.

dim

Integer, number of dimension of the resulting word vectors.

window

Integer, skip length between words. Defaults to 5.

min_count

Integer, number of times a token should appear to be considered in the model. Defaults to 5.

n_iter

Integer, number of training iterations. Defaults to 10.

x_max

Integer, maximum number of co-occurrences to use in the weighting function. Defaults to 10.

stopwords

Character, a vector of stop words to exclude from training.

convergence_tol

Numeric, value determining the convergence criteria. numeric = -1 defines early stopping strategy. Stop fitting when one of two following conditions will be satisfied: (a) passed all iterations (b) cost_previous_iter / cost_current_iter - 1 < convergence_tol. Defaults to -1.

threads

number of CPU threads to use. Defaults to 1.

composition

Character, Either "tibble", "matrix", or "data.frame" for the format out the resulting word vectors.

verbose

Logical, controls whether progress is reported as operations are executed.

Value

A tibble, data.frame or matrix containing the token in the first column and word vectors in the remaining columns.

References

Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation.

Examples

Run this code

# NOT RUN {
glove(fairy_tales, x_max = 5)
# }

Run the code above in your browser using DataLab