starspace: Interface to Starspace for training a Starspace model

Description

Interface to Starspace for training a Starspace model, providing raw access to the C++ functionality.

Usage

starspace(
  model = "textspace.bin",
  file,
  trainMode = 0,
  fileFormat = c("fastText", "labelDoc"),
  label = "__label__",
  dim = 100,
  epoch = 5,
  lr = 0.01,
  loss = c("hinge", "softmax"),
  margin = 0.05,
  similarity = c("cosine", "dot"),
  negSearchLimit = 50,
  adagrad = TRUE,
  ws = 5,
  minCount = 1,
  minCountLabel = 1,
  ngrams = 1,
  thread = 1,
  ...
)

Value

an object of class textspace which is a list with elements

model: a Rcpp pointer to the model
args: a list with elements
1. file: the binary file of the model saved on disk
2. dim: the dimension of the embedding
3. data: data-specific Starspace training parameters
4. param: algorithm-specific Starspace training parameters
5. dictionary: parameters which define ths dictionary of words and labels in Starspace
6. options: parameters specific to duration of training, the text preparation and the training batch size
7. test: parameters specific to model testing
iter: a list with element epoch, lr, error and error_validation showing the error after each epoch

Arguments

model

the full path to where the model file will be saved. Defaults to 'textspace.bin'.

file

the full path to the file on disk which will be used for training.

trainMode

integer with the training mode. Possible values are 0, 1, 2, 3, 4 or 5. Defaults to 0. The use cases are

0: tagspace (classification tasks) and search tasks
1: pagespace & docspace (interest-based or content-based recommendation)
2: articlespace (sentences within document)
3: sentence embeddings and entity similarity
4: multi-relational graphs
5: word embeddings

fileFormat

either one of 'fastText' or 'labelDoc'. See the documentation of StarSpace

label

labels prefix (character string identifying how a label is prefixed, defaults to '__label__')

dim

the size of the embedding vectors (integer, defaults to 100)

epoch

number of epochs (integer, defaults to 5)

lr

learning rate (numeric, defaults to 0.01)

loss

loss function (either 'hinge' or 'softmax')

margin

margin parameter in case of hinge loss (numeric, defaults to 0.05)

similarity

cosine or dot product similarity in cas of hinge loss (character, defaults to 'cosine')

negSearchLimit

number of negatives sampled (integer, defaults to 50)

adagrad

whether to use adagrad in training (logical)

ws

the size of the context window for word level training - only used in trainMode 5 (integer, defaults to 5)

minCount

minimal number of word occurences for being part of the dictionary (integer, defaults to 1 keeping all words)

minCountLabel

minimal number of label occurences for being part of the dictionary (integer, defaults to 1 keeping all labels)

ngrams

max length of word ngram (integer, defaults to 1, using only unigrams)

thread

integer with the number of threads to use. Defaults to 1.

...

arguments passed on to ruimtehol:::textspace. See the details below.

References

https://github.com/facebookresearch

Examples

Run this code

if (FALSE) {
data(dekamer, package = "ruimtehol")
x <- strsplit(dekamer$question, "\\W")
x <- lapply(x, FUN = function(x) x[x != ""])
x <- sapply(x, FUN = function(x) paste(x, collapse = " "))

idx <- sample.int(n = nrow(dekamer), size = round(nrow(dekamer) * 0.7))
writeLines(x[idx], con = "traindata.txt")
writeLines(x[-idx], con = "validationdata.txt")

set.seed(123456789)
m <- starspace(file = "traindata.txt", validationFile = "validationdata.txt", 
               trainMode = 5, dim = 10, 
               loss = "softmax", lr = 0.01, ngrams = 2, minCount = 5,
               similarity = "cosine", adagrad = TRUE, ws = 7, epoch = 3,
               maxTrainTime = 10)
str(starspace_dictionary(m))              
wordvectors <- as.matrix(m)
wv <- starspace_embedding(m, 
                          x = c("Nationale Loterij", "migranten", "pensioen"),
                          type = "ngram")
wv
mostsimilar <- embedding_similarity(wordvectors, wv["pensioen", ])
head(sort(mostsimilar[, 1], decreasing = TRUE), 10)
starspace_knn(m, "koning")

## clean up for cran
file.remove(c("traindata.txt", "validationdata.txt"))
}

Run the code above in your browser using DataLab