doc2vec: Get document vectors based on a word2vec model

Description

Document vectors are the sum of the vectors of the words which are part of the document standardised by the scale of the vector space. This scale is the sqrt of the average inner product of the vector elements.

Usage

doc2vec(object, newdata, split = " ", encoding = "UTF-8", ...)

Value

a matrix with 1 row per document containing the text document vectors, the rownames of this matrix are the document identifiers

Arguments

object: a word2vec model as returned by word2vec or read.word2vec
newdata: either a list of tokens where each list element is a character vector of tokens which form the document and the list name is considered the document identifier; or a data.frame with columns doc_id and text; or a character vector with texts where the character vector names will be considered the document identifier
split: in case newdata is not a list of tokens, text will be splitted into tokens by splitting based on function strsplit with the provided split argument
encoding: set the encoding of the text elements to the specified encoding. Defaults to 'UTF-8'.
...: not used

Examples

Run this code

path  <- system.file(package = "word2vec", "models", "example.bin")
model <- read.word2vec(path)
x <- data.frame(doc_id = c("doc1", "doc2", "testmissingdata"), 
                text = c("there is no toilet. on the bus", "no tokens from dictionary", NA),
                stringsAsFactors = FALSE)
emb <- doc2vec(model, x, type = "embedding")
emb

newdoc <- doc2vec(model, "i like busses with a toilet")
word2vec_similarity(emb, newdoc)

## similar way of extracting embeddings
x <- setNames(object = c("there is no toilet. on the bus", "no tokens from dictionary", NA), 
              nm = c("a", "b", "c"))
emb <- doc2vec(model, x, type = "embedding")
emb

## similar way of extracting embeddings
x <- setNames(object = c("there is no toilet. on the bus", "no tokens from dictionary", NA), 
              nm = c("a", "b", "c"))
x <- strsplit(x, "[ .]")
emb <- doc2vec(model, x, type = "embedding")
emb

## show behaviour in case of NA or character data of no length
x <- list(a = character(), b = c("bus", "toilet"), c = NA)
emb <- doc2vec(model, x, type = "embedding")
emb

Run the code above in your browser using DataLab

Description

Usage

Value

Arguments

See Also

Examples