This function loads a DSM matrix from a disk file in the specified format (see section sQuote(Formats) for details).
read.dsm.matrix(file, format = c("word2vec"),
encoding = "UTF-8", batchsize = 1e6, verbose=FALSE)
either a character string naming a file or a connection
open for writing (in text mode)
input file format (see section sQuote(Formats)). The input file format cannot be guessed automatically.
character encoding of the input file (ignored if file
is a connection)
for certain input formats, the matrix is read in batches of batchsize
cells each in order to limit memory overhead
if TRUE
, show progress bar when reading in batches
Currently, the only supported file format is word2vec
.
word2vec
This widely used text format for word embeddings is only suitable for a dense matrix. Row labels must be unique and may not contain whitespace. Values are usually rounded to a few decimal digits in order to keep file size manageable.
The first line of the file lists the matrix dimensions (rows, columns) separated by a single blank. It is followed by one text line for each matrix row, starting with the row label. The label and are cells are separated by single blanks, so row labels cannot contain whitespace.
Stephanie Evert (https://purl.org/stephanie.evert)
In order to read text formats from a compressed file, pass a gzfile
, bzfile
or xzfile
connection with appropriate encoding
in the argument file
. Make sure not to open the connection before passing it to read.dsm.matrix
.
write.dsm.matrix
, read.dsm.triplet
, read.dsm.ucs
fn <- system.file("extdata", "word2vec_hiero.txt", package="wordspace")
read.dsm.matrix(fn, format="word2vec")
Run the code above in your browser using DataLab