dtm_svd_similarity

a sparse matrix such as a "dgCMatrix" object which is returned by <code><a rd-options="" href="/link/document_term_matrix?package=udpipe&version=0.8.6" data-mini-rdoc="udpipe::document_term_matrix">document_term_matrix</a></code> containing frequencies of terms for each document

a matrix containing the <code>v</code> element from an singular value decomposition with the right singular vectors. 
The rownames of that matrix should contain terms which are available in the <code>colnames(dtm)</code>. See the examples.

embedding

a numeric vector with weights giving your definition of which terms are positive or negative, 
The names of this vector should be terms available in the rownames of the embedding matrix. See the examples.

weights

a character vector of terms to limit the calculation of the similarity for the <code>dtm</code> to the linear combination of the weights. 
Defaults to all terms from the <code>embedding</code> matrix.

terminology

either 'cosine' or 'dot' indicating to respectively calculate cosine similarities or inner product similarities between the <code>dtm</code> and the SVD embedding space. Defaults to 'cosine'.

type

Calculate the similarity of a document term matrix to a set of terms based on 
a Singular Value Decomposition (SVD) embedding matrix.
This can be used to easily construct a sentiment score based on the latent scale defined by a set of positive or negative terms.

This natural language processing toolkit provides language-agnostic
'tokenization', 'parts of speech tagging', 'lemmatization' and 'dependency
parsing' of raw text. Next to text parsing, the package also allows you to train
annotation models based on data of 'treebanks' in 'CoNLL-U' format as provided
at <https://universaldependencies.org/format.html>. The techniques are explained
in detail in the paper: 'Tokenizing, POS Tagging, Lemmatizing and Parsing UD 2.0
with UDPipe', available at <doi:10.18653/v1/K17-3009>.
The toolkit also contains functionalities for commonly used data manipulations on texts
which are enriched with the output of the parser. Namely functionalities and algorithms
for collocations, token co-occurrence, document term matrix handling,
term frequency inverse document frequency calculations,
information retrieval metrics (Okapi BM25), handling of multi-word expressions,
keyword detection (Rapid Automatic Keyword Extraction, noun phrase extraction, syntactical patterns)
sentiment scoring and semantic similarity analysis.

Jan Wijffels

udpipe

Tokenization, Parts of Speech Tagging, Lemmatization and
Dependency Parsing with the 'UDPipe' 'NLP' Toolkit

BNOSAC 

Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics, Charles University in Prague, Czech Republic 

Milan Straka 

Jana Strakov<c3><a1>

dtm_svd_similarity function

a sparse matrix such as a "dgCMatrix" object which is returned by <code><a rd-options='' href='document_term_matrix'>document_term_matrix</a></code> containing frequencies of terms for each document

dtm_svd_similarity: Semantic Similarity to a Singular Value Decomposition

Description

Usage

Arguments

Value

See Also

Examples