Learn R Programming

⚠️There's a newer version (0.6.4) of this package.Take me there.

title: "text2vec" author: "Dmitriy Selivanov" output: html_document: toc: false toc_float: false

You've just discovered text2vec!

text2vec is an R package which provides an efficient framework with a concise API for text analysis and natural language processing (NLP).

Goals which we aimed to achieve as a result of development of text2vec:

  • Concise - expose as few functions as possible
  • Consistent - expose unified interfaces, no need to explore new interface for each task
  • Flexible - allow to easily solve complex tasks
  • Fast - maximize efficiency per single thread, transparently scale to multiple threads on multicore machines
  • Memory efficient - use streams and iterators, not keep data in RAM if possible

See API section for details.

Performance

This package is efficient because it is carefully written in C++, which also means that text2vec is memory friendly. Some parts are fully parallelized using OpenMP.

Other emrassingly parallel tasks (such as vectorization) can use any fork-based parallel backend on UNIX-like machines. They can achieve near-linear scalability with the number of available cores.

Finally, a streaming API means that users do not have to load all the data into RAM.

Contributing

The package has issue tracker on GitHub where I'm filing feature requests and notes for future work. Any ideas are appreciated.

Contributors are welcome. You can help by:

License

GPL (>= 2)

Copy Link

Version

Install

install.packages('text2vec')

Monthly Downloads

5,621

Version

0.6

License

GPL (>= 2) | file LICENSE

Last Published

February 18th, 2020

Functions in text2vec (0.6)

combine_vocabularies

Combines multiple vocabularies into one
normalize

Matrix normalization
LatentSemanticAnalysis

Latent Semantic Analysis model
create_dtm

Document-term matrix construction
similarities

Pairwise Similarity Matrix Computation
split_into

Split a vector for parallel processing
prepare_analogy_questions

Prepares list of analogy questions
perplexity

Perplexity of a topic model
itoken

Iterators (and parallel iterators) over input objects
RelaxedWordMoversDistance

Creates Relaxed Word Movers Distance (RWMD) model
jsPCA_robust

(numerically robust) Dimension reduction via Jensen-Shannon Divergence & Principal Components
create_tcm

Term-co-occurence matrix construction
vectorizers

Vocabulary and hash vectorizers
distances

Pairwise Distance Matrix Computation
create_vocabulary

Creates a vocabulary of unique terms
text2vec

text2vec
tokenizers

Simple tokenization functions for string splitting
ifiles

Creates iterator over text files from the disk
prune_vocabulary

Prune vocabulary
reexports

Objects exported from other packages
as.lda_c

Converts document-term matrix sparse matrix to 'lda_c' format
coherence

Coherence metrics for topic models
BNS

BNS
check_analogy_accuracy

Checks accuracy of word embeddings on the analogy task
GloVe

re-export rsparse::GloVe
TfIdf

TfIdf
LatentDirichletAllocation

Creates Latent Dirichlet Allocation model.
Collocations

Collocations model.
movie_review

IMDB movie reviews