Learn R Programming

textTinyR

The textTinyR package consists of text processing functions for small or big data files. More details on the functionality of textTinyR can be found in blog-post1 and blog-post2. The R package can be installed, in the following Operating Systems: Linux, Mac and Windows. However, there is one limitation : chinese, japanese, korean, thai or languages with ambiguous word boundaries are not supported.

UPDATE 01-04-2018 : boost-locale is no longer a system requirement for the textTinyR package.

Installation of the textTinyR package (CRAN, Github)

To install the package from CRAN use,


install.packages('textTinyR')

and to download the latest version from Github use the install_github function of the devtools package,


devtools::install_github(repo = 'mlampros/textTinyR')

https://github.com/mlampros/textTinyR/issues

UPDATE 06-02-2020

Docker images of the textTinyR package are available to download from my dockerhub account. The images come with Rstudio and the R-development version (latest) installed. The whole process was tested on Ubuntu 18.04. To pull & run the image do the following,


docker pull mlampros/texttinyr:rstudiodev

docker run -d --name rstudio_dev -e USER=rstudio -e PASSWORD=give_here_your_password --rm -p 8787:8787 mlampros/texttinyr:rstudiodev

The user can also bind a home directory / folder to the image to use its files by specifying the -v command,


docker run -d --name rstudio_dev -e USER=rstudio -e PASSWORD=give_here_your_password --rm -p 8787:8787 -v /home/YOUR_DIR:/home/rstudio/YOUR_DIR mlampros/texttinyr:rstudiodev

In the latter case you might have first give permission privileges for write access to YOUR_DIR directory (not necessarily) using,


chmod -R 777 /home/YOUR_DIR

The USER defaults to rstudio but you have to give your PASSWORD of preference (see https://rocker-project.org/ for more information).

Open your web-browser and depending where the docker image was build / run give,

1st. Option on your personal computer,

http://0.0.0.0:8787 

2nd. Option on a cloud instance,

http://Public DNS:8787

to access the Rstudio console in order to give your username and password.

Citation:

If you use the code of this repository in your paper or research please cite both textTinyR and the original software https://CRAN.R-project.org/package=textTinyR/citation.html:

@Manual{,
  title = {{textTinyR}: Text Processing for Small or Big Data Files},
  author = {Lampros Mouselimis},
  year = {2021},
  note = {R package version 1.1.8},
  url = {https://CRAN.R-project.org/package=textTinyR},
}

Copy Link

Version

Install

install.packages('textTinyR')

Monthly Downloads

1,624

Version

1.1.8

License

GPL-3

Issues

Pull Requests

Stars

Forks

Last Published

December 4th, 2023

Functions in textTinyR (1.1.8)

read_rows

read a specific number of rows from a text file
dims_of_word_vecs

dimensions of a word vectors file
dense_2sparse

convert a dense matrix to a sparse matrix
dice_distance

dice similarity of words using n-grams
levenshtein_distance

levenshtein distance of two words
select_predictors

Exclude highly correlated predictors
save_sparse_binary

save a sparse matrix in binary format
tokenize_transform_text

String tokenization and transformation ( character string or path to a file )
text_file_parser

text file parser
utf_locale

utf-locale for the available languages
vocabulary_parser

returns the vocabulary counts for small or medium ( xml and not only ) files
tokenize_transform_vec_docs

String tokenization and transformation ( vector of documents )
sparse_term_matrix

Term matrices and statistics ( document-term-matrix, term-document-matrix)
token_stats

token statistics
text_intersect

intersection of words or letters in tokenized text
sparse_Sums

RowSums and colSums for a sparse matrix
sparse_Means

RowMens and colMeans for a sparse matrix
Count_Rows

Number of rows of a file
batch_compute

Compute batches
TEXT_DOC_DISSIM

Dissimilarity calculation of text documents
cosine_distance

cosine distance of two character strings (each string consists of more than one words)
cluster_frequency

Frequencies of an existing cluster object
COS_TEXT

Cosine similarity for text documents
Doc2Vec

Conversion of text documents to word-vector-representation features ( Doc2Vec )
JACCARD_DICE

Jaccard or Dice similarity for text documents
big_tokenize_transform

String tokenization and transformation for big data sets
bytes_converter

bytes converter of a text file ( KB, MB or GB )
read_characters

read a specific number of characters from a text file
load_sparse_binary

load a sparse matrix in binary format
matrix_sparsity

sparsity percentage of a sparse matrix