Learn R Programming

textdata

The goal of textdata is to provide access to text-related data sets for easy access without bundling them inside a package. Some text datasets are too large to store within an R package or are licensed in such a way that prevents them from being included in an OSS-licensed package. Instead, this package provides a framework to download, parse, and store the datasets on the disk and load them when needed.

Installation

You can install the not yet released version of textdata from CRAN with:

install.packages("textdata")

And the development version from GitHub with:

# install.packages("remotes")
remotes::install_github("EmilHvitfeldt/textdata")

Example

The first time you use one of the functions for accessing an included text dataset, such as lexicon_afinn() or dataset_sentence_polarity(), the function will prompt you to agree that you understand the dataset’s license or terms of use and then download the dataset to your computer.

After the first use, each time you use a function like lexicon_afinn(), the function will load the dataset from disk.

Included text datasets

As of today, the datasets included in textdata are:

DatasetFunction
v1.0 sentence polarity datasetdataset_sentence_polarity()
AFINN-111 sentiment lexiconlexicon_afinn()
Hu and Liu’s opinion lexiconlexicon_bing()
NRC word-emotion association lexiconlexicon_nrc()
NRC Emotion Intensity Lexiconlexicon_nrc_eil()
The NRC Valence, Arousal, and Dominance Lexiconlexicon_nrc_vad()
Loughran and McDonald’s opinion lexicon for financial documentslexicon_loughran()
AG’s Newsdataset_ag_news()
DBpedia ontologydataset_dbpedia()
Trec-6 and Trec-50dataset_trec()
IMDb Large Movie Review Datasetdataset_imdb()
Stanford NLP GloVe pre-trained word vectorsembedding_glove6b()
embedding_glove27b()
embedding_glove42b()
embedding_glove840b()

Check out each function’s documentation for detailed information (including citations) for the relevant dataset.

Community Guidelines

Note that this project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms. Feedback, bug reports (and fixes!), and feature requests are welcome; file issues or seek support here. For details on how to add a new dataset to this package, check out the vignette!

Copy Link

Version

Install

install.packages('textdata')

Monthly Downloads

10,358

Version

0.4.4

License

MIT + file LICENSE

Issues

Pull Requests

Stars

Forks

Maintainer

Last Published

September 2nd, 2022

Functions in textdata (0.4.4)

embedding_glove

Global Vectors for Word Representation
lexicon_loughran

Loughran-McDonald sentiment lexicon
lexicon_nrc_eil

NRC Emotion Intensity Lexicon (aka Affect Intensity Lexicon) v0.5
lexicon_nrc_vad

The NRC Valence, Arousal, and Dominance Lexicon
lexicon_nrc

NRC word-emotion association lexicon
dataset_sentence_polarity

v1.0 sentence polarity dataset
load_dataset

Internal Functions
textdata-package

textdata: Download and Load Various Text Datasets
lexicon_bing

Bing sentiment lexicon
cache_info

List folders and their sizes in cache
catalogue

Catalogue of all available data sources
dataset_trec

TREC dataset
dataset_dbpedia

DBpedia Ontology Dataset
dataset_imdb

IMDB Large Movie Review Dataset
lexicon_afinn

AFINN-111 dataset
dataset_ag_news

AG's News Topic Classification Dataset