Learn R Programming

⚠️There's a newer version (4.1.0) of this package.Take me there.

quanteda v0.99 Important Changes

Many important changes to the package have been underway, including API changes, as we approach a stable API, "1.0" release in October 2017. Version 0.99 represents the last version that will contain many of the deprecated object types and methods that date several releases.

v0.999 also implements many enhancements and performance improvements over previous releases. See NEWS.md for details, and Quanteda Structure and Design for a description of the package's underlying logic and design philosophy.

About the package

An R package for managing and analyzing text, created by Kenneth Benoit in collaboration with a team of core contributors: Kohei Watanabe, Paul Nulty, Adam Obeng, Haiyan Wang, Ben Lauderdale, and Will Lowe. Supported by the European Research Council grant ERC-2011-StG 283794-QUANTESS.

For more details, see the package website.

How to cite the package:

To cite package 'quanteda' in publications please use the
following:

  Benoit, Kenneth et. al. ().  "quanteda: Quantitative Analysis of
  Textual Data".  R package version: 0.9.99.  http://quanteda.io.

A BibTeX entry for LaTeX users is

  @Manual{,
    title = {quanteda: Quantitative Analysis of Textual Data},
    author = {Kenneth Benoit and Kohei Watanabe and Paul Nulty and Adam Obeng and Haiyan Wang and Benjamin Lauderdale and Will Lowe},
    note = {R package version 0.9.99},
    url = {http://quanteda.io},
  }

Leave feedback

If you like quanteda, please consider leaving feedback or a testimonial here.

Features

Powerful text analytics

Generalized, flexible corpus management. quanteda provides a comprehensive workflow and ecosystem for the management, processing, and analysis of texts. Documents and associated document- and collection-level metadata are easily loaded and stored as a corpus object, although most of quanteda's operations work on simple character objects as well. A corpus is designed to efficiently store all of the texts in a collection, as well as meta-data for documents and for the collection as a whole. This makes it easy to perform natural language processing on the texts in a corpus simply and quickly, such as tokenizing, stemming, or forming ngrams. quanteda's functions for tokenizing texts and forming multiple tokenized documents into a document-feature matrix are both extremely fast and extremely simple to use. quanteda can segment texts easily by words, paragraphs, sentences, or even user-supplied delimiters and tags.

Works nicely with UTF-8. Built on the text processing functions in the stringi package, which is in turn built on C++ implementation of the ICU libraries for Unicode text handling, quanteda pays special attention to fast and correct implementation of Unicode and the handling of text in any character set, following conversion internally to UTF-8.

Built for efficiency and speed. All of the functions in quanteda are built for maximum performance and scale while still being as R-based as possible. The package makes use of three efficient architectural elements: the stringi package for text processing, the Matrix package for sparse matrix objects, and the data.table package for indexing large documents efficiently. If you can fit it into memory, quanteda will handle it quickly. (And eventually, we will make it possible to process objects even larger than available memory.)

Super-fast conversion of texts into a document-feature matrix. quanteda is principally designed to allow users a fast and convenient method to go from a corpus of texts to a selected matrix of documents by features, after defining and selecting the documents and features. The package makes it easy to redefine documents, for instance by splitting them into sentences or paragraphs, or by tags, as well as to group them into larger documents by document variables, or to subset them based on logical conditions or combinations of document variables. A special variation of the "dfm", a feature co-occurrence matrix, is also implemented, for direct use with embedding and representational models such as text2vec.

Extensive feature selection capabilities. The package also implements common NLP feature selection functions, such as removing stopwords and stemming in numerous languages, selecting words found in dictionaries, treating words as equivalent based on a user-defined "thesaurus", and trimming and weighting features based on document frequency, feature frequency, and related measures such as tf-idf.

Qualitative exploratory tools. Easily search and save keywords in context, for instance, or identify keywords. Like all of quanteda's pattern matching functions, users have the option of simple "glob" expressions, regular expressions, or fixed pattern matches.

Dictionary-based analysis. quanteda allows fast and flexible implementation of dictionary methods, including the import and conversion of foreign dictionary formats such as those from Provalis's WordStat, the Linguistic Inquiry and Word Count (LIWC), Lexicoder, Yoshioder, and YAML.

Text analytic methods. Once constructed, a dfm can be easily analyzed using either quanteda's built-in tools for scaling document positions (for the "wordfish" and "Wordscores" models, direct use with the ca package for correspondence analysis), predictive models using Naive Bayes multinomial and Bernoulli classifiers, computing distance or similarity matrixes of features or documents, or computing readability or lexical diversity indexes.

In addition, quanteda a document-feature matrix is easily used with or converted for a number of other text analytic tools, such as:

  • topic models (including converters for direct use with the topicmodels, LDA, and stm packages);

  • machine learning through a variety of other packages that take matrix or matrix-like inputs.

Planned features. Coming soon to quanteda are:

  • Bootstrapping methods for texts that makes it easy to resample texts from pre-defined units, to facilitate computation of confidence intervals on textual statistics using techniques of non-parametric bootstrapping, but applied to the original texts as data.

  • Additional predictive and analytic methods by expanding the textstat_ and textmodel_ functions. Current textmodel types include correspondence analysis, "Wordscores", "Wordfish", and Naive Bayes; current textstat statistics are readability, lexical diversity, similarity, and distance.

  • Expanded settings for all objects, that will propogate through downstream objects.

  • Object histories, that will propogate through downstream objects, to enhance analytic reproducibility and transparency.

How to Install

  1. From CRAN: Use your GUI's R package installer, or execute:

    install.packages("quanteda") 
  2. From GitHub, using:

    # devtools packaged required to install quanteda from Github 
    devtools::install_github("kbenoit/quanteda") 

    Because this compiles some C++ source code, you will need a compiler installed. If you are using a Windows platform, this means you will need also to install the Rtools software available from CRAN. If you are using OS X, you will need to to install XCode, available for free from the App Store, or if you prefer a lighter footprint set of tools, just the Xcode command line tools, using the command xcode-select --install from the Terminal.

  3. Additional recommended packages:

    The following packages work well with or extend quanteda and we recommend that you also install them:

    • readtext: An easy way to read text data into R, from almost any input format.

    • spacyr: NLP using the spaCy library, including part-of-speech tagging, entity recognition, and dependency parsing.

    • quantedaData: Additional textual data for use with quanteda.

      devtools::install_github("kbenoit/quantedaData")
    • LIWCalike: An R implementation of the Linguistic Inquiry and Word Count approach to text analysis.

      devtools::install_github("kbenoit/LIWCalike")

Getting Started

See the package website, which includes the Getting Started Vignette.

Demonstration

library(quanteda)
## quanteda version 0.9.99
## Using 4 of 8 threads for parallel computing
## 
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
## 
##     View

# create a corpus from the immigration texts from UK party platforms
uk2010immigCorpus <- 
    corpus(data_char_ukimmig2010,
           docvars = data.frame(party = names(data_char_ukimmig2010)),
           metacorpus = list(notes = "Immigration-related sections of 2010 UK party manifestos"))
uk2010immigCorpus
## Corpus consisting of 9 documents and 1 docvar.
summary(uk2010immigCorpus)
## Corpus consisting of 9 documents.
## 
##          Text Types Tokens Sentences        party
##           BNP  1125   3280        88          BNP
##     Coalition   142    260         4    Coalition
##  Conservative   251    499        15 Conservative
##        Greens   322    679        21       Greens
##        Labour   298    683        29       Labour
##        LibDem   251    483        14       LibDem
##            PC    77    114         5           PC
##           SNP    88    134         4          SNP
##          UKIP   346    723        27         UKIP
## 
## Source:  /Users/kbenoit/Dropbox (Personal)/GitHub/quanteda/* on x86_64 by kbenoit
## Created: Thu Aug 10 18:26:49 2017
## Notes:   Immigration-related sections of 2010 UK party manifestos

# key words in context for "deport", 3 words of context
kwic(uk2010immigCorpus, "deport", 3)
##                                                                     
##   [BNP, 157]        The BNP will | deport | all foreigners convicted
##  [BNP, 1946]                . 2. | Deport | all illegal immigrants  
##  [BNP, 1952] immigrants We shall | deport | all illegal immigrants  
##  [BNP, 2585]  Criminals We shall | deport | all criminal entrants

# create a dfm, removing stopwords
mydfm <- dfm(uk2010immigCorpus, remove = stopwords("english"), remove_punct = TRUE)
mydfm
## Document-feature matrix of: 9 documents, 1,547 features (83.8% sparse).

topfeatures(mydfm, 20)  # 20 top words
## immigration     british      people      asylum     britain          uk 
##          66          37          35          29          28          27 
##      system  population     country         new  immigrants      ensure 
##          27          21          20          19          17          17 
##       shall citizenship      social    national         bnp     illegal 
##          17          16          14          14          13          13 
##        work     percent 
##          13          12

# plot a word cloud
set.seed(100)
textplot_wordcloud(mydfm, min.freq = 6, random.order = FALSE,
                   rot.per = .25, 
                   colors = RColorBrewer::brewer.pal(8,"Dark2"))

Contributing

Contributions in the form of feedback, comments, code, and bug reports are most welcome. How to contribute:

Copy Link

Version

Install

install.packages('quanteda')

Monthly Downloads

23,966

Version

0.99

License

GPL-3

Maintainer

Last Published

August 15th, 2017

Functions in quanteda (0.99)

as.corpus

coerce a compressed corpus to a standard corpus
as.corpus.corpuszip

coerce a compressed corpus to a standard corpus
View

View methods for quanteda
applyDictionary

apply a dictionary or thesaurus to an object
as.list.dist

coerce a dist object into a list
as.list.dist_selection

coerce a dist_selection object into a list
as.dictionary

coercion and checking functions for dictionary objects
as.dist.dist

coerce a dist into a dist
as.matrix.dfm

coerce a dfm to a matrix or data.frame
as.matrix.dist_selection

coerce a dist_selection object to a matrix
as.matrix.simil

Coerce a simil object into a matrix
coef.textmodel

extract text model coefficients
collocations

deprecated function names for textstat_collocations
corpus

construct a corpus object
corpus_reshape

recast the document units of a corpus
bootstrap_dfm

bootstrap a dfm
cbind.dfm

Combine dfm objects by Rows or Columns
corpus_sample

randomly sample documents from a corpus
corpus_segment

segment texts into component elements
data_char_sampletext

a paragraph of text for testing various text-based functions
as.tokens

coercion, checking, and combining functions for tokens objects
corpus_subset

extract a subset of a corpus
corpus_trimsentences

remove sentences based on their token lengths or a pattern match
create

utility function to create a object with new set of attributes
dfm_compress

recombine a dfm or fcm by combining identical dimension elements
dfm_group

combine documents in a dfm by a grouping variable
docvars

get or set for document-level variables
fcm-class

Virtual class "fcm" for a feature co-occurrence matrix
is.dfm

coercion and checking functions for dfm objects
joinTokens

join tokens function
metacorpus

get or set corpus metadata
corpus_trim

remove sentences based on their token lengths or a pattern match
data_char_ukimmig2010

immigration-related sections of 2010 UK party manifestos
dfm

create a document-feature matrix
dfm2lsa

convert a dfm to an lsa "textmatrix"
changeunits

deprecated name for corpus_reshape
char_tolower

convert the case of character objects
compress

compress a dfm by combining similarly named dimensions
convert-wrappers

convenience wrappers for dfm convert
metadoc

get or set document-level meta-data
plot-deprecated

deprecated plotting functions
predict.textmodel_NB_fitted

prediction method for Naive Bayes classifier objects
selectFeatures

select features from an object
selectFeaturesOLD

old version of selectFeatures.tokenizedTexts
data-deprecated

datasets with deprecated or defunct names
data-internal

internal data sets
dfm-class

Virtual class "dfm" for a document-feature matrix
dfm-internal

internal functions for dfm objects
dfm_trim

trim a dfm using frequency threshold-based feature selection
dfm_weight

weight the feature frequencies in a dfm
groups

grouping variable(s) for various functions
textmodel_fitted-class

the fitted textmodel classes
textmodel_wordfish

wordfish text model
textmodel_wordscores

Wordscores text model
textmodel_wordshoal

wordshoal text model
as.yaml

convert quanteda dictionary objects to the YAML format
attributes<-

R-like alternative to reassign_attributes()
convert

convert a dfm to a non-quanteda format
corpus-class

base method extensions for corpus objects
data_dfm_lbgexample

dfm from data in Table 1 of Laver, Benoit, and Garry (2003)
deprecated-textstat

deprecated textstat names
dfm_lookup

apply a dictionary to a dfm
dfm_sample

randomly sample documents or features from a dfm
dictionary2-class

print a dictionary object
tokens_compound

convert token sequences into compound tokens
tokens_group

recombine documents tokens by groups
tokens_wordstem

stem the terms in an object
topfeatures

identify the most frequent features in a dfm
data_corpus_inaugural

US presidential inaugural address texts
data_corpus_irishbudget2010

Irish budget speeches from 2010
dfm_select

select features from a dfm or fcm
dfm_sort

sort a dfm by frequency of one or more margins
docfreq

compute the (weighted) document frequency of a feature
docnames

get or set document names
keyness

compute keyness (internal functions)
dictionary

create a dictionary
fcm

create a feature co-occurrence matrix
fcm_sort

sort an fcm in alphabetical order of the features
pattern

pattern for feature, token and keyword matching
dfm_subset

extract a subset of a dfm
dfm_tolower

convert the case of the features of a dfm and combine
featnames

get the feature labels from a dfm
features

deprecated function for featnames
head.dfm

return the first or last part of a dfm
nsyllable

count syllables in a text
ntoken

count the number of tokens or types
nscrabble

count the Scrabble letter values of text
nsentence

count the number of sentences
remove_attributes

utility function to remove all attributes
sample

randomly sample documents or features
phrase

declare a compound character to be a sequence of separate pattern matches
print.dfm

print a dfm object
print.dist_selection

print a dist_selection object
kwic

locate keywords-in-context
ndoc

count the number of documents or features
ngrams

deprecated function name for forming ngrams and skipgrams
print.phrases

print a phrase object
quanteda-package

An R package for the quantitative analysis of textual data
sort.dfm

sort a dfm by one or more margins
spacyr-methods

extensions of methods defined in the quanteda package
sparsity

compute the sparsity of a document-feature matrix
stopwords

access built-in stopwords
syllables

deprecated name for nsyllable
textfile

old function to read texts from files
textstat_frequency

tabulate feature frequencies
quanteda_options

get or set package options for quanteda
removeFeatures

remove features from an object
settings

Get or set the corpus settings
similarity

compute similarities between documents and/or features
scrabble

deprecated name for nscrabble
segment

segment: deprecated function
textmodel-internal

internal functions for textmodel objects
subset.corpus

deprecated name for corpus_subset
summary.corpus

summarize a corpus
textplot_keyness

plot word keyness
textplot_scale1d

plot a fitted scaling model
tfidf

compute tf-idf weights from a dfm
toLower

Convert texts to lower (or upper) case
tokens_ngrams

create ngrams and skipgrams from tokens
textstat_keyness

calculate keyness statistics
textstat_lexdiv

calculate lexical diversity
textstat_readability

calculate readability
textmodel

fit a text model
texts

get or assign corpus texts
textstat_collocations

identify and score multi-word expressions
tokenize

tokenize a set of texts
tokens

tokenize a set of texts
tokens_select

select or remove tokens from a tokens object
tokens_tolower

convert the case of tokens
tokens_hash

Function to hash list-of-character tokens
textmodel_NB

Naive Bayes classifier for texts
textmodel_ca

correspondence analysis of a document-feature matrix
textplot_wordcloud

plot features as a wordcloud
tokens_recompile

recompile a hashed tokens object
weight

weight or smooth a dfm
wordstem

stem words
tokens_lookup

apply a dictionary to a tokens object
textplot_xray

plot the dispersion of key word(s)
textstat_dist

Similarity and distance computation between documents or features
tf

compute (weighted) term frequency from a dfm
trim

deprecated name for dfm_trim
valuetype

pattern matching using valuetype