Learn R Programming

About

quanteda is an R package for managing and analyzing text, created and maintained by Kenneth Benoit and Kohei Watanabe. Its creation was funded by the European Research Council grant ERC-2011-StG 283794-QUANTESS and its continued development is supported by the Quanteda Initiative CIC.

For more details, see https://quanteda.io.

quanteda version 4

The quanteda 4.0 is a major release that improves functionality and performance and further improves function consistency by removing previously deprecated functions. It also includes significant new tokeniser rules that make the default tokeniser smarter than ever, with new Unicode and ICU-compliant rules enabling it to work more consistently with even more languages.

We describe more fully these significant changes in:

The quanteda family of packages

We completed the trend of splitting quanteda into modular packages with the release of v3. The quanteda family of packages includes the following:

  • quanteda: contains all of the core natural language processing and textual data management functions
  • quanteda.textmodels: contains all of the text models and supporting functions, namely the textmodel_*() functions. This was split from the main package with the v2 release
  • quanteda.textstats: statistics for textual data, namely the textstat_*() functions, split with the v3 release
  • quanteda.textplots: plots for textual data, namely the textplot_*() functions, split with the v3 release

We are working on additional package releases, available in the meantime from our GitHub pages:

  • quanteda.sentiment: Functions and lexicons for sentiment analysis using dictionaries
  • quanteda.tidy: Extensions for manipulating document variables in core quanteda objects using your favourite tidyverse functions

and more to come.

How To…

Install (binaries) from CRAN

The normal way from CRAN, using your R GUI or

install.packages("quanteda") 

(New for quanteda v4.0) For Linux users: Because all installations on Linux are compiled, Linux users will first need to install the Intel oneAPI Threading Building Blocks for parallel computing for installation to work.

To install TBB on Linux:

# Fedora, CentOS, RHEL
sudo yum install tbb-devel

# Debian and Ubuntu
sudo apt install libtbb-dev

Compile from source (macOS and Windows)

Because this compiles some C++ and Fortran source code, you will need to have installed the appropriate compilers to build the development version.

You will also need to install TBB:

macOS:

First, you will need to install XCode command line tools.

xcode-select --install

Then install the TBB libraries and the pkg-config utility: (after installing Homebrew):

brew install tbb pkg-config

Finally, you will need to install gfortran.

Windows:

Install RTools, which includes the TBB libraries.

Enable parallelisation

quanteda takes advantage of parallel computing through the TBB (Threading Building Blocks) library to speed up computations. This guide provides step-by-step instructions on how to set up your system for using Quanteda with parallel capabilities on Windows, macOS, and Linux.

Windows:

Download and install RTools from RTools download page.

macOS:

  1. Install XCode Command Line Tools
    • Type the following command in the terminal:

      xcode-select --install
  2. Install Homebrew
    • If Homebrew is not installed, run:

      /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
  3. Install TBB and pkg-config
    • After installing Homebrew, run:

      brew install tbb pkg-config
  4. Install gfortran
    • Required for compiling Fortran code, install using Homebrew:

      brew install gcc

Linux:

Install TBB:

  • For Fedora, CentOS, RHEL:

    sudo yum install tbb-devel
  • For Debian and Ubuntu:

    sudo apt install libtbb-dev

More details are provided in the quanteda documentation.

Use quanteda

See the quick start guide to learn how to use quanteda.

Get Help

Cite the package

Benoit, Kenneth, Kohei Watanabe, Haiyan Wang, Paul Nulty, Adam Obeng, Stefan Müller, and Akitaka Matsuo. (2018) “quanteda: An R package for the quantitative analysis of textual data”. Journal of Open Source Software 3(30), 774. https://doi.org/10.21105/joss.00774.

For a BibTeX entry, use the output from citation(package = "quanteda").

Leave Feedback

If you like quanteda, please consider leaving feedback or a testimonial here.

Contribute

Contributions in the form of feedback, comments, code, and bug reports are most welcome. How to contribute:

Copy Link

Version

Install

install.packages('quanteda')

Monthly Downloads

19,698

Version

4.2.0

License

GPL-3

Maintainer

Kenneth Benoit

Last Published

January 8th, 2025

Functions in quanteda (4.2.0)

check_integer

Validate input vectors
check_class

Check object class for functions
check_dots

Check arguments passed to other functions via ...
concat

Return the concatenator character from an object
corpus-class

Base method extensions for corpus objects
convert

Convert quanteda objects to non-quanteda formats
data-internal

Internal data sets
dfm_compress

Recombine a dfm or fcm by combining identical dimension elements
dfm2lsa

Convert a dfm to an lsa "textmatrix"
corpus_sample

Randomly sample documents from a corpus
corpus_segment

Segment texts on a pattern match
data_char_ukimmig2010

Immigration-related sections of 2010 UK party manifestos
data_char_sampletext

A paragraph of text for testing various text-based functions
data-relocated

Formerly included data objects
char_tolower

Convert the case of character objects
dfm-internal

Internal functions for dfm objects
char_select

Select or remove elements from a character vector
dfm_match

Match the feature set of a dfm to given feature names
data_dictionary_LSD2015

Lexicoder Sentiment Dictionary (2015)
corpus

Construct a corpus object
bootstrap_dfm

Bootstrap a dfm
dfm-class

Virtual class "dfm" for a document-feature matrix
corpus_subset

Extract a subset of a corpus
dfm_subset

Extract a subset of a dfm
dfm_sort

Sort a dfm by frequency of one or more margins
escape_regex

Internal function for select_types() to escape regular expressions
docvars

Get or set document-level variables
cbind.dfm

Combine dfm objects by Rows or Columns
dfm_trim

Trim a dfm using frequency threshold-based feature selection
corpus_reshape

Recast the document units of a corpus
dfm_weight

Weight the feature frequencies in a dfm
corpus_group

Combine documents in corpus by a grouping variable
dfm_tfidf

Weight a dfm by tf-idf
is.collocations

Check if an object is collocations
dfm_tolower

Convert the case of the features of a dfm and combine
format_sparsity

format a sparsity value for printing
dictionary

Create a dictionary
dictionary2-class

dictionary class objects and functions
dfm_group

Combine documents in a dfm by a grouping variable
flatten_list

Internal function to flatten a nested list
dfm

Create a document-feature matrix
corpus_trim

Remove sentences based on their token lengths or a pattern match
dfm_lookup

Apply a dictionary to a dfm
dfm_sample

Randomly sample documents from a dfm
dfm_select

Select features from a dfm or fcm
expand

Simpler and faster version of expand.grid() in base package
data_dfm_lbgexample

dfm from data in Table 1 of Laver, Benoit, and Garry (2003)
data_corpus_inaugural

US presidential inaugural address texts
is_glob

Check if patterns contains glob wildcard
make_meta

Internal functions to create a list of the meta fields
print-methods

Print methods for quanteda core objects
quanteda-package

An R package for the quantitative analysis of textual data
matrix2dfm

Converts a Matrix to a dfm
index

Locate a pattern in a tokens object
info_tbb

Get information on TBB library
fcm_sort

Sort an fcm in alphabetical order of the features
fcm

Create a feature co-occurrence matrix
kwic

Locate keywords-in-context
docnames

Get or set document names
docfreq

Compute the (weighted) document frequency of a feature
field_system

Shortcut functions to access or assign metadata
message_error

Return an error message
dfm_replace

Replace features in dfm
flatten_dictionary

Flatten a hierarchical dictionary into a list of character vectors
message_dfm

Print messages in dfm methods
fcm-class

Virtual class "fcm" for a feature co-occurrence matrix
get_docvars

Internal function to extract docvars
quanteda_options

Get or set package options for quanteda
messages

Message parameter documentation
msg

Conditionally format messages
split_values

Internal function for special handling of multi-word dictionary values
message_tokens

Print messages in tokens methods
names-quanteda

Special handling for names of quanteda objects
summary.corpus

Summarize a corpus
meta

Get or set object metadata
meta_system

Internal function to get, set or initialize system metadata
tokens_restore

Restore special tokens
phrase

Declare a pattern to be a sequence of separate patterns
%>%

Pipe operator
search_glob

Select types without performing slow regex search
search_index

Internal function for select_types to search the index using fastmatch.
print.phrases

Print a phrase object
featfreq

Compute the frequencies of features
tokens_sample

Randomly sample documents from a tokens object
remove_empty_keys

Utility function to remove empty keys
replace_dictionary_values

Internal function to replace dictionary values
textstats

Statistics for textual data
tokenize_custom

Customizable tokenizer
get_object_version

Get the package version that created an object
tokens_replace

Replace tokens in a tokens object
tokens_recompile

recompile a serialized tokens object
textplots

Plots for textual data
featnames

Get the feature labels from a dfm
is_indexed

Check if a glob pattern is indexed by index_types
topfeatures

Identify the most frequent features in a dfm
lowercase_dictionary_values

Internal function to lowercase dictionary values
make_docvars

Internal function to make new system-level docvars
list2dictionary

Internal function to convert a list to a dictionary
groups

Grouping variable(s) for various functions
tokenize_internal

quanteda tokenizers
tokens-class

Base method extensions for tokens objects
tokens_wordstem

Stem the terms in an object
tokens_lookup

Apply a dictionary to a tokens object
tokens_ngrams

Create n-grams and skip-grams from tokens
tokens_tolower

Convert the case of tokens
is_regex

Check if a string is a regular expression
head.dfm

Return the first or last part of a dfm
matrix2fcm

Converts a Matrix to a fcm
merge_dictionary_values

Internal function to merge values of duplicated keys
tokens_trim

Trim tokens using frequency threshold-based feature selection
read_dict_functions

Internal functions to import dictionary files
object-builders

Object builders
ndoc

Count the number of documents or features
object2id

Match quanteda objects against token types
nest_dictionary

Utility function to generate a nested list
pattern2id

Match patterns against token types
ntoken

Count the number of tokens or types
pattern

Pattern for feature, token and keyword matching
valuetype

Pattern matching using valuetype
nsentence

Count the number of sentences
tokens_xptr

Methods for tokens_xptr objects
spacyr-methods

Extensions for and from spacy_parse objects
reexports

Objects exported from other packages
types

Get word types from a tokens object
sparsity

Compute the sparsity of a document-feature matrix
texts

Get or assign corpus texts [deprecated]
resample

Sample a vector
serialize_tokens

Function to serialize list-of-character tokens
reshape_docvars

Internal function to subset or duplicate docvar rows
summary_metadata

Functions to add or retrieve corpus summary metadata
tokens_compound

Convert token sequences into compound tokens
tokens_split

Split tokens by a separator pattern
set_dfm_dimnames<-

Internal functions to set dimnames
tokens_group

Combine documents in a tokens object by a grouping variable
tokens_subset

Extract a subset of a tokens
textmodels

Models for scaling and classification of textual data
tokens

Construct a tokens object
tokens_segment

Segment tokens object by patterns
tokens_select

Select or remove tokens from a tokens object
tokens_chunk

Segment tokens object by chunks of a given size
unlist_integer

Unlist a list of integer vectors safely
unlist_character

Unlist a list of character vectors safely
as.dictionary

Coercion and checking functions for dictionary objects
as.yaml

Convert quanteda dictionary objects to the YAML format
as.data.frame.dfm

Convert a dfm to a data.frame
as.fcm

Coercion and checking functions for fcm objects
as.matrix.dfm

Coerce a dfm to a matrix or data.frame
attributes<-

Function extending base::attributes()
apply_if

Modify only documents matching a logical condition
as.character.corpus

Coercion and checking methods for corpus objects
as.list.tokens

Coercion, checking, and combining functions for tokens objects
convert-wrappers

Convenience wrappers for dfm convert
as.dfm

Coercion and checking functions for dfm objects