The Open Corpus Workbench
(CWB) is an indexing and querying engine
popular in corpus-assisted research. Its core aim is to support working
efficiently with large, structurally and linguistically annotated corpora.
First of all, the CWB includes tools to index and compress corpora. Second,
the Corpus Library
(CL) offers low-level functionality to retrieve
information from CWB indexed corpora. Third, the Corpus Query
Processor
(CQP) offers a syntax that allows to perform anything from
simple to complex queries, using different annotation layers of corpora.
The CWB is a classical tool which has inspired a set of developments. A
persisting advantage of the CWB is its mature, open source code base that
is actively maintained by a community of developers. It is used as a robust
and efficient backend for widely used tools such as
TXM(https://txm.gitpages.huma-num.fr/textometrie/) or CQPweb
(https://cwb.sourceforge.io/cqpweb.php). Its uncompromising C
implementation guarantees speed and makes it well suited to be integrated
with R at the same time.
The package RcppCWB
is a follow-up on the rcqp
package that
has pioneered to expose CWB functionality from within R. Indeed, the
rcqp
package, published at CRAN in 2015, offers robust access to CWB
functionality. However, the "pure C" implementation of the rcqp
package creates difficulties to make the package portable to Windows. The
primary purpose of the RcppCWB
package is to reimplement a wrapper
library for the CWB using a design that makes it easier to achieve
cross-platform portability.
Even though RcppCWB
functions may be used directly, the package is
designed to serve as an interface to CWB indexed corpora in packages with
higher-level functionality. In this regard, RcppCWB
is the backend
of the polmineR
package. It is deliberately open to be used in other
contexts. The package may stimulate using linguistically annotated, indexed
and compressed corpora on all platforms. The paradigm of working with text
as linguistic data may benefit from RcppCWB
.