Learn R Programming

⚠️There's a newer version (8.0.0) of this package.Take me there.

arrow

Apache Arrow is a cross-language development platform for in-memory data. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware. It also provides computational libraries and zero-copy streaming messaging and interprocess communication.

The arrow package exposes an interface to the Arrow C++ library to access many of its features in R. This includes support for working with Parquet (read_parquet(), write_parquet()) and Feather (read_feather(), write_feather()) files, as well as lower-level access to Arrow memory and messages.

Installation

Install the latest release of arrow from CRAN with

install.packages("arrow")

On macOS and Windows, installing a binary package from CRAN will handle Arrow’s C++ dependencies for you. On Linux, you’ll need to first install the C++ library. See the Arrow project installation page for a list of PPAs from which you can obtain it.

If you install the arrow package from source and the C++ library is not found, the R package functions will notify you that Arrow is not available. Call

arrow::install_arrow()

for version- and platform-specific guidance on installing the Arrow C++ library.

Example

library(arrow)
set.seed(24)

tab <- arrow::table(x = 1:10, y = rnorm(10))
tab$schema
#> arrow::Schema 
#> x: int32
#> y: double
tab
#> arrow::Table
as.data.frame(tab)
#>     x            y
#> 1   1 -0.545880758
#> 2   2  0.536585304
#> 3   3  0.419623149
#> 4   4 -0.583627199
#> 5   5  0.847460017
#> 6   6  0.266021979
#> 7   7  0.444585270
#> 8   8 -0.466495124
#> 9   9 -0.848370044
#> 10 10  0.002311942

Installing a development version

To use the development version of the R package, you’ll need to install it from source, which requires the additional C++ library setup. On macOS, you may install the C++ library using Homebrew:

# For the released version:
brew install apache-arrow
# Or for a development version, you can try:
brew install apache-arrow --HEAD

On Windows, you can download a .zip file with the arrow dependencies from the rwinlib project, and then set the RWINLIB_LOCAL environment variable to point to that zip file before installing the arrow R package. That project contains released versions of the C++ library; for a development version, Windows users may be able to find a binary by going to the Apache Arrow project’s Appveyor, selecting an R job from a recent build, and downloading the build\arrow-*.zip file from the “Artifacts” tab.

Linux users can get a released version of the library from our PPAs, as described above. If you need a development version of the C++ library, you will likely need to build it from source. See “Development” below.

Once you have the C++ library, you can install the R package from GitHub using the remotes package. From within an R session,

# install.packages("remotes") # Or install "devtools", which includes remotes
remotes::install_github("apache/arrow/r")

or if you prefer to stay at the command line,

R -e 'remotes::install_github("apache/arrow/r")'

You can specify a particular commit, branch, or release to install by including a ref argument to install_github().

Developing

If you need to alter both the Arrow C++ library and the R package code, or if you can’t get a binary version of the latest C++ library elsewhere, you’ll need to build it from source too.

First, clone the repository and install a release build of the C++ library.

git clone https://github.com/apache/arrow.git
mkdir arrow/cpp/build && cd arrow/cpp/build
cmake .. -DARROW_PARQUET=ON -DARROW_BOOST_USE_SHARED:BOOL=Off -DARROW_INSTALL_NAME_RPATH=OFF
make install

This likely will require additional system libraries to be installed, the specifics of which are platform dependent. See the C++ developer guide for details.

Once you’ve built the C++ library, you can install the R package and its dependencies, along with additional dev dependencies, from the git checkout:

cd ../../r
R -e 'install.packages("devtools"); devtools::install_dev_deps()'
R CMD INSTALL .

If the package fails to install/load with an error like this:

** testing if installed package can be loaded from temporary location
Error: package or namespace load failed for 'arrow' in dyn.load(file, DLLpath = DLLpath, ...):
unable to load shared object '/Users/you/R/00LOCK-r/00new/arrow/libs/arrow.so':
dlopen(/Users/you/R/00LOCK-r/00new/arrow/libs/arrow.so, 6): Library not loaded: @rpath/libarrow.14.dylib

try setting the environment variable LD_LIBRARY_PATH (or DYLD_LIBRARY_PATH on macOS) to wherever Arrow C++ was put in make install, e.g. export LD_LIBRARY_PATH=/usr/local/lib, and retry installing the R package.

For any other build/configuration challenges, see the C++ developer guide.

Editing Rcpp code

The arrow package uses some customized tools on top of Rcpp to prepare its C++ code in src/. If you change C++ code in the R package, you will need to set the ARROW_R_DEV environment variable to TRUE (optionally, add it to your~/.Renviron file to persist across sessions) so that the data-raw/codegen.R file is used for code generation.

You’ll also need remotes::install_github("romainfrancois/decor").

Useful functions

Within an R session, these can help with package development:

devtools::load_all() # Load the dev package
devtools::test(filter="^regexp$") # Run the test suite, optionally filtering file names
devtools::document() # Update roxygen documentation
rmarkdown::render("README.Rmd") # To rebuild README.md
pkgdown::build_site(run_dont_run=TRUE) # To preview the documentation website
devtools::check() # All package checks; see also below

Any of those can be run from the command line by wrapping them in R -e '$COMMAND'. There’s also a Makefile to help with some common tasks from the command line (make test, make doc, make clean, etc.)

Full package validation

R CMD build --keep-empty-dirs .
R CMD check arrow_*.tar.gz --as-cran --no-manual

Copy Link

Version

Install

install.packages('arrow')

Monthly Downloads

335,737

Version

0.14.1

License

Apache License (>= 2.0)

Issues

Pull Requests

Stars

Forks

Maintainer

Neal Richardson

Last Published

August 5th, 2019

Functions in arrow (0.14.1)

MockOutputStream

arrow__io__InputStream

class arrow::io::InputStream
read_delim_arrow

Read a CSV or other delimited file with Arrow
arrow__io__ReadableFile

class arrow::io::ReadableFile
csv_read_options

Read options for the Arrow file readers
ReadableFile

read_feather

Read a Feather file
arrow__ipc__Message

class arrow::ipc::Message
csv_table_reader

Arrow CSV and JSON table readers
arrow__io__MemoryMappedFile

class arrow::io::MemoryMappedFile
arrow__Column

class arrow::Column
arrow__DataType

class arrow::DataType
type

infer the arrow Array type from an R vector
RecordBatchFileReader

RecordBatchFileWriter

Create a record batch file writer from a stream
write_arrow

Write Arrow formatted data
arrow__io__OutputStream

OutputStream
arrow__RecordBatchReader

class arrow::RecordBatchReader
arrow__ArrayData

class arrow::ArrayData
arrow__Schema

class arrow::Schema
arrow__Array

class arrow::Array Array base type. Immutable data array with some logical type and some length.
arrow__io__MockOutputStream

class arrow::io::MockOutputStream
arrow__FixedWidthType

class arrow::FixedWidthType
arrow__ipc__RecordBatchFileWriter

class arrow::ipc::RecordBatchFileWriter Writer for the Arrow binary file format
FeatherTableReader

A arrow::ipc::feather::TableReader to read from a file
arrow__RecordBatch

class arrow::RecordBatch
arrow__Table

class arrow::Table
array

create an arrow::Array from an R vector
arrow__ipc__MessageReader

class arrow::ipc::MessageReader
arrow__DictionaryType

class arrow::DictionaryType
arrow__Field

class arrow::Field
arrow-package

arrow: Integration to 'Apache' 'Arrow'
arrow__ipc__RecordBatchFileReader

class arrow::ipc::RecordBatchFileReader
arrow__MemoryPool

class arrow::MemoryPool
buffer

Create a arrow::Buffer from an R object
arrow__ipc__RecordBatchStreamReader

class arrow::ipc::RecordBatchStreamReader
arrow__io__BufferOutputStream

class arrow::io::BufferOutputStream
cast_options

Cast options
mmap_open

Open a memory mapped file
arrow_available

Is the C++ Arrow library available?
mmap_create

Create a new read/write memory mapped file of a given size
arrow__json__TableReader

class arrow::json::TableReader
record_batch

arrow__io__RandomAccessFile

class arrow::io::RandomAccessFile
RecordBatchStreamReader

arrow__io__Readable

class arrow::io::Readable
RecordBatchStreamWriter

Writer for the Arrow streaming binary format
chunked_array

create an arrow::ChunkedArray from various R vectors
arrow__Buffer

class arrow::Buffer
default_memory_pool

read_schema

read a Schema from a stream
field

Factory for a arrow::Field
install_arrow

Help installing the Arrow C++ library
compression_codec

codec
dictionary

dictionary type factory
arrow__io__BufferReader

class arrow::io::BufferReader
arrow__io__FileOutputStream

class arrow::io::FileOutputStream
arrow__ChunkedArray

class arrow::ChunkedArray
arrow__io__FixedSizeBufferWriter

class arrow::io::FixedSizeBufferWriter
read_json_arrow

Read a JSON file
read_table

Read an arrow::Table from a stream
write_parquet

Write Parquet file to disk
parquet_arrow_reader_properties

Create a new ArrowReaderProperties instance
arrow__ipc__RecordBatchStreamWriter

class arrow::ipc::RecordBatchStreamWriter Writer for the Arrow streaming binary format
arrow__ipc__RecordBatchWriter

class arrow::ipc::RecordBatchWriter
parquet_file_reader

Parquet file reader
read_message

Read a Message from a stream
reexports

Objects exported from other packages
read_parquet

Read a Parquet file
read_record_batch

read arrow::RecordBatch as encapsulated IPC message, given a known arrow::Schema
write_feather_RecordBatch

Write a record batch in the feather format
write_feather

Write data in the Feather format
csv_convert_options

Conversion options for the CSV reader
schema

Schema factory
csv_parse_options

Parsing options for Arrow file readers
table

Create an arrow::Table from a data frame
TimeUnit

Apache Arrow data types
CompressedInputStream

Compressed input stream
CompressedOutputStream

Compressed output stream
FeatherTableWriter

Create TableWriter that writes into a stream
MessageReader

Open a MessageReader that reads from a stream
BufferOutputStream

FileOutputStream

BufferReader

FixedSizeBufferWriter