Learn R Programming

arrow

Apache Arrow is a cross-language development platform for in-memory data. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware. It also provides computational libraries and zero-copy streaming messaging and interprocess communication.

The arrow package exposes an interface to the Arrow C++ library, enabling access to many of its features in R. It provides low-level access to the Arrow C++ library API and higher-level access through a {dplyr} backend and familiar R functions.

What can the arrow package do?

  • Read and write Parquet files (read_parquet(), write_parquet()), an efficient and widely used columnar format
  • Read and write Feather files (read_feather(), write_feather()), a format optimized for speed and interoperability
  • Analyze, process, and write multi-file, larger-than-memory datasets (open_dataset(), write_dataset())
  • Read large CSV and JSON files with excellent speed and efficiency (read_csv_arrow(), read_json_arrow())
  • Write CSV files (write_csv_arrow())
  • Manipulate and analyze Arrow data with dplyr verbs
  • Read and write files in Amazon S3 buckets with no additional function calls
  • Exercise fine control over column types for seamless interoperability with databases and data warehouse systems
  • Use compression codecs including Snappy, gzip, Brotli, Zstandard, LZ4, LZO, and bzip2 for reading and writing data
  • Enable zero-copy data sharing between R and Python
  • Connect to Arrow Flight RPC servers to send and receive large datasets over networks
  • Access and manipulate Arrow objects through low-level bindings to the C++ library
  • Provide a toolkit for building connectors to other applications and services that use Arrow

Installation

Installing the latest release version

Install the latest release of arrow from CRAN with

install.packages("arrow")

Conda users can install arrow from conda-forge with

conda install -c conda-forge --strict-channel-priority r-arrow

Installing a released version of the arrow package requires no additional system dependencies. For macOS and Windows, CRAN hosts binary packages that contain the Arrow C++ library. On Linux, source package installation will also build necessary C++ dependencies. For a faster, more complete installation, set the environment variable NOT_CRAN=true. See vignette("install", package = "arrow") for details.

For Windows users of R 3.6 and earlier, note that support for AWS S3 is not available, and the 32-bit version does not support Arrow Datasets. These features are only supported by the rtools40 toolchain on Windows and thus are only available in R >= 4.0.

Installing a development version

Development versions of the package (binary and source) are built nightly and hosted at https://arrow-r-nightly.s3.amazonaws.com. To install from there:

install.packages("arrow", repos = "https://arrow-r-nightly.s3.amazonaws.com")

Conda users can install arrow nightly builds with

conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow

If you already have a version of arrow installed, you can switch to the latest nightly development version with

arrow::install_arrow(nightly = TRUE)

These nightly package builds are not official Apache releases and are not recommended for production use. They may be useful for testing bug fixes and new features under active development.

Usage

Among the many applications of the arrow package, two of the most accessible are:

  • High-performance reading and writing of data files with multiple file formats and compression codecs, including built-in support for cloud storage
  • Analyzing and manipulating bigger-than-memory data with dplyr verbs

The sections below describe these two uses and illustrate them with basic examples. The sections below mention two Arrow data structures:

  • Table: a tabular, column-oriented data structure capable of storing and processing large amounts of data more efficiently than R’s built-in data.frame and with SQL-like column data types that afford better interoperability with databases and data warehouse systems
  • Dataset: a data structure functionally similar to Table but with the capability to work on larger-than-memory data partitioned across multiple files

Reading and writing data files with arrow

The arrow package provides functions for reading single data files in several common formats. By default, calling any of these functions returns an R data.frame. To return an Arrow Table, set argument as_data_frame = FALSE.

  • read_parquet(): read a file in Parquet format
  • read_feather(): read a file in Feather format (the Apache Arrow IPC format)
  • read_delim_arrow(): read a delimited text file (default delimiter is comma)
  • read_csv_arrow(): read a comma-separated values (CSV) file
  • read_tsv_arrow(): read a tab-separated values (TSV) file
  • read_json_arrow(): read a JSON data file

For writing data to single files, the arrow package provides the functions write_parquet(), write_feather(), and write_csv_arrow(). These can be used with R data.frame and Arrow Table objects.

For example, let’s write the Star Wars characters data that’s included in dplyr to a Parquet file, then read it back in. Parquet is a popular choice for storing analytic data; it is optimized for reduced file sizes and fast read performance, especially for column-based access patterns. Parquet is widely supported by many tools and platforms.

First load the arrow and dplyr packages:

library(arrow, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)

Then write the data.frame named starwars to a Parquet file at file_path:

file_path <- tempfile()
write_parquet(starwars, file_path)

Then read the Parquet file into an R data.frame named sw:

sw <- read_parquet(file_path)

R object attributes are preserved when writing data to Parquet or Feather files and when reading those files back into R. This enables round-trip writing and reading of sf::sf objects, R data.frames with with haven::labelled columns, and data.frames with other custom attributes.

For reading and writing larger files or sets of multiple files, arrow defines Dataset objects and provides the functions open_dataset() and write_dataset(), which enable analysis and processing of bigger-than-memory data, including the ability to partition data into smaller chunks without loading the full data into memory. For examples of these functions, see vignette("dataset", package = "arrow").

All these functions can read and write files in the local filesystem or in Amazon S3 (by passing S3 URIs beginning with s3://). For more details, see vignette("fs", package = "arrow")

Using dplyr with arrow

The arrow package provides a dplyr backend enabling manipulation of Arrow tabular data with dplyr verbs. To use it, first load both packages arrow and dplyr. Then load data into an Arrow Table or Dataset object. For example, read the Parquet file written in the previous example into an Arrow Table named sw:

sw <- read_parquet(file_path, as_data_frame = FALSE)

Next, pipe on dplyr verbs:

result <- sw %>%
  filter(homeworld == "Tatooine") %>%
  rename(height_cm = height, mass_kg = mass) %>%
  mutate(height_in = height_cm / 2.54, mass_lbs = mass_kg * 2.2046) %>%
  arrange(desc(birth_year)) %>%
  select(name, height_in, mass_lbs)

The arrow package uses lazy evaluation to delay computation until the result is required. This speeds up processing by enabling the Arrow C++ library to perform multiple computations in one operation. result is an object with class arrow_dplyr_query which represents all the computations to be performed:

result
#> Table (query)
#> name: string
#> height_in: expr
#> mass_lbs: expr
#>
#> * Filter: equal(homeworld, "Tatooine")
#> * Sorted by birth_year [desc]
#> See $.data for the source Arrow object

To perform these computations and materialize the result, call compute() or collect(). compute() returns an Arrow Table, suitable for passing to other arrow or dplyr functions:

result %>% compute()
#> Table
#> 10 rows x 3 columns
#> $name <string>
#> $height_in <double>
#> $mass_lbs <double>

collect() returns an R data.frame, suitable for viewing or passing to other R functions for analysis or visualization:

result %>% collect()
#> # A tibble: 10 x 3
#>    name               height_in mass_lbs
#>    <chr>                  <dbl>    <dbl>
#>  1 C-3PO                   65.7    165.
#>  2 Cliegg Lars             72.0     NA
#>  3 Shmi Skywalker          64.2     NA
#>  4 Owen Lars               70.1    265.
#>  5 Beru Whitesun lars      65.0    165.
#>  6 Darth Vader             79.5    300.
#>  7 Anakin Skywalker        74.0    185.
#>  8 Biggs Darklighter       72.0    185.
#>  9 Luke Skywalker          67.7    170.
#> 10 R5-D4                   38.2     70.5

The arrow package works with most single-table dplyr verbs, including those that compute aggregates.

sw %>%
  group_by(species) %>%
  summarise(mean_height = mean(height, na.rm = TRUE)) %>%
  collect()

Additionally, equality joins (e.g. left_join(), inner_join()) are supported for joining multiple tables.

jedi <- data.frame(
  name = c("C-3PO", "Luke Skywalker", "Obi-Wan Kenobi"),
  jedi = c(FALSE, TRUE, TRUE)
)

sw %>%
  select(1:11) %>%
  right_join(jedi) %>%
  collect()

Window functions (e.g. ntile()) are not yet supported. Inside dplyr verbs, Arrow offers support for many functions and operators, with common functions mapped to their base R and tidyverse equivalents. The changelog lists many of them. If there are additional functions you would like to see implemented, please file an issue as described in the Getting help section below.

For dplyr queries on Table objects, if the arrow package detects an unimplemented function within a dplyr verb, it automatically calls collect() to return the data as an R data.frame before processing that dplyr verb. For queries on Dataset objects (which can be larger than memory), it raises an error if the function is unimplemented; you need to explicitly tell it to collect().

Additional features

Other applications of arrow are described in the following vignettes:

  • vignette("python", package = "arrow"): use arrow and reticulate to pass data between R and Python
  • vignette("flight", package = "arrow"): connect to Arrow Flight RPC servers to send and receive data
  • vignette("arrow", package = "arrow"): access and manipulate Arrow objects through low-level bindings to the C++ library

The Arrow for R cheatsheet and Cookbook are additional resources for getting started with arrow.

Getting help

If you encounter a bug, please file an issue with a minimal reproducible example on the Apache Jira issue tracker. Create an account or log in, then click Create to file an issue. Select the project Apache Arrow (ARROW), select the component R, and begin the issue summary with [R] followed by a space. For more information, see the Report bugs and propose features section of the Contributing to Apache Arrow page in the Arrow developer documentation.

We welcome questions, discussion, and contributions from users of the arrow package. For information about mailing lists and other venues for engaging with the Arrow developer and user communities, please see the Apache Arrow Community page.


All participation in the Apache Arrow project is governed by the Apache Software Foundation’s code of conduct.

Copy Link

Version

Install

install.packages('arrow')

Monthly Downloads

377,780

Version

8.0.0

License

Apache License (>= 2.0)

Issues

Pull Requests

Stars

Forks

Maintainer

Last Published

May 9th, 2022

Functions in arrow (8.0.0)

Codec

Compression Codec class
ArrayData

ArrayData class
FixedWidthType

class arrow::FixedWidthType
arrow-package

arrow: Integration to 'Apache' 'Arrow'
CsvReadOptions

File reader options
Expression

Arrow expressions
FeatherReader

FeatherReader class
ExtensionArray

class arrow::ExtensionArray
ExtensionType

class arrow::ExtensionType
OutputStream

OutputStream classes
FileSystem

FileSystem classes
FileWriteOptions

Format-specific write options
ParquetWriterProperties

ParquetWriterProperties class
InputStream

InputStream classes
Partitioning

Define Partitioning for a Dataset
RecordBatch

RecordBatch class
MemoryPool

class arrow::MemoryPool
RecordBatchReader

RecordBatchReader classes
arrow_available

Is the C++ Arrow library available?
as_arrow_array

Convert an object to an Arrow Array
copy_files

Copy files between FileSystems
arrow_info

Report information on the package's capabilities
RecordBatchWriter

RecordBatchWriter classes
flight_get

Get data from a Flight server
ParquetArrowReaderProperties

ParquetArrowReaderProperties class
cpu_count

Manage the global CPU thread pool in libarrow
flight_put

Send data to a Flight server
list_compute_functions

List available Arrow C++ compute functions
list_flights

See available resources on a Flight server
Dataset

Multi-file datasets
DictionaryType

class DictionaryType
as_record_batch_reader

Convert an object to an Arrow RecordBatchReader
Scalar

Arrow scalars
Field

Field class
FileFormat

Dataset file formats
ParquetFileReader

ParquetFileReader class
as_arrow_table

Convert an object to an Arrow Table
as_chunked_array

Convert an object to an Arrow ChunkedArray
concat_tables

Concatenate one or more Tables
install_pyarrow

Install pyarrow for use with reticulate
as_data_type

Convert an object to an Arrow DataType
dictionary

Create a dictionary type
Scanner

Scan the contents of a dataset
enums

Arrow enums
Schema

Schema class
ParquetFileWriter

ParquetFileWriter class
as_record_batch

Convert an object to an Arrow RecordBatch
io_thread_count

Manage the global I/O thread pool in libarrow
read_delim_arrow

Read a CSV or other delimited file with Arrow
CsvTableReader

Arrow CSV and JSON table reader classes
write_to_raw

Write Arrow data to a raw vector
read_feather

Read a Feather file
load_flight_server

Load a Python Flight server
create_package_with_all_dependencies

Create a source bundle that includes all thirdparty dependencies
read_arrow

Read Arrow IPC stream format
dataset_factory

Create a DatasetFactory
data-type

Apache Arrow data types
read_json_arrow

Read a JSON file
default_memory_pool

as_schema

Convert an object to an Arrow DataType
mmap_create

Create a new read/write memory mapped file of a given size
make_readable_file

Handle a range of possible input sources
cast_options

Cast options
map_batches

Apply a function to a stream of RecordBatches
match_arrow

match and %in% for Arrow objects
codec_is_available

Check whether a compression codec is available
flight_disconnect

Explicitly close a Flight client
contains_regex

Does this string contain regex metacharacters?
flight_connect

Connect to a Flight server
infer_type

Infer the arrow Array type from an R object
new_extension_type

Extension types
open_dataset

Open a multi-file dataset
install_arrow

Install or upgrade the Arrow library
DataType

class arrow::DataType
FileInfo

FileSystem entry info
FileSelector

file selector
mmap_open

Open a memory mapped file
read_schema

read a Schema from a stream
to_duckdb

Create a (virtual) DuckDB table from an Arrow object
to_arrow

Create an Arrow object from others
repeat_value_as_array

Take an object of length 1 and repeat it.
reexports

Objects exported from other packages
register_binding

Register compute bindings
recycle_scalars

Recycle scalar values in a list of arrays
write_dataset

Write a dataset
write_feather

Write data in the Feather format
MessageReader

class arrow::MessageReader
Message

class arrow::Message
array

Arrow Arrays
Table

Table class
write_arrow

Write Arrow IPC stream format
s3_bucket

Connect to an AWS S3 bucket
write_parquet

Write Parquet file to disk
buffer

Buffer class
call_function

Call an Arrow compute function
concat_arrays

Concatenate zero or more Arrays
compression

Compressed stream classes
get_stringr_pattern_options

Get stringr pattern options
unify_schemas

Combine and harmonize schemas
read_parquet

Read a Parquet file
value_counts

table for Arrow objects
hive_partition

Construct Hive partitioning
read_message

Read a Message from a stream
vctrs_extension_array

Extension type for generic typed vectors
write_csv_arrow

Write CSV file to disk
ChunkedArray

ChunkedArray class
FragmentScanOptions

Format-specific scan options