Learn R Programming

dtplyr

Overview

dtplyr provides a data.table backend for dplyr. The goal of dtplyr is to allow you to write dplyr code that is automatically translated to the equivalent, but usually much faster, data.table code.

See vignette("translation") for details of the current translations, and table.express and rqdatatable for related work.

Installation

You can install from CRAN with:

install.packages("dtplyr")

Or try the development version from GitHub with:

# install.packages("devtools")
devtools::install_github("tidyverse/dtplyr")

Usage

To use dtplyr, you must at least load dtplyr and dplyr. You may also want to load data.table so you can access the other goodies that it provides:

library(data.table)
library(dtplyr)
library(dplyr, warn.conflicts = FALSE)

Then use lazy_dt() to create a “lazy” data table that tracks the operations performed on it.

mtcars2 <- lazy_dt(mtcars)

You can preview the transformation (including the generated data.table code) by printing the result:

mtcars2 %>% 
  filter(wt < 5) %>% 
  mutate(l100k = 235.21 / mpg) %>% # liters / 100 km
  group_by(cyl) %>% 
  summarise(l100k = mean(l100k))
#> Source: local data table [3 x 2]
#> Call:   `_DT1`[wt < 5][, `:=`(l100k = 235.21/mpg)][, .(l100k = mean(l100k)), 
#>     keyby = .(cyl)]
#> 
#>     cyl l100k
#>   <dbl> <dbl>
#> 1     4  9.05
#> 2     6 12.0 
#> 3     8 14.9 
#> 
#> # Use as.data.table()/as.data.frame()/as_tibble() to access results

But generally you should reserve this only for debugging, and use as.data.table(), as.data.frame(), or as_tibble() to indicate that you’re done with the transformation and want to access the results:

mtcars2 %>% 
  filter(wt < 5) %>% 
  mutate(l100k = 235.21 / mpg) %>% # liters / 100 km
  group_by(cyl) %>% 
  summarise(l100k = mean(l100k)) %>% 
  as_tibble()
#> # A tibble: 3 × 2
#>     cyl l100k
#>   <dbl> <dbl>
#> 1     4  9.05
#> 2     6 12.0 
#> 3     8 14.9

Why is dtplyr slower than data.table?

There are two primary reasons that dtplyr will always be somewhat slower than data.table:

  • Each dplyr verb must do some work to convert dplyr syntax to data.table syntax. This takes time proportional to the complexity of the input code, not the input data, so should be a negligible overhead for large datasets. Initial benchmarks suggest that the overhead should be under 1ms per dplyr call.

  • To match dplyr semantics, mutate() does not modify in place by default. This means that most expressions involving mutate() must make a copy that would not be necessary if you were using data.table directly. (You can opt out of this behaviour in lazy_dt() with immutable = FALSE).

Code of Conduct

Please note that the dtplyr project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

Copy Link

Version

Install

install.packages('dtplyr')

Monthly Downloads

430,682

Version

1.3.1

License

MIT + file LICENSE

Issues

Pull Requests

Stars

Forks

Maintainer

Hadley Wickham

Last Published

March 22nd, 2023

Functions in dtplyr (1.3.1)

left_join.dtplyr_step

Join data tables
group_modify.dtplyr_step

Apply a function to each group
head.dtplyr_step

Subset first or last rows
nest.dtplyr_step

Nest
pivot_longer.dtplyr_step

Pivot data from wide to long
mutate.dtplyr_step

Create and modify columns
slice.dtplyr_step

Subset rows using their positions
group_by.dtplyr_step

Group and ungroup
filter.dtplyr_step

Subset rows using column values
transmute.dtplyr_step

Create new columns, dropping old
unite.dtplyr_step

Unite multiple columns into one by pasting strings together.
select.dtplyr_step

Subset columns using their names
separate.dtplyr_step

Separate a character column into multiple columns with a regular expression or numeric locations
lazy_dt

Create a "lazy" data.table for use with dplyr verbs
intersect.dtplyr_step

Set operations
pivot_wider.dtplyr_step

Pivot data from long to wide
summarise.dtplyr_step

Summarise each group to one row
relocate.dtplyr_step

Relocate variables using their names
replace_na.dtplyr_step

Replace NAs with specified values
rename.dtplyr_step

Rename columns using their names
distinct.dtplyr_step

Subset distinct/unique rows
count.dtplyr_step

Count observations by group
arrange.dtplyr_step

Arrange rows by column values
complete.dtplyr_step

Complete a data frame with missing combinations of data
expand.dtplyr_step

Expand data frame to include all possible combinations of values.
collect.dtplyr_step

Force computation of a lazy data.table
dtplyr-package

dtplyr: Data Table Back-End for 'dplyr'
drop_na.dtplyr_step

Drop rows containing missing values
fill.dtplyr_step

Fill in missing values with previous or next value
.datatable.aware

dtplyr is data.table aware