Learn R Programming

dplyr

Overview

dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges:

  • mutate() adds new variables that are functions of existing variables
  • select() picks variables based on their names.
  • filter() picks cases based on their values.
  • summarise() reduces multiple values down to a single summary.
  • arrange() changes the ordering of the rows.

These all combine naturally with group_by() which allows you to perform any operation “by group”. You can learn more about them in vignette("dplyr"). As well as these single-table verbs, dplyr also provides a variety of two-table verbs, which you can learn about in vignette("two-table").

If you are new to dplyr, the best place to start is the data transformation chapter in R for data science.

Backends

In addition to data frames/tibbles, dplyr makes working with other computational backends accessible and efficient. Below is a list of alternative backends:

  • dtplyr: for large, in-memory datasets. Translates your dplyr code to high performance data.table code.

  • dbplyr: for data stored in a relational database. Translates your dplyr code to SQL.

  • sparklyr: for very large datasets stored in Apache Spark.

Installation

# The easiest way to get dplyr is to install the whole tidyverse:
install.packages("tidyverse")

# Alternatively, install just dplyr:
install.packages("dplyr")

Development version

To get a bug fix or to use a feature from the development version, you can install the development version of dplyr from GitHub.

# install.packages("devtools")
devtools::install_github("tidyverse/dplyr")

Cheat Sheet

Usage

library(dplyr)

starwars %>% 
  filter(species == "Droid")
#> # A tibble: 6 × 14
#>   name   height  mass hair_color skin_color  eye_color birth_year sex   gender  
#>   <chr>   <int> <dbl> <chr>      <chr>       <chr>          <dbl> <chr> <chr>   
#> 1 C-3PO     167    75 <NA>       gold        yellow           112 none  masculi…
#> 2 R2-D2      96    32 <NA>       white, blue red               33 none  masculi…
#> 3 R5-D4      97    32 <NA>       white, red  red               NA none  masculi…
#> 4 IG-88     200   140 none       metal       red               15 none  masculi…
#> 5 R4-P17     96    NA none       silver, red red, blue         NA none  feminine
#> # … with 1 more row, and 5 more variables: homeworld <chr>, species <chr>,
#> #   films <list>, vehicles <list>, starships <list>

starwars %>% 
  select(name, ends_with("color"))
#> # A tibble: 87 × 4
#>   name           hair_color skin_color  eye_color
#>   <chr>          <chr>      <chr>       <chr>    
#> 1 Luke Skywalker blond      fair        blue     
#> 2 C-3PO          <NA>       gold        yellow   
#> 3 R2-D2          <NA>       white, blue red      
#> 4 Darth Vader    none       white       yellow   
#> 5 Leia Organa    brown      light       brown    
#> # … with 82 more rows

starwars %>% 
  mutate(name, bmi = mass / ((height / 100)  ^ 2)) %>%
  select(name:mass, bmi)
#> # A tibble: 87 × 4
#>   name           height  mass   bmi
#>   <chr>           <int> <dbl> <dbl>
#> 1 Luke Skywalker    172    77  26.0
#> 2 C-3PO             167    75  26.9
#> 3 R2-D2              96    32  34.7
#> 4 Darth Vader       202   136  33.3
#> 5 Leia Organa       150    49  21.8
#> # … with 82 more rows

starwars %>% 
  arrange(desc(mass))
#> # A tibble: 87 × 14
#>   name      height  mass hair_color skin_color eye_color birth_year sex   gender
#>   <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
#> 1 Jabba De…    175  1358 <NA>       green-tan… orange         600   herm… mascu…
#> 2 Grievous     216   159 none       brown, wh… green, y…       NA   male  mascu…
#> 3 IG-88        200   140 none       metal      red             15   none  mascu…
#> 4 Darth Va…    202   136 none       white      yellow          41.9 male  mascu…
#> 5 Tarfful      234   136 brown      brown      blue            NA   male  mascu…
#> # … with 82 more rows, and 5 more variables: homeworld <chr>, species <chr>,
#> #   films <list>, vehicles <list>, starships <list>

starwars %>%
  group_by(species) %>%
  summarise(
    n = n(),
    mass = mean(mass, na.rm = TRUE)
  ) %>%
  filter(
    n > 1,
    mass > 50
  )
#> # A tibble: 8 × 3
#>   species      n  mass
#>   <chr>    <int> <dbl>
#> 1 Droid        6  69.8
#> 2 Gungan       3  74  
#> 3 Human       35  82.8
#> 4 Kaminoan     2  88  
#> 5 Mirialan     2  53.1
#> # … with 3 more rows

Getting help

If you encounter a clear bug, please file an issue with a minimal reproducible example on GitHub. For questions and other discussion, please use community.rstudio.com or the manipulatr mailing list.


Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.

Copy Link

Version

Install

install.packages('dplyr')

Monthly Downloads

1,402,758

Version

1.0.10

License

MIT + file LICENSE

Issues

Pull Requests

Stars

Forks

Maintainer

Hadley Wickham

Last Published

September 1st, 2022

Functions in dplyr (1.0.10)

context

Context dependent expressions
case_when

A general vectorised if
between

Do values in a numeric vector fall in specified range?
compute

Force computation of a database query
copy_to

Copy a local data frame to a remote src
distinct

Subset distinct/unique rows
do

Do anything
dplyr-package

dplyr: A Grammar of Data Manipulation
cumall

Cumulativate versions of any, all, and mean
defunct

Defunct functions
desc

Descending order
dim_desc

Describing dimensions
filter_all

Filter within a selection of variables
count

Count observations by group
funs

Create a list of function calls
mutate

Create, modify, and delete columns
group_split

Split data frame by groups
group_nest

Nest a tibble using a grouping specification
filter-joins

Filtering joins
filter

Subset rows using column values
make_tbl

Create a "tbl" object
ranking

Windowed rank functions.
mutate-joins

Mutating joins
progress_estimated

Progress bar with estimated time.
pull

Extract a single column
explain

Explain details of a tbl
group_by_all

Group by a selection of variables
dplyr_tidy_select

Argument type: tidy-select
group_by_drop_default

Default value for .drop argument of group_by
mutate_all

Mutate multiple columns
n_distinct

Efficiently count the number of unique values in a set of vectors
ident

Flag a character vector as SQL identifiers
if_else

Vectorised if
na_if

Convert values to NA
group_data

Grouping metadata
group_map

Apply a function to each group
nest_by

Nest by one or more variables
dplyr_data_masking

Argument type: data-masking
dplyr_extending

Extending dplyr with new data frame subclasses
near

Compare two numeric vectors
sample_n

Sample n rows from a table
nest_join

Nest join
new_grouped_df

Low-level construction and validation for the grouped_df and rowwise_df classes
select

Subset columns using their names and types
se-deprecated

Deprecated SE versions of main verbs.
group_cols

Select grouping variables
rename

Rename columns
distinct_prepare

Same basic philosophy as group_by_prepare(): lazy_dots comes in, list of data and vars (character vector) comes out.
nth

Extract the first, last or nth value from a vector
rows

Manipulate individual rows
order_by

A helper function for ordering window function output
setops

Set operations
slice

Subset rows using their positions
starwars

Starwars characters
scoped

Operate on a selection of variables
sql

SQL escaping.
rowwise

Group input by rows
recode

Recode values
same_src

Figure out if two sources are the same (or two tbl have the same source)
summarise_all

Summarise multiple columns
src_tbls

List all tbls provided by a source.
distinct_all

Select distinct rows by a selection of variables
src_dbi

Source for database backends
src

Create a "src" object
summarise_each

Summarise and mutate multiple columns.
src_local

A local source
summarise

Summarise each group to fewer rows
with_order

Run a function with one order, translating result back to original order
storms

Storm tracks data
top_n

Select top (or bottom) n rows (by value)
tbl_vars

List variables provided by a tbl.
tbl_ptype

Return a prototype of a tbl
glimpse

Get a glimpse of your data
location

Print the location in memory of a data frame
grouped_df

A grouped data frame.
group_by

Group by one or more variables
group_trim

Trim grouping structure
lead-lag

Compute lagged or leading values
relocate

Change column order
vars

Select variables
with_groups

Perform an operation with temporary groups
select_all

Select and rename a selection of variables
select_vars

Select variables
reexports

Objects exported from other packages
tbl_df

Coerce to a tibble
tidyeval-compat

Other tidy eval tools
tbl

Create a table from a data source
band_members

Band membership
add_rownames

Convert row names to an explicit variable.
across

Apply a function (or functions) across multiple columns
all_vars

Apply predicate to all variables
arrange

Arrange rows by column values
auto_copy

Copy tables to same source, if necessary
arrange_all

Arrange rows by a selection of variables
backend_dbplyr

Database and SQL generics.
bench_compare

Evaluate, compare, benchmark operations of a set of srcs.
all_equal

Flexible equality comparison for data frames
bind

Efficiently bind multiple data frames by row and column
coalesce

Find first non-missing element
c_across

Combine values from multiple columns
combine

Combine vectors
common_by

Extract out common by variables
check_dbplyr

dbplyr compatibility functions