pivot_wider()
"widens" data, increasing the number of columns and
decreasing the number of rows. The inverse transformation is
pivot_longer()
.
Learn more in vignette("pivot", "tidyr")
.
Note that pivot_wider()
is not and cannot be lazy because we need to look
at the data to figure out what the new column names will be.
If you have a long running query you have two options:
(temporarily) store the result of the query via compute()
.
Create a spec before and use dbplyr_pivot_wider_spec()
- dbplyr's version
of tidyr::pivot_wider_spec()
. Note that this function is only a temporary
solution until pivot_wider_spec()
becomes a generic. It will then be
removed soon afterwards.
# S3 method for tbl_lazy
pivot_wider(
data,
...,
id_cols = NULL,
id_expand = FALSE,
names_from = name,
names_prefix = "",
names_sep = "_",
names_glue = NULL,
names_sort = FALSE,
names_vary = "fastest",
names_expand = FALSE,
names_repair = "check_unique",
values_from = value,
values_fill = NULL,
values_fn = ~max(.x, na.rm = TRUE),
unused_fn = NULL
)dbplyr_pivot_wider_spec(
data,
spec,
...,
names_repair = "check_unique",
id_cols = NULL,
id_expand = FALSE,
values_fill = NULL,
values_fn = ~max(.x, na.rm = TRUE),
unused_fn = NULL,
error_call = current_env()
)
A lazy data frame backed by a database query.
Unused; included for compatibility with generic.
A set of columns that uniquely identifies each observation.
Unused; included for compatibility with the generic.
A pair of
arguments describing which column (or columns) to get the name of the
output column (names_from
), and which column (or columns) to get the
cell values from (values_from
).
If values_from
contains multiple values, the value will be added to the
front of the output column.
String added to the start of every variable name.
If names_from
or values_from
contains multiple
variables, this will be used to join their values together into a single
string to use as a column name.
Instead of names_sep
and names_prefix
, you can supply
a glue specification that uses the names_from
columns (and special
.value
) to create custom column names.
Should the column names be sorted? If FALSE
, the default,
column names are ordered by first appearance.
When names_from
identifies a column (or columns) with
multiple unique values, and multiple values_from
columns are provided,
in what order should the resulting column names be combined?
"fastest"
varies names_from
values fastest, resulting in a column
naming scheme of the form: value1_name1, value1_name2, value2_name1, value2_name2
. This is the default.
"slowest"
varies names_from
values slowest, resulting in a column
naming scheme of the form: value1_name1, value2_name1, value1_name2, value2_name2
.
Should the values in the names_from
columns be expanded
by expand()
before pivoting? This results in more columns, the output
will contain column names corresponding to a complete expansion of all
possible values in names_from
. Additionally, the column names will be
sorted, identical to what names_sort
would produce.
What happens if the output has invalid column names?
Optionally, a (scalar) value that specifies what each
value
should be filled in with when missing.
A function, the default is max()
, applied to the value
in each cell in the output. In contrast to local data frames it must not be
NULL
.
Optionally, a function applied to summarize the values from
the unused columns (i.e. columns not identified by id_cols
,
names_from
, or values_from
).
The default drops all unused columns from the result.
This can be a named list if you want to apply different aggregations to different unused columns.
id_cols
must be supplied for unused_fn
to be useful, since otherwise
all unspecified columns will be considered id_cols
.
This is similar to grouping by the id_cols
then summarizing the
unused columns using unused_fn
.
A specification data frame. This is useful for more complex pivots because it gives you greater control on how metadata stored in the columns become column names in the result.
Must be a data frame containing character .name
and .value
columns.
Additional columns in spec
should be named to match columns in the
long format of the dataset and contain values corresponding to columns
pivoted from the wide format.
The special .seq
variable is used to disambiguate rows internally;
it is automatically removed after pivoting.
The execution environment of a currently
running function, e.g. caller_env()
. The function will be
mentioned in error messages as the source of the error. See the
call
argument of abort()
for more information.
The big difference to pivot_wider()
for local data frames is that
values_fn
must not be NULL
. By default it is max()
which yields
the same results as for local data frames if the combination of id_cols
and value
column uniquely identify an observation.
Mind that you also do not get a warning if an observation is not uniquely
identified.
The translation to SQL code basically works as follows:
Get unique keys in names_from
column.
For each key value generate an expression of the form:
value_fn(
CASE WHEN (`names from column` == `key value`)
THEN (`value column`)
END
) AS `output column`
Group data by id columns.
Summarise the grouped data with the expressions from step 2.
memdb_frame(
id = 1,
key = c("x", "y"),
value = 1:2
) %>%
tidyr::pivot_wider(
id_cols = id,
names_from = key,
values_from = value
)
Run the code above in your browser using DataLab