This is an overview of the usage of vec_ptype2()
and vec_cast()
and
their role in the vctrs coercion mechanism. Related topics:
For an example of implementing coercion methods for simple vectors,
see ?howto-faq-coercion
.
For an example of implementing coercion methods for data frame
subclasses, see
?howto-faq-coercion-data-frame
.
For a tutorial about implementing vctrs classes from scratch, see
vignette("s3-vector")
.
The coercion system in vctrs is designed to make combination of multiple inputs consistent and extensible. Combinations occur in many places, such as row-binding, joins, subset-assignment, or grouped summary functions that use the split-apply-combine strategy. For example:
vec_c(TRUE, 1)
#> [1] 1 1vec_c("a", 1)
#> Error in `vec_c()`:
#> ! Can't combine `..1` <character> and `..2` <double>.
vec_rbind(
data.frame(x = TRUE),
data.frame(x = 1, y = 2)
)
#> x y
#> 1 1 NA
#> 2 1 2
vec_rbind(
data.frame(x = "a"),
data.frame(x = 1, y = 2)
)
#> Error in `vec_rbind()`:
#> ! Can't combine `..1$x` <character> and `..2$x` <double>.
One major goal of vctrs is to provide a central place for implementing
the coercion methods that make generic combinations possible. The two
relevant generics are vec_ptype2()
and vec_cast()
. They both take
two arguments and perform double dispatch, meaning that a method is
selected based on the classes of both inputs.
The general mechanism for combining multiple inputs is:
Find the common type of a set of inputs by reducing (as in
base::Reduce()
or purrr::reduce()
) the vec_ptype2()
binary
function over the set.
Convert all inputs to the common type with vec_cast()
.
Initialise the output vector as an instance of this common type with
vec_init()
.
Fill the output vector with the elements of the inputs using
vec_assign()
.
The last two steps may require vec_proxy()
and vec_restore()
implementations, unless the attributes of your class are constant and do
not depend on the contents of the vector. We focus here on the first two
steps, which require vec_ptype2()
and vec_cast()
implementations.
vec_ptype2()
Methods for vec_ptype2()
are passed two prototypes, i.e. two inputs
emptied of their elements. They implement two behaviours:
If the types of their inputs are compatible, indicate which of them is the richer type by returning it. If the types are of equal resolution, return any of the two.
Throw an error with stop_incompatible_type()
when it can be
determined from the attributes that the types of the inputs are not
compatible.
A type is compatible with another type if the values it represents are a subset or a superset of the values of the other type. The notion of “value” is to be interpreted at a high level, in particular it is not the same as the memory representation. For example, factors are represented in memory with integers but their values are more related to character vectors than to round numbers:
# Two factors are compatible
vec_ptype2(factor("a"), factor("b"))
#> factor()
#> Levels: a b# Factors are compatible with a character
vec_ptype2(factor("a"), "b")
#> character(0)
# But they are incompatible with integers
vec_ptype2(factor("a"), 1L)
#> Error:
#> ! Can't combine `factor("a")` <factor<4d52a>> and `1L` <integer>.
Richness of type is not a very precise notion. It can be about richer
data (for instance a double
vector covers more values than an integer
vector), richer behaviour (a data.table
has richer behaviour than a
data.frame
), or both. If you have trouble determining which one of the
two types is richer, it probably means they shouldn’t be automatically
coercible.
Let’s look again at what happens when we combine a factor and a character:
vec_ptype2(factor("a"), "b")
#> character(0)
The ptype2 method for <character>
and <factor<"a">>
returns
<character>
because the former is a richer type. The factor can only
contain "a"
strings, whereas the character can contain any strings. In
this sense, factors are a subset of character.
Note that another valid behaviour would be to throw an incompatible type
error. This is what a strict factor implementation would do. We have
decided to be laxer in vctrs because it is easy to inadvertently create
factors instead of character vectors, especially with older versions of
R where stringsAsFactors
is still true by default.
Each ptype2 method should strive to have exactly the same behaviour when the inputs are permuted. This is not always possible, for example factor levels are aggregated in order:
vec_ptype2(factor(c("a", "c")), factor("b"))
#> factor()
#> Levels: a c bvec_ptype2(factor("b"), factor(c("a", "c")))
#> factor()
#> Levels: b a c
In any case, permuting the input should not return a fundamentally different type or introduce an incompatible type error.
The classes that you can coerce together form a coercion (or subtyping) hierarchy. Below is a schema of the hierarchy for the base types like integer and factor. In this diagram the directions of the arrows express which type is richer. They flow from the bottom (more constrained types) to the top (richer types).
A coercion hierarchy is distinct from the structural hierarchy implied by memory types and classes. For instance, in a structural hierarchy, factors are built on top of integers. But in the coercion hierarchy they are more related to character vectors. Similarly, subclasses are not necessarily coercible with their superclasses because the coercion and structural hierarchies are separate.
As a class implementor, you have two options. The simplest is to create an entirely separate hierarchy. The date and date-time classes are an example of an S3-based hierarchy that is completely separate. Alternatively, you can integrate your class in an existing hierarchy, typically by adding parent nodes on top of the hierarchy (your class is richer), by adding children node at the root of the hierarchy (your class is more constrained), or by inserting a node in the tree.
These coercion hierarchies are implicit, in the sense that they are
implied by the vec_ptype2()
implementations. There is no structured
way to create or modify a hierarchy, instead you need to implement the
appropriate coercion methods for all the types in your hierarchy, and
diligently return the richer type in each case. The vec_ptype2()
implementations are not transitive nor inherited, so all pairwise
methods between classes lying on a given path must be implemented
manually. This is something we might make easier in the future.
vec_cast()
The second generic, vec_cast()
, is the one that looks at the data and
actually performs the conversion. Because it has access to more
information than vec_ptype2()
, it may be stricter and cause an error
in more cases. vec_cast()
has three possible behaviours:
Determine that the prototypes of the two inputs are not compatible.
This must be decided in exactly the same way as for vec_ptype2()
.
Call stop_incompatible_cast()
if you can determine from the
attributes that the types are not compatible.
Detect incompatible values. Usually this is because the target type is too restricted for the values supported by the input type. For example, a fractional number can’t be converted to an integer. The method should throw an error in that case.
Return the input vector converted to the target type if all values are
compatible. Whereas vec_ptype2()
must return the same type when the
inputs are permuted, vec_cast()
is directional. It always returns
the type of the right-hand side, or dies trying.
The dispatch mechanism for vec_ptype2()
and vec_cast()
looks like S3
but is actually a custom mechanism. Compared to S3, it has the following
differences:
It dispatches on the classes of the first two inputs.
There is no inheritance of ptype2 and cast methods. This is because the S3 class hierarchy is not necessarily the same as the coercion hierarchy.
NextMethod()
does not work. Parent methods must be called explicitly
if necessary.
The default method is hard-coded.
The determination of the common type of data frames with vec_ptype2()
happens in three steps:
Match the columns of the two input data frames. If some columns
don’t exist, they are created and filled with adequately typed NA
values.
Find the common type for each column by calling vec_ptype2()
on
each pair of matched columns.
Find the common data frame type. For example the common type of a grouped tibble and a tibble is a grouped tibble because the latter is the richer type. The common type of a data table and a data frame is a data table.
vec_cast()
operates similarly. If a data frame is cast to a target
type that has fewer columns, this is an error.
If you are implementing coercion methods for data frames, you will need
to explicitly call the parent methods that perform the common type
determination or the type conversion described above. These are exported
as df_ptype2()
and df_cast()
.
Being too strict with data frame combinations would cause too much pain because there are many data frame subclasses in the wild that don’t implement vctrs methods. We have decided to implement a special fallback behaviour for foreign data frames. Incompatible data frames fall back to a base data frame:
df1 <- data.frame(x = 1)
df2 <- structure(df1, class = c("foreign_df", "data.frame"))vec_rbind(df1, df2)
#> x
#> 1 1
#> 2 1
When a tibble is involved, we fall back to tibble:
df3 <- tibble::as_tibble(df1)vec_rbind(df1, df3)
#> # A tibble: 2 x 1
#> x
#> <dbl>
#> 1 1
#> 2 1
These fallbacks are not ideal but they make sense because all data frames share a common data structure. This is not generally the case for vectors. For example factors and characters have different representations, and it is not possible to find a fallback time mechanically.
However this fallback has a big downside: implementing vctrs methods for your data frame subclass is a breaking behaviour change. The proper coercion behaviour for your data frame class should be specified as soon as possible to limit the consequences of changing the behaviour of your class in R scripts.