Create a reference matrix, useful for visualisation, with evenly spread and
combined values. Usually used to make generate predictions using
get_predicted()
. See this
vignette
for a tutorial on how to create a visualisation matrix using this function.
Alternatively, these can also be used to extract the "grid" columns from
objects generated by emmeans and marginaleffects.
get_datagrid(x, ...)# S3 method for data.frame
get_datagrid(
x,
by = "all",
factors = "reference",
numerics = "mean",
preserve_range = FALSE,
reference = x,
length = 10,
range = "range",
...
)
# S3 method for numeric
get_datagrid(x, length = 10, range = "range", ...)
# S3 method for factor
get_datagrid(x, ...)
# S3 method for default
get_datagrid(
x,
by = "all",
factors = "reference",
numerics = "mean",
preserve_range = TRUE,
reference = x,
include_smooth = TRUE,
include_random = FALSE,
include_response = FALSE,
data = NULL,
verbose = TRUE,
...
)
# S3 method for emmGrid
get_datagrid(x, ...)
# S3 method for slopes
get_datagrid(x, ...)
Reference grid data frame.
An object from which to construct the reference grid.
Arguments passed to or from other methods (for instance, length
or range
to control the spread of numeric variables.).
Indicates the focal predictors (variables) for the reference grid
and at which values focal predictors should be represented. If not specified
otherwise, representative values for numeric variables or predictors are
evenly distributed from the minimum to the maximum, with a total number of
length
values covering that range (see 'Examples'). Possible options for
by
are:
"all"
, which will include all variables or predictors.
a character vector of one or more variable or predictor names, like
c("Species", "Sepal.Width")
, which will create a grid of all combinations
of unique values. For factors, will use all levels, for numeric variables,
will use a range of length length
(evenly spread from minimum to maximum)
and for character vectors, will use all unique values.
a list of named elements, indicating focal predictors and their representative
values, e.g. by = list(Sepal.Length = c(2, 4), Species = "setosa")
.
a string with assignments, e.g. by = "Sepal.Length = 2"
or
by = c("Sepal.Length = 2", "Species = 'setosa'")
- note the usage of single
and double quotes to assign strings within strings.
There is a special handling of assignments with brackets, i.e. values
defined inside [
and ]
.For numeric variables, the value(s) inside
the brackets should either be
two values, indicating minimum and maximum (e.g. by = "Sepal.Length = [0, 5]"
),
for which a range of length length
(evenly spread from given minimum to
maximum) is created.
more than two numeric values by = "Sepal.Length = [2,3,4,5]"
, in which
case these values are used as representative values.
a "token" that creates pre-defined representative values:
for mean and -/+ 1 SD around the mean: "x = [sd]"
for median and -/+ 1 MAD around the median: "x = [mad]"
for Tukey's five number summary (minimum, lower-hinge, median, upper-hinge, maximum): "x = [fivenum]"
for terciles, including minimum and maximum: "x = [terciles]"
for terciles, excluding minimum and maximum: "x = [terciles2]"
for quartiles, including minimum and maximum: "x = [quartiles]"
for quartiles, excluding minimum and maximum: "x = [quartiles2]"
for minimum and maximum value: "x = [minmax]"
for 0 and the maximum value: "x = [zeromax]"
For factor variables, the value(s) inside the brackets should indicate
one or more factor levels, like by = "Species = [setosa, versicolor]"
.
Note: the length
argument will be ignored when using brackets-tokens.
The remaining variables not specified in by
will be fixed (see also arguments
factors
and numerics
).
Type of summary for factors. Can be "reference"
(set at the
reference level), "mode"
(set at the most common level) or "all"
to
keep all levels.
Type of summary for numeric values. Can be "all"
(will
duplicate the grid for all unique values), any function ("mean"
,
"median"
, ...) or a value (e.g., numerics = 0
).
In the case of combinations between numeric variables
and factors, setting preserve_range = TRUE
will drop the observations
where the value of the numeric variable is originally not present in the
range of its factor level. This leads to an unbalanced grid. Also, if you
want the minimum and the maximum to closely match the actual ranges, you
should increase the length
argument.
The reference vector from which to compute the mean and SD.
Used when standardizing or unstandardizing the grid using effectsize::standardize
.
Length of numeric target variables selected in by
. This arguments
controls the number of (equally spread) values that will be taken to represent the
continuous variables. A longer length will increase precision, but can also
substantially increase the size of the datagrid (especially in case of interactions).
If NA
, will return all the unique values. In case of multiple continuous target
variables, length
can also be a vector of different values (see examples).
Option to control the representative values given in by
, if
no specific values were provided. Use in combination with the length
argument
to control the number of values within the specified range. range
can be
one of the following:
"range"
(default), will use the minimum and maximum of the original data
vector as end-points (min and max).
if an interval type is specified, such as "iqr"
,
"ci"
, "hdi"
or
"eti"
, it will spread the values within that range
(the default CI width is 95%
but this can be changed by adding for instance
ci = 0.90
.) See IQR()
and bayestestR::ci()
. This can be useful to have
more robust change and skipping extreme values.
if "sd"
or "mad"
, it will spread by this dispersion
index around the mean or the median, respectively. If the length
argument
is an even number (e.g., 4
), it will have one more step on the positive
side (i.e., -1, 0, +1, +2
). The result is a named vector. See 'Examples.'
"grid"
will create a reference grid that is useful when plotting
predictions, by choosing representative values for numeric variables based
on their position in the reference grid. If a numeric variable is the first
predictor in by
, values from minimum to maximum of the same length as
indicated in length
are generated. For numeric predictors not specified at
first in by
, mean and -1/+1 SD around the mean are returned. For factors,
all levels are returned.
If x
is a model object, decide whether smooth terms
should be included in the data grid or not.
If x
is a mixed model object, decide whether random
effect terms should be included in the data grid or not. If
include_random
is FALSE
, but x
is a mixed model with random effects,
these will still be included in the returned grid, but set to their
"population level" value (e.g., NA
for glmmTMB or 0
for merMod).
This ensures that common predict()
methods work properly, as these
usually need data with all variables in the model included.
If x
is a model object, decide whether the response
variable should be included in the data grid or not.
Optional, the data frame that was used to fit the model. Usually,
the data is retrieved via get_data()
.
Toggle warnings.
get_predicted()